# Classification: Bach Chorales Harmony

Dataset contains pitch class information of 60 Bach chorales. The dataset divides each chorale as a "harmonic time series" with each row containing pitch information in the twelve-tone equal temperament tuning system with 12 columns containing binary information (present or not) for each of the 12 pitch classes. It also contains one column with information about the bass pitch and one column with meter information which is a number from 1 to 5 that indicates how accented the time event is (being 5 more accecente and 1 less accented). Finally the last column contains the chord label resonating during the given time event.
Dataset contains 5665 time events and was obtained from [here](https://archive.ics.uci.edu/ml/datasets/Bach+Choral+Harmony).

In [16]:
import numpy as np
import sys
import pandas as pd
import pathmagic  # noqa
import dataset_manipulation as daux

Start loading dataset (CSV format) in pandas dataframe.

In [18]:
raw_dataset_path = r'jsbach_chorals_harmony.data'
dataset = pd.read_csv(raw_dataset_path, header=None)

Splitting dataset in training (80%) and testing (20%).

In [19]:
training, test = split_dataframe(dataset, split_probability=0.8)

Extract published performance (8th column of the dataset) for training and test data.

In [20]:
y_training = training[8].copy()
y_test = test[8].copy()

Remove unnecessary columns so we are left only with the desired features. The removed columns contain: 
- Vendor name - column 0 
- Model name - column 1
- Published relative performance - column 8
- Estimated relative performance - column 9

In [21]:
drop_array = np.array([0,1,8,9])  # Select columns to drop
X_training = training.drop(drop_array, axis=1)  # axis=1 for columns, axis=0 for rows
X_test = test.drop(drop_array, axis=1)

We proceed now with data preprocessing for ridge regression which consists of:
1. Substraction the mean off of y  

\begin{equation*}
y \leftarrow y - \frac{1}{n} \sum_{i=1}^n y_i
\end{equation*}


In [22]:
y_training = y_training - np.mean(y_training)
y_test = y_test - np.mean(y_test)

2. Standardize dimensions of $x_i$

\begin{equation*}
x_{ij} \leftarrow \frac{(x_{ij} - \bar{x}_{.j})}{\hat{\sigma}_j}, \quad \hat{\sigma}_j = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_{ij} - \bar{x}_{.j})^2}
\end{equation*} 

This is, substract the empirical mean and divide by the empirical standard deviation for each dimention for every example.


In [23]:
# For training data
X_training_column_mean = np.mean(X_training, axis=0)
X_training_column_mean = np.tile(X_training_column_mean, (X_training.shape[0], 1))
std_training_column = np.sqrt(np.mean(np.square(X_training - X_training_column_mean), axis=0))
std_training_column = np.tile(std_training_column, (X_training.shape[0], 1))
X_training = (X_training - X_training_column_mean) / std_training_column


# For test data
X_test_column_mean = np.mean(X_test, axis=0)
X_test_column_mean = np.tile(X_test_column_mean, (X_test.shape[0], 1))
std_test_column = np.sqrt(np.mean(np.square(X_test - X_test_column_mean), axis=0))
std_test_column = np.tile(std_test_column, (X_test.shape[0], 1))
X_test = (X_test - X_test_column_mean) / std_test_column

Due to standardization of the data there is no need for inserting offset column of ones in the beginning of the data but here is the example code (commented out) anyway.

In [24]:
#X_training.insert(0, 0, 1)  # Insert a column of ones for the offset
#X_test.insert(0, 0, 1)

Save all the processed data in four CSV files.

In [25]:
y_training.to_csv(r'y_training', index=False, header=None)
y_test.to_csv(r'y_test', index=False, header=None)
X_training.to_csv(r'X_training', index=False, header=None)
X_test.to_csv(r'X_test', index=False, header=None)