# Classification: Bach Chorales Harmony

Dataset contains pitch class information of 60 Bach chorales. The dataset divides each chorale as a "harmonic time series" with each row containing pitch information in the twelve-tone equal temperament tuning system with 12 columns containing binary information (present or not) for each of the 12 pitch classes. It also contains one column with information about the bass pitch and one column with meter information which is a number from 1 to 5 that indicates how accented the time event is (being 5 more accecente and 1 less accented). Finally the last column contains the chord label resonating during the given time event.
Dataset contains 5665 time events and was obtained from [here](https://archive.ics.uci.edu/ml/datasets/Bach+Choral+Harmony).

In [44]:
import numpy as np
import sys
import pandas as pd
import pathmagic  # noqa
import dataset_manipulation as daux

Start loading dataset (CSV format) in pandas dataframe.

In [45]:
raw_dataset_path = r'jsbach_chorals_harmony.data'
dataset = pd.read_csv(raw_dataset_path, header=None)

Splitting dataset in training (80%) and testing (20%).

In [46]:
training, test = daux.split_dataframe(dataset, split_probability=0.8)

Extract chord label(16th column of the dataset) for training and test data.

In [47]:
y_training = training[16].copy()
y_test = test[16].copy()

Remove unwanted information so we are left only with the desired features. The removed columns contain: 
- Choral ID: Corresponding to the file names from [Bach Central](http://www.bachcentral.com) - column 0
- Chord label: Chord resonating during the given event - column 16

I was inclined to remove columns 15 (chord bass note) and 15 (accent meter) but finally I left them since it was the intention of the authors to carry out the classification using these features and the information they provide.

In [48]:
drop_array = np.array([0,16])  # Select columns to drop
X_training = training.drop(drop_array, axis=1)  # axis=1 for columns, axis=0 for rows
X_test = test.drop(drop_array, axis=1)

Removing unnecessary columns is not so important for Naive Bayes since it is quite robust to irrelevant attributes. We could check that leaving irrelevant columns in the dataset and comparing algorithm results with and without them.

No offset column of ones inserted in the beginning of the data.

In [49]:
#X_training.insert(0, 0, 1)  # Insert a column of ones for the offset
#X_test.insert(0, 0, 1)

We map also strings and non-numerical values to a numerical equivalent to be able to make calculations easier with them.
We map the tones present this way:
'NO' -> 0  
'YES' -> 1  

In [50]:
X_training = X_training.replace([' NO', 'YES'], [0, 1])
X_test = X_test.replace([' NO', 'YES'], [0, 1])

For the bass note we assign a number from 0 to the total number of bass notes in the dataset-1. The numbers are assigned increasingly by order of appearance, for instance:
If F is the first bass note will get number 0, if G is the second bass not will get 1 and so on.

In [51]:
bass_alphabet_training = X_training[14].unique()
mapping = np.arange(bass_alphabet_training.size)
X_training = X_training.replace(bass_alphabet_training, mapping)

bass_alphabet_test = X_test[14].unique()
mapping = np.arange(bass_alphabet_test.size)
X_test = X_test.replace(bass_alphabet_test, mapping)

Save all the processed data in four CSV files.

In [52]:
X_header = ['event_number', 'C', 'C#/Db', 'D', 'D#/Eb', 'E', 'F', 'F#/Gb', 'G', 'G#/Ab', 'A', 'A#/Bb', 'B', 'bass', 'meter']
y_header = ['chord']
y_training.to_csv(r'y_training', index=False, header=y_header)
y_test.to_csv(r'y_test', index=False, header=y_header)
X_training.to_csv(r'X_training', index=False, header=X_header)
X_test.to_csv(r'X_test', index=False, header=X_header)