<a href="https://colab.research.google.com/github/clementbowe14/ml-class/blob/main/BachChorales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



Load in the dataframe

In [None]:
import pandas as pd

#load the dataframe and read the first ten values.
bach = pd.read_csv("/content/sample_data/bach.csv")
bach.head(10)

Unnamed: 0,choral_ID,event_number,C,C#,D,D#,E,F,F#,G,G#,A,A#,B,bass,meter,chord_label
0,000106b_,1,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
1,000106b_,2,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,5,C_M
2,000106b_,3,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,2,C_M
3,000106b_,4,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
4,000106b_,5,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,2,F_M
5,000106b_,6,NO,NO,YES,NO,NO,YES,NO,NO,NO,YES,NO,NO,D,4,D_m
6,000106b_,7,NO,NO,YES,NO,NO,YES,NO,NO,NO,YES,NO,NO,D,2,D_m
7,000106b_,8,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,A,3,F_M
8,000106b_,9,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,A,2,F_M
9,000106b_,10,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,YES,NO,Bb,5,BbM


Observe the value counts of the different labels



In [None]:
bach['chord_label'].value_counts()

D_M     503
G_M     489
C_M     488
F_M     389
A_M     352
       ... 
F_d7      1
DbM7      1
Ebd       1
Abd       1
F#d7      1
Name: chord_label, Length: 102, dtype: int64

To avoid potential class imbalances in our machine learning model the labels with a frequency < 2 are removed from the dataset.

In [None]:
s=bach.chord_label.value_counts().gt(1)
bach = bach.loc[bach.chord_label.isin(s[s].index)]
bach

Unnamed: 0,choral_ID,event_number,C,C#,D,D#,E,F,F#,G,G#,A,A#,B,bass,meter,chord_label
0,000106b_,1,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
1,000106b_,2,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,5,C_M
2,000106b_,3,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,E,2,C_M
3,000106b_,4,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,3,F_M
4,000106b_,5,YES,NO,NO,NO,NO,YES,NO,NO,NO,YES,NO,NO,F,2,F_M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5660,015505b_,105,NO,NO,YES,NO,NO,NO,NO,YES,NO,NO,YES,NO,G,4,G_m
5661,015505b_,106,NO,NO,YES,NO,NO,NO,NO,YES,NO,YES,NO,NO,G,3,G_m
5662,015505b_,107,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,NO,NO,C,5,C_M
5663,015505b_,108,YES,NO,NO,NO,YES,NO,NO,YES,NO,NO,YES,NO,C,3,C_M


Here we observe the value counts of the dataframe again with labels with 1 instance removed.


In [None]:
bach.chord_label.value_counts()

D_M     503
G_M     489
C_M     488
F_M     389
A_M     352
       ... 
C_d7      2
C#d6      2
C_d6      2
C#M4      2
B_m6      2
Name: chord_label, Length: 94, dtype: int64

Next let's see if there are any missing values that need to be imputed or removed.

In [None]:
bach.isna().sum()

choral_ID       0
event_number    0
C               0
C#              0
D               0
D#              0
E               0
F               0
F#              0
G               0
G#              0
A               0
A#              0
B               0
bass            0
meter           0
chord_label     0
dtype: int64

Since there were no nan values in the dataset the next thing to do in the data preparation is encoding the labels. Each of the unique chord combinations in the labels are assigned an integer value. This task is handled by the sklearn preprocessing LabelEncoder library


In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(bach['chord_label'])
labels = le.fit_transform(bach['chord_label'])

Let's take a look at the labels after they were transformed by the label encoder.

In [None]:
labels

array([75, 34, 34, ..., 34, 34, 75])

The next task is to extract the features from the dataset to use for the models. First the chord_label field, and event_label field are dropped because the labels are needed for classification and the event_label field is unrelated to determining the chord. Then the remaining columns are one-hot encoded using sklearn's OneHotEncoder library, and the results are stored in a sparse matrix.

In [None]:
features = bach.drop(['chord_label'], axis=1)

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(
    categories='auto',
    sparse_output=True
    )

bach_sparse = encoder.fit_transform(features)


Here the data is split into training and testing. The ratio of the data is 80% for training and 20% for testing. The labels are stratified so the proportions of the data are the same in training and testing.

In [None]:
from sklearn.model_selection import train_test_split
bach_train_features, bach_test_features, bach_train_labels, bach_test_labels = train_test_split(bach_sparse, labels, test_size = 0.2, random_state=42, stratify=labels)

The first model we will try is the XGBoost classifier. We will use the gpu_hist tree method, gpu predictor as the predictor, and a max depth of 6.

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier(tree_method='gpu_hist',predictor='gpu_predictor', max_depth=6)
model.fit(bach_train_features, bach_train_labels)

The model makes predictions on the training to data to learn. It then makes predictions on the testing data and sklearn's accuracy score evaluates the accuracy of the model.

In [None]:
from sklearn.metrics import accuracy_score

training_predictions = model.predict(bach_train_features)
predictions = model.predict(bach_test_features)
score = accuracy_score(bach_test_labels, predictions)
training_score = accuracy_score(bach_train_labels, training_predictions)
print("Training Score:", training_score)
print()
print("Test Score:", score)

Training Score: 0.965524861878453

Test Score: 0.7756183745583038


Next we will utilize sklearn's model selection and stratifiedKfold to optimize the XGBoost model. Params contains the hyperparameters, n_estimators, max_depth, min_split_loss, subsample, and tree method. The StratifiedKFold object is created for cross validation. The randomized search cv object fits to the training data and searches for the best combination of hyperparameters in our search space, by evaluating the performance of the model using the cross validation folds specified in the StratifiedKFold. The best hyperparameters are then used to train a XGBClassifier model.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold

params = {
    'n_estimators': [150, 200, 250, 300],
    'max_depth': [4, 6, 8, 10, 12],
    'min_split_loss': [0, 1, 10],
    'subsample': [1, .9, .8, .75],
    'tree_method': ['exact', 'gpu_hist']
    }

random_state = 10 
folds = 10
param_comb = 20

model = XGBClassifier()
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=random_state)
random_search = RandomizedSearchCV(model, param_distributions=params, n_iter=param_comb,
                                   n_jobs=-1, cv=skf.split(bach_train_features, bach_train_labels))

In [None]:
random_search.fit(bach_train_features, bach_train_labels)



KeyboardInterrupt: ignored

Let's take a look at the parameters of the estimator we got back.

In [None]:
best_estimator = random_search.best_estimator_

estimator_parameters = best_estimator.get_params()

for parameter  in estimator_parameters:
  print("{}: {}".format(parameter, estimator_parameters[parameter]))

In [None]:
predictions = best_estimator.predict(bach_test_features)

accuracy_score(bach_test_labels, predictions)

The final approach we will try is using the BaggingClassifier. While the XGBoost model sequentially builds more complex tree models, the bagging classifier builds each tree independent of the others. The BaggingClassifier will be configured to using 150 decision trees, and each tree will be trained on 80% of the training data and a random subset of 75% of the features. 

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

clf = DecisionTreeClassifier()

bagging_clf = BaggingClassifier(clf, n_estimators=150, max_samples=0.8, max_features=0.75, bootstrap=True)
bagging_clf.fit(bach_train_features, bach_train_labels)

predictions = bagging_clf.predict(bach_test_features)

accuracy_score(bach_test_labels, predictions)