<a href="https://colab.research.google.com/github/bernhardtandy/ProjectsMLAI/blob/main/HW2_AndyBernhardt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Projects in Machine Learning and AI Homework 2**
## *Decision Trees, Bagging, and Boosting for Binary Music Genre Classification*
##### **Andy Bernhardt**
##### **bernha@rpi.edu**

---
## **Task 1: Decision Tree Classifier for Binary Music Genre Classification**

In this section, we will load the required libraries and dataset (GTZAN) for the project, preprocess the data, and use sklearn to implement a decision tree classifier for binary music genre classification ("classical" vs "pop"). We will vary the input parameters to the decision tree classifier and compare the results.


### Setup

In [84]:
# Import required libraries for project
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.preprocessing import Normalizer

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier

### Load dataset from Google Drive

In [53]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [54]:
# Add HW2 folder to path
import sys
sys.path.append('/content/drive/MyDrive/ProjectsMLAI/HW2')

In [55]:
# Load GTZAN 3-second dataset
df = pd.read_csv("/content/drive/MyDrive/ProjectsMLAI/HW2/GTZAN.csv")

### Dataset description
GTZAN is the "most-used public dataset for evaluation in machine listening research for music genre recognition (MGR)". Most notably, it includes a dataset of 30-second audio clips from 100 audio files for 10 genres each, as well as an image for each audio file which is a "visual representation" of the file. In addition, the dataset includes two CSV files: one with features extracted from each 30-second audio file ($\approx$100 examples per genre/10 genres), and a second with features extracted from each audio file split into 3-second clips ($\approx$1000 examples per genre/10 genres). Although the data is from recordings collected in 2000-2001, the dataset is relatively new ($\approx$ two years old).

In this project, we use the second CSV file of features from 3-second clips, and consider only the data corresponding to the "classical" and "pop" genres (998 examples for "classical" and 1000 examples for "pop" - a balanced dataset) in order to project the general multi-class genre classification task into a binary classification task. We first split the dataset 80/20 into training and test sets, and then implement various models to predict the genre given the feature information. These models are finally evaluated on the held-out test set and compared.

Link to dataset: https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification

### Data Preprocessing
- Show dataset
- Filter rows with labels "classical" and "pop"
- Split dataframe into training (80%) and testing (20%) sets
- Isolate features (chroma_stft_mean, chroma_stft_var, rms_mean, rms_var, spectral_centroid_mean, spectral_centroid_var, spectral_bandwidth_mean, spectral_bandwidth_var, rolloff_mean, rolloff_var, zero_crossing_rate_mean, zero_crossing_rate_var, harmony_mean, harmony_var, perceptr_mean, perceptr_var, tempo, mfcc1_mean, mfcc1_var, mfcc2_mean, mfcc2_var, mfcc3_mean, mfcc3_var, mfcc4_mean, mfcc4_var, mfcc5_mean, mfcc5_var, mfcc6_mean, mfcc6_var, mfcc7_mean, mfcc7_var, mfcc8_mean, mfcc8_var, mfcc9_mean, mfcc9_var, mfcc10_mean, mfcc10_var, mfcc11_mean, mfcc11_var, mfcc12_mean, mfcc12_var, mfcc13_mean, mfcc13_var, mfcc14_meanm, fcc14_var, mfcc15_mean, mfcc15_var, mfcc16_mean, mfcc16_var, mfcc17_mean, mfcc17_var, mfcc18_mean, mfcc18_var, mfcc19_mean, mfcc19_var, mfcc20_mean, mfcc20_var)
- Normalize features
- Change labels to 0 ("classical") and 1 ("pop") and isolate labels

In [56]:
# Show dataset
df

Unnamed: 0,filename,length,chroma_stft_mean,chroma_stft_var,rms_mean,rms_var,spectral_centroid_mean,spectral_centroid_var,spectral_bandwidth_mean,spectral_bandwidth_var,...,mfcc16_var,mfcc17_mean,mfcc17_var,mfcc18_mean,mfcc18_var,mfcc19_mean,mfcc19_var,mfcc20_mean,mfcc20_var,label
0,blues.00000.0.wav,66149,0.335406,0.091048,0.130405,0.003521,1773.065032,167541.630869,1972.744388,117335.771563,...,39.687145,-3.241280,36.488243,0.722209,38.099152,-5.050335,33.618073,-0.243027,43.771767,blues
1,blues.00000.1.wav,66149,0.343065,0.086147,0.112699,0.001450,1816.693777,90525.690866,2010.051501,65671.875673,...,64.748276,-6.055294,40.677654,0.159015,51.264091,-2.837699,97.030830,5.784063,59.943081,blues
2,blues.00000.2.wav,66149,0.346815,0.092243,0.132003,0.004620,1788.539719,111407.437613,2084.565132,75124.921716,...,67.336563,-1.768610,28.348579,2.378768,45.717648,-1.938424,53.050835,2.517375,33.105122,blues
3,blues.00000.3.wav,66149,0.363639,0.086856,0.132565,0.002448,1655.289045,111952.284517,1960.039988,82913.639269,...,47.739452,-3.841155,28.337118,1.218588,34.770935,-3.580352,50.836224,3.630866,32.023678,blues
4,blues.00000.4.wav,66149,0.335579,0.088129,0.143289,0.001701,1630.656199,79667.267654,1948.503884,60204.020268,...,30.336359,0.664582,45.880913,1.689446,51.363583,-3.392489,26.738789,0.536961,29.146694,blues
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9985,rock.00099.5.wav,66149,0.349126,0.080515,0.050019,0.000097,1499.083005,164266.886443,1718.707215,85931.574523,...,42.485981,-9.094270,38.326839,-4.246976,31.049839,-5.625813,48.804092,1.818823,38.966969,rock
9986,rock.00099.6.wav,66149,0.372564,0.082626,0.057897,0.000088,1847.965128,281054.935973,1906.468492,99727.037054,...,32.415203,-12.375726,66.418587,-3.081278,54.414265,-11.960546,63.452255,0.428857,18.697033,rock
9987,rock.00099.7.wav,66149,0.347481,0.089019,0.052403,0.000701,1346.157659,662956.246325,1561.859087,138762.841945,...,78.228149,-2.524483,21.778994,4.809936,25.980829,1.775686,48.582378,-0.299545,41.586990,rock
9988,rock.00099.8.wav,66149,0.387527,0.084815,0.066430,0.000320,2084.515327,203891.039161,2018.366254,22860.992562,...,28.323744,-5.363541,17.209942,6.462601,21.442928,2.354765,24.843613,0.675824,12.787750,rock


In [57]:
# Show original label value counts
print(f"Value counts:\n{df.label.value_counts()}")

Value counts:
blues        1000
jazz         1000
metal        1000
pop          1000
reggae       1000
disco         999
classical     998
hiphop        998
rock          998
country       997
Name: label, dtype: int64


In [58]:
# Filter rows with labels "classical" and "pop"
df_filtered = df.loc[df['label'].isin(['classical', 'pop'])]
df_filtered

Unnamed: 0,filename,length,chroma_stft_mean,chroma_stft_var,rms_mean,rms_var,spectral_centroid_mean,spectral_centroid_var,spectral_bandwidth_mean,spectral_bandwidth_var,...,mfcc16_var,mfcc17_mean,mfcc17_var,mfcc18_mean,mfcc18_var,mfcc19_mean,mfcc19_var,mfcc20_mean,mfcc20_var,label
1000,classical.00000.0.wav,66149,0.255331,0.080393,0.032510,0.000075,1599.272683,1.856708e+04,1675.591596,20596.851729,...,55.257095,-1.666343,104.916260,4.525014,69.806412,-0.897889,110.099174,4.160629,194.109070,classical
1001,classical.00000.1.wav,66149,0.231431,0.084894,0.031453,0.000059,1551.352817,2.852497e+04,1485.790068,28831.794511,...,50.988071,-1.546098,65.954590,7.157280,69.336983,2.718532,120.725609,-1.692275,150.527496,classical
1002,classical.00000.2.wav,66149,0.225458,0.082233,0.041776,0.000222,1466.237496,4.501883e+04,1495.076539,10600.321072,...,81.254791,-4.686039,102.037117,-4.411082,60.409714,-2.694283,60.788319,3.038420,213.015579,classical
1003,classical.00000.3.wav,66149,0.260866,0.082233,0.032749,0.000136,1435.850575,3.227031e+04,1585.998216,36208.016702,...,40.191471,2.764065,46.115536,-0.732678,60.197281,-8.223904,64.066719,1.734897,119.727020,classical
1004,classical.00000.4.wav,66149,0.269611,0.084948,0.045156,0.000476,1477.712706,2.144815e+04,1569.311614,20844.078511,...,62.712025,-0.079564,89.584717,2.686025,65.182037,-1.290078,105.829987,-4.315813,79.882378,classical
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7987,pop.00099.5.wav,66149,0.342589,0.089339,0.137082,0.002701,3210.404285,5.229747e+05,3518.891658,106856.507085,...,55.283569,1.613460,56.897308,6.900325,95.293251,-0.688965,37.893326,5.038363,56.635410,pop
7988,pop.00099.6.wav,66149,0.324163,0.096156,0.151230,0.001677,2584.693526,5.699567e+05,3274.529361,135583.646289,...,49.914917,3.775996,73.963264,12.763522,150.864807,7.824332,220.597992,11.147660,241.048462,pop
7989,pop.00099.7.wav,66149,0.347829,0.102577,0.130131,0.002049,2732.619962,9.389428e+05,3305.178275,228680.158389,...,37.255424,0.172578,43.497780,2.358351,79.210114,5.260961,137.600296,9.504729,119.177002,pop
7990,pop.00099.8.wav,66149,0.353859,0.092972,0.148098,0.001058,2891.360207,1.304064e+06,3344.280322,165330.747555,...,57.142181,2.004879,86.238083,4.730950,161.434052,10.439860,157.169983,9.645600,124.777672,pop


In [59]:
# Show label value counts after filtering
print(f"Value counts:\n{df_filtered.label.value_counts()}")

Value counts:
pop          1000
classical     998
Name: label, dtype: int64


In [60]:
# Split dataframe into training (80%) and testing (20%) sets
df_train, df_test = train_test_split(df_filtered, test_size=0.20, random_state=42)

In [61]:
# Show label distribution of df_train
print(f"Value counts:\n{df_train.label.value_counts()}")

Value counts:
pop          807
classical    791
Name: label, dtype: int64


In [62]:
# Isolate features
X_train = df_train.drop(['filename', 'length', 'label'], axis=1)
X_test = df_test.drop(['filename', 'length', 'label'], axis=1)
X_train.shape, X_test.shape
# We check that our feature matrices have the correct shape
# For the training data, this is 1598 examples with 57 features each
# For the testing data, this is 400 examples with 57 features each

((1598, 57), (400, 57))

In [63]:
# Normalize features
transformer = Normalizer(norm='max').fit(X_train)
X_train = transformer.transform(X_train)
X_test = transformer.transform(X_test)
# Puts all independent variables on the same scale

In [64]:
# Map labels to 0 ("classical") and 1 ("pop")
df_train.loc[df_train['label'] == 'classical', 'label'] = 0
df_train.loc[df_train['label'] == 'pop', 'label'] = 1
df_train['label'] = pd.to_numeric(df_train['label'])

df_test.loc[df_test['label'] == 'classical', 'label'] = 0
df_test.loc[df_test['label'] == 'pop', 'label'] = 1
df_test['label'] = pd.to_numeric(df_test['label'])

In [65]:
# Isolate labels
y_train = df_train.label
y_test = df_test.label
y_train.shape, y_test.shape
# We check that our label vectors have the correct shape

((1598,), (400,))

### Define Functions for Decision Tree Classifier Experiments

In [66]:
def fit_decision_tree(criterion, max_depth, max_features, X_train, y_train):
  decision_tree = tree.DecisionTreeClassifier(criterion=criterion, max_depth=max_depth, 
                                              max_features=max_features, random_state=42).fit(X_train, y_train)
  return decision_tree

In [79]:
def evaluate_model(model, X_test, y_test):
  pred = model.predict(X_test)
  cm = confusion_matrix(y_test, pred)
  acc = accuracy_score(y_test, pred)
  p = precision_score(y_test, pred)
  r = recall_score(y_test, pred)
  f1 = f1_score(y_test, pred)
  cr = classification_report(y_test, pred, target_names=['classical', 'pop'])
  return cm, acc, p, r, f1, cr

In [68]:
def vary_parameters(X_train, y_train, X_test, y_test):
  result = {}
  criteria = ['gini', 'entropy']
  max_depths = [1, 2, 5, 10, 100, None]
  max_features = ['sqrt', 'log2', None]
  for c in criteria:
      for md in max_depths:
          for mf in max_features:
              decision_tree = fit_decision_tree(c, md, mf, X_train, y_train)
              cm, acc, p, r, f1, cr = evaluate_model(decision_tree, X_test, y_test)
              result[(c, md, mf)] = f1
  return result

### Fit and Evaluate the Default Sklearn Decision Tree Classifier

In [69]:
decision_tree = fit_decision_tree('gini', None, None, X_train, y_train)
cm, acc, p, r, f1, cr = evaluate_model(decision_tree, X_test, y_test)
print(f"Confusion Matrix:\n{cm}\nAccuracy: {acc}, Precision: {p}, Recall: {r}, F1-score: {f1}\nClassification Report:\n{cr}")

Confusion Matrix:
[[202   5]
 [  5 188]]
Accuracy: 0.975, Precision: 0.9740932642487047, Recall: 0.9740932642487047, F1-score: 0.9740932642487047
Classification Report:
              precision    recall  f1-score   support

   classical       0.98      0.98      0.98       207
         pop       0.97      0.97      0.97       193

    accuracy                           0.97       400
   macro avg       0.97      0.97      0.97       400
weighted avg       0.97      0.97      0.97       400



### Compare Decision Tree Classifiers with Varying Split Criteria, Max Depth, and Max Features

In [50]:
result = vary_parameters(X_train, y_train, X_test, y_test)

In [51]:
result

{('gini', 1, 'sqrt'): 0.893827160493827,
 ('gini', 1, 'log2'): 0.893827160493827,
 ('gini', 1, None): 0.9722921914357683,
 ('gini', 2, 'sqrt'): 0.9258312020460359,
 ('gini', 2, 'log2'): 0.9411764705882352,
 ('gini', 2, None): 0.9690721649484536,
 ('gini', 5, 'sqrt'): 0.9616368286445013,
 ('gini', 5, 'log2'): 0.9720101781170483,
 ('gini', 5, None): 0.9820051413881749,
 ('gini', 10, 'sqrt'): 0.9591836734693877,
 ('gini', 10, 'log2'): 0.9717223650385605,
 ('gini', 10, None): 0.9740932642487047,
 ('gini', 100, 'sqrt'): 0.9695431472081218,
 ('gini', 100, 'log2'): 0.9690721649484536,
 ('gini', 100, None): 0.9740932642487047,
 ('gini', None, 'sqrt'): 0.9695431472081218,
 ('gini', None, 'log2'): 0.9690721649484536,
 ('gini', None, None): 0.9740932642487047,
 ('entropy', 1, 'sqrt'): 0.893827160493827,
 ('entropy', 1, 'log2'): 0.893827160493827,
 ('entropy', 1, None): 0.9722921914357683,
 ('entropy', 2, 'sqrt'): 0.9283819628647215,
 ('entropy', 2, 'log2'): 0.9411764705882352,
 ('entropy', 2, Non

Below, we summarize the results of our parameter-varying experiments in tables.

#### Split criterion: Gini index, Metric: F1-score
Best score in row is **bold**, best score in column is *italic*, best score in table is []

|  | Max Depth | 1 | 2 | 5 | 10 | 100 | None |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|**Max Features**|
|$\sqrt{n}$||0.8938|0.9258|0.9616|0.9592|**0.9695**|**0.9695**
|$\log_2{n}$||0.8938|0.9411|**0.9720**|0.9717|0.9691|0.9691
|**None**||*0.9723*|*0.9691*|[***0.9820***]|*0.9741*|*0.9741*|*0.9741*

#### Split criterion: Information gain, Metric: F1-score
Best score in row is **bold**, best score in column is *italic*, best score in table is []

|  | Max Depth | 1 | 2 | 5 | 10 | 100 | None |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|**Max Features**|
|$\sqrt{n}$||0.8938|0.9284|0.9664|**0.9767**|**0.9767**|**0.9767**
|$\log_2{n}$||0.8938|0.9412|0.9634|0.9541|**0.9695**|**0.9695**
|**None**||*0.9723*|*0.9723*|[***0.9769***]|[***0.9769***]|[***0.9769***]|[***0.9769***]

DESCRIBE THE CHANGES NOTICED IN THE RESULT, DESCRIBE HOW CRITERION, MAX DEPTH, AND MAX FEATURES ARE AFFECTING THE OUTPUT

---
## **Task 2: Bagging and Boosting for Binary Music Genre Classification**

In this section, we will use sklearn to implement bagging (extra-trees classifier) and boosting (gradient boosting classifier) ensemble methods for binary music genre classification ("classical" vs "pop"). We will use k-fold cross validation to compare the effectiveness of the two models.


In [93]:
bagging = ExtraTreesClassifier(n_estimators=100)

In [94]:
# 10-fold cross validation on the training dataset, repeated 3 times
# Output is mean and std F1-score over the models trained on 90% of the data and tested on the remaining 10%
cross_validation = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
cross_validation_scores = cross_val_score(bagging, X_train, y_train, scoring='f1', cv=cross_validation, n_jobs=-1, error_score='raise')
print(f"F1-score: Mean: {np.mean(cross_validation_scores)}, Standard Deviation: {np.std(cross_validation_scores)}")

F1-score: Mean: 0.9856590027689845, Standard Deviation: 0.009292135351084621


In [95]:
# Bagging model fit over entire training dataset and evaluated on testing dataset
bagging.fit(X_train, y_train)
cm, acc, p, r, f1, cr = evaluate_model(bagging, X_test, y_test)
print(f"Confusion Matrix:\n{cm}\nAccuracy: {acc}, Precision: {p}, Recall: {r}, F1-score: {f1}\nClassification Report:\n{cr}")

Confusion Matrix:
[[201   6]
 [  1 192]]
Accuracy: 0.9825, Precision: 0.9696969696969697, Recall: 0.9948186528497409, F1-score: 0.9820971867007673
Classification Report:
              precision    recall  f1-score   support

   classical       1.00      0.97      0.98       207
         pop       0.97      0.99      0.98       193

    accuracy                           0.98       400
   macro avg       0.98      0.98      0.98       400
weighted avg       0.98      0.98      0.98       400



In [96]:
boosting = GradientBoostingClassifier(n_estimators=100, random_state=42)

In [97]:
# 10-fold cross validation on the training dataset, repeated 3 times
# Output is mean and std F1-score over the models trained on 90% of the data and tested on the remaining 10%
cross_validation_scores = cross_val_score(boosting, X_train, y_train, scoring='f1', cv=cross_validation, n_jobs=-1, error_score='raise')
print(f"F1-score: Mean: {np.mean(cross_validation_scores)}, Standard Deviation: {np.std(cross_validation_scores)}")

F1-score: Mean: 0.9839594617920908, Standard Deviation: 0.009354268699527922


In [98]:
# Boosting model fit over entire training dataset and evaluated on testing dataset
boosting.fit(X_train, y_train)
cm, acc, p, r, f1, cr = evaluate_model(boosting, X_test, y_test)
print(f"Confusion Matrix:\n{cm}\nAccuracy: {acc}, Precision: {p}, Recall: {r}, F1-score: {f1}\nClassification Report:\n{cr}")

Confusion Matrix:
[[202   5]
 [  0 193]]
Accuracy: 0.9875, Precision: 0.9747474747474747, Recall: 1.0, F1-score: 0.9872122762148338
Classification Report:
              precision    recall  f1-score   support

   classical       1.00      0.98      0.99       207
         pop       0.97      1.00      0.99       193

    accuracy                           0.99       400
   macro avg       0.99      0.99      0.99       400
weighted avg       0.99      0.99      0.99       400



COMMENT ON THE DIFFERENCE/SIMILARITY OF THE RESULTS

---
## **Task 3: Comparing Decision Tree Classifier, Bagging, and Boosting**

In this section, we will compare the effectiveness of the base decision tree classifier, the extra-trees classifier (bagging), and the gradient boosting classifier (boosting) models implemented above. 


Compare the effectiveness of the three models implemented above. Clearly
describe the metric you are using for comparison. Describe (with examples) Why is this metric(metrics) suited/appropriate for the problem at hand? How would a choice of a different metric impact your results? Can you demonstrate that?