<h1><b>MACHINE LEARNING

<h3> In this notebook, we will be looking at trying to predict the genre of a song based on a variety of features

<h3> Run cells using shift + enter or the run button at the top

In [47]:
"""You may need to install sklearn and xgboost if you haven't already from last week's workshop
ONLY UNCOMMENT AND RUN IT IF NEEDED, OTHERWISE YOU CAN SKIP THIS CELL"""
# %pip install scikit-learn
# %pip install xgboost

"You may need to install sklearn and xgboost if you haven't already from last week's workshop\nONLY UNCOMMENT AND RUN IT IF NEEDED, OTHERWISE YOU CAN SKIP THIS CELL"

In [48]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost.sklearn import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

<h2><b>Data loading and preprocessing

In [49]:
# Load the data
df = pd.read_csv('ClassicHit.csv')
df.head()

Unnamed: 0,Track,Artist,Year,Duration,Time_Signature,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Popularity,Genre
0,Hey Jack Kerouac,"10,000 Maniacs",1987,206413,4,0.616,0.511,6,-15.894,1,0.0279,0.0384,0.0,0.15,0.604,132.015,40,Alt. Rock
1,Like the Weather,"10,000 Maniacs",1987,236653,4,0.77,0.459,1,-17.453,1,0.0416,0.112,0.00343,0.145,0.963,133.351,43,Alt. Rock
2,What's the Matter Here?,"10,000 Maniacs",1987,291173,4,0.593,0.816,9,-7.293,1,0.041,0.00449,3.2e-05,0.0896,0.519,99.978,12,Alt. Rock
3,Trouble Me,"10,000 Maniacs",1989,193560,4,0.861,0.385,2,-10.057,1,0.0341,0.154,0.0,0.123,0.494,117.913,47,Alt. Rock
4,Candy Everybody Wants,"10,000 Maniacs",1992,185960,4,0.622,0.876,10,-6.31,1,0.0305,0.0193,0.00684,0.0987,0.867,104.97,43,Alt. Rock


In [50]:
# List the columns and their data types
df.dtypes

Track                object
Artist               object
Year                  int64
Duration              int64
Time_Signature        int64
Danceability        float64
Energy              float64
Key                   int64
Loudness            float64
Mode                  int64
Speechiness         float64
Acousticness        float64
Instrumentalness    float64
Liveness            float64
Valence             float64
Tempo               float64
Popularity            int64
Genre                object
dtype: object

<h3> Since the Track and Artist are strings, we will be dropping those columns. In theory, we could find a way to encode those, but for now we will be focusing on predicting the genre solely based on the numerical features. We will also remove the Year, Duration and Time Signature columns since those are generic pieces of information about the songs that will likely not yield any new info

<h3> Additionally, we will do the same basic preprocessing steps as last week where we drop any duplicate rows and remove rows with a tempo of 0 

In [51]:
# Drop Track and Artist columns
df = df.drop(['Track', 'Artist', 'Year', 'Duration', 'Time_Signature'], axis=1)
df.dtypes

Danceability        float64
Energy              float64
Key                   int64
Loudness            float64
Mode                  int64
Speechiness         float64
Acousticness        float64
Instrumentalness    float64
Liveness            float64
Valence             float64
Tempo               float64
Popularity            int64
Genre                object
dtype: object

<h3> With some models, when you have a multi-class classification problem like this, you may need to go and utilize one-hot encoding or label encoding to deal with the multiple-class issue. This is especially the case when the labels you have for each class are strings as opposed to numerical. For simplicity, we will use label encoding since it doesn't create additional columns that we need to deal with.

<h3> Label encoding works by assigning each class a number as a way to convert the strings into numbers. While this is convenient, it may cause the models to believe that there are inherent relationships between genres that may not actually exist (i.e. if Rock is 0 but Classical is 1, a model may think they're similar when they aren't). This is where one-hot encoding would be better since it avoids this "inherent bias" problem, but it does create a new column for every class. Since we have 19 genres, this is a lot!

In [52]:
# Label encode the Genre column
le = LabelEncoder()
df['Genre'] = le.fit_transform(df['Genre'])
df.head()

Unnamed: 0,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Popularity,Genre
0,0.616,0.511,6,-15.894,1,0.0279,0.0384,0.0,0.15,0.604,132.015,40,0
1,0.77,0.459,1,-17.453,1,0.0416,0.112,0.00343,0.145,0.963,133.351,43,0
2,0.593,0.816,9,-7.293,1,0.041,0.00449,3.2e-05,0.0896,0.519,99.978,12,0
3,0.861,0.385,2,-10.057,1,0.0341,0.154,0.0,0.123,0.494,117.913,47,0
4,0.622,0.876,10,-6.31,1,0.0305,0.0193,0.00684,0.0987,0.867,104.97,43,0


<h3> We can now split the data into features and labels

In [53]:
X = df.drop(['Genre'], axis=1)
y = df['Genre']
y

0         0
1         0
2         0
3         0
4         0
         ..
15145    18
15146    18
15147    18
15148    18
15149    18
Name: Genre, Length: 15150, dtype: int32

<h2> Split data into train/test

In [54]:
# Get the counts of each genre, sorted by genre
df['Genre'].value_counts().sort_index()

0      780
1      683
2      833
3      652
4      700
5      575
6      388
7      311
8      778
9      922
10    3669
11     754
12     822
13     718
14     439
15     799
16     381
17     620
18     326
Name: Genre, dtype: int64

<h3> Notice how there are so many more entries for Genre 10 than all the other genres? If we're not careful, the model might become really good at predicting Genre 10 but none of the others. For this reason, when we do our train_test split we want to <i>stratify</i> our splitting so that way each class is equally represented across the train/test splits

<h3> Additionally, we don't want our results to be determined purely by just one random split of the data. What if there are other splits where the models perform better? For this reason, we will use k-fold cross-validation to test the models on multiple splits to get a better overall evaluation of our model

In [55]:
# Uncomment below for regular train/test split
# 80% training, 20% testing, random_state=42 for reproducibility
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) 

cv = 5 # Number of cross-validation folds

In [56]:
df.head()

Unnamed: 0,Danceability,Energy,Key,Loudness,Mode,Speechiness,Acousticness,Instrumentalness,Liveness,Valence,Tempo,Popularity,Genre
0,0.616,0.511,6,-15.894,1,0.0279,0.0384,0.0,0.15,0.604,132.015,40,0
1,0.77,0.459,1,-17.453,1,0.0416,0.112,0.00343,0.145,0.963,133.351,43,0
2,0.593,0.816,9,-7.293,1,0.041,0.00449,3.2e-05,0.0896,0.519,99.978,12,0
3,0.861,0.385,2,-10.057,1,0.0341,0.154,0.0,0.123,0.494,117.913,47,0
4,0.622,0.876,10,-6.31,1,0.0305,0.0193,0.00684,0.0987,0.867,104.97,43,0


<h3> Now it is time to scale. Most columns already look scaled except for Key, Loudness, Mode, Tempo, and Popularity. From these, we want to avoid scaling the Key and Mode since they are just indicators of what key the song was in, and whether it was a major or minor key. 

<h3> For this reason, we only want to scale the Loudness, Tempo and Popularity.

In [57]:
# Scale just the Loudness, Tempo and Popularity columns
columns_to_scale = ['Loudness', 'Tempo', 'Popularity']
columns_to_exclude = [col for col in X.columns if col not in columns_to_scale]

# Create the ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), columns_to_scale)
    ],
    remainder='passthrough'  # This will keep the excluded columns as they are
)

# Fit and transform the data (not running this code for the pipeline later!)
# X_train = preprocessor.fit_transform(X_train)
# X_test = preprocessor.transform(X_test)

<h3> Notice how that was a lot of code that we had to write? We didn't even get to the part of writing the code for training the model and getting results! What if there was a nice way we could contain it all together? That's where sklearn's pipelines come in.

In [58]:
# Helper function for getting the model results without needing to repeat code
def get_results(model, X_test, y_test):
    y_pred = model.predict(X_test)
    modelAcc = accuracy_score(y_test, y_pred)
    print('Accuracy:', modelAcc)
    print('Precision Recall Fscore Support:', precision_recall_fscore_support(y_test, y_pred, average='weighted'))
    return modelAcc

In [59]:
# Define the hyperSearch function
def hyperSearch(pipeline, param_grid, X, y, cv=5):
    grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring='accuracy')
    grid_search.fit(X, y)
    print(f'Best Parameters: {grid_search.best_params_}')
    print(f'Best Score: {grid_search.best_score_}')
    return grid_search.best_estimator_, grid_search.best_score_

Note: all hyperparameters set for models were set after I did the hyperparameter tuning via GridSearch on them. DO NOT RUN THIS UNLESS YOU LIKE YOUR LAPTOP'S CPU TEMPS TO GET REALLY, REALLY HOT

<h2> Decision Trees

<h3> Decision Trees are a simple model that at the simplest level is like a series of if statements to determine what the result is

![Decision Tree](Decision-Trees.png)

In [60]:
# Dataframe to store the results from each model
performanceSummaryDf = pd.DataFrame(columns=['Model', 'Accuracy'])

In [61]:
# Create a pipeline for the Decision Tree Classifier
pipe_dt = Pipeline([
    ('preprocessor', preprocessor),
    ('dt', DecisionTreeClassifier(random_state=42, max_depth=10, min_samples_split=10))
])

# # Do a grid search for the Decision Tree Classifier
# param_grid_dt = {
#     'dt__max_depth': [None, 10, 20, 30],
#     'dt__min_samples_split': [2, 5, 10]
# }
# best_dt, best_dt_score = hyperSearch(pipe_dt, param_grid_dt, X, y, cv=cv)

dt_scores = cross_val_score(pipe_dt, X, y, cv=cv, scoring='accuracy')
print('CV Scores:', dt_scores)
print('Mean Accuracy:', dt_scores.mean())
# Add the average accuracy to the performance summary dataframe
performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['Decision Tree'], 'Accuracy': [dt_scores.mean()]})])
# Use the below code if you want to use regular train/test split instead of cross-validation
# # Fit the pipeline
# pipe_dt.fit(X_train, y_train)

# # Get the results
# dtAcc = get_results(pipe_dt, X_test, y_test)
# performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['Decision Tree'], 'Accuracy': [dtAcc]})])

CV Scores: [0.35280528 0.34422442 0.35775578 0.33663366 0.34422442]
Mean Accuracy: 0.3471287128712871


<h2> Random Forests

<h3> We saw a Decision Tree on its own doesn't do too great. What if we took a bunch of different decision trees and combined them together to try to get a better result? That's what a Random Forest does.


![Random Forest](RF.png)

In [62]:
# Create a pipeline for the Random Forest Classifier
pipe_rf = Pipeline([
    ('preprocessor', preprocessor),
    ('rf', RandomForestClassifier(random_state=42, max_depth=30, min_samples_split=10, n_estimators=300))
])
"""
# Do a grid search for the Random Forest Classifier
param_grid_rf = {
    'rf__n_estimators': [100, 200, 300],
    'rf__max_depth': [None, 10, 20, 30],
    'rf__min_samples_split': [2, 5, 10]
}
best_rf, best_rf_score = hyperSearch(pipe_rf, param_grid_rf, X, y, cv=cv)"""
rf_scores = cross_val_score(pipe_rf, X, y, cv=cv, scoring='accuracy')
print('CV Scores:', rf_scores)
print('Mean Accuracy:', rf_scores.mean())
# Add the average accuracy to the performance summary dataframe
performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['Random Forest'], 'Accuracy': [rf_scores.mean()]})])

# Use the below code if you want to use regular train/test split instead of cross-validation
# # Fit the pipeline
# pipe_rf.fit(X_train, y_train)

# # Get the results
# rfAcc = get_results(pipe_rf, X_test, y_test)
# performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['Random Forest'], 'Accuracy': [rfAcc]})])

CV Scores: [0.42541254 0.42409241 0.42805281 0.39339934 0.40759076]
Mean Accuracy: 0.41570957095709565


<h3> Notice how the results improved when using a whole forest of trees instead of just one!

<h2> XGBoost (Extreme Gradient Boosting)

<h3> XGBoost is similar to random forests, but XGBoost builds multiple decision trees sequentially, with each tree correcting the errors of its predecessor, generally leading to high predictive accuracy and efficiency.

![XGBoost](XGBoost.png)

In [63]:
# Create a pipeline for the XGBoost Classifier
pipe_xgb = Pipeline([
    ('preprocessor', preprocessor),
    ('xgb', XGBClassifier(random_state=42, learning_rate = 0.1, max_depth = 3, n_estimators = 200))
])

# Do a grid search for the XGBoost Classifier
"""param_grid_xgb = {
    'xgb__n_estimators': [100, 200, 300],
    'xgb__max_depth': [3, 5, 7],
    'xgb__learning_rate': [0.1, 0.01, 0.001]
}
best_xgb, best_xgb_score = hyperSearch(pipe_xgb, param_grid_xgb, X, y, cv=cv)"""
xgb_scores = cross_val_score(pipe_xgb, X, y, cv=cv, scoring='accuracy')
print('CV Scores:', xgb_scores)
print('Mean Accuracy:', xgb_scores.mean())
# Add the average accuracy to the performance summary dataframe
performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['XGBoost'], 'Accuracy': [xgb_scores.mean()]})])
# # Use the below code if you want to use regular train/test split instead of cross-validation
# Fit the pipeline
# pipe_xgb.fit(X_train, y_train)

# # Get the results
# xgbAcc = get_results(pipe_xgb, X_test, y_test)
# performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['XGBoost'], 'Accuracy': [xgbAcc]})])

CV Scores: [0.42541254 0.42145215 0.4349835  0.3960396  0.40990099]
Mean Accuracy: 0.4175577557755775


<h2> Support Vector Machines (SVMs)

<h3> SVMs work by trying to determine a decision boundary between the different classes, where each class falls on one portion of the boundary. These boundaries can be determined by different kernel functions (see below image).

![SVC](SVC.png)

In [64]:
# Create a pipeline for the Support Vector Classifier
pipe_svc = Pipeline([
    ('preprocessor', preprocessor),
    ('svc', SVC(random_state=42, C=10, kernel='rbf'))
])

# Do a grid search for the Support Vector Classifier
"""param_grid_svc = {
    'svc__C': [0.1, 1, 10],
    'svc__kernel': ['poly', 'rbf', 'sigmoid']
}
best_svc, best_svc_score = hyperSearch(pipe_svc, param_grid_svc, X, y, cv=cv)"""
svc_scores = cross_val_score(pipe_svc, X, y, cv=cv, scoring='accuracy')
print('CV Scores:', svc_scores)
print('Mean Accuracy:', svc_scores.mean())
# Add the average accuracy to the performance summary dataframe
performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['SVC'], 'Accuracy': [svc_scores.mean()]})])
# Use the below code if you want to use regular train/test split instead of cross-validation
# # Fit the pipeline
# pipe_svc.fit(X_train, y_train)

# # Get the results
# svcAcc = get_results(pipe_svc, X_test, y_test)
# performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['SVC'], 'Accuracy': [svcAcc]})])

CV Scores: [0.37689769 0.37392739 0.38976898 0.36963696 0.36468647]
Mean Accuracy: 0.374983498349835


<h2> K-Nearest Neighbors (KNN)</h2>

<h3>KNN is a instance-based machine learning algorithm that classifies data points based on the majority class among their k nearest neighbors in the feature space. It works by calculating the distances between points, often using metrics like Euclidean distance. Because of all these distance calculations it can be computationally intensive for large datasets.</h3>

![KNN](KNN.png)

In [65]:
# Create a pipeline for the K-Nearest Neighbors Classifier
pipe_knn = Pipeline([
    ('preprocessor', preprocessor),
    ('knn', KNeighborsClassifier(n_neighbors=7, weights='distance'))
])

# Do a grid search for the K-Nearest Neighbors Classifier
"""param_grid_knn = {
    'knn__n_neighbors': [3, 5, 7],
    'knn__weights': ['uniform', 'distance']
}
best_knn, best_knn_score = hyperSearch(pipe_knn, param_grid_knn, X, y, cv=cv)"""
knn_scores = cross_val_score(pipe_knn, X, y, cv=cv, scoring='accuracy')
print('CV Scores:', knn_scores)
print('Mean Accuracy:', knn_scores.mean())
# Add the average accuracy to the performance summary dataframe
performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['K-Nearest Neighbors'], 'Accuracy': [knn_scores.mean()]})])
# Use the below code if you want to use regular train/test split instead of cross-validation
# # Fit the pipeline
# pipe_knn.fit(X_train, y_train)

# # Get the results
# knnAcc = get_results(pipe_knn, X_test, y_test)
# performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['K-Nearest Neighbors'], 'Accuracy': [knnAcc]})])

CV Scores: [0.27755776 0.27458746 0.28481848 0.26666667 0.27161716]
Mean Accuracy: 0.275049504950495


<h2> Model Performance Summary

In [66]:
# Add a row for average accuracy 
performanceSummaryDf = pd.concat([performanceSummaryDf, pd.DataFrame({'Model': ['Average'], 'Accuracy': [performanceSummaryDf['Accuracy'].mean()]})])
# Sort the dataframe by accuracy
performanceSummaryDf = performanceSummaryDf.sort_values(by='Accuracy', ascending=False)
performanceSummaryDf

Unnamed: 0,Model,Accuracy
0,XGBoost,0.417558
0,Random Forest,0.41571
0,SVC,0.374983
0,Average,0.366086
0,Decision Tree,0.347129
0,K-Nearest Neighbors,0.27505


<h1> Does anyone have any guesses as to why all of the models performed as they did? </h1>

Too many classes? Features between groups having too similar of value to be distinguishable?

<h1> Next Steps

<h3> There are many other things that we can do to improve our model performance. Some options include: </h3>

- <h4> Do more extensive hyperparameter tuning in the GridSearch (takes a lot of time!)
- <h4> Adding/removing features to see which ones help/worsen the models
- <h4> Simplify the problem by reducing the number of classes we're predicting
- <h4> and more!