# Machine Learning Modelling

We will now try to model our data using the insights that were gathered in the EDA phase. The objective for this model is to classify songs based on its years. We have already seen that clustering via PCA/tSNE is difficult, but hopefully some of the machine learning that we will be implementing will help.

Here is a plan of what I would like to accomplish in this phase:
1. Create our training, validation and testing data (test data will need to be scrapped again)
2. Create a simple logistic regression as a baseline model (most likely LASSO)
    1. Look at which features are impacting the model the most 
    2. Examine residuals to see if there are any patterns
3. Create 5 different models and try to optimize hyperparameters for each one
    1. Try to incorporate Bayesian machine learning for this
    2. Random forest, XGBoost, KNN, SVM and/or other models
4. Explain
    1. Use Shapley plots and partial dependence plots to examine which features are having a heavy impact

In [3]:
# Libraries

# General data handling
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [4]:
# Reading in data
df = pd.read_csv("../data/processed_data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,filename,chroma_stft,spectral_centroid,spectral_bandwidth,rolloff,zero_crossing_rate,tempo,mfcc1,mfcc2,...,mfcc12,mfcc13,mfcc14,mfcc15,mfcc16,mfcc17,mfcc18,mfcc19,mfcc20,label
0,0,cooler_than_me,0.427503,2869.82862,2677.180491,6014.937118,0.133024,129.199219,-12.486107,66.946167,...,2.956805,-4.295158,0.656407,-4.171,0.771177,-1.440347,1.546116,-3.990371,2.771891,2010
1,1,airplanes,0.423003,2371.365987,2335.975387,4883.348314,0.111172,92.285156,-19.895674,88.955681,...,2.093508,-5.614816,2.615297,-4.192569,2.258908,-5.238406,2.319968,-3.514364,2.430466,2010
2,2,in_my_head,0.356209,3011.418486,2711.790253,6264.323144,0.145706,112.347147,-22.311823,56.398441,...,0.38518,-5.598175,3.440188,-7.220667,-0.522158,-4.795784,2.379431,-3.509303,1.03219,2010
3,3,tik_tok,0.389041,2566.800494,2495.921945,5313.32822,0.120453,117.453835,-32.02467,74.238457,...,2.125536,-6.483168,2.310317,-2.31159,-0.336032,-2.30795,0.711338,-3.418213,4.908051,2010
4,4,love_the_the_way_you_lie,0.409369,2796.55357,2507.57635,5479.505082,0.158967,117.453835,-39.791149,74.725708,...,0.675785,-6.952001,2.451728,-6.09828,0.66411,-6.62831,0.223306,-3.690106,-0.426928,2010


In [5]:
# Removing random first column
df = df.iloc[:,1:29]
df.head()

Unnamed: 0,filename,chroma_stft,spectral_centroid,spectral_bandwidth,rolloff,zero_crossing_rate,tempo,mfcc1,mfcc2,mfcc3,...,mfcc12,mfcc13,mfcc14,mfcc15,mfcc16,mfcc17,mfcc18,mfcc19,mfcc20,label
0,cooler_than_me,0.427503,2869.82862,2677.180491,6014.937118,0.133024,129.199219,-12.486107,66.946167,0.302483,...,2.956805,-4.295158,0.656407,-4.171,0.771177,-1.440347,1.546116,-3.990371,2.771891,2010
1,airplanes,0.423003,2371.365987,2335.975387,4883.348314,0.111172,92.285156,-19.895674,88.955681,-16.760706,...,2.093508,-5.614816,2.615297,-4.192569,2.258908,-5.238406,2.319968,-3.514364,2.430466,2010
2,in_my_head,0.356209,3011.418486,2711.790253,6264.323144,0.145706,112.347147,-22.311823,56.398441,-11.291659,...,0.38518,-5.598175,3.440188,-7.220667,-0.522158,-4.795784,2.379431,-3.509303,1.03219,2010
3,tik_tok,0.389041,2566.800494,2495.921945,5313.32822,0.120453,117.453835,-32.02467,74.238457,-4.40717,...,2.125536,-6.483168,2.310317,-2.31159,-0.336032,-2.30795,0.711338,-3.418213,4.908051,2010
4,love_the_the_way_you_lie,0.409369,2796.55357,2507.57635,5479.505082,0.158967,117.453835,-39.791149,74.725708,-19.090473,...,0.675785,-6.952001,2.451728,-6.09828,0.66411,-6.62831,0.223306,-3.690106,-0.426928,2010


Now that we have our data, let's divide it into training and validation data. 

In [6]:
X = df.iloc[:, 1:27]
y = df['label']
y = y.astype('category')

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.33)

## 1. Baseline Model

Let's first make a baseline logistic regression model to predict the different classes. From this model, we can examine any important features and residuals to note any interesting patterns

https://stats.stackexchange.com/questions/52104/multinomial-logistic-regression-vs-one-vs-rest-binary-logistic-regression

In [7]:
log_clf = LogisticRegression().fit(X_train, y_train)
log_accuracy = log_clf.score(X_val, y_val)
print("Log accuracy: ", log_accuracy)

Log accuracy:  0.17543859649122806


