# Music Genre Classification 

#### **Importing Libraries**

In [1]:
%matplotlib notebook
import numpy as np 
import pandas as pd 
import seaborn as sns
from sklearn.model_selection import train_test_split 
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import RandomizedSearchCV
from matplotlib import pyplot as plt
from mpl_toolkits import mplot3d
from sklearn.externals.six import StringIO 
from IPython.display import Image 
from sklearn.tree import export_graphviz

## The dataset

#### **Loading the dataset**

Let's load the data provided by a research group from The Echo Nest. One dataset contains metadata about tracks, and the second one includes the track metrics such as danceability and acoustics on a scale from -1 to 1. The former is in CSV format, and the latter is in JSON format. I merged them using the ids of tracks as the primary key. 


In [2]:
# Load the metadata about song tracks and the track metrics
tracks = pd.read_csv('songs.csv')
echonest_metrics = pd.read_json('echonest-metrics.json', precise_float=True)

# Merge the datasets using the track id column
df = pd.merge(left=echonest_metrics, right=tracks[['track_id', 'genre_top']], on='track_id')

In [3]:
df.head()

Unnamed: 0,track_id,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence,genre_top
0,2,0.416675,0.675894,0.634476,0.010628,0.177647,0.15931,165.922,0.576661,Hip-Hop
1,3,0.374408,0.528643,0.817461,0.001851,0.10588,0.461818,126.957,0.26924,Hip-Hop
2,341,0.977282,0.468808,0.134975,0.6877,0.105381,0.073124,119.646,0.430707,Rock
3,46204,0.953349,0.498525,0.552503,0.924391,0.684914,0.028885,78.958,0.430448,Rock
4,46205,0.613229,0.50032,0.487992,0.936811,0.63775,0.030327,112.667,0.824749,Rock



The idea is that each song is essentially a deliberately designed audio. With musical features derived from this raw audio information, we can score the song on various musical features and use it for further analysis.

Lets's look at some general information of this merged dataset.


In [4]:
sns.countplot(data=df, x='genre_top')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x10c1e4400>

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4802 entries, 0 to 4801
Data columns (total 10 columns):
track_id            4802 non-null int64
acousticness        4802 non-null float64
danceability        4802 non-null float64
energy              4802 non-null float64
instrumentalness    4802 non-null float64
liveness            4802 non-null float64
speechiness         4802 non-null float64
tempo               4802 non-null float64
valence             4802 non-null float64
genre_top           4802 non-null object
dtypes: float64(8), int64(1), object(1)
memory usage: 572.7+ KB


We can see that number of rock songs is about 4 times the hip-hop songs in the dataset. It's nice that none of the columns in this dataset contain NAs so we do not need to worry about the missing values. 

#### **A Question of Interest**

Framing the problem is essential in the context of machine learning, because it influences the options of machine learning algorithms and the measures needed to evaluate the performance of the models. Basically, the goal of this project is to train models on songs data to label each track with their genres, either 'Hip-Hop' or 'Rock'. This falls into the supervised task of classification. To address this machine learning problem, I focused on three classification models: kNN(5 neighbors), SVM(RBF kernel), and Random Forests.

#### **Preparing the Data**

The real world data is messy, with missing values and undesired formats, hence it is necessary to preprocess the data before fitting the model. In order to perform classification with the models, I converted the information of the genre feature so that "Rock" becomes 0 and "HIp-hop" becomes 1. 

In [6]:
df.loc[df['genre_top'] == 'Rock', 'genre'] = 0
df.loc[df['genre_top'] == 'Hip-Hop', 'genre'] = 1

To improve interpretability and reduce the computation cost of the model, we typically want to remove irrelevant variables and avoid using variables that are correlated with each other. Let's see the pairwise correlation of columns, and use this information to select some features for building the model. The closer the value is to +1 or -1, the strong the relationship between the two variables.

In [7]:
cor = df.corr()
plt.figure(figsize=(10,10))
sns.heatmap(cor, annot=True, cmap="PiYG", center=0)
plt.show()

<IPython.core.display.Javascript object>

Looking at the row/column of our target variable `genre` in this correlation matrix, `speechiness`(0.5), `danceability`(0.48), `instrumentalness`(-0.33) and `valence`(0.25) seem to be the most correlated. However the correlation between `valence` and `danceability` is 0.47, meaning that it could be redundant if we include both of them; hence, we will drop `valence` and just use the first three features.

In [None]:
ax = plt.axes(projection='3d')
ax.scatter(df['speechiness'], df['danceability'], df['instrumentalness'], c=df['genre'])

<IPython.core.display.Javascript object>

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x1091f6c18>

The above graph shows the distribution of the two genres(in different colors). There seems to be a pattern but we are not sure, let's feed the data into the machine and see what it learns!

## Create Models

We will use `speechiness`, `danceability`, `instrumentalness` as input columns to classify each point into classes of music genres. Let's create the inputs and labels.

In [8]:
features = ['speechiness', 'danceability', 'instrumentalness']
X = df[features] #define the inputs
y = df['genre'] #define the classes

Before moving on, we want to hold out a test set for the final evaluation. 
We set aside about 10% of the data as the test set using sklearn's `train_test_split` function. 

In [17]:
# Create validation data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=11)

Now that we have properly processed and formatted the data, we move on to building some models using sklearn implementations. To address our classification question, we will consider kNN, SVM(RBF kernel), and Random Forests. 


In [18]:
# Create the details of the models 
knn = KNeighborsClassifier(n_neighbors=5)
svm = SVC(kernel='rbf', C=7, gamma='auto', random_state=1)
rf = RandomForestClassifier(max_depth=5, n_estimators=20)
clfs = [knn,svm,rf]

## Select a Model

Using different types of classifiers, we are trying to see which one predicts `genre` with the highest accuracy. Specifically, we will perform 6-fold cross-validation on the remaining 90% of the data to select the appropriate model. *X_train* corresponds to input features, and *y_train* corresponds to class labels.



In [19]:
def get_cv_scores(clfs, X, y):
    scores_lst = []
    cv = KFold(n_splits=6)
    for clf in clfs:
        scores = cross_val_score(clf, X, y, cv=cv)
        scores_lst.append(scores.mean())
    return scores_lst

Since large ranges could make an influence on the performance of kNN and SVM classifiers, scaling features are important. Let's scale the features before passing them to cross-validation and see if there is any improvement.

In [20]:
def scale_feartures(features):
    scaler = StandardScaler().fit(features)
    scaled_features = scaler.transform(features)
    return scaled_features

In [21]:
X_train = scale_feartures(X_train)
scores_lst = get_cv_scores(clfs, X_train, y_train)

In [22]:
# Make a table
eval_new = pd.DataFrame({
    "Model": ["kNN","SVM(rbf)","Random Forest"],
    "CV-Score": scores_lst 
    })

In [23]:
eval_new.sort_values(by="CV-Score", ascending=False)

Unnamed: 0,Model,CV-Score
2,Random Forest,0.902336
1,SVM(rbf),0.900949
0,kNN,0.890304


According to the CV scores, the Random Forest does a better job than the other two algorithms. In the Random Forest, multiple models are created then combined to produce improved results. Since the Random Forest seems to be the most appropriate model, we will use it for the final training.

## Fine-tuning the hyperparameters

Just like we could use k-fold cross-validation to compare and select a model for our dataset, we can also use it for optimizing the hyperparameters of a model.

In [24]:
# specify the parameter values that we want to investigate
parameters = {'bootstrap': [True, False],
              'max_depth': [5, 10, 15, 20, 25, 30],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [2, 4, 6, 8, 10],
              'min_samples_split': [2, 5, 10, 15],
              'n_estimators': [5, 10, 15, 20]}

In [25]:
# initialize a RandomizedSearchCV object from the sklearn
gs = RandomizedSearchCV(estimator=rf,
                        param_distributions = parameters,
                        n_iter = 50, #the number of different combinations to try,
                        n_jobs = -1, #run computations in parallel
                        cv = 5,
                        scoring = 'accuracy',
                        return_train_score = True)

In [26]:
# fit the grid with the training data
gs.fit(X_train, y_train)

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid='warn', n_iter=50, n_jobs=-1,
          param_distributions={'bootstrap': [True, False], 'max_depth': [5, 10, 15, 20, 25, 30], 'max_features': ['auto', 'sqrt'], 'min_samples_leaf': [2, 4, 6, 8, 10], 'min_samples_split': [2, 5, 10, 15], 'n_estimators': [5, 10, 15, 20]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='accuracy', verbose=0)

In [27]:
# Put the results in a table
table = pd.DataFrame({
    "mean score": gs.cv_results_['mean_test_score'],
    "params": gs.cv_results_['params']
})

In [28]:
# View the top five choices
table.sort_values(by="mean score", ascending=False).head()

Unnamed: 0,mean score,params
47,0.904189,"{'n_estimators': 20, 'min_samples_split': 15, ..."
12,0.902569,"{'n_estimators': 20, 'min_samples_split': 5, '..."
43,0.902337,"{'n_estimators': 20, 'min_samples_split': 5, '..."
14,0.902337,"{'n_estimators': 20, 'min_samples_split': 5, '..."
24,0.902106,"{'n_estimators': 20, 'min_samples_split': 15, ..."


Let's take a look at the optimized combination of parameters:

In [29]:
# Examine the best model
print(gs.best_score_)
print(gs.best_params_)

0.9041888451747281
{'n_estimators': 20, 'min_samples_split': 15, 'min_samples_leaf': 10, 'max_features': 'sqrt', 'max_depth': 30, 'bootstrap': True}


## Final Model

Now that we have selected our model and optimized the hyperparameters based on cross-validation, we will use all of the train/test set to tune and get the specific parameters for our model. Since the hyper-parameters are defined in the previous step, we only need to fit the final model.

In [296]:
rf = gs.best_estimator_
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=8, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=15,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## Performance

Ok, it's time to see how well our tuned model can make predictions based on data it hasn't seen before. We will use the validation set to evaluate the model's performance.

In [304]:
X_test = scale_feartures(X_test)
preds = rf.predict(X_test)
score = accuracy_score(y_test, preds)
print(score)

0.9106029106029107


At the moment, the model has an accuracy of about 91.06% on the test set. That's to say, when given new song track data, we can expect our model to predict whether it is a rock song or a hip-hop song with 91.06% accuracy. 