<h1><center><font size="6">Kaggle Competition: Songs Classification</font></center></h1>
<h2><center><font size="4">Yanni Zhang </font></center></h2>

<br>



# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Preparation for Analysis</a>  
 - <a href='#21'>Loading packages</a>  
 - <a href='#22'>Loading data</a>   
- <a href='#3'>Data Exploration</a>   
 - <a href='#31'>Missing Values Evaluation</a>  
 - <a href='#32'>Genre and Non-lyrics Features</a>   
 - <a href='#33'>Lyrics Features</a>  
- <a href='#5'>Predictive Model for Classification</a>
 - <a href='#50'>Training and Validation Dataset</a>  
 - <a href='#51'>Model Selection</a>  
 - <a href='#55'>Hyperparameters Optimization</a> 
- <a href='#6'>Model Ensambling</a>
 - <a href='#61'>Ensamble Framework</a>
 - <a href='#62'>Prediction and Submission</a>
- <a href='#7'>References</a>    

# <a id='1'>Introduction</a>  

This project is going to present the process of **analyzing the songs data** and **classifying the songs data into four genres**: pop, rap, rock, and hip hop. The main target of this project is to build a suitable classifier that could be well-performed on the training and test data.

To build a successful classifier, I will first look into the train data and have a deeper understanding of various **features**. Then I will use **dimension reduction methods** to create new features (principal components).I will choose some popular **algorithms** and use **cross-validation** to test their accurarcy. I will also conduct a **grid search optimization** to select the best set of hyperparameters. Finally, I will use **ensambling** to get the final classifer and predict on the test data.

The dataset used for this project is the **Songs** dataset which was scraped from Spotify and Genius APIs respectively.

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='2'>Preparation for Analysis</a>   


Before beginning the analysis, it is important to first load the necessary packages and datasets.

## <a id='21'>Loading Packages</a>

These are the necessary packages for the analysis. The packages contain data processing, visualisation, models, hyperparameter tuning, and model metrics functions.

In [1]:
from catboost import CatBoostClassifier
from pathlib import Path

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import os
import pandas as pd
import random
import sys
import scipy
import seaborn as sns
import xgboost as xgb

from collections import Counter
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px

from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, cross_val_score,StratifiedKFold, learning_curve
sns.set(style='white', context='notebook', palette='deep')
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

<a href="#0"><font size="1">Go to top</font></a>  


## <a id='22'>Loading Data</a>  



There are **songs_train** and **songs_test** data as well as an example **sample_submission** file in the original folder. 

Let's load the **songs_train** and **songs_test** data.

In [2]:
train_df=pd.read_csv('songs_train.csv')
test_df=pd.read_csv('songs_test.csv')

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='3'>Data Exploration</a>  

In order to get an initial picture of the data, I will first present a selection of rows.



In [3]:
train_df.sample(3).head()

Unnamed: 0,song_id,genre,audio_danceability,audio_energy,audio_key,audio_loudness,audio_mode,audio_speechiness,audio_acousticness,audio_instrumentalness,...,lyrics_year,lyrics_years,lyrics_yellow,lyrics_yes,lyrics_yesterday,lyrics_yet,lyrics_yo,lyrics_york,lyrics_young,lyrics_zone
3285,3286,rap,0.826,0.583,2,-5.186,0,0.0626,0.441,0.0,...,0,0,0,0,0,0,0,0,0,0
4878,4879,rap,0.537,0.948,8,-3.063,1,0.355,0.151,0.0,...,0,0,0,0,0,0,18,0,0,0
3777,3778,rap,0.901,0.811,9,-6.391,1,0.175,0.0529,0.0,...,0,0,0,0,0,0,0,0,1,0


In [4]:
test_df.sample(3).head()

Unnamed: 0,song_id,audio_danceability,audio_energy,audio_key,audio_loudness,audio_mode,audio_speechiness,audio_acousticness,audio_instrumentalness,audio_liveness,...,lyrics_year,lyrics_years,lyrics_yellow,lyrics_yes,lyrics_yesterday,lyrics_yet,lyrics_yo,lyrics_york,lyrics_young,lyrics_zone
3040,13041,0.477,0.483,4,-9.413,0,0.0364,0.0027,0.783,0.156,...,0,0,0,0,0,0,0,0,0,0
3772,13773,0.504,0.83,11,-3.383,0,0.125,0.00581,0.0,0.181,...,0,0,0,0,0,0,0,0,0,0
3016,13017,0.641,0.747,4,-5.902,0,0.0841,0.02,0.0,0.11,...,0,0,0,0,0,0,1,0,0,0


In [5]:
print("Train: rows:{} cols:{}".format(train_df.shape[0], train_df.shape[1]))
print("Test:  rows:{} cols:{}".format(test_df.shape[0], test_df.shape[1]))

Train: rows:10000 cols:1235
Test:  rows:7100 cols:1234


**songs_train** has 1000 songs and **songs_test** has 7100 songs. It is easy to find that the **train** and **test** files contain the same columns, except the **songs_test** does not have the genre column.  

Example Columns: 
* **song_id** - the index of the songs (in the dataset);  
* **audio_danceability** - how suitable is the song for dancing based on its musical element (from 0 to 1);
* **audio_energy** - the intendity and loudiness of the song (from 0 to 1);
* **audio_key** - the key of the song;
* **lyrics** - how many times this word appears in the song...

Before building a classifier, it is important to have a good understanding of the data. Therefore, I will explore the some features and check if there are missing data.

## <a id='31'>Missing Values Evaluation</a>  

This function could help to examine whether there are missing values in the train and test datasets.

In [6]:
def missing(data):
    to = data.isnull().sum().sort_values(ascending = False)
    per = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([to, per], axis=1, keys=['Total', 'Percent'])
missing(train_df).sample(3).head()

Unnamed: 0,Total,Percent
lyrics_coupe,0,0.0
lyrics_livin,0,0.0
lyrics_finna,0,0.0


In [7]:
missing(test_df).sample(3).head()

Unnamed: 0,Total,Percent
lyrics_mine,0,0.0
lyrics_hits,0,0.0
lyrics_making,0,0.0


There are no missing values in **train** and **test** datasets. 

## <a id='32'>Genre and Non-lyrics Features</a>  

I will provide some descriptions of `Genre` and `None-lyric Features` in the train data, hoping to find some patterns in these features that could help to categorize the data.

## 1. Genre

In [8]:
genre_num = train_df.groupby(['genre']).size().reset_index(name='count')
fig = px.bar(genre_num, x="genre", y="count", title="Genre Distribution")
fig.show()

According to the bar chart, it is obvious that the four types are roughly even distributed in the train dataset. Each type is around 25% of the total songs. Rock has the highest number (2696), whereas hiphip is the smallest types (has 2289 number of songs). 

## 2. Non-lyrics features

In this part, I have selected some non-lyrics features (13 features) in the training data and grouped them by genres. Then I will provide some basic description about each non-lyrics features and discuss whether they could be the features that differentiate songs.

In [9]:
features = train_df.iloc[:, 1:14]
features.sample(3).head()

Unnamed: 0,genre,audio_danceability,audio_energy,audio_key,audio_loudness,audio_mode,audio_speechiness,audio_acousticness,audio_instrumentalness,audio_liveness,audio_valence,audio_tempo,audio_duration_ms
227,pop,0.722,0.833,6,-3.915,1,0.094,0.0556,0.0,0.195,0.53,93.941,180000
348,pop,0.687,0.702,7,-5.324,0,0.0455,0.0064,4.4e-05,0.204,0.284,129.956,214840
2553,pop,0.896,0.459,1,-8.937,1,0.0515,0.0737,8.4e-05,0.0981,0.484,125.939,168897


In [10]:
grouped = features.groupby(features.genre)
rock = grouped.get_group("rock")
pop = grouped.get_group("pop")
rap = grouped.get_group("rap")
hiphop = grouped.get_group("hip hop")

### Non-lyrics Features' Minimum Values

In [11]:
minrock = pd.DataFrame(rock.min())
minpop = pd.DataFrame(pop.min())
minrap = pd.DataFrame(rap.min())
minhiphop = pd.DataFrame(hiphop.min())
minimum = pd.concat([minrock, minpop, minrap, minhiphop], axis=1)
new_header = minimum.iloc[0] 
minimum = minimum[1:] 
minimum.columns = new_header 

meanrock = pd.DataFrame(rock.mean())
meanpop = pd.DataFrame(pop.mean())
meanrap = pd.DataFrame(rap.mean())
meanhiphop = pd.DataFrame(hiphop.mean())
mean = pd.concat([meanrock, meanpop, meanrap, meanhiphop], axis=1)
mean.columns = new_header

maxrock = pd.DataFrame(rock.max())
maxpop = pd.DataFrame(pop.max())
maxrap = pd.DataFrame(rap.max())
maxhiphop = pd.DataFrame(hiphop.max())
maximum = pd.concat([maxrock, maxpop, maxrap, maxhiphop], axis=1)
new_header = maximum.iloc[0] 
maximum = maximum[1:] 
maximum.columns = new_header 

summary = pd.concat([minimum, mean, maximum],axis=1)
summary.columns = ['rock.min', 'pop.min', 'rap.min', 'hiphop.min', 
                   'rock.mean', 'pop.mean', 'rap.mean', 'hiphop.mean',
                  'rock.max', 'pop.max', 'rap.max', 'hiphop.max']
summary

Unnamed: 0,rock.min,pop.min,rap.min,hiphop.min,rock.mean,pop.mean,rap.mean,hiphop.mean,rock.max,pop.max,rap.max,hiphop.max
audio_danceability,0.112,0.189,0.209,0.228,0.518301,0.650699,0.731149,0.707821,0.965,0.975,0.981,0.975
audio_energy,0.00357,0.125,0.114,0.0924,0.720505,0.702768,0.677197,0.694365,0.997,0.999,0.986,0.993
audio_key,0.0,0.0,0.0,0.0,5.272997,5.277973,5.224082,5.534294,11.0,11.0,11.0,11.0
audio_loudness,-24.149,-22.587,-17.696,-21.779,-7.619894,-6.026587,-6.22974,-6.773828,-1.273,-0.323,0.175,0.221
audio_mode,0.0,0.0,0.0,0.0,0.665801,0.619883,0.559592,0.559633,1.0,1.0,1.0,1.0
audio_speechiness,0.0223,0.0226,0.0232,0.0252,0.057349,0.086333,0.203708,0.233644,0.464,0.848,0.752,0.943
audio_acousticness,3e-06,1.5e-05,3e-05,4e-06,0.155038,0.167035,0.147355,0.155032,0.991,0.987,0.947,0.963
audio_instrumentalness,0.0,0.0,0.0,0.0,0.059778,0.035248,0.012198,0.014272,0.98,0.966,0.981,0.953
audio_liveness,0.0222,0.0185,0.0239,0.0167,0.196599,0.177993,0.19304,0.220958,0.986,0.937,0.899,0.983
audio_valence,0.0344,0.0305,0.0349,0.0348,0.536254,0.545599,0.514213,0.563533,0.985,0.975,0.967,0.969


This table presents the minimum, average and maximum values of each features in each type of genre. However, the difference between each genre is quite small. It is difficult to determine which features can choose as symbolic features and need more analysis. Next, I will pick some typical features to discuss.

### Audio Danceability

In [12]:
fig1 = px.scatter(train_df, y="audio_danceability", color="genre")
fig1.update_layout(title='Songs Audio Danceability by genre')
fig1.show()

According to the scatter plot, audio danceability is varied accross different genres. Rock's dancability is generally lower than other genres.

### Audio Instrumentalness

In [13]:
fig2 = px.scatter(train_df, y="audio_instrumentalness", color="genre")
fig2.update_layout(title='Songs Audio Energy by genre')
fig2.show()

Pop and rock have generally higher instrumentalness than rap and hip hop.

### Audio Loudness

In [14]:
fig = px.scatter(train_df, y="audio_loudness", color="genre")
fig.update_layout(title='Songs Audio loudness by genre')
fig.show()

Rock's loudness is partially lower than other types.

### Audio Speechiness

In [15]:
fig = px.scatter(train_df, y="audio_speechiness", color="genre")
fig.update_layout(title='Songs Audio speechiness by genre')
fig.show()

Pop and rock have relative low speechiness than rap and hiphop.
These features are the ones which could partially differentiate songs and could be taken into the baseline model.

<a href="#0"><font size="1">Go to top</font></a>  


## <a id='33'>Lyrics Features</a>  

Other thousands of features are all lyrics features, which represent the time each word appears. Since there are too many lyrics features, I will use the dimension reduction methods to make them easier to compute.  

In [16]:
genre = train_df[['genre']]
lyrics1 = train_df.iloc[:,14:1236]
lyrics = pd.concat([genre, lyrics1],axis=1)
lyrics.sample(3).head()

Unnamed: 0,genre,lyrics_across,lyrics_act,lyrics_actin,lyrics_acting,lyrics_action,lyrics_afraid,lyrics_ago,lyrics_ah,lyrics_ahead,...,lyrics_year,lyrics_years,lyrics_yellow,lyrics_yes,lyrics_yesterday,lyrics_yet,lyrics_yo,lyrics_york,lyrics_young,lyrics_zone
4028,rap,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,7,0
4802,rap,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7472,rock,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
x = lyrics1
y = genre
x = StandardScaler().fit_transform(x)

# Dimensions reduction by PCA 
pca = PCA(n_components=2)
Y = pca.fit_transform(x)
principalDf = pd.DataFrame(data = Y, columns = ['principal component 1', 'principal component 2'])
finalDf = pd.concat([principalDf, genre], axis = 1)

# Scatter plot the 2-dimensional PCA demographic features grouped by borough
fig = px.scatter(Y, x=finalDf['principal component 1'], y=finalDf['principal component 2'], 
                 symbol=finalDf['genre'],color=finalDf['genre'],size_max=1)
fig.update_layout(title_text="Dimension Reduction for lyrics features")
fig.show()

Now, the dimension reduction integrate the lyrics features into two principal components. Although the Principal Component Analysis (PCA) cannot sperate four genres clearly, we can still see that there are some difference between each genres. For example, pop, rock and hiphop has lower principal component 1 than rap, which rock has low principal component 2.

Based on previous analysis, I will select all non-lyrics features and the two principal components for following model building. The non-lyrics features all seems related to genre classification, while the principal components synthesize the lyrics features which are quite important. 

<a href="#0"><font size="1">Go to top</font></a>  




# <a id='5'>Predictive Model for Classification</a>  



## <a id='50'>Training and Validation Dataset</a>  

To begin with, I will create two new dataset. This first dataset will build on the original **songs_train** and substitute the non-lyrics features to the principal components I have created from last part. The second dataset will build on the original **songs_test**. I will also use dimension reduction on **songs_test** and generate the new dataset which also includes non-lyrics features and the principal components.

Then,I will split the first dataset into training and validation dataset. I will pick 80% of the data for training and left 20% for validation. Next, I will try different predictive models and test their performance.

In [18]:
first = pd.merge(left=principalDf, left_index=True,
                  right=train_df, right_index=True,
                  how='inner')
first.sample(3).head()

Unnamed: 0,principal component 1,principal component 2,song_id,genre,audio_danceability,audio_energy,audio_key,audio_loudness,audio_mode,audio_speechiness,...,lyrics_year,lyrics_years,lyrics_yellow,lyrics_yes,lyrics_yesterday,lyrics_yet,lyrics_yo,lyrics_york,lyrics_young,lyrics_zone
3022,-1.386175,2.753744,3023,rap,0.793,0.481,9,-9.258,1,0.124,...,1,0,1,0,0,0,0,0,0,0
7390,-0.207045,-2.723041,7391,rock,0.443,0.872,3,-4.205,1,0.0444,...,0,0,0,0,0,0,0,0,0,0
2287,-0.917936,-0.763566,2288,pop,0.762,0.863,0,-3.689,0,0.0565,...,0,0,0,0,0,0,0,0,0,0


In [19]:
lyrics_test = train_df.iloc[:,14:1236]
x = StandardScaler().fit_transform(lyrics_test)
pca = PCA(n_components=2)
Y = pca.fit_transform(x)
principalDf_test = pd.DataFrame(data = Y, columns = ['principal component 1', 'principal component 2'])
second = pd.merge(left=principalDf_test, left_index=True,
                  right=test_df, right_index=True,
                  how='inner')
second.sample(3).head()

Unnamed: 0,principal component 1,principal component 2,song_id,audio_danceability,audio_energy,audio_key,audio_loudness,audio_mode,audio_speechiness,audio_acousticness,...,lyrics_year,lyrics_years,lyrics_yellow,lyrics_yes,lyrics_yesterday,lyrics_yet,lyrics_yo,lyrics_york,lyrics_young,lyrics_zone
2862,0.81309,-1.994871,12863,0.812,0.385,10,-9.676,0,0.416,0.787,...,0,0,0,0,0,0,0,0,0,0
5648,0.274302,-2.783884,15649,0.653,0.658,2,-6.428,1,0.0304,0.0215,...,0,0,0,0,0,0,0,0,0,0
4371,-0.910229,-1.559902,14372,0.505,0.762,9,-7.15,1,0.0454,0.0415,...,0,0,0,0,0,0,0,0,0,0



## <a id='51'>Model Selection</a>  

I build the model using Random Forest with a few predictors I have selected in non-lyrics features. These features are the ones I see variances across genres.

In [20]:
Y_train = first["genre"]

X_train = first.drop(labels = ["genre"],axis = 1)
X_train = X_train.drop("song_id", axis=1)

In [21]:
kfold = StratifiedKFold(n_splits=10)

In [None]:
random_state = 10
classifiers = []
classifiers.append(SVC(random_state=random_state))
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state=random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier())
classifiers.append(LogisticRegression(random_state = random_state))

cv_results = []
for classifier in classifiers :
    cv_results.append(cross_val_score(classifier, X_train, y = Y_train, scoring = "accuracy", cv = kfold, n_jobs=4))

cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

cv_res = pd.DataFrame({"Algorithm":["SVC","DecisionTree","AdaBoost","RandomForest","GradientBoosting",
                                    "MultipleLayerPerceptron","KNeighboors","LogisticRegression"],
                       "Cross_Validation_Means":cv_means,"Cross_Validation_Std": cv_std})

g = sns.barplot("Cross_Validation_Means","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

In [24]:
cv_res

Unnamed: 0,Algorithm,Cross_Validation_Means,Cross_Validation_Errors
0,SVC,0.342,0.04287
1,DecisionTree,0.5495,0.044035
2,AdaBoost,0.5873,0.053738
3,RandomForest,0.6392,0.059749
4,GradientBoosting,0.6448,0.048526
5,MultipleLayerPerceptron,0.4218,0.067706
6,KNeighboors,0.3245,0.025168
7,LogisticRegression,0.2696,0.00049


This table show the cross validation data among each classifers. It is easy to acknowledge that RandomForest and GradientBoosting perform the best. And I will use these two models for further hyperparameter optimization.

<a href="#0"><font size="1">Go to top</font></a>  

## <a id='55'>Hyperparameters Optimization</a>

After selection of classifiers, I will conduct grid search to tune each models' hyperparameters and find the best combination.  

In [38]:
# RFC Parameters Tunning 
RFC = RandomForestClassifier()

rf_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2,3,10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[50,100,300],
              "criterion": ["gini"]}

gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsRFC.fit(X_train,Y_train)

RFC_best = gsRFC.best_estimator_

gsRFC.best_score_

Fitting 10 folds for each of 81 candidates, totalling 810 fits


0.6531999999999999

In [45]:
# Gradient Boosting Parameters Tunning

GBC = GradientBoostingClassifier()
gb_param_grid = {'loss' : ["deviance"],
              'n_estimators' : [100,200],
              'learning_rate': [0.1],
              'max_depth': [4],
              'min_samples_leaf': [100],
              'max_features': [0.3] }

gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)

gsGBC.fit(X_train,Y_train)

GBC_best = gsGBC.best_estimator_

gsGBC.best_score_

Fitting 10 folds for each of 2 candidates, totalling 20 fits


0.6537999999999999

<a href="#0"><font size="1">Go to top</font></a>  

# <a id='6'>Model Ensambling</a>

## <a id='61'>Ensamble Framework</a>

In [42]:
voting = VotingClassifier(estimators=[('rfc', RFC_best),('gbc',GBC_best)], 
                          voting='soft', n_jobs=4)

voting = voting.fit(X_train, Y_train)

## <a id='62'>Prediction and Submission</a>

### 1. Prediction

In [46]:
second_id = second["song_id"]
test_second = second.drop("song_id", axis=1)
test_genre = pd.Series(voting.predict(test_second), name="genre")
submission = pd.concat([second_id,test_genre],axis=1)

### 2. Submission

In [44]:
submission.to_csv("submission_new.csv",index=False)

# <a id='7'>References</a>

[1] https://www.kaggle.com/c/titanic

[2] https://www.kaggle.com/code/yassineghouzam/titanic-top-4-with-ensemble-modeling

[3] https://www.kaggle.com/code/gpreda/tutorial-for-classification/notebook

<a href="#0"><font size="1">Go to top</font></a> 