# 1. Introduction

The purpose of this tutorial is to provide you with an introduction to some of the commonly used machine learning techniques. Given that the focus of the course this semester is with SOCAN, I've chosen to use a Spotify Music dataset. The tutorial will go over data preprocessing and modelling techniques. The corresponding presentations will go through APIs for NLP and audio processing that may be useful in prototyping your ideas.

# 2. Import Packages

In [1]:
import numpy as np
import pandas as pd # this library is used for data processing
import seaborn as sns # used for data visualization

from matplotlib import pyplot as plt
%matplotlib inline
sns.set_style("whitegrid")

import warnings
warnings.filterwarnings("ignore")

# 3. Loading Dataset

The first step in any machine learning project is to load your dataset. We use the pandas library to do this as it provides us with dataframe objects that handle large amounts of data well.

In [2]:
spotify_df = pd.read_csv('../data/SpotifyFeatures.csv')
spotify_df.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Opera,Giuseppe Verdi,"Stiffelio, Act III: Ei fugge! … Lina, pensai c...",7EsKYeHtTc4H4xWiTqSVZA,21,0.986,0.313,490867,0.231,0.000431,C#,0.0964,-14.287,Major,0.0547,86.001,4/4,0.0886
1,Opera,Giacomo Puccini,Madama Butterfly / Act 1: ... E soffitto e pareti,7MfmRBvqaW0I6UTxXnad8p,18,0.972,0.36,176797,0.201,0.028,D#,0.133,-19.794,Major,0.0581,131.798,4/4,0.369
2,Opera,Giacomo Puccini,"Turandot / Act 2: Gloria, gloria, o vincitore",7pBo1GDhIysyUMFXiDVoON,10,0.935,0.168,266184,0.47,0.0204,C,0.363,-8.415,Major,0.0383,75.126,3/4,0.0696
3,Opera,Giuseppe Verdi,"Rigoletto, Act IV: Venti scudi hai tu detto?",02mvYZX5aKNzdqEo6jF20m,17,0.961,0.25,288573,0.00605,0.0,D,0.12,-33.44,Major,0.048,76.493,4/4,0.038
4,Opera,Giuseppe Verdi,"Don Carlo / Act 4: ""Ella giammai m'amò!""",03TW0jwGMGhUabAjOpB1T9,19,0.985,0.142,629760,0.058,0.146,D,0.0969,-23.625,Major,0.0493,172.935,4/4,0.0382


As we can see above there are a number of attributes about each song that could be interesting features for us to look at and use to answer our question of whether a song is popular or not.

# 4. Dataset Statistics 

Often times it is important to understand the summary statistics of your data to get a better sense of what type of preprocessing you might need. Here we get a sense for how many examples there are for each feature, the mean, standard deviation, the minimum value and maximum value.

In [3]:
spotify_df.describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0
mean,44.20913,0.3512,0.554198,236609.2,0.580967,0.13731,0.214638,-9.354658,0.122442,117.423062,0.444795
std,17.276599,0.351385,0.183949,116678.7,0.260577,0.292447,0.196977,5.940994,0.186264,30.712458,0.255397
min,0.0,1e-06,0.0569,15509.0,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0
25%,33.0,0.0309,0.437,186253.0,0.405,0.0,0.0977,-11.287,0.0368,92.734,0.232
50%,47.0,0.205,0.57,221173.0,0.618,3.7e-05,0.128,-7.515,0.0506,115.347,0.43
75%,57.0,0.689,0.69,264840.0,0.793,0.0234,0.263,-5.415,0.109,138.887,0.643
max,100.0,0.996,0.987,5552917.0,0.999,0.999,1.0,1.585,0.967,239.848,1.0


# 5. Data Preprocessing

This is a crucial part of any machine learning project because these preprocessing steps can drastically improve your model's performance. We also need to be sure that we get rid of errors in the data that may cause issues such as null values.

## a. Cleaning Null Values

In [4]:
pd.isnull(spotify_df).sum()

genre               0
artist_name         0
track_name          0
track_id            0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64

Based on our check, there are no null values in the data thus there are no further steps needed to deal with null values!

## b. Dealing with Categorical Variables

Categorical type data is commonly found but we need to be able to provide our models with only numerical representations of our data. Thus for categorical variables we simply change the categories from 1 to number of categories.

### 1. Key

In [5]:
list_of_keys = spotify_df['key'].unique()
for i in range(len(list_of_keys)):
    spotify_df.loc[spotify_df['key'] == list_of_keys[i], "key"] = i
spotify_df.sample(5)

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
99577,Rap,Kendrick Lamar,Compton,1wf4LnpdAhLaoI2WwYDKAE,54,0.0815,0.342,248093,0.907,0.0,0,0.383,-4.432,Minor,0.415,170.38,4/4,0.225
50797,Electronic,Bonobo,If You Stayed Over - Reprise,5JrslPPXLFnUJzq85P6qTv,33,0.912,0.352,104411,0.526,0.967,9,0.068,-14.956,Minor,0.0482,170.001,4/4,0.0519
47332,Electronic,Vök,Night & Day,6UkBBvlkNhbdFBz6i3KQmC,34,0.014,0.717,195484,0.582,0.000609,8,0.148,-8.121,Major,0.0788,93.988,4/4,0.585
17126,Dance,Dido,Friends,7ohyW5BmiG0jKcvw31fhHH,37,0.67,0.815,203933,0.396,0.0418,5,0.0999,-10.981,Minor,0.0332,123.007,4/4,0.45
12879,Pop,Jason Mraz,I'm Yours,1EzrEOXmMH3G43AXT1y7pA,85,0.595,0.686,242187,0.457,0.0,6,0.105,-8.322,Major,0.0468,150.953,4/4,0.718


### 2. Mode (Binary Variable)

In [6]:
spotify_df.loc[spotify_df["mode"] == 'Major', "mode"] = 1
spotify_df.loc[spotify_df["mode"] == 'Minor', "mode"] = 0
spotify_df.sample(5)

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
94735,Children’s Music,Grouplove,Standing in The Sun,6kCHznMiHnRaNDHXwHmrrV,50,0.00317,0.475,255893,0.594,7e-06,6,0.0752,-7.331,1,0.0292,180.277,4/4,0.626
209813,Movie,Mel Gibson,Funny How Time Slips Away,1OTt0jwBX5riJ0MnvRLcov,1,0.179,0.381,262000,0.634,0.00035,11,0.142,-7.6,1,0.0286,69.323,4/4,0.548
62040,R&B,Shoffy,Movin On,64TUmDi8kUUCF816GhdRXX,58,0.488,0.807,211636,0.339,0.0165,8,0.107,-12.958,1,0.055,110.004,4/4,0.131
148118,Indie,Knox Hamilton,Pretty Way to Fight,6FI9HEkxiYljt06KHdmpKE,46,0.00725,0.525,203667,0.893,0.0576,0,0.219,-4.776,1,0.0665,153.996,4/4,0.336
21306,Alternative,Gorillaz,"The Apprentice (feat. Rag'n'Bone Man, Zebra Ka...",67fRHOlaYQQFG67D9DkdnW,56,0.182,0.446,234933,0.648,1.1e-05,6,0.348,-6.773,0,0.3,86.021,4/4,0.696


### 3. Time Signature

In [7]:
list_of_time_signatures = spotify_df['time_signature'].unique()
for i in range(len(list_of_time_signatures)):
    spotify_df.loc[spotify_df['time_signature'] == list_of_time_signatures[i], 'time_signature'] = i
spotify_df.sample(5)

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
115118,Classical,Johann Sebastian Bach,"St. Matthew Passion, BWV 244, Pt. 1: No. 10, I...",5dc71dIr7dIxy5LOe9J35R,0,0.985,0.206,58587,0.0642,0.958,8,0.117,-26.193,1,0.0397,64.397,0,0.251
9526,Alternative,Portugal. The Man,Feel It Still - Medasin Remix,4m6ObZmZ7wnyrKtmLvlyVE,63,0.553,0.568,194434,0.463,0.00114,9,0.0858,-10.622,1,0.262,157.686,0,0.148
108207,Hip-Hop,Too $hort,Blow the Whistle,2lMg3lCMOGistaWBNGjuT3,62,0.0042,0.907,163133,0.625,0.0,0,0.151,-5.557,1,0.198,99.918,0,0.605
77308,Folk,Patty Griffin,Up To The Mountain (MLK Song),2W0n4u0ySQpkLVeg6rOOr1,42,0.94,0.298,248867,0.204,0.000106,6,0.115,-7.242,1,0.0323,84.126,2,0.218
192163,Soul,Eric Benét,Femininity,1NyMdeeVDYizUcXHConsq6,39,0.102,0.677,288600,0.63,0.0,5,0.0554,-4.778,1,0.05,72.954,0,0.625


### 4. Popularity (Label)

In [8]:
spotify_df['popularity'] = np.where(spotify_df['popularity'] >= 57, 1, 0)

# 6. Training Models

Now that we have preprocess our data, we are ready to train and evaluate models. First thing we need to do is to split our dataset into a training set and a test set. The training set is used to train the algorithm and the test set is used to evaluate its performance on unseen data. Splitting and the models themselves are often done using the scikit-learn library.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras

## Dataset Split

In [11]:
features = ["acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "key", "liveness",
            "mode", "speechiness", "tempo", "time_signature", "valence"]

In [12]:
training_df = spotify_df.sample(frac = 0.8, random_state= 420)
X_train = training_df[features]
y_train = training_df['popularity']
X_test = training_df.drop(training_df.index)[features]

In [13]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.2, random_state = 420)

## Logistic Regression

In [14]:
LR_Model = LogisticRegression()
LR_Model.fit(X_train, y_train)
LR_Predict = LR_Model.predict(X_valid)
LR_Accuracy = accuracy_score(y_valid, LR_Predict)
print("Accuracy: " + str(LR_Accuracy))

LR_AUC = roc_auc_score(y_valid, LR_Predict) 
print("AUC: " + str(LR_AUC))

Accuracy: 0.7497945543198379
AUC: 0.5


## Random Forest

In [15]:
RFC_Model = RandomForestClassifier()
RFC_Model.fit(X_train, y_train)
RFC_Predict = RFC_Model.predict(X_valid)
RFC_Accuracy = accuracy_score(y_valid, RFC_Predict)
print("Accuracy: " + str(RFC_Accuracy))

RFC_AUC = roc_auc_score(y_valid, RFC_Predict) 
print("AUC: " + str(RFC_AUC))

Accuracy: 0.9348874157672711
AUC: 0.879036656108296


## Neural Network

In [25]:
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(12, )),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='Adam',
             loss='binary_crossentropy',
             metrics=['Accuracy', 'AUC'])

model.fit(X_train, y_train, epochs=10)

Train on 146021 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x16416d4e0>

# Conclusion

Now you've gone through the process of a machine learning model applied to Spotify data. It is important to keep these concepts of data preprocessing and model selection when determining the best way to solve your problems. S