# 1. Introduction

The purpose of this tutorial is to provide you with an introduction to some of the commonly used machine learning techniques. Given that the focus of the course this semester is with SOCAN, I've chosen to use a Spotify Music dataset. The tutorial will go over data preprocessing and modelling techniques. The corresponding presentations will go through APIs for NLP and audio processing that may be useful in prototyping your ideas.

# 2. Import Packages

In [43]:
import numpy as np
import pandas as pd # this library is used for data processing
import seaborn as sns # used for data visualization

from matplotlib import pyplot as plt
%matplotlib inline
sns.set_style("whitegrid")

import warnings
warnings.filterwarnings("ignore")

# 3. Loading Dataset

The first step in any machine learning project is to load your dataset. We use the pandas library to do this as it provides us with dataframe objects that handle large amounts of data well.

In [44]:
spotify_df = pd.read_csv('../data/SpotifyFeatures.csv')
spotify_df.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Opera,Giuseppe Verdi,"Stiffelio, Act III: Ei fugge! … Lina, pensai c...",7EsKYeHtTc4H4xWiTqSVZA,21,0.986,0.313,490867,0.231,0.000431,C#,0.0964,-14.287,Major,0.0547,86.001,4/4,0.0886
1,Opera,Giacomo Puccini,Madama Butterfly / Act 1: ... E soffitto e pareti,7MfmRBvqaW0I6UTxXnad8p,18,0.972,0.36,176797,0.201,0.028,D#,0.133,-19.794,Major,0.0581,131.798,4/4,0.369
2,Opera,Giacomo Puccini,"Turandot / Act 2: Gloria, gloria, o vincitore",7pBo1GDhIysyUMFXiDVoON,10,0.935,0.168,266184,0.47,0.0204,C,0.363,-8.415,Major,0.0383,75.126,3/4,0.0696
3,Opera,Giuseppe Verdi,"Rigoletto, Act IV: Venti scudi hai tu detto?",02mvYZX5aKNzdqEo6jF20m,17,0.961,0.25,288573,0.00605,0.0,D,0.12,-33.44,Major,0.048,76.493,4/4,0.038
4,Opera,Giuseppe Verdi,"Don Carlo / Act 4: ""Ella giammai m'amò!""",03TW0jwGMGhUabAjOpB1T9,19,0.985,0.142,629760,0.058,0.146,D,0.0969,-23.625,Major,0.0493,172.935,4/4,0.0382


As we can see above there are a number of attributes about each song that could be interesting features for us to look at and use to answer our question of whether a song is popular or not.

# 4. Dataset Statistics 

Often times it is important to understand the summary statistics of your data to get a better sense of what type of preprocessing you might need. Here we get a sense for how many examples there are for each feature, the mean, standard deviation, the minimum value and maximum value.

In [45]:
spotify_df.describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0,228159.0
mean,44.20913,0.3512,0.554198,236609.2,0.580967,0.13731,0.214638,-9.354658,0.122442,117.423062,0.444795
std,17.276599,0.351385,0.183949,116678.7,0.260577,0.292447,0.196977,5.940994,0.186264,30.712458,0.255397
min,0.0,1e-06,0.0569,15509.0,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0
25%,33.0,0.0309,0.437,186253.0,0.405,0.0,0.0977,-11.287,0.0368,92.734,0.232
50%,47.0,0.205,0.57,221173.0,0.618,3.7e-05,0.128,-7.515,0.0506,115.347,0.43
75%,57.0,0.689,0.69,264840.0,0.793,0.0234,0.263,-5.415,0.109,138.887,0.643
max,100.0,0.996,0.987,5552917.0,0.999,0.999,1.0,1.585,0.967,239.848,1.0


# 5. Data Preprocessing

This is a crucial part of any machine learning project because these preprocessing steps can drastically improve your model's performance. We also need to be sure that we get rid of errors in the data that may cause issues such as null values.

## a. Cleaning Null Values

In [46]:
pd.isnull(spotify_df).sum()

genre               0
artist_name         0
track_name          0
track_id            0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64

Based on our check, there are no null values in the data thus there are no further steps needed to deal with null values!

## b. Dealing with Categorical Variables

Categorical type data is commonly found but we need to be able to provide our models with only numerical representations of our data. Thus for categorical variables we simply change the categories from 1 to number of categories.

### 1. Key

In [47]:
list_of_keys = spotify_df['key'].unique()
for i in range(len(list_of_keys)):
    spotify_df.loc[spotify_df['key'] == list_of_keys[i], "key"] = i
spotify_df.sample(5)

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
62215,R&B,Mahalia,No Reply,3yQuFakjbjEF6nanm7anyN,57,0.546,0.704,193813,0.564,5e-06,8,0.0962,-6.292,Minor,0.0699,123.072,3/4,0.586
107162,Hip-Hop,KYLE,iSpy (feat. Kodak Black),3CLd7BfvgsBwBUI8kwFLU6,67,0.455,0.776,227779,0.724,0.0,8,0.109,-4.874,Major,0.216,74.987,4/4,0.677
153319,Country,Shania Twain,Don't Be Stupid (You Know I Love You),2cngJ0gAhCZOpamv9mEyby,54,0.39,0.736,213693,0.721,0.0,3,0.653,-4.748,Major,0.0319,122.002,4/4,0.921
46763,Electronic,Dion Timmer,The Right Type,7a3YBK0s48V0N2LxpeNrQT,38,0.235,0.681,250750,0.69,0.0,2,0.0641,-5.343,Major,0.0395,104.991,4/4,0.249
121950,Comedy,Bill Hicks,Gideons,4pDIKlGFv3KSIExnCgdgNa,15,0.71,0.649,65000,0.86,0.0,10,0.726,-10.343,Minor,0.937,69.629,4/4,0.383


### 2. Mode (Binary Variable)

In [48]:
spotify_df.loc[spotify_df["mode"] == 'Major', "mode"] = 1
spotify_df.loc[spotify_df["mode"] == 'Minor', "mode"] = 0
spotify_df.sample(5)

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
93817,Children’s Music,NoMBe,Drama 2.0,0zKSfdydEV8daVj5409nD4,26,0.166,0.622,221385,0.646,0.00245,8,0.153,-8.476,0,0.0333,97.974,4/4,0.503
204450,Jazz,Walterwarm,Calm Sea,3kjHCxJiILzpp2n1zf0p9z,34,0.835,0.7,107691,0.205,0.958,4,0.114,-12.082,1,0.102,170.133,4/4,0.798
51437,Electronic,Christian Löffler,Haul (feat. Mohna) - Mixed,13o6vo2DYB7WP6UMO8MMG3,36,0.415,0.659,225133,0.587,0.922,6,0.105,-12.571,0,0.036,123.103,4/4,0.1
47280,Electronic,Chris Lake,Give Her Right Back ft. [Dances With White Girls],6nzC3PRVLiPPcUuem4e4DL,40,0.0101,0.856,199996,0.895,0.0975,5,0.0696,-3.908,0,0.0692,125.952,4/4,0.634
220312,World,Random Rab,Vapour Train,3A5kOtCVTLqJ8fDF95rXOA,25,0.242,0.72,277427,0.221,0.865,2,0.15,-11.193,0,0.0602,87.992,4/4,0.443


### 3. Time Signature

In [49]:
list_of_time_signatures = spotify_df['time_signature'].unique()
for i in range(len(list_of_time_signatures)):
    spotify_df.loc[spotify_df['time_signature'] == list_of_time_signatures[i], 'time_signature'] = i
spotify_df.sample(5)

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
195328,Soundtrack,Atli Örvarsson,Nadine's Theme,6BdxHS3BpAB6Yrjazp5VNM,30,0.64,0.496,223627,0.152,0.847,11,0.114,-14.495,1,0.032,122.018,1,0.0432
66503,R&B,Tom Misch,Day 5: For Carol,3JjeC1epY5y2Jysxn5FVCB,49,0.765,0.485,394862,0.497,0.886,3,0.132,-10.022,1,0.0308,100.857,0,0.065
96842,Rap,frumhere,brooklyn baby,1YE6LR1TpF4QUAfFySsQKK,67,0.81,0.579,141500,0.34,0.249,9,0.126,-15.129,0,0.313,135.652,0,0.185
59077,Anime,Xavier Omär,Change On Me (feat. Leuca),2j7qlbFwdLSB1HP3GJNpNi,49,0.26,0.695,225441,0.485,0.0,6,0.111,-8.671,0,0.134,137.974,0,0.352
84374,Classical,Dmitri Shostakovich,"Preludes and Fugues for Piano, Op.87: Prelude ...",1k5NeJPU9qZfRqq528RmQQ,34,0.991,0.404,316027,0.0488,0.883,9,0.086,-26.314,0,0.0495,118.949,0,0.132


### 4. Popularity (Label)

In [50]:
spotify_df['popularity'] = np.where(spotify_df['popularity'] >= 57, 1, 0)

# 6. Training Models

Now that we have preprocess our data, we are ready to train and evaluate models. First thing we need to do is to split our dataset into a training set and a test set. The training set is used to train the algorithm and the test set is used to evaluate its performance on unseen data. Splitting and the models themselves are often done using the scikit-learn library.

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras

ModuleNotFoundError: No module named 'sklearn'