# Classification for the Spotify dataset
In the following, we will perform classification for the given dataset. Goal is _create a model classificator for labelling tracks by `genre`_.  
We will perform 3 attempts for this classification:
- multiclass classification with n_class = n_genra;
- multiclass classification with n_class < n_genra, and in particular we want to regroup consider some sort of 'basic' genra criterion;
- binary classification, not musical tracks vs musical tracks. 

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

## Evaluation metrics
Importing the metrics for evaluating performances.  
Here reporting some formulas in order to recap everything:  
$$Acc = \frac{TP + TN}{TP + TN + FP + FN} $$  
The F1 score can be interpreted as a harmonic mean (harmonic mean is the reciprocal of the aritmetic mean of the reciprocals. It is sometimes appropriate for situations when the average rate is desired) of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is F1 = 2 * (precision * recall) / (precision + recall):

$$F1_{\text{score}} = \frac{2TP}{2TP + FN + FP} $$  
$$Recall = \frac{TP}{TP + FN} $$

In [3]:
#Importing the performance metrics
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
)
from scikitplot.metrics import plot_roc
from scikitplot.metrics import plot_precision_recall

In [4]:
# importing datasets - one will be used for training, one for testing.
df_original_copy_training = pd.read_csv("dataset (missing + split)/train.csv", skipinitialspace=True)
df_train = pd.read_csv("dataset (missing + split)/train.csv", skipinitialspace=True) #this will be modified

df_test = pd.read_csv("dataset (missing + split)/test.csv", skipinitialspace=True)

# Data preparation: transformation and preprocessing
This has to be done on both training set and test set. 

### Data transformation

In [5]:
#Changing udm to duration_ms and features_duration_ms from ms to min - train
df_train['duration_ms'] *= 1/6e4
#Changing udm to duration_ms and features_duration_ms from ms to min - test
df_test['duration_ms'] *= 1/6e4
#Setting popularity as a % - train
df_train['popularity'] /= 100
df_train.rename(columns = {'duration_ms':'duration_min'}, inplace = True)
df_train.rename(columns = {'popularity':'popularity_percent'}, inplace = True)
#Setting popularity as a % - test
df_test['popularity'] /= 100
df_test.rename(columns = {'duration_ms':'duration_min'}, inplace = True)
df_test.rename(columns = {'popularity':'popularity_percent'}, inplace = True)

### Filling of NaN values

In [6]:
#Dealing with mode attribute missing values - train
#Computing p0 and p1 and filling missing values of mode attribute by sampling

p0=df_train['mode'].value_counts()[0]/(len(df_train)-df_train['mode'].isnull().sum())
p1=df_train['mode'].value_counts()[1]/(len(df_train)-df_train['mode'].isnull().sum())
list_of_nan_indexes_train=df_train[df_train['mode'].isnull()].index.tolist()
for i in list_of_nan_indexes_train:
    if np.random.random() < p1:
        df_train.loc[i,'mode'] = 1.0
    else:
        df_train.loc[i,'mode'] = 0.0
        
#Dealing with mode attribute missing values - test
#Computing p0 and p1 and filling missing values of mode attribute by sampling

p0=df_test['mode'].value_counts()[0]/(len(df_test)-df_test['mode'].isnull().sum())
p1=df_test['mode'].value_counts()[1]/(len(df_test)-df_test['mode'].isnull().sum())
list_of_nan_indexes_test=df_test[df_test['mode'].isnull()].index.tolist()
for i in list_of_nan_indexes_test:
    if np.random.random() < p1:
        df_test.loc[i,'mode'] = 1.0
    else:
        df_test.loc[i,'mode'] = 0.0

In [7]:
#Dealing with time_signature attribute missing values - train
#Computing the array containing the probabilities of every outcome for time_signature
outcomes_of_time_signature = len(df_train['time_signature'].value_counts())
p_array=np.array(df_train['time_signature'].value_counts().sort_index(ascending=True)/(len(df_train)-df_train['time_signature'].isnull().sum()))
#creating a dictionary containing the correspondance between value and sorted index
dict_ts = {0: 0.0, 1: 1.0, 2: 3.0, 3 : 4.0, 4 : 5.0}
list_of_nan_indexes_ts=df_train[df_train['time_signature'].isnull()].index.tolist()
from scipy.stats import multinomial 
#please forgive the dunb programming here, no real deal of time, only ~3000 points to be evaluated
for i in list_of_nan_indexes_ts:
    tmp = multinomial.rvs(1, p_array, size=1, random_state=None)
    array_tmp=np.where(tmp[0][:]==1)
    index=array_tmp[0][0] #implement a dict for the substitution
    df_train.loc[i,'time_signature'] = dict_ts[index]
    
#Dealing with time_signature attribute missing values - test
#Computing the array containing the probabilities of every outcome for time_signature
outcomes_of_time_signature = len(df_test['time_signature'].value_counts())
p_array=np.array(df_test['time_signature'].value_counts().sort_index(ascending=True)/(len(df_test)-df_test['time_signature'].isnull().sum()))
#creating a dictionary containing the correspondance between value and sorted index
dict_ts = {0: 0.0, 1: 1.0, 2: 3.0, 3 : 4.0, 4 : 5.0}
list_of_nan_indexes_ts=df_test[df_test['time_signature'].isnull()].index.tolist()
from scipy.stats import multinomial 
#please forgive the dunb programming here, no real deal of time, only ~3000 points to be evaluated
for i in list_of_nan_indexes_ts:
    tmp = multinomial.rvs(1, p_array, size=1, random_state=None)
    array_tmp=np.where(tmp[0][:]==1)
    index=array_tmp[0][0] #implement a dict for the substitution
    df_test.loc[i,'time_signature'] = dict_ts[index]

In [8]:
df_train['genre'].value_counts()

genre
j-dance          750
iranian          750
brazil           750
chicago-house    750
forro            750
idm              750
indian           750
study            750
disney           750
afrobeat         750
mandopop         750
techno           750
sleep            750
spanish          750
j-idol           750
industrial       750
happy            750
bluegrass        750
black-metal      750
breakbeat        750
Name: count, dtype: int64

In [9]:
df_test['genre'].value_counts()

genre
industrial       250
breakbeat        250
happy            250
afrobeat         250
study            250
black-metal      250
sleep            250
bluegrass        250
techno           250
brazil           250
j-dance          250
chicago-house    250
disney           250
iranian          250
mandopop         250
idm              250
spanish          250
j-idol           250
indian           250
forro            250
Name: count, dtype: int64

Pay attention, in train and test sets, genra does not have same order, it is necessary to use a dictionary map for BOTH. 

### Outlier criteria
For now, we do not consider any point as outlier because none of them miss the classification label and considering that this time one of our tasks is to distinguish between musical genra and not-musical genra. 

### Eliminating redundant features

In [8]:
df_train.dtypes

name                      object
duration_min             float64
explicit                    bool
popularity_percent       float64
artists                   object
album_name                object
danceability             float64
energy                   float64
key                        int64
loudness                 float64
mode                     float64
speechiness              float64
acousticness             float64
instrumentalness         float64
liveness                 float64
valence                  float64
tempo                    float64
features_duration_ms       int64
time_signature           float64
n_beats                  float64
n_bars                   float64
popularity_confidence    float64
processing               float64
genre                     object
dtype: object

In [9]:
column2drop = ['features_duration_ms', 'popularity_confidence', 'processing', 'name', 'artists','album_name'] #cols that gives max complexity
df_train.drop(column2drop, axis=1, inplace=True)
df_test.drop(column2drop, axis=1, inplace=True)


In [10]:
attributes = [col for col in df_train.columns if col != 'genre']

### Encoding `str` values

In [11]:
#from sklearn.preprocessing import LabelEncoder

In [12]:
# label encoding genre feature
#le = LabelEncoder()
#df_train['genre'] = le.fit_transform(df_train['genre'])
#df_test['genre'] = le.fit_transform(df_test['genre'])

In [13]:
#Creating genre map -> mapping every genre into an int value in order to have correlation values. 
#Redoing Pearson correlation adding genre category

genre_map={"j-dance":0,"iranian":1,"brazil":2,"chicago-house":3,"forro":4,"idm":5,"indian":6,"study":7,"disney":8,"afrobeat":9,"mandopop":10,"techno":11,"sleep":12,"spanish":13,"j-idol":14,"industrial":15,"happy":16,"bluegrass":17,"black-metal":18,"breakbeat":19}
df_train.replace({'genre':genre_map})
df_test.replace({'genre':genre_map})

Unnamed: 0,duration_min,explicit,popularity_percent,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,n_beats,n_bars,genre
0,3.447100,False,0.32,0.383,0.9510,0,-3.743,1.0,0.1040,0.006070,0.000000,0.2610,0.668,110.584,4.0,385.0,96.0,15
1,5.495550,False,0.41,0.464,0.5790,5,-9.136,1.0,0.0596,0.281000,0.827000,0.0992,0.140,171.752,4.0,935.0,235.0,19
2,2.266667,False,0.40,0.611,0.7780,9,-4.803,1.0,0.0326,0.094600,0.000005,0.1390,0.285,90.024,4.0,200.0,49.0,6
3,4.117333,False,0.25,0.500,0.9580,0,-1.695,0.0,0.0350,0.008170,0.318000,0.7320,0.955,130.059,4.0,526.0,132.0,14
4,3.468667,False,0.00,0.802,0.6840,1,-8.839,1.0,0.1230,0.001810,0.010200,0.2360,0.637,130.022,4.0,440.0,110.0,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,1.346883,False,0.17,0.217,0.0754,11,-16.629,0.0,0.0327,0.901000,0.914000,0.1350,0.201,142.026,5.0,175.0,44.0,8
4996,4.358333,False,0.02,0.467,0.7820,10,-8.136,1.0,0.0599,0.001810,0.000057,0.0971,0.203,145.059,4.0,621.0,158.0,1
4997,4.300000,False,0.19,0.524,0.9730,0,-5.214,0.0,0.0469,0.000057,0.005170,0.1070,0.840,140.029,4.0,594.0,149.0,15
4998,6.741767,False,0.19,0.166,0.9750,2,-3.585,0.0,0.1100,0.000032,0.005520,0.0656,0.233,75.005,4.0,550.0,138.0,18


NB: here _partitioning_ is not necessary because both train and test sets are already equally partitioned (so,holdout is not performed).

In [14]:
y_train = np.array(df_train['genre'])
y_test = np.array(df_test['genre'])

X_train = df_train.replace({'genre':genre_map})
X_test = df_test.replace({'genre':genre_map})

In [15]:
print(X_train)

       duration_min  explicit  popularity_percent  danceability  energy  key  \
0          4.029333     False                0.46         0.690   0.513    5   
1          7.400000     False                0.00         0.069   0.196    1   
2          5.558433     False                0.03         0.363   0.854    2   
3          4.496667     False                0.23         0.523   0.585    5   
4          5.127517     False                0.25         0.643   0.687    7   
...             ...       ...                 ...           ...     ...  ...   
14995      7.200433     False                0.00         0.554   0.657    1   
14996      3.045767     False                0.44         0.103   0.860    1   
14997      6.668183     False                0.43         0.799   0.535    1   
14998      3.287500     False                0.37         0.511   0.970    5   
14999      3.306817     False                0.36         0.678   0.518    6   

       loudness  mode  speechiness  aco

# Decision tree

In [16]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

#class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', 
#max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
#max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, 
#class_weight=None, ccp_alpha=0.0)[source]

For now, we work only on _training test set_, in order to tune properly the parameters of the tree.  
After that, the test set will be used for computing the performance metrics. 

In [17]:
# Training a not-tuned tree, computing its performance metrics for doing a post-tuning comparison
dt = DecisionTreeClassifier()

In [18]:
%%time
dt.fit(X_train, y_train)

CPU times: user 187 ms, sys: 3.56 ms, total: 191 ms
Wall time: 189 ms


In [19]:
y_train_pred = dt.predict(X_train)

In [20]:
print('Train Accuracy %s' % accuracy_score(y_train, y_train_pred))
print('Train F1-score %s' % f1_score(y_train, y_train_pred, average=None))

Train Accuracy 1.0
Train F1-score [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [None]:
print(classification_report(y_train, y_train_pred))

Okay, something seems wrong... let's keep going and try to understand what happened. 

In [None]:
zipped = zip(attributes, dt.feature_importances_)
zipped = sorted(zipped, key=lambda x: x[1], reverse=True)
for col, imp in zipped:
    print(col, imp)

For some reason, the tree is not computed. Maybe too many categories? 