# Predicting Music Genres

A colleague and I performed a decision tree classification analysis during my Master's Program. We used Spotify data obtained from the Spotify API. This data has various attributes such as tempo and popularity. 

#### Note:
I was proud of what I accomplished with this project but I would like to reproduce this analysis one day. Upon review, there are several things I would do differently in this analysis such as change the process of the data engineering for the project. Overall, this was a fun project to complete and review the results of this classification.

## Load Libraries

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix,classification_report,precision_score

## Load Data

In [8]:
df = pd.read_csv("SpotifyFeatures.csv")

## Data Engineering
When engineering the data, we chose these actions:
* we decided to drop genres since the data included 27 genres. We simply chose the genres we were most interested in.
* we combined genres based on our perceived knowledge of the genres. Generally, Rap and Hip-Hop have similarities so we paired the genres together.
* Due to combining the genres, we had to drop duplicates on the track_ids within each newly labeled genre.
* For simplicity, we dropped those track_ids that occurred multiple times with different genres.

Once the data reflected the genres we were most interested in:
* Categorical attributes were transformed into dummy columns for each value. 

### Creating the Data

In [9]:
drop_genres = ["A Capella","Anime","Children's Music","Movie","Comedy","Reggae",\
              "Ska","Soundtrack","World","Dance","Electronic", "Children’s Music"]
df2 = df[~(df["genre"].isin(drop_genres))]

In [10]:
Genre_dict = {"Pop":"Pop","Jazz":"Jazz/Blues","Blues":"Jazz/Blues","Country":"Country/Folk","Folk":"Country/Folk",
              "Rap":"Rap/Hip-Hop","Hip-Hop":"Rap/Hip-Hop","R&B":"R&B/Soul","Soul":"R&B/Soul",
              "Classical":"Classical/Opera", "Opera":"Classical/Opera","Rock":"Rock/Alt/Indie", "Alternative":"Rock/Alt/Indie",
              "Indie":"Rock/Alt/Indie","Reggaeton":"Reggaeton"}

In [11]:
for genre in df2.genre.unique():
    df2=df2.replace(genre,Genre_dict[genre])
    
df2.genre.value_counts()

Rock/Alt/Indie     28078
Rap/Hip-Hop        18527
Jazz/Blues         18464
R&B/Soul           18081
Country/Folk       17963
Classical/Opera    17536
Pop                 9386
Reggaeton           8927
Name: genre, dtype: int64

In [13]:
df3 = df2.drop_duplicates(subset=["genre","track_id"])
tracks = df3.track_id.value_counts().reset_index()
mulitple_ids = list(tracks[tracks.track_id>1]["index"])
df_Model = df3[~(df3.track_id.isin(mulitple_ids))]

### Formatting Data for the Classification model

In [15]:
df_Model = df_Model.drop(columns=["artist_name","track_name","track_id"])

In [16]:
#make dummies for categorical
for i in range(len(df_Model.columns)):
    if df_Model[df_Model.columns[i]].dtypes == "O":
        if df_Model.columns[i]!="genre":
            dummy = pd.get_dummies(df_Model[df_Model.columns[i]], prefix = str(df_Model.columns[i]))
            df_Model = pd.concat([df_Model, dummy], axis = 1)

In [17]:
#drop unnecessary columns
df_Model=df_Model.drop(columns=['key','mode','time_signature'])

In [18]:
X = df_Model.drop(columns=['genre'])
y = df_Model['genre']

In [19]:
df_Model.iloc[:,0:12].describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,82278.0,82278.0,82278.0,82278.0,82278.0,82278.0,82278.0,82278.0,82278.0,82278.0,82278.0
mean,40.05321,0.42004,0.540753,245381.2,0.527367,0.159602,0.193352,-10.555467,0.083809,116.498499,0.450418
std,15.743149,0.370816,0.193743,118047.2,0.278681,0.309763,0.166665,6.869715,0.089789,31.06207,0.261526
min,0.0,1e-06,0.0582,15509.0,2e-05,0.0,0.0102,-47.599,0.0222,34.151,0.0
25%,32.0,0.0634,0.403,187853.0,0.297,0.0,0.0961,-13.42875,0.0365,91.965,0.228
50%,42.0,0.304,0.559,223493.0,0.569,0.000119,0.124,-8.167,0.0474,112.9505,0.439
75%,50.0,0.827,0.691,272193.0,0.76,0.0687,0.234,-5.686,0.0833,137.18875,0.661
max,100.0,0.996,0.987,5488000.0,0.999,0.994,1.0,3.744,0.949,242.903,0.986


## Decision Tree Classification
The final dataset was split 70/30 for a training set and a test set respectively.

Four tests were performed for this analysis:
* Stratified Sampling: The training/test sets were stratified based on the target value, genre. 
* Random Sampling: The training/test sets were randomly selected.
* Feature Selection: Based on the variance within the data, several columns were chosen based on the variability calculated. 
* Set Parameters for Model: Same data set up as the first test, but parameters were added to the model. The maximum depths were 10 levels and the minimum amount of samples per leaf is 100.

The best model was the last test performed. By setting those parameters for the decision model, it increased results for precision and recall for all genres. 

### Test 1: Stratified Sampling

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify = y, random_state = 9)

In [21]:
Model = DecisionTreeClassifier()
Model.fit(X_train,y_train)
Predict = Model.predict(X_test)

print(classification_report(y_test,Predict))

                 precision    recall  f1-score   support

Classical/Opera       0.89      0.90      0.90      5114
   Country/Folk       0.47      0.46      0.47      3406
     Jazz/Blues       0.60      0.59      0.59      4658
            Pop       0.47      0.44      0.45       835
       R&B/Soul       0.34      0.34      0.34      2843
    Rap/Hip-Hop       0.58      0.55      0.56      1859
      Reggaeton       0.60      0.65      0.63      2561
 Rock/Alt/Indie       0.48      0.48      0.48      3408

       accuracy                           0.59     24684
      macro avg       0.55      0.55      0.55     24684
   weighted avg       0.59      0.59      0.59     24684



In [23]:
Feat = list(Model.feature_importances_)
for i in range(len(list(X.columns))):
    print(X.columns[i], (100*Feat[i]))

popularity 15.730553122082513
acousticness 20.036221151419877
danceability 9.464209800003385
duration_ms 5.424754884629846
energy 6.159624858923034
instrumentalness 7.779357447592285
liveness 4.025358970813704
loudness 6.693220539412597
speechiness 8.393323867763296
tempo 5.092938632891275
valence 5.375230073484281
key_A 0.37315371642105855
key_A# 0.24998599599652577
key_B 0.3318402837448929
key_C 0.4469568579017878
key_C# 0.3554809145471348
key_D 0.31508779845158796
key_D# 0.18385821036690628
key_E 0.32305462825443626
key_F 0.3979128276242854
key_F# 0.3658317364648943
key_G 0.3739689958184168
key_G# 0.3562423811543125
mode_Major 0.4455246533030264
mode_Minor 0.8261847332509951
time_signature_0/4 0.0
time_signature_1/4 0.0196796111274612
time_signature_3/4 0.20542861975229101
time_signature_4/4 0.1701425234795997
time_signature_5/4 0.08487216332427334


In [34]:
print("Number of levels: ", Model.get_depth())
print("Number of leaves: ", Model.get_n_leaves())
print("Parameters: ", Model.get_params())

Number of levels:  37
Number of leaves:  13191
Parameters:  {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': None, 'splitter': 'best'}


### Test 2: Random Sampling

In [24]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X, y, test_size=0.3, random_state = 9)

In [25]:
Model2 = DecisionTreeClassifier()
Model2.fit(X2_train,y2_train)
Predict2 = Model2.predict(X2_test)

print(classification_report(y2_test,Predict2))

                 precision    recall  f1-score   support

Classical/Opera       0.90      0.90      0.90      5108
   Country/Folk       0.43      0.45      0.44      3385
     Jazz/Blues       0.61      0.59      0.60      4613
            Pop       0.45      0.47      0.46       828
       R&B/Soul       0.34      0.32      0.33      2933
    Rap/Hip-Hop       0.56      0.57      0.57      1844
      Reggaeton       0.61      0.62      0.62      2535
 Rock/Alt/Indie       0.46      0.47      0.46      3438

       accuracy                           0.58     24684
      macro avg       0.55      0.55      0.55     24684
   weighted avg       0.58      0.58      0.58     24684



In [26]:
Feat2 = list(Model2.feature_importances_)
for i in range(len(list(X.columns))):
    print(X.columns[i], (100*Feat2[i]))

popularity 15.488655335512563
acousticness 19.830359654140445
danceability 9.770225327464379
duration_ms 5.42258972349231
energy 6.054167907939453
instrumentalness 7.709026260836674
liveness 4.111797013232579
loudness 7.248277958888516
speechiness 8.283691393711546
tempo 4.876415317095565
valence 5.376836885778197
key_A 0.30040101616819104
key_A# 0.3252654107043228
key_B 0.3710365737667696
key_C 0.3829408490865239
key_C# 0.3360009511396342
key_D 0.34450455867390767
key_D# 0.20722760079493413
key_E 0.3819088678959781
key_F 0.3671074280225338
key_F# 0.35358490239095813
key_G 0.4446468186304299
key_G# 0.29590356267113926
mode_Major 0.5466287170356127
mode_Minor 0.6648527784985548
time_signature_0/4 0.0
time_signature_1/4 0.030275952599622655
time_signature_3/4 0.21075820659505642
time_signature_4/4 0.20098834852787256
time_signature_5/4 0.06392467870572753


In [27]:
#see tree in a text format
from sklearn.tree.export import export_text

r = export_text(Model2, feature_names=list(X.columns))
print(r)

|--- acousticness <= 0.87
|   |--- popularity <= 39.50
|   |   |--- instrumentalness <= 0.00
|   |   |   |--- danceability <= 0.67
|   |   |   |   |--- popularity <= 34.50
|   |   |   |   |   |--- danceability <= 0.51
|   |   |   |   |   |   |--- popularity <= 18.50
|   |   |   |   |   |   |   |--- acousticness <= 0.63
|   |   |   |   |   |   |   |   |--- popularity <= 0.50
|   |   |   |   |   |   |   |   |   |--- danceability <= 0.30
|   |   |   |   |   |   |   |   |   |   |--- class: Jazz/Blues
|   |   |   |   |   |   |   |   |   |--- danceability >  0.30
|   |   |   |   |   |   |   |   |   |   |--- class: Country/Folk
|   |   |   |   |   |   |   |   |--- popularity >  0.50
|   |   |   |   |   |   |   |   |   |--- popularity <= 6.00
|   |   |   |   |   |   |   |   |   |   |--- loudness <= -6.02
|   |   |   |   |   |   |   |   |   |   |   |--- class: Jazz/Blues
|   |   |   |   |   |   |   |   |   |   |--- loudness >  -6.02
|   |   |   |   |   |   |   |   |   |   |   |--- class: Rock/A



### Test 3: Decision Tree with Feature Selection

In [28]:
from sklearn.feature_selection import VarianceThreshold

In [29]:
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X3 = sel.fit_transform(X)

In [30]:
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y, test_size=0.3, random_state = 7)

In [31]:
Model3 = DecisionTreeClassifier()
Model3.fit(X3_train,y3_train)
Predict3 = Model3.predict(X3_test)

print(classification_report(y3_test,Predict3))

                 precision    recall  f1-score   support

Classical/Opera       0.84      0.83      0.83      5085
   Country/Folk       0.32      0.33      0.33      3396
     Jazz/Blues       0.45      0.44      0.45      4624
            Pop       0.43      0.43      0.43       836
       R&B/Soul       0.21      0.22      0.22      2834
    Rap/Hip-Hop       0.31      0.30      0.31      1896
      Reggaeton       0.39      0.40      0.40      2572
 Rock/Alt/Indie       0.31      0.30      0.30      3441

       accuracy                           0.45     24684
      macro avg       0.41      0.41      0.41     24684
   weighted avg       0.45      0.45      0.45     24684



In [65]:
f_list = list(sel.get_support(indices=True))

In [70]:
for item in f_list:
    print(X.columns[item])

popularity
duration_ms
loudness
tempo
mode_Major
mode_Minor


In [56]:
list(Model3.feature_importances_)

[0.20924807653961158,
 0.21264359169458313,
 0.35481715614037895,
 0.2060900571934613,
 0.0065891703372506095,
 0.010611948094714488]

### Test 4: Set parameters for the Decision Tree

In [32]:
X4_train, X4_test, y4_train, y4_test = train_test_split(X, y, test_size=0.3, stratify = y, random_state = 9)

In [37]:
Model4 = DecisionTreeClassifier(criterion = 'entropy', max_depth=10, min_samples_leaf=100)
Model4.fit(X4_train,y4_train)
Predict4 = Model4.predict(X4_test)

print(classification_report(y4_test,Predict4))

                 precision    recall  f1-score   support

Classical/Opera       0.93      0.90      0.91      5114
   Country/Folk       0.50      0.61      0.55      3406
     Jazz/Blues       0.65      0.65      0.65      4658
            Pop       0.50      0.54      0.52       835
       R&B/Soul       0.44      0.35      0.39      2843
    Rap/Hip-Hop       0.61      0.68      0.64      1859
      Reggaeton       0.70      0.61      0.65      2561
 Rock/Alt/Indie       0.55      0.55      0.55      3408

       accuracy                           0.64     24684
      macro avg       0.61      0.61      0.61     24684
   weighted avg       0.64      0.64      0.64     24684



In [35]:
print("Number of levels: ", Model4.get_depth())
print("Number of leaves: ", Model4.get_n_leaves())
print("Parameters: ", Model4.get_params())

Number of levels:  10
Number of leaves:  294
Parameters:  {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'entropy', 'max_depth': 10, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 100, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': 'deprecated', 'random_state': None, 'splitter': 'best'}
