<a href="https://colab.research.google.com/github/barbararolska/Songs-genre-classifier/blob/main/Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spotify Genre Prediction

Spotify, like most streaming services, aims to provide consistent experience to its users. To achieve that, it analyses audio tracks and classifies some features and later uses those features to create consistent user feed and recommendation system.

Although, you won't see those parametres in their official app, Spotify is transparent about those data. You can fetch them from Spotify service using its well documented API.

These parameters seems kinda random and unpredictabe at first, but further investigaton reveals that there is in fact certain regularity to it. It begs the question how predictable they are. To answer that in the following project we will try to train different ML models and see how good they are at predicting genres of songs just by looking at their audio features.

# Collecting Data

The following code is responsible for fetching data from Spotify.
It is not runnable unless you generate and provide fresh Spotify API token.

We settled on getting around 200-400 most popular songs from each genre we picked.
Collected data later was compiled to a single CSV file and provided via Google Drive storage.

In [None]:
import random
import requests
from urllib.parse import urlencode
import json
from dataclasses import dataclass, asdict
import time

In [None]:
token = ['']

def get_data(endpoint: str):
    time.sleep(0.1)

    headers = {
        'Authorization': f'Bearer {token[0]}',
    }

    response = requests.get(f'https://api.spotify.com/v1{endpoint}', headers=headers)

    return json.loads(response.content)

In [None]:
token[0] = ''

In [None]:
import inspect

@dataclass
class AudioFeatures:
    danceability: float = 0.0
    energy: float = 0.0
    key: int = 0
    loudness: float = 0.0
    mode: int = 0
    speechiness: float = 0.0
    acousticness: float = 0.0
    instrumentalness : float = 0.0
    liveness: float = 0.0
    valence: float = 0.0
    tempo: float = 0.0
    duration_ms: int = 0
    time_signature: int = 0

    @classmethod
    def from_dict(cls, env) -> "AudioFeatures":
        return cls(**{
            k: v for k, v in env.items()
            if k in inspect.signature(cls).parameters
        })

@dataclass
class Track:
    popularity: int = 0
    name: str = ''
    id: str = ''
    uri: str = ''
    artists: list | None = None
    audio_features: AudioFeatures | None = None

    @classmethod
    def from_dict(cls, env) -> "Track":
        res = cls(**{
            k: v for k, v in env.items()
            if k in inspect.signature(cls).parameters
        })

        if res.audio_features is not None:
            res.audio_features = AudioFeatures.from_dict(res.audio_features)

        return res

    def get_audio_features(self):
        if self.audio_features is None:
            self.audio_features = AudioFeatures.from_dict(get_data(f'/audio-features/{self.id}'))
        return self.audio_features

    @classmethod
    def get_track(cls, id: str):
        return cls.from_dict(get_data(f'/tracks/{id}'))


@dataclass
class Playlist:
    total: int = 0
    name: str = ''
    id: str = ''
    songs: list[Track] | None = None

    @classmethod
    def from_dict(cls, env) -> "Playlist":
        res = cls(**{'total': env.get('tracks', {}).get('total', 0), **{
            k: v for k, v in env.items()
            if k in inspect.signature(cls).parameters
        }})

        if res.songs is not None:
            res.songs = [Track.from_dict(elem) for elem in res.songs]
            res.total = len(res.songs)

        return res

    @classmethod
    def get_playlist(cls, id: str):
        # return get_data(f'/playlists/{id}')
        return Playlist.from_dict(get_data(f'/playlists/{id}'))

    def get_tracks(self):
        if self.songs is not None:
            return self.songs

        data = []
        extras = {}

        for i in range(0, self.total, 50):
            data.extend(
                aux := get_data(f'/playlists/{self.id}/tracks?offset={i}&limit=50')['items']
            )

            ids = [elem['track']['id'] for elem in aux]

            audio_features = get_data('/audio-features?ids=' + ','.join(ids))

            for elem in audio_features['audio_features']:
                extras[elem['id']] = AudioFeatures.from_dict(elem)


        self.songs = [
            Track.from_dict(elem['track'])
            for elem in data
        ]

        for song in self.songs:
            song.audio_features = extras.get(song.id, None)

        return self.songs


# Dataset

In [None]:
from collections.abc import Collection
import pandas as pd

url = "https://drive.google.com/file/d/1X1cbT5S4buo8-ZXklZoXkV0OP7gXPA6j/view?usp=sharing" #link Wojtka

# Define the column names
col_names = ["id",
            "name",
            "artist",
            "genre",
            "popularity",
             "danceability",
             "energy",
             "key",
             "loudness",
             "mode",
             "speechiness",
             "acousticness",
             "instrumentalness",
             "liveness",
             "valence",
             "tempo",
             "duration_ms",
             "time_signature"]

# Read data from URL
# https://stackoverflow.com/questions/56611698/pandas-how-to-read-csv-file-from-google-drive-public
url='https://drive.google.com/uc?id=' + url.split('/')[-2]

songs_data = pd.read_csv(url, delimiter=',', names=col_names, quotechar='|', quoting = 0)[1:]

songs_data.head()


Unnamed: 0,id,name,artist,genre,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
1,2MuWTIM3b0YEAskbeeFE1i,Master Of Puppets,Metallica,metal,0.78,0.543,0.836,4,-9.11,0,0.0353,0.000647,0.431,0.153,0.56,105.173,515387,4
2,4OROzZUy6gOWN4UGQVaZMF,The Trooper - 2015 Remaster,Iron Maiden,metal,0.75,0.285,0.908,4,-4.601,0,0.0705,0.0318,0.00909,0.311,0.669,159.756,252733,4
3,51TG9W3y9qyO8BY5RXKgnZ,Symphony Of Destruction,Megadeth,metal,0.69,0.61,0.891,4,-5.452,0,0.0587,0.000805,0.0928,0.0657,0.51,139.903,246733,4
4,3Jnxngdff0lVu2rza1GVx6,Paranoid (2009 - Remaster),Black Sabbath,metal,0.81,0.423,0.685,4,-9.651,0,0.078,6.28e-05,0.00642,0.133,0.328,162.78,167720,4
5,7fcfNW0XxTWlwVlftzfDOR,Walk,Pantera,metal,0.75,0.43,0.704,1,-8.008,0,0.0804,0.000155,0.00221,0.0933,0.703,118.242,315120,4


In [None]:
songs_data.groupby(["genre"])["genre"].count()

genre
classical     263
disco_polo    443
electonica    354
folk          483
jazz          493
latino        414
metal         236
pop           451
rap           418
rock          349
Name: genre, dtype: int64

## Spliting into sets, choosing columns

The following code partitions data into train set and test set with ratio 60:40.


In [None]:
import numpy as np
import pandas as pd


X=songs_data.iloc[:, 4:]
y=songs_data["genre"]
train=songs_data.sample(frac=0.8,random_state=200)
test=songs_data.drop(train.index)
set_train = train.iloc[:, 4:]
set_test = test.iloc[:, 4:]
targets_train = train["genre"]
targets_test = test["genre"]


We also prepare the standarised dataset which we will later use to build k-Nearest Neighbours model.

In [None]:
#Standarising data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
scaled_features = scaler.transform(X)
Xscal = pd.DataFrame(scaled_features, columns = X.columns)
Xscal.head()

Unnamed: 0,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,1.313416,-0.280982,0.79394,-0.335467,-0.06338,-1.228413,-0.580282,-0.919506,0.904709,-0.189672,0.28124,-0.597536,3.451645,0.232121
1,1.184196,-1.762904,1.044818,-0.335467,0.722773,-1.228413,-0.137197,-0.834521,-0.475292,0.915464,0.692404,1.302792,0.172202,0.232121
2,0.925756,0.103859,0.985583,-0.335467,0.5744,-1.228413,-0.285731,-0.919075,-0.20149,-0.800294,0.092632,0.611602,0.097288,0.232121
3,1.442635,-0.970248,0.267794,-0.335467,-0.157704,-1.228413,-0.04279,-0.921099,-0.484026,-0.329562,-0.5939,1.408074,-0.889252,0.232121
4,1.184196,-0.930041,0.333998,-1.171245,0.128756,-1.228413,-0.012579,-0.920848,-0.497796,-0.607245,0.820658,-0.142534,0.951153,0.232121


# Data insight

## Entropy

In [None]:
import numpy as np
import pandas as pd

def entropy(y):
    p = pd.value_counts(y, normalize=True)
    entropy = np.sum(p*np.log2(1/p))
    return entropy

def conditional_entropy(x, y):
    p = pd.value_counts(y, normalize=True)
    df = pd.DataFrame({'x': x, 'y': y})
    p_con= pd.crosstab(df['x'],df['y'], normalize='columns')
    conditional_entropy = np.sum(p*np.sum(p_con*np.log2(1/p_con)))
    return conditional_entropy

for c in ['danceability','energy','acousticness', 'instrumentalness', 'valence', 'tempo', 'duration_ms']:
    print(c, conditional_entropy(songs_data[c], songs_data['popularity']).round(4))

def mutual_information(X, Y):
    #H(X) - H(X|Y) z wykladu
    mut_info = entropy(X)-conditional_entropy(X,Y)
    return mut_info

mutinfsort = []
for cname in songs_data.columns:
    mutinfsort.append( (mutual_information(songs_data['genre'], songs_data[cname]), cname))
    print(f"{cname.ljust(25)}:\t {mutual_information(songs_data['genre'], songs_data[cname]).round(decimals=3)}")

danceability 5.7729
energy 5.7708
acousticness 5.8233
instrumentalness 4.4156
valence 5.8053
tempo 5.8819
duration_ms 5.8869
id                       :	 3.287
name                     :	 3.269
artist                   :	 3.226
genre                    :	 3.287
popularity               :	 0.757
danceability             :	 1.426
energy                   :	 1.793
key                      :	 0.078
loudness                 :	 3.062
mode                     :	 0.029
speechiness              :	 1.509
acousticness             :	 2.436
instrumentalness         :	 2.047
liveness                 :	 1.288
valence                  :	 1.701
tempo                    :	 3.166
duration_ms              :	 3.207
time_signature           :	 0.066


## Simple predictors

Before we even start training advanced models it may be worth to look into the nature of the data.
As a starter we will try to find how accurately can we separate and predict pairs of genres just by using only 1 parameter at a time.

In [None]:
from sklearn.linear_model import LogisticRegression

genres: list[str] = [str(genre[0]) for genre in  songs_data.groupby(['genre'])['genre'].count().items()]
cols = col_names[4:]

def classify_genre(genre1: str, genre2: str, key: str):
    dataset = train.loc[train['genre'].isin([genre1, genre2])]
    testset = test.loc[test['genre'].isin([genre1, genre2])]

    clf = LogisticRegression(max_iter=1000).fit(dataset[key].to_numpy().reshape(-1, 1), dataset['genre'].to_numpy())

    res = clf.predict(testset[key].to_numpy().reshape(-1, 1))
    accu = sum(res == testset['genre']) / len(res)

    return accu

regress = []

for gen1 in genres:
    for gen2 in genres:
        if gen1 <= gen2:
            continue

        for col in cols:
            accu = classify_genre(gen1, gen2, col)
            regress.append((gen1, gen2, col, accu))

In [None]:
res = sorted(regress, key=lambda e: e[3], reverse=True)[:30]

for (gen1, gen2, param, accu) in res:
    print(f"{gen1:15s} {gen2:15s} {param:15s} => {accu:.2f}")

metal           classical       acousticness    => 1.00
disco_polo      classical       energy          => 0.99
disco_polo      classical       acousticness    => 0.99
electonica      classical       acousticness    => 0.99
rock            classical       acousticness    => 0.99
jazz            disco_polo      acousticness    => 0.99
metal           jazz            acousticness    => 0.99
disco_polo      classical       loudness        => 0.99
rap             classical       acousticness    => 0.99
rock            classical       loudness        => 0.98
metal           classical       energy          => 0.98
jazz            electonica      acousticness    => 0.98
rock            jazz            acousticness    => 0.98
rock            classical       energy          => 0.97
rap             jazz            acousticness    => 0.97
latino          classical       energy          => 0.97
latino          classical       loudness        => 0.97
rap             classical       speechiness     

And look at that!

There are genres and features that are really accurate just by themselfes.
In fact, we can differentiate some of them with near perfect accuracy.

Hovewer, we can clearly see that most of the most accurate predictors listed here
uses accousticnes as the parameter and differentiates classical music.

It seems like classical music really stands out in this dataset, which gives us intuition that this genre in particular might be really easy to predict.


### Classical music predictor

To test that hypothesis we used Logistic regression to predict whether the track is a classical piece or not.

In [None]:
clf = LogisticRegression(max_iter=10_000).fit(set_train.to_numpy(), targets_train.to_numpy() == 'classical')

res = clf.predict(set_test.to_numpy())

accu = sum(res == (targets_test == 'classical')) / len(res)
false_positive = sum(((res) & (targets_test != 'classical'))) / len(res)
false_negative = sum(((res == False) & (targets_test == 'classical'))) / len(res)

print(
    f"Accuracy total: {accu:.3f} \n"
    f"False positives: {false_positive:.3f} \n"
    f"False negatives: {false_negative:.3f} \n"
)

Accuracy total: 0.931 
False positives: 0.000 
False negatives: 0.069 



# Advanced models

To compare different methods we use accuracy of predictions. Since our dataset isn't very large we use cross-validation with 10 folds to find mean accuracy of model.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

def Accuracy(real, pred):
  acc = np.sum(real == pred)/len(real)
  return(acc)


## Decision Tree



In [None]:
from sklearn.tree import DecisionTreeClassifier



clf_tree = DecisionTreeClassifier(random_state = 200,
                                  class_weight = None)

clf_tree_model = clf_tree.fit(set_train, targets_train)
y_pred_tree = clf_tree_model.predict(set_test)
Accuracy(targets_test, y_pred_tree)


0.5492957746478874

In [None]:
# cross-validation
from sklearn.model_selection import cross_val_score

clf_tree_cv = DecisionTreeClassifier(random_state = 200,
                                  class_weight = None)

scores_tree = cross_val_score(clf_tree_cv, X, y, cv=10)
print(scores_tree)
scores_tree.mean()

[0.46547315 0.43734015 0.54987212 0.57800512 0.56410256 0.58205128
 0.61025641 0.61538462 0.57435897 0.4974359 ]


0.5474280280674142



```
# Sformatowano jako kod
```

## K-Nearest Neighbours

We have determined the optimal number of neighbours $k=10$.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

dikt = {}

for k in range(1,200):
    knn_model = KNeighborsClassifier(n_neighbors = k)
    scores_knn = cross_val_score(knn_model, Xscal, y, cv=10)
    dikt[k] = scores_knn.mean()

m = max(dikt, key=dikt.get)
print(f"({m})", dikt[m])


(10) 0.6006820119352089


In [None]:
knn_score = dikt[m]

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf_forest = RandomForestClassifier(random_state = 200,
                                    n_estimators = 200,
                                    class_weight = "balanced_subsample")

clf_forest_model = clf_forest.fit(set_train, targets_train)
y_pred_forest = clf_forest_model.predict(set_test)
Accuracy(targets_test, y_pred_forest)

0.7272727272727273

In [None]:
rf_scores = cross_val_score(clf_forest, X, y, cv=10)
rf_scores
rf_scores.mean()

0.6918624172076857

## Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
clf_xgb = GradientBoostingClassifier(random_state=200, n_estimators=200)
clf_xgb_model = clf_xgb.fit(set_train, targets_train)
y_pred_xgb = clf_xgb_model.predict(set_test)
Accuracy(targets_test, y_pred_xgb)

0.7157490396927016

In [None]:
xgb_scores = cross_val_score(clf_xgb, X, y, cv=10)
xgb_scores
xgb_scores.mean()

0.6788091022362123

## PCA + KNN

In [None]:
from sklearn.decomposition import PCA


pca = PCA(n_components=10)
pca.fit(Xscal)

print(pca.explained_variance_ratio_)

[0.29429206 0.09173356 0.08423608 0.07658349 0.07156738 0.06644868
 0.06445775 0.06016565 0.05605801 0.04611817]


In [None]:
from sklearn.neighbors import KNeighborsClassifier

for i in range(1, len(col_names[4:])+1):
    pca = PCA(n_components=i)
    pca.fit(Xscal)
    X_trn_pca =(pca.transform(Xscal))

    m = 0;
    k = 0;

    for j in range(1, 20):
        knn_model = KNeighborsClassifier(n_neighbors = j)
        scores_knn = cross_val_score(knn_model, X_trn_pca, y, cv=10)
        n = scores_knn.mean()
        if n > m:
            k = j
            m = n

    print(f"{i:2d} k={k:2d} m={m:.3f}")

 1 k=19 m=0.339
 2 k=17 m=0.465
 3 k=18 m=0.473
 4 k=14 m=0.503
 5 k=18 m=0.534
 6 k=19 m=0.534
 7 k=18 m=0.545
 8 k=18 m=0.544
 9 k=15 m=0.547
10 k=10 m=0.572
11 k=13 m=0.580
12 k=12 m=0.596
13 k=12 m=0.598
14 k=10 m=0.601


It turns out 4 components are enough to score 50% on KNN classifier.
As we go up we eventually match the original KNN algorithm and yield 60% accuracy.

# Results

In [None]:
results = {'Classifier': ["KNN", "Decision Tree", "Random Forest", "Gradient Boosting"],
           'Accuracy': [knn_score, scores_tree.mean(),rf_scores.mean(), xgb_scores.mean()]}

results_table = pd.DataFrame(data=results)

results_table

Unnamed: 0,Classifier,Accuracy
0,KNN,0.600682
1,Decision Tree,0.547428
2,Random Forest,0.691862
3,Gradient Boosting,0.678809



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

