<a href="https://colab.research.google.com/github/balazsivanyi/ML_miniproject/blob/main/ML_miniproject_balazsivanyi_CLEAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session-aware music recommendation ML algorithm on Spotify datasets



```
# Balazs Andras Ivanyi, Aalborg University Copenhagen
```



### Brief description of project

Originally, I wanted to create an application that would have been a session-based music recommendation algorithm, taking one song as an input, and producing a playlist (a set of songs) as an output. It was identified in recent studies that deep learning-based “neural” approaches tend to perform worse in classical session-based recommendation tasks than simpler algorithms, such as kNNs. The purpose was to evaluate a kNN session-based recommendation, and compare to state of the art DL-based recommendation system benchmarks, on Spotify's Million Playlist Dataset. However, I realised that it would be a more suitable apporach to start with a less complex implementation, and focus on music streaming skip prediction first.

## Install and setup phase

Importing libraries and dependencies for the project. 

In [None]:
%matplotlib inline

import numpy as np 
import matplotlib as mpl 
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sb

Setting up final dataframe from Spotify's skip prediction [challange](https://www.aicrowd.com/challenges/spotify-sequential-skip-prediction-challenge). The dataset is broken down to training and test sets, but they are considerably heavy (60G and 14G respectively). Alongside these files, the track data's audio features are located in separate files. So first I merged the two trackdata-files, and I used only one segment of the user interaction log data from total 60G.

In [None]:
#loading in track audio features data
trackData1 = pd.read_csv('/content/drive/MyDrive/ML_miniproject_2021/tf_000000000000.csv')                           
trackData2 = pd.read_csv('/content/drive/MyDrive/ML_miniproject_2021/tf_000000000001.csv')

#combine into one dataframe                               
trackData = pd.concat([trackData1, trackData2], ignore_index=True)
print('Total number of tracks: {}'.format(len(trackData)))

Total number of tracks: 3706388


In [None]:
# Spotify streaming session data
userData = pd.read_csv('/content/drive/MyDrive/ML_miniproject_2021/log_0_20180815_000000000000.csv')
print('Total number of user interaction logs: {}'.format(len(userData)))

#check if streaming data is not missing any songs
set(userData.track_id_clean).issubset(set(trackData.track_id))

Total number of user interaction logs: 3105679


True

The tracks' audio features are from Spotify API acoustic features [database](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features). These describe the acoustic features of each song in the streaming session, and these features can be analysed if they can be used to predict skipping behaviours. The acoustic dataset's features can be seen printed below, and each feature's values can be found on Spotify's API documentation.

In [None]:
#check columns of track dataframe
trackData.columns

Index(['track_id', 'duration', 'release_year', 'us_popularity_estimate',
       'acousticness', 'beat_strength', 'bounciness', 'danceability',
       'dyn_range_mean', 'energy', 'flatness', 'instrumentalness', 'key',
       'liveness', 'loudness', 'mechanism', 'mode', 'organism', 'speechiness',
       'tempo', 'time_signature', 'valence', 'acoustic_vector_0',
       'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3',
       'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6',
       'acoustic_vector_7'],
      dtype='object')

The user interaction dataset presents user's interactions via a music streaming session on Spotify. Each session has its unique identifier, the length of the streaming session, the played tracks in that session, and these track's position in the session:
```
session_id, session_length, track_id_clean, session_position
```
Moreover, the dataset contains wether that specific track was skipped in the streaming session. It also shows, if the song was skipped right after, shortly after, or way after it started playing. All these features, as well as the complete set of features are described on the skip prediction challange's [website](https://aicrowd-production.s3.eu-central-1.amazonaws.com/dataset_files/challenge_204/7dcfad42-65c6-4481-abe8-5a44339fa305_Dataset%20Description.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJ6IZH6GWKDCCDFAQ%2F20220103%2Feu-central-1%2Fs3%2Faws4_request&X-Amz-Date=20220103T140044Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=9a5942aa3186479c13868238ed4466eab638c051e77d1bd699e8535268e559f4).


In [None]:
#check columns of user interaction dataframe
userData.columns

Index(['session_id', 'session_position', 'session_length', 'track_id_clean',
       'skip_1', 'skip_2', 'skip_3', 'not_skipped', 'context_switch',
       'no_pause_before_play', 'short_pause_before_play',
       'long_pause_before_play', 'hist_user_behavior_n_seekfwd',
       'hist_user_behavior_n_seekback', 'hist_user_behavior_is_shuffle',
       'hour_of_day', 'date', 'premium', 'context_type',
       'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end'],
      dtype='object')

Then the two dataframes were merged by their shared column, the track IDs.

In [None]:
#renaming columns to match in the two dataframes
userData = userData.rename(columns={'track_id_clean': 'track_id'})

#matching and merging dataframes by track_id column
mergedData = userData.merge(trackData, how='left', on="track_id")

#remapping major and minor from string to binary values
mergedData['mode'] = mergedData['mode'].map({'major': 1, 'minor': 0})

mergedData.columns

Index(['session_id', 'session_position', 'session_length', 'track_id',
       'skip_1', 'skip_2', 'skip_3', 'not_skipped', 'context_switch',
       'no_pause_before_play', 'short_pause_before_play',
       'long_pause_before_play', 'hist_user_behavior_n_seekfwd',
       'hist_user_behavior_n_seekback', 'hist_user_behavior_is_shuffle',
       'hour_of_day', 'date', 'premium', 'context_type',
       'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end',
       'duration', 'release_year', 'us_popularity_estimate', 'acousticness',
       'beat_strength', 'bounciness', 'danceability', 'dyn_range_mean',
       'energy', 'flatness', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mechanism', 'mode', 'organism', 'speechiness', 'tempo',
       'time_signature', 'valence', 'acoustic_vector_0', 'acoustic_vector_1',
       'acoustic_vector_2', 'acoustic_vector_3', 'acoustic_vector_4',
       'acoustic_vector_5', 'acoustic_vector_6', 'acoustic_vector_7'],
      dtype='object

In [None]:
#deleting last columns, which won't be used for feature engineering
mergedData = mergedData.drop(['acoustic_vector_0', 'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3', 'acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6', 'acoustic_vector_7'], axis=1)

#displaying an excerpet of the dataframe for checking
mergedData.head(10)

Unnamed: 0,session_id,session_position,session_length,track_id,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,short_pause_before_play,long_pause_before_play,hist_user_behavior_n_seekfwd,hist_user_behavior_n_seekback,hist_user_behavior_is_shuffle,hour_of_day,date,premium,context_type,hist_user_behavior_reason_start,hist_user_behavior_reason_end,duration,release_year,us_popularity_estimate,acousticness,beat_strength,bounciness,danceability,dyn_range_mean,energy,flatness,instrumentalness,key,liveness,loudness,mechanism,mode,organism,speechiness,tempo,time_signature,valence
0,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,1,20,t_86abc9b1-2a71-41d8-ab97-ac97ea20276a,True,True,True,False,0,0,0,0,0,0,True,8,2018-08-14,True,user_collection,fwdbtn,fwdbtn,200.546677,2006,99.997576,5.9e-05,0.205253,0.191648,0.390629,4.778084,0.963439,0.930136,0.06890018,10,0.140413,-4.378,0.777542,0,0.157301,0.077345,167.065002,4,0.363948
1,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,2,20,t_33a133e6-240c-467d-a5c5-a6729a545cc2,True,True,True,False,0,0,1,1,0,0,True,8,2018-08-14,True,user_collection,fwdbtn,fwdbtn,246.51973,2015,97.391548,0.000445,0.197883,0.17279,0.228356,4.468809,0.890128,0.968709,0.2377608,11,0.307806,-2.373,0.334852,0,0.470331,0.06011,138.798996,5,0.4759
2,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,3,20,t_cd87b117-d9d0-4562-b469-65ae0e88f8f5,True,True,True,False,0,1,0,0,0,0,True,8,2018-08-14,True,user_collection,fwdbtn,fwdbtn,230.213333,2015,99.961404,0.090172,0.367402,0.333184,0.54119,5.807304,0.643394,0.982674,9.8464e-09,3,0.152956,-5.517,0.627907,0,0.270725,0.03661,90.060997,4,0.491455
3,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,4,20,t_de6bfde1-10b3-4984-add7-b41050bc9353,True,True,True,False,0,1,0,0,0,0,True,8,2018-08-14,True,user_collection,fwdbtn,fwdbtn,207.786667,2017,99.999173,0.422664,0.268346,0.280567,0.277216,5.783823,0.39362,1.011296,6.70491e-07,8,0.095243,-8.903,0.169231,1,0.659099,0.033571,86.777,3,0.226621
4,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,5,20,t_01d7104d-d28c-4c56-9012-d22ef2b8bdc9,False,False,False,True,0,1,0,0,0,0,True,8,2018-08-14,True,user_collection,fwdbtn,trackdone,195.518997,2018,99.999788,0.002812,0.496197,0.445518,0.630482,6.607257,0.694285,1.031873,1.266271e-10,11,0.071866,-6.257,0.773723,0,0.160015,0.025284,97.004997,4,0.215933
5,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,6,20,t_ff674955-20ad-48bf-8494-d5fbe9dd7fac,False,False,True,False,0,0,1,1,0,0,True,8,2018-08-14,True,user_collection,trackdone,fwdbtn,245.053329,2011,93.858501,0.068239,0.349181,0.350332,0.602128,6.290171,0.795717,0.963909,0.1196051,0,0.149576,-3.657,0.570766,0,0.307326,0.103283,126.059998,4,0.264787
6,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,7,20,t_0478077e-cc90-48f5-a989-21714d69151d,True,True,True,False,0,1,0,0,0,0,True,8,2018-08-14,True,user_collection,fwdbtn,fwdbtn,174.226669,2018,99.895486,0.59015,0.543267,0.535417,0.53707,7.862614,0.596715,1.040544,1.141131e-05,7,0.100791,-7.641,0.532663,0,0.532297,0.0718,79.959999,4,0.388015
7,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,8,20,t_3c70d8ac-b601-4f8e-be57-cfdb7f119183,True,True,True,False,0,1,0,0,0,0,True,8,2018-08-14,True,user_collection,fwdbtn,fwdbtn,227.533325,2017,99.999812,0.627452,0.496016,0.589231,0.670188,9.350102,0.654282,1.025306,1.02117e-06,4,0.071018,-5.944,0.820244,1,0.461523,0.153234,180.024002,4,0.437593
8,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,9,20,t_048f0e89-bdcb-4d33-bcea-a4f4c3591cc4,True,True,True,False,0,1,0,0,0,0,True,8,2018-08-14,True,user_collection,fwdbtn,fwdbtn,268.866669,2017,99.999286,0.053429,0.461715,0.462736,0.588504,7.27311,0.730975,1.014365,1.496789e-10,2,0.307783,-6.343,0.510145,1,0.348434,0.086839,87.907997,4,0.190725
9,31_0000b0c5-94b8-426b-87e2-ef81510b9b17,10,20,t_52fc9bcf-ce50-43fc-9498-c2c8421a33e7,True,True,True,False,0,1,0,0,0,0,True,8,2018-08-14,True,user_collection,fwdbtn,fwdbtn,222.653336,2018,99.99971,0.044075,0.57509,0.579509,0.736817,8.41769,0.636002,1.039552,6.660252e-05,11,0.350031,-4.546,0.771015,0,0.164889,0.043701,105.004997,4,0.564775


In [None]:

#check details of merged dataframe
mergedData.info()
mergedData.isna().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3105679 entries, 0 to 3105678
Data columns (total 42 columns):
 #   Column                           Dtype  
---  ------                           -----  
 0   session_id                       object 
 1   session_position                 int64  
 2   session_length                   int64  
 3   track_id                         object 
 4   skip_1                           bool   
 5   skip_2                           bool   
 6   skip_3                           bool   
 7   not_skipped                      bool   
 8   context_switch                   int64  
 9   no_pause_before_play             int64  
 10  short_pause_before_play          int64  
 11  long_pause_before_play           int64  
 12  hist_user_behavior_n_seekfwd     int64  
 13  hist_user_behavior_n_seekback    int64  
 14  hist_user_behavior_is_shuffle    bool   
 15  hour_of_day                      int64  
 16  date                             object 
 17  premium 

session_id                         0
session_position                   0
session_length                     0
track_id                           0
skip_1                             0
skip_2                             0
skip_3                             0
not_skipped                        0
context_switch                     0
no_pause_before_play               0
short_pause_before_play            0
long_pause_before_play             0
hist_user_behavior_n_seekfwd       0
hist_user_behavior_n_seekback      0
hist_user_behavior_is_shuffle      0
hour_of_day                        0
date                               0
premium                            0
context_type                       0
hist_user_behavior_reason_start    0
hist_user_behavior_reason_end      0
duration                           0
release_year                       0
us_popularity_estimate             0
acousticness                       0
beat_strength                      0
bounciness                         0
d

The two dataframes were succesfully merged without any empty datapoints in the merged dataframe. So the inital exploration of data and preliminary feature engineeering could begin.

In [None]:
%matplotlib inline

import numpy as np 
import matplotlib as mpl 
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sb
from sklearn.decomposition import PCA

## Exploratory Data Analysis & Feature engineering

I first checked what is the ratio of skipped and not skipped songs in the dataset. As the balance is skewed with almost twice as more skipped songs, I needed to balance this out later for training. Also, I was curios how many individual listening sessions are included in the dataset, which was 188450.

In [None]:
#number of skipped songs
mergedData.not_skipped.value_counts()

False    2021717
True     1083962
Name: not_skipped, dtype: int64

In [None]:
#number of different user listening sessions
mergedData.session_id.nunique()

188450

### Pairwise plot of audio feautres

![img](content/drive/MyDrive/ML_miniproject_2021/pairwise_plot.png)


In [None]:
#audio_features = ['duration', 'release_year', 'us_popularity_estimate', 'acousticness', 'beat_strength', 'bounciness', 'danceability', 'dyn_range_mean',
#'energy', 'flatness', 'instrumentalness', 'key', 'liveness', 'loudness', 'mechanism', 'mode', 'organism', 'speechiness', 'tempo',
#'time_signature', 'valence', 'not_skipped']
#sb.pairplot(data = balancedData.sample(frac=.0001, replace=False, random_state=7)[audio_features], hue='not_skipped', palette='Set2', height=2, plot_kws={"s":7});


I plotted a pairwise plot of the acoustic features only, so I could identify meaningful trends and correlations between different acoustic features. There were some correlations, such as between dancebility and bounciness, but I didn't find them sufficient enough to use for dimensionality reduction. The acoustic features could not be used alone to draw any trends if songs were skipped or not, either. 

### Minimum-Redundancy-Maximum-Relevance

As k-nearest neigbour (kNN) models are time-sensitive to big datasets when training, and with a high number of features they can overfit/underfit, I further implemented dimensionality reduction. I used the Minimum-Redundancy-Maximum-Relevance algorithm, which gained more interest in the recent years, for its effectiveness and [simplicity](https://eng.uber.com/research/maximum-relevance-and-minimum-redundancy-feature-selection-methods-for-a-marketing-machine-learning-platform/). 

Essentially, it was designed to find the smallest relevant subset of features for a specific Machine Learning problem, back in [2003](https://www.researchgate.net/publication/4033100_Minimum_Redundancy_Feature_Selection_From_Microarray_Gene_Expression_Data). This makes it a minimal-optimal feature selection algorithm. Thus, I decided to quickly and efficiently select the top 10 relelveant features from my dataset.

In [None]:
pip install git+https://github.com/smazzanti/mrmr

Collecting git+https://github.com/smazzanti/mrmr
  Cloning https://github.com/smazzanti/mrmr to /tmp/pip-req-build-atwk290e
  Running command git clone -q https://github.com/smazzanti/mrmr /tmp/pip-req-build-atwk290e


In [None]:
from mrmr import mrmr_classif
from sklearn.datasets import make_classification

#creating data for dimensionality reduction
X, y = make_classification(n_samples = 10000, n_features = 42, n_informative = 10, n_redundant = 32)
X = mergedData
y = mergedData['not_skipped'].squeeze()

#using mrmr classification
selected_features = mrmr_classif(X, y, K = 10)
print(selected_features)

In [None]:
#keeping selected features only in the dataframe
mergedDataTemp = mergedData[mergedData.columns.difference(selected_features)]
mergedDataReduced = mergedData[mergedData.columns.drop(mergedDataTemp)]
mergedDataReduced.info()

## Simple kNN implementation with reduced data

After successfully selecting the relevant set of features for the kNN implementation, I integrated Weights & Biases so I can use it for monitoring training process and hyperparameter tuning. Before training, the dataset had to be balanced for unbiased results.

In [None]:
%%capture
!pip install wandb

In [None]:
import wandb
wandb.login()

In [None]:
#balancing out dataset on skipped - not skipped songs
balancedDataReduced = mergedDataReduced.groupby('not_skipped', group_keys=False).apply(lambda x: x.sample(1000000))
balancedDataReduced.not_skipped.value_counts()

In [None]:
#splitting inputs and target variable
X = balancedDataReduced.drop(columns = ['not_skipped']) #input
y = balancedDataReduced['not_skipped'].values #target

X.info()
print(y[0:])

For Machine Learning training, I utilised Scikit-learn library, offers straightforward framework for classificatioin tasks. To train my kNN model, there were variables which needed to be encoded into suitable, categorical formats.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

#encoding target variable to binary
lb = LabelEncoder()
y_encoded = lb.fit_transform(y)
print(y_encoded)

#converting string into categorical data
X.session_id = pd.Categorical(X.session_id)
X['session_id'] = X.session_id.cat.codes

X.track_id = pd.Categorical(X.track_id)
X['track_id'] = X.track_id.cat.codes

X.hist_user_behavior_reason_start = pd.Categorical(X.hist_user_behavior_reason_start)
X['hist_user_behavior_reason_start'] = X.hist_user_behavior_reason_start.cat.codes

X.hist_user_behavior_reason_end = pd.Categorical(X.hist_user_behavior_reason_end)
X['hist_user_behavior_reason_end'] = X.hist_user_behavior_reason_end.cat.codes

#checking if datatypes are appropriate for training
X.info()

In [None]:
#splitting dataset between test and training 
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.1, random_state=1, stratify=y)

#creating KNN classifier
knn = KNeighborsClassifier(n_neighbors = 5)

#fitting the classifier to the data
knn.fit(X_train,y_train)

#showing model predictions on the test data
y_pred = knn.predict(X_test)

#checking accuracy of model on the test data
knn.score(X_test, y_test)

y_probas = knn.predict_proba(X_test)

In [None]:
knn.score(X_test, y_test)

After training the kNN model on the dataset, it did not provide a sufficient accuracy, it performed only slightly better than a coin toss. However, the training did take time even with only 10 features, and using more complex sequential and vector based kNN [models](https://github.com/rn5l/session-rec/blob/master/algorithms/knn/vsknn.py) could have been even more computationally heavy. After some more research I found that for my task applying different classification models could be more applicable.

## Moving away from kNN implementation and using full dataset again

After being inspired by [this](https://github.com/a-poor/spotify-skip-prediction/blob/master/lgbm_model_single_history.ipynb) notebook author's implementation, and studying decision trees more, I decided to move towards Gradient Boosted Trees. 

### Feature analysis with correlation matrices

Before getting into the design of the Gradient Boosted Trees, I wanted to find whether some features correlate with each other, and they could be used for dimensionality reduction. However, the dataset was still quite sparse in terms of correlations and as Gradient Boosted Trees are usually capable of handling high dimensinal data, I didn't actively continue on this path.

In [None]:
plt.figure(figsize=(16, 6))
heatmap = sb.heatmap(mergedData[['not_skipped', 
                                   'session_position', 
                                   'session_length', 
                                   'track_id', 
                                   'context_switch', 
                                   'premium', 
                                   'skip_1', 
                                   'skip_2', 
                                   'skip_3',
                                   'hist_user_behavior_is_shuffle',
                                   'no_pause_before_play', 
                                   'short_pause_before_play',
                                   'long_pause_before_play',
                                   'hist_user_behavior_n_seekfwd',
                                   'hist_user_behavior_n_seekback',
                                   'hour_of_day',
                                   'date',
                                   'duration',
                                   'release_year',
                                   'us_popularity_estimate']].corr(), vmin=-1, vmax=1, annot=False);
heatmap.set_title('Correlation Heatmap of user interaction logs', fontdict={'fontsize':15}, pad=12);


In [None]:
plt.figure(figsize=(16, 6))
heatmap = sb.heatmap(mergedData[['not_skipped', 'acousticness', 'beat_strength', 'bounciness', 'danceability', 'dyn_range_mean',
              'energy', 'flatness', 'instrumentalness', 'liveness', 'loudness',
              'mechanism', 'organism', 'speechiness', 'tempo', 'valence']].corr(), vmin=-1, vmax=1, annot=False);
heatmap.set_title('Correlation Heatmap of acoustic features', fontdict={'fontsize':15}, pad=12);


### Importing previously cleaned data

For simpler handling of the dataset, I exported it to one .csv file and reexported it again.This .csv file can be found attached to the submission.

In [None]:
data = pd.read_csv('/content/drive/MyDrive/spotify_skip.csv')

In [None]:
#displaying the first few rows
data.head(5)

### Reformatting features' datatypes for optimal training performance

Similarly to the previous trial with kNN, I went through to reformat specific features, so they would be all suitable for training.

In [None]:
#feature variable types

#categorical
cat_variable = ['mode', 'context_type', 'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end']
for c in cat_variable:
    data[c] = data[c].astype('category')

#boolean
bool_variable = ['context_switch', 'no_pause_before_play', 'short_pause_before_play', 'long_pause_before_play',
                'not_skipped', 'premium', 'hist_user_behavior_is_shuffle']
for b in bool_variable:
    data[b] = data[b].astype('bool')

#ID
id_variable = ['session_id', 'track_id', 'date']
for i in id_variable:
  le = LabelEncoder()
  data[i] = le.fit_transform(data[i])

data.head(30)

In [None]:
#importing sklearn libraries for training
import lightgbm as lgbm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
import warnings

In [None]:
#reparsing data so skipped — not skipped ratio is balanced for training
balancedData = data.groupby('not_skipped', group_keys=False).apply(lambda x: x.sample(1000000))

balancedData.not_skipped.value_counts()

In [None]:
#splitting features for training
audio_features = ['duration', 'us_popularity_estimate', 'time_signature', 'key', 'mode', 'acousticness', 'beat_strength', 'bounciness', 
              'danceability', 'dyn_range_mean', 'energy', 'flatness', 'instrumentalness', 'liveness', 'loudness', 'mechanism', 
              'organism', 'speechiness', 'tempo', 'valence']

user_interaction = ['context_switch', 'no_pause_before_play', 'short_pause_before_play', 'long_pause_before_play', 
                    'hist_user_behavior_n_seekfwd', 'hist_user_behavior_n_seekback', 'hist_user_behavior_is_shuffle', 
                    'hour_of_day', 'premium', 'context_type']

### Training with Gradient Boosted Trees algorithm (LightGBM) — first audio features only

Gradient boosted trees can be classified as a type of decision tree algorithms, similary to Random Forests. Both of the algorithms combine multiple decision trees, so the implementation can avoid overfitting. While Random Forests use a method called bagging to combine these individual trees in parralel, Gradient Boosted Trees use a method named boosting. Instead of computing in parralel, trees which are weak learners (usually decision stumps) are sequentially combined. This is when boosting happens to correct the previous trees' errors. This is beneficial, as this approach has a great model capacity which yields in: faster training speed and higher efficiency, better accuracy, and capable of handling large-scale data. This is suitable for my dataset, however, gradient boosted trees are prone to overfitting. I used LightGBM, which is a gradient boosting [library](https://lightgbm.readthedocs.io/en/latest/index.html) for tree-based learning algorithms.





First I wanted to try out whether I can train a model solely based on audio features from the dataset (excluding user interaction data). Even though with this I did not expect high accuracy, I still managed to improve my previous kNN implementation.

In [None]:
#separating target variable (not_skipped) and input features from dataframe
X_input = balancedData[audio_features] #first training only on audio characters
y_target = balancedData['not_skipped']

In [None]:
#splitting dataframe between train, validation, and test
X_train, X_test, y_train, y_test = train_test_split(X_input, y_target, train_size=0.9)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.8)

X_train.shape, X_val.shape, X_test.shape

In [None]:
#importing weights & biases project
!pip install wandb
import wandb

wandb.init(project="spotify_skip_predict", entity="101010")

In [None]:
#training light gradient boosted tree model on audio features only
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
y_probas = model.predict_proba(X_test)
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
print('Accuracy: {:.3f}'.format(accuracy_score(y_val, y_pred)))

### Training with user interactions and audio features — all the previously selected features

Using the whole dataset this time for training:

In [None]:
#separating target variable (not_skipped) and input features from dataframe
X_input = balancedData[audio_features + user_interaction] #training on all the features
y_target = balancedData['not_skipped']

In [None]:
#splitting dataframe between train, validation, and test
X_train, X_test, y_train, y_test = train_test_split(X_input, y_target, train_size=0.9)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.8)

X_train.shape, X_val.shape, X_test.shape

In [None]:
#training light gradient boosted tree model on all features
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
y_probas = model.predict_proba(X_test)
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
print('Accuracy: {:.3f}'.format(accuracy_score(y_val, y_pred)))

### Hyperparameter tuning with Weights & Biases

As training on the whole dataset with LightGBM did not improve drastically the model's accuracy, I decided to do hyperparameter tuning with Weights & Biases. Hyperparameter tuning enables to select the most optimal hyperparameters for the training model itslef, for achieving the best accuracy. For the optimisation process I used grid search, which is one of the most traditional ways of hyperparamter tuning. While balancing between achieving sufficient accuracy and avoiding overfitting, I found that based on my [research](https://towardsdatascience.com/hyperparameter-tuning-to-reduce-overfitting-lightgbm-5eb81a0b464e) the selected parameters can aid the training process.

In [None]:
#grid search sweep for hyperparameter tuning
sweep_config = {
    "name": "audio_user_feat_eng_2",
    "method": "grid",
    "metric": {
        "name": "accuracy",
        "goal": "maximize"
        },
    "parameters": {
        "max_depth": {
            "values": [3, 5, 10, 15]
        },
        "num_leaves": {
            "values": [10, 20, 30]
        },
        "learning_rate": {
            "values": [.05, .1, .2]
        },
        "subsample": {
            "values": [1, .8, .5]
        }
    }
}

In [None]:
def train():
    config_defaults = {
      "max_depth": 3,
      "num_leaves": 10,
      "learning_rate": .05,
      "subsample": 1,
    }
    wandb.init(project="spotify_skip_predict", entity="101010", config=config_defaults)
    config = wandb.config

    X_data = balancedData[audio_features + user_interaction]
    Y_data = balancedData['not_skipped']

    #splitting data into train, validation, and test sets
    X_train, X_test, y_train, y_test = train_test_split(X_data, Y_data, train_size=0.9)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.8)

    #fitting model on training set
    model = lgbm.LGBMClassifier(max_depth=config.max_depth,
                                num_leaves=config.num_leaves,
                                learning_rate=config.learning_rate, 
                                subsample=config.subsample)
    model.fit(X_train, y_train)

    #making predictions on test set
    y_pred = model.predict(X_val)

    #evaluating predictions
    accuracy = accuracy_score(y_val, y_pred)
    print(f"Accuracy: {int(accuracy * 100.)}%")
    wandb.log({"accuracy": accuracy})

In [None]:
sweep_id = wandb.sweep(sweep_config, entity="101010", project="spotify_skip_predict")

In [None]:
wandb.agent(sweep_id, train)

I managed to increase test accuracy minimally through hyperparameter tuning from 60.6% to 60.87%. Here the parralel coordinates plot and the accuracy scores can be seen:

https://wandb.ai/101010/spotify_skip_predict/reports/Hyperparameter-tuning--VmlldzoxMzg4NzI5?accessToken=807v08fdor2kc9cwnd2btjn17rin98q21pob5ql65oo99qfrrlckx2i3m6o8932k


### Adding extra features via feature engineering

As the hypermarameter tuning only minimally improved the test accuracy, I wanted to add extra features via feature engineering. To link back to the original dataset's nature, I wanted to take into account music streaming's sequential manner. Implementing a more sequence-aware feature could increase the accuracy. Thus, I decided to two new features to the dataset, if the previous song was skipped or not, and the previous track's length.

In [None]:
#adding if previous track was skipped or not
data['prev_not_skipped'] = data.groupby(['session_id'])['not_skipped'].shift(1)
data['prev_not_skipped'] = data['prev_not_skipped'].astype('bool')

#adding the previous track's length
data['prev_duration'] = data.groupby(['session_id'])['duration'].shift(1)

In [None]:
engineered_features = ['prev_not_skipped', 'prev_duration', 'session_position']

X_data = data[audio_features + user_interaction + engineered_features]
Y_data = data['not_skipped']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_data, Y_data, train_size=0.9)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, train_size=0.8)

X_train.shape, X_val.shape, X_test.shape

In [None]:
#applying previously tuned hyperparameters
model = lgbm.LGBMClassifier(
    max_depth = 10,
    num_leaves = 20,
    learning_rate = .2,
    subsample = 0.8
)

model.fit(X_train, y_train)
y_pred = model.predict(X_val)
print('Accuracy: {:.3f}'.format(accuracy_score(y_val, y_pred)))

After testing the model with the new features and the optimised hyperparameteres, I managed to achieve 80.6% accuracy. This can be considered as a decent result given the scope of the problem, and the [leaderboard](https://www.aicrowd.com/challenges/spotify-sequential-skip-prediction-challenge/leaderboards) at the skip prediction challange.

However, I believe that my implementation can be still somewhat overfitting, which can be found out after another round of hyperparameter tuning, and a more detailed evaluation of the results.