

# **Will you skip this music track or not?**


The public part of the dataset consists of roughly 130 million listening sessions with associated user interactions on the Spotify service. 

The task is to predict whether individual tracks encountered in a listening session will be skipped by a particular user. In order to do this, complete information about the first half of a user’s listening session is provided, while the prediction is to be carried out on the second half. Participants have access to metadata, as well as acoustic descriptors, for all the tracks encountered in listening sessions.

https://www.aicrowd.com/challenges/spotify-sequential-skip-prediction-challenge

Brost, B., Mehrotra, R., & Jehan, T. (2019, May). The music streaming sessions dataset. In The World Wide Web Conference (pp. 2594-2600).



As the entire dataset is too big to experiment data manipulation, Spotify provided a mini dataset for this purpose.

In this script, we will do data wrangling to inspect the quality of the data, and do data engineering to generate features for machine learning modeling.


# Mount Google drive to Colab

In [1]:

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Capstone_SpotifyStreaming/notebooks

#!pip install featuretools==0.4.0
#!pip install -U featuretools
!pip install featuretools

import numpy as np
import pandas as pd
import featuretools as ft


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Capstone_SpotifyStreaming/notebooks
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/




In [2]:
# check the installed packages
!pip list -v 

Package                       Version                Location                               Installer
----------------------------- ---------------------- -------------------------------------- ---------
absl-py                       1.3.0                  /usr/local/lib/python3.7/dist-packages pip
aeppl                         0.0.33                 /usr/local/lib/python3.7/dist-packages pip
aesara                        2.7.9                  /usr/local/lib/python3.7/dist-packages pip
aiohttp                       3.8.3                  /usr/local/lib/python3.7/dist-packages pip
aiosignal                     1.3.1                  /usr/local/lib/python3.7/dist-packages pip
alabaster                     0.7.12                 /usr/local/lib/python3.7/dist-packages pip
albumentations                1.2.1                  /usr/local/lib/python3.7/dist-packages pip
altair                        4.2.0                  /usr/local/lib/python3.7/dist-packages pip
appdirs                     

In [3]:
%matplotlib inline

# Load the data and perform some data cleaning/re-coding as described in 2_mini_EDA

In [4]:
# load the track information (mini version)

tf_df = pd.read_csv('../data/raw/data/track_features/tf_mini.csv')
log_df = pd.read_csv('../data/raw/data/training_set/log_mini.csv')

In [5]:
# perform some data cleaning/re-coding as described in 2_mini_EDA

tf_df_dummy = pd.get_dummies(tf_df, columns=['key','time_signature','mode'])
log_df_dummy = pd.get_dummies(log_df.drop(columns = ['session_length',  'hist_user_behavior_reason_end', 'hist_user_behavior_n_seekfwd','hist_user_behavior_n_seekback']), columns=['hist_user_behavior_reason_start', 'context_type'])


In [6]:
tf_df_dummy.head().T

Unnamed: 0,0,1,2,3,4
track_id,t_a540e552-16d4-42f8-a185-232bd650ea7d,t_67965da0-132b-4b1e-8a69-0ef99b32287c,t_0614ecd3-a7d5-40a1-816e-156d5872a467,t_070a63a0-744a-434e-9913-a97b02926a29,t_d6990e17-9c31-4b01-8559-47d9ce476df1
duration,109.706673,187.693329,160.839996,175.399994,369.600006
release_year,1950,1950,1951,1951,1951
us_popularity_estimate,99.975414,99.96943,99.602549,99.665018,99.991764
acousticness,0.45804,0.916272,0.812884,0.396854,0.728831
beat_strength,0.519497,0.419223,0.42589,0.400934,0.371328
bounciness,0.504949,0.54553,0.50828,0.35999,0.335115
danceability,0.399767,0.491235,0.491625,0.552227,0.483044
dyn_range_mean,7.51188,9.098376,8.36867,5.967346,5.802681
energy,0.817709,0.154258,0.358813,0.514585,0.721442


In [7]:
log_df_dummy.head().T

Unnamed: 0,0,1,2,3,4
session_id,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e
session_position,1,2,3,4,5
track_id_clean,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,t_64f3743c-f624-46bb-a579-0f3f9a07a123
skip_1,False,False,False,False,False
skip_2,False,False,False,False,False
skip_3,False,False,False,False,False
not_skipped,True,True,True,True,True
context_switch,0,0,0,0,0
no_pause_before_play,0,1,1,1,1
short_pause_before_play,0,0,0,0,0


# **Calculate acoustic distance of each song from previous song(s)**

The track information dataframe contains acoustic analysis and scores of each track on 8 acoustic features (see: https://benanne.github.io/2014/08/05/spotify-cnns.html). Therefore, within each session, I would like to calculate the ***distance*** of each track to the other tracks.

The distance can be calculated in many ways, but here I would like to focus on *Euclidean* and *Mahalanobis* distances. 

Euclidean distance is simply the root-sum-square between two vectors. The Mahalanobis distance can be conceptualized as the Euclidean distance being normized to the multidimensional statistical variance.

Both metrices were applied to calculate the distance of the current track to (1) the previous track and (2) the mean of all tracks in the session.

In [8]:
# extract the 8 acoustic feature of each track
df = log_df_dummy.merge(tf_df_dummy[['track_id','acoustic_vector_0', 'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3','acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7']], left_on = 'track_id_clean', right_on = 'track_id')
df.sort_values(by = ['session_id', 'session_position'],inplace = True)
df.head().T

Unnamed: 0,0,45,50,327,353
session_id,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e
session_position,1,2,3,4,5
track_id_clean,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,t_64f3743c-f624-46bb-a579-0f3f9a07a123
skip_1,False,False,False,False,False
skip_2,False,False,False,False,False
skip_3,False,False,False,False,False
not_skipped,True,True,True,True,True
context_switch,0,0,0,0,0
no_pause_before_play,0,1,1,1,1
short_pause_before_play,0,0,0,0,0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 167880 entries, 0 to 134498
Data columns (total 39 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   session_id                                  167880 non-null  object 
 1   session_position                            167880 non-null  int64  
 2   track_id_clean                              167880 non-null  object 
 3   skip_1                                      167880 non-null  bool   
 4   skip_2                                      167880 non-null  bool   
 5   skip_3                                      167880 non-null  bool   
 6   not_skipped                                 167880 non-null  bool   
 7   context_switch                              167880 non-null  int64  
 8   no_pause_before_play                        167880 non-null  int64  
 9   short_pause_before_play                     167880 non-null  int64  
 

In [10]:
df.columns

Index(['session_id', 'session_position', 'track_id_clean', 'skip_1', 'skip_2',
       'skip_3', 'not_skipped', 'context_switch', 'no_pause_before_play',
       'short_pause_before_play', 'long_pause_before_play',
       'hist_user_behavior_is_shuffle', 'hour_of_day', 'date', 'premium',
       'hist_user_behavior_reason_start_appload',
       'hist_user_behavior_reason_start_backbtn',
       'hist_user_behavior_reason_start_clickrow',
       'hist_user_behavior_reason_start_endplay',
       'hist_user_behavior_reason_start_fwdbtn',
       'hist_user_behavior_reason_start_playbtn',
       'hist_user_behavior_reason_start_remote',
       'hist_user_behavior_reason_start_trackdone',
       'hist_user_behavior_reason_start_trackerror', 'context_type_catalog',
       'context_type_charts', 'context_type_editorial_playlist',
       'context_type_personalized_playlist', 'context_type_radio',
       'context_type_user_collection', 'track_id', 'acoustic_vector_0',
       'acoustic_vector_1', 'ac

In [11]:
session_id = pd.unique(df['session_id'])
print('number of sessions in this mini dataset:',len(session_id))


number of sessions in this mini dataset: 10000


In [12]:
# All the tracks within each session should be categorized as skipped or not
df['session_id_skip_1'] = df['session_id'] * df['skip_1']
df['session_id_skip_2'] = df['session_id'] * df['skip_2']
df['session_id_skip_3'] = df['session_id'] * df['skip_3']
df['session_id_not_skipped'] = df['session_id'] * df['not_skipped']
df.head(10).T

Unnamed: 0,0,45,50,327,353,475,537,540,541,601
session_id,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e,0_00006f66-33e5-4de7-a324-2d18e439fc1e
session_position,1,2,3,4,5,6,7,8,9,10
track_id_clean,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,t_64f3743c-f624-46bb-a579-0f3f9a07a123,t_c815228b-3212-4f9e-9d4f-9cb19b248184,t_e23c19f5-4c32-4557-aa44-81372c2e3705,t_0be6eced-f56f-48bd-8086-f2e0b760fdee,t_f3ecbd3b-9e8e-4557-b8e0-39cfcd7e65dd,t_2af4dfa0-7df3-4b7e-b7ab-353ba48237f9
skip_1,False,False,False,False,False,False,True,True,False,True
skip_2,False,False,False,False,False,False,True,True,True,True
skip_3,False,False,False,False,False,True,True,True,True,True
not_skipped,True,True,True,True,True,False,False,False,False,False
context_switch,0,0,0,0,0,0,0,0,0,0
no_pause_before_play,0,1,1,1,1,1,1,1,1,1
short_pause_before_play,0,0,0,0,0,0,0,0,0,0


In [13]:
# # calculate the mahalanobis distance and the rmse of each track's acoustic factor to the mean of each session
# from scipy.spatial import distance
# singular_s_id = []

# for s_id in session_id:
#   temp_data = df.loc[df['session_id'] == s_id,['acoustic_vector_0', 'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3','acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7']]
#   temp_data_noskip = df.loc[(df['session_id'] == s_id) & (df['not_skipped'] == True),['acoustic_vector_0', 'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3','acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7']]
  

#   acou_center = temp_data.mean(axis = 0) # the multidimensional mean of the acoustic factors
#   acou_center_noskip = temp_data_noskip.mean(axis = 0) # the multidimensional mean of the acoustic factors among the non-skipped tracks
  
#   df.loc[temp_data.index, 'acous_rmse_to_mean'] = np.sqrt(np.square(temp_data - acou_center).sum(axis = 1))
#   df.loc[temp_data_noskip.index, 'acous_rmse_to_mean_noskip'] = np.sqrt(np.square(temp_data_noskip - acou_center_noskip).sum(axis = 1))
#   df.loc[temp_data.index, 'acous_rmse_to_previous'] = np.sqrt(np.square(temp_data.diff()).sum(axis = 1))
#   iv = []
#   try:
#     iv = np.linalg.inv(np.array(np.cov(temp_data, rowvar = False))) # compute the covariance matrix, and then take its inverse
#   except: # The inverse of the covariance matrix was singular for some sessions due to whatever reason. In those cases, adding a small Gaussian noise can resolve the issue.
#     iv = np.linalg.inv(np.array(np.cov(temp_data + np.random.normal(0,1e-5,temp_data.shape), rowvar = False)))
#     singular_s_id.append(s_id)

#   for n in range(len(temp_data)):
#       df.loc[temp_data.index[n], 'acous_mah_dist_to_mean'] = distance.mahalanobis(temp_data.iloc[n,:], acou_center, iv)
#       if n > 0:
#         df.loc[temp_data.index[n], 'acous_mah_dist_to_previous'] = distance.mahalanobis(temp_data.iloc[n,:], temp_data.iloc[n-1,:], iv)

      



In [14]:
# df.head(10)

In [15]:
# # It appears that the singular sessions have small number of tracks, which mathematically makes sense.
# for s_id in singular_s_id:
#   temp_data = df.loc[df['session_id'] == s_id,['acoustic_vector_0', 'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3','acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7']]
#   print(len(temp_data))

In [16]:
# # print out a singular session's data for visual inspection

# temp_singular_data = df.loc[df['session_id'] == singular_s_id[0],['acoustic_vector_0', 'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3','acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7']]

# import seaborn as sns
# import matplotlib.pylab as plt
# %matplotlib inline
  
# ax = sns.heatmap( temp_singular_data , linewidth = 0.5 , cmap = 'coolwarm' )
  
# plt.title( singular_s_id[0] )
# plt.show()


In [17]:
# take out the acoustic feature columns for now, as it will be combined back later.
log_df_dummy2 = df.drop(columns = ['acoustic_vector_0', 'acoustic_vector_1', 'acoustic_vector_2', 'acoustic_vector_3','acoustic_vector_4', 'acoustic_vector_5', 'acoustic_vector_6','acoustic_vector_7'])

In [18]:
df.columns

Index(['session_id', 'session_position', 'track_id_clean', 'skip_1', 'skip_2',
       'skip_3', 'not_skipped', 'context_switch', 'no_pause_before_play',
       'short_pause_before_play', 'long_pause_before_play',
       'hist_user_behavior_is_shuffle', 'hour_of_day', 'date', 'premium',
       'hist_user_behavior_reason_start_appload',
       'hist_user_behavior_reason_start_backbtn',
       'hist_user_behavior_reason_start_clickrow',
       'hist_user_behavior_reason_start_endplay',
       'hist_user_behavior_reason_start_fwdbtn',
       'hist_user_behavior_reason_start_playbtn',
       'hist_user_behavior_reason_start_remote',
       'hist_user_behavior_reason_start_trackdone',
       'hist_user_behavior_reason_start_trackerror', 'context_type_catalog',
       'context_type_charts', 'context_type_editorial_playlist',
       'context_type_personalized_playlist', 'context_type_radio',
       'context_type_user_collection', 'track_id', 'acoustic_vector_0',
       'acoustic_vector_1', 'ac

# **Use featuretool to do automatic feature engineering**

In [19]:
#First, initializing an EntitySet with a name
es = ft.EntitySet(id="spotify_data")

In [20]:
from woodwork.logical_types import Categorical, PostalCode

es = es.add_dataframe(
    dataframe_name="tf",
    dataframe=tf_df_dummy,
    index="track_id",
)

es

Entityset: spotify_data
  DataFrames:
    tf [Rows: 50704, Columns: 46]
  Relationships:
    No relationships

In [21]:
es['tf'].ww.schema

Unnamed: 0_level_0,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
track_id,Unknown,['index']
duration,Double,['numeric']
release_year,Integer,['numeric']
us_popularity_estimate,Double,['numeric']
acousticness,Double,['numeric']
beat_strength,Double,['numeric']
bounciness,Double,['numeric']
danceability,Double,['numeric']
dyn_range_mean,Double,['numeric']
energy,Double,['numeric']


In [22]:
# add dataframe
# the 'session_position' contains order information within each session

es = es.add_dataframe(
    dataframe_name="log", dataframe=log_df_dummy2, make_index = True, index="event_id", time_index="session_position",
)

es

Entityset: spotify_data
  DataFrames:
    tf [Rows: 50704, Columns: 46]
    log [Rows: 167880, Columns: 36]
  Relationships:
    No relationships

In [23]:
es['log'].ww.schema

Unnamed: 0_level_0,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1
event_id,Integer,['index']
session_id,Categorical,['category']
session_position,Integer,"['time_index', 'numeric']"
track_id_clean,Unknown,[]
skip_1,Boolean,[]
skip_2,Boolean,[]
skip_3,Boolean,[]
not_skipped,Boolean,[]
context_switch,Integer,['numeric']
no_pause_before_play,Integer,['numeric']


In [24]:
# When two DataFrames have a one-to-many relationship, we call the “one” DataFrame, the “parent DataFrame”. A relationship between a parent and child is defined like this:
# (parent_dataframe, parent_column, child_dataframe, child_column)
es = es.add_relationship("tf", "track_id", "log", "track_id_clean")
es

Entityset: spotify_data
  DataFrames:
    tf [Rows: 50704, Columns: 46]
    log [Rows: 167880, Columns: 36]
  Relationships:
    log.track_id_clean -> tf.track_id

In [None]:
# check this for specifying groupby: https://featuretools.alteryx.com/en/stable/guides/specifying_primitive_options.html
# The 'session_id' was specified as GroupBy option as we care what happend in each session.
# The tf (track inforamtion) dataframe does not contain any order or session information, so it does not be be specified as GroupBy option.

feature_matrix, feature_defs = ft.dfs(entityset=es, target_dataframe_name="log", groupby_trans_primitives=["Diff","CumSum", "CumCount", "CumMean", "CumMin", "CumMax"], n_jobs=-1,
                                      primitive_options={"diff": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1","session_id_skip_2","session_id_skip_3","session_id_not_skipped"]},"ignore_groupby_dataframes": ["tf"]},
                                                         "cum_sum": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1","session_id_skip_2","session_id_skip_3","session_id_not_skipped"]},"ignore_groupby_dataframes": ["tf"]},
                                                         "cum_count": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1","session_id_skip_2","session_id_skip_3","session_id_not_skipped"]},"ignore_groupby_dataframes": ["tf"]},
                                                         "cum_mean": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1","session_id_skip_2","session_id_skip_3","session_id_not_skipped"]},"ignore_groupby_dataframes": ["tf"]},
                                                         "cum_min": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1","session_id_skip_2","session_id_skip_3","session_id_not_skipped"]},"ignore_groupby_dataframes": ["tf"]},
                                                         "cum_max": {"include_groupby_columns": {"log": ["session_id","session_id_skip_1","session_id_skip_2","session_id_skip_3","session_id_not_skipped"]},"ignore_groupby_dataframes": ["tf"]}},
                                      ignore_columns = {"log": ["skip_1","skip_2","skip_3","not_skipped"]})

feature_matrix

Perhaps you already have a cluster running?
Hosting the HTTP server on port 37013 instead
  f"Port {expected} is already in use.\n"


EntitySet scattered to 2 workers in 10 seconds




In [None]:
feature_defs

In [None]:
# save feature matrix

feature_matrix.to_csv('../data/processed/feature_matrix.csv')



In [None]:
feature_matrix.ww.schema

In [None]:
# load the feature matrix to check whether it was properly saved.

feature_matrix2 = pd.read_csv('../data/processed/feature_matrix.csv')
feature_matrix2.head().T

In [None]:
# inspect the data of a session

a = feature_matrix2[feature_matrix2['session_id'] == '0_00006f66-33e5-4de7-a324-2d18e439fc1e']

a.columns

In [None]:
# inspect whether the featuretools generated features considered the session and the order information. The answer appears yes.

a[['tf.speechiness','DIFF(tf.speechiness) by session_id','session_id',	'session_position']]

# Scale standardization & training/testing data split will be performed at the model fitting stage.
