

# **Will you skip this music track or not?**


The public part of the dataset consists of roughly 130 million listening sessions with associated user interactions on the Spotify service. 

The task is to predict whether individual tracks encountered in a listening session will be skipped by a particular user. In order to do this, complete information about the first half of a user’s listening session is provided, while the prediction is to be carried out on the second half. Participants have access to metadata, as well as acoustic descriptors, for all the tracks encountered in listening sessions.

https://www.aicrowd.com/challenges/spotify-sequential-skip-prediction-challenge

Brost, B., Mehrotra, R., & Jehan, T. (2019, May). The music streaming sessions dataset. In The World Wide Web Conference (pp. 2594-2600).



As the entire dataset is too big to experiment data manipulation, Spotify provided a mini dataset for this purpose.

In this script, we will do data wrangling to inspect the quality of the data, and do data engineering to generate features for machine learning modeling.


# Mount Google drive to Colab

In [1]:

#from google.colab import drive
#drive.mount('/content/drive')
#%cd /content/drive/MyDrive/Capstone_SpotifyStreaming/notebooks

import numpy as np
import pandas as pd


In [2]:
# install pandas_profiling for auto data inspection

#import pandas_profiling
from pandas_profiling import ProfileReport 

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# load the track information (mini version)

tf_df = pd.read_csv('../data/raw/data/track_features/tf_mini.csv')
tf_df.head()

Unnamed: 0,track_id,duration,release_year,us_popularity_estimate,acousticness,beat_strength,bounciness,danceability,dyn_range_mean,energy,...,time_signature,valence,acoustic_vector_0,acoustic_vector_1,acoustic_vector_2,acoustic_vector_3,acoustic_vector_4,acoustic_vector_5,acoustic_vector_6,acoustic_vector_7
0,t_a540e552-16d4-42f8-a185-232bd650ea7d,109.706673,1950,99.975414,0.45804,0.519497,0.504949,0.399767,7.51188,0.817709,...,4,0.935512,-0.033284,-0.411896,-0.02858,0.349438,0.832467,-0.213871,-0.299464,-0.675907
1,t_67965da0-132b-4b1e-8a69-0ef99b32287c,187.693329,1950,99.96943,0.916272,0.419223,0.54553,0.491235,9.098376,0.154258,...,3,0.359675,0.145703,-0.850372,0.12386,0.746904,0.371803,-0.420558,-0.21312,-0.525795
2,t_0614ecd3-a7d5-40a1-816e-156d5872a467,160.839996,1951,99.602549,0.812884,0.42589,0.50828,0.491625,8.36867,0.358813,...,4,0.726769,0.02172,-0.743634,0.333247,0.568447,0.411094,-0.187749,-0.387599,-0.433496
3,t_070a63a0-744a-434e-9913-a97b02926a29,175.399994,1951,99.665018,0.396854,0.400934,0.35999,0.552227,5.967346,0.514585,...,4,0.859075,0.039143,-0.267555,-0.051825,0.106173,0.614825,-0.111419,-0.265953,-0.542753
4,t_d6990e17-9c31-4b01-8559-47d9ce476df1,369.600006,1951,99.991764,0.728831,0.371328,0.335115,0.483044,5.802681,0.721442,...,4,0.562343,0.131931,-0.292523,-0.174819,-0.034422,0.717229,-0.016239,-0.392694,-0.455496


In [4]:
tf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50704 entries, 0 to 50703
Data columns (total 30 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   track_id                50704 non-null  object 
 1   duration                50704 non-null  float64
 2   release_year            50704 non-null  int64  
 3   us_popularity_estimate  50704 non-null  float64
 4   acousticness            50704 non-null  float64
 5   beat_strength           50704 non-null  float64
 6   bounciness              50704 non-null  float64
 7   danceability            50704 non-null  float64
 8   dyn_range_mean          50704 non-null  float64
 9   energy                  50704 non-null  float64
 10  flatness                50704 non-null  float64
 11  instrumentalness        50704 non-null  float64
 12  key                     50704 non-null  int64  
 13  liveness                50704 non-null  float64
 14  loudness                50704 non-null

In [5]:
# count the number of NaN values
tf_df.isnull().sum().sum()

0

In [6]:
# load the listening session data (mini version)

log_df = pd.read_csv('../data/raw/data/training_set/log_mini.csv')
log_df.head(20)

Unnamed: 0,session_id,session_position,session_length,track_id_clean,skip_1,skip_2,skip_3,not_skipped,context_switch,no_pause_before_play,...,long_pause_before_play,hist_user_behavior_n_seekfwd,hist_user_behavior_n_seekback,hist_user_behavior_is_shuffle,hour_of_day,date,premium,context_type,hist_user_behavior_reason_start,hist_user_behavior_reason_end
0,0_00006f66-33e5-4de7-a324-2d18e439fc1e,1,20,t_0479f24c-27d2-46d6-a00c-7ec928f2b539,False,False,False,True,0,0,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone
1,0_00006f66-33e5-4de7-a324-2d18e439fc1e,2,20,t_9099cd7b-c238-47b7-9381-f23f2c1d1043,False,False,False,True,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone
2,0_00006f66-33e5-4de7-a324-2d18e439fc1e,3,20,t_fc5df5ba-5396-49a7-8b29-35d0d28249e0,False,False,False,True,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone
3,0_00006f66-33e5-4de7-a324-2d18e439fc1e,4,20,t_23cff8d6-d874-4b20-83dc-94e450e8aa20,False,False,False,True,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone
4,0_00006f66-33e5-4de7-a324-2d18e439fc1e,5,20,t_64f3743c-f624-46bb-a579-0f3f9a07a123,False,False,False,True,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,trackdone
5,0_00006f66-33e5-4de7-a324-2d18e439fc1e,6,20,t_c815228b-3212-4f9e-9d4f-9cb19b248184,False,False,True,False,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,trackdone,fwdbtn
6,0_00006f66-33e5-4de7-a324-2d18e439fc1e,7,20,t_e23c19f5-4c32-4557-aa44-81372c2e3705,True,True,True,False,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,fwdbtn,fwdbtn
7,0_00006f66-33e5-4de7-a324-2d18e439fc1e,8,20,t_0be6eced-f56f-48bd-8086-f2e0b760fdee,True,True,True,False,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,fwdbtn,fwdbtn
8,0_00006f66-33e5-4de7-a324-2d18e439fc1e,9,20,t_f3ecbd3b-9e8e-4557-b8e0-39cfcd7e65dd,False,True,True,False,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,fwdbtn,fwdbtn
9,0_00006f66-33e5-4de7-a324-2d18e439fc1e,10,20,t_2af4dfa0-7df3-4b7e-b7ab-353ba48237f9,True,True,True,False,0,1,...,0,0,0,True,16,2018-07-15,True,editorial_playlist,fwdbtn,fwdbtn


In [7]:
log_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167880 entries, 0 to 167879
Data columns (total 21 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   session_id                       167880 non-null  object
 1   session_position                 167880 non-null  int64 
 2   session_length                   167880 non-null  int64 
 3   track_id_clean                   167880 non-null  object
 4   skip_1                           167880 non-null  bool  
 5   skip_2                           167880 non-null  bool  
 6   skip_3                           167880 non-null  bool  
 7   not_skipped                      167880 non-null  bool  
 8   context_switch                   167880 non-null  int64 
 9   no_pause_before_play             167880 non-null  int64 
 10  short_pause_before_play          167880 non-null  int64 
 11  long_pause_before_play           167880 non-null  int64 
 12  hist_user_behavi

In [8]:
# count the number of NaN values
tf_df.isnull().sum().sum()

0

In [10]:
# Generate ProfileReport of the track data

tf_profile = ProfileReport(tf_df, html={'style' : {'full_width':True}})
tf_profile.to_file(output_file="../reports/profiling/tf_mini_profile.html") 

#Download the profile report: https://drive.google.com/file/d/1-AbpYDsq1X-AGGl-2lHP-2ApgecaNAvJ/view?usp=share_link

Summarize dataset: 100%|███████████| 773/773 [01:30<00:00,  8.57it/s, Completed]
Generate report structure: 100%|██████████████████| 1/1 [00:07<00:00,  7.68s/it]
Render HTML: 100%|████████████████████████████████| 1/1 [00:16<00:00, 16.54s/it]
Export report to file: 100%|██████████████████████| 1/1 [00:00<00:00, 15.95it/s]


In [11]:
# Generate ProfileReport of the listening session data

log_profile = ProfileReport(log_df, html={'style' : {'full_width':True}})
log_profile.to_file(output_file="../reports/profiling/log_mini_profile.html") 

#Download the profile report: https://drive.google.com/file/d/1-BRp061aKDdVDinAGXTUrdtb2UU-Z2dB/view?usp=share_link

  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))
Summarize dataset: 100%|█████████████| 60/60 [00:16<00:00,  3.53it/s, Completed]
Generate report structure: 100%|██████████████████| 1/1 [00:04<00:00,  4.31s/it]
Render HTML: 100%|████████████████████████████████| 1/1 [00:01<00:00,  1.29s/it]
Export report to file: 100%|█████████████████████| 1/1 [00:00<00:00, 194.67it/s]


According to the inspections above and the profile reports, there is no missing data, and the data type of each column was reasonablly coded. Therefore, there is no need to clean the data.