![spotify_logo](../assets/Spotify_Logo_CMYK_Green.png)

# Spotify Skip Prediction: Feature Engineering and Baseline Modelling
Notebook: 4 of 7

Author: Alex Thach - alcthach@gmail.com  
BrainStation Data Science Capstone Project - Winter 2022  
April 4, 2022
---

# Recap:
In the previous notebook I dived further into the data and completed some multivariate analysis. 
Some things to note from that notebook are:
- Previous history of user interactions might be related the probability of the current track being skipped
- Merged session logs and track features table successfully, written to a .csv file in the project directory
- `'skip_2'` at best had weak correlations with other features in the dataset
- Skip rate appeared to differ based on:
    - Where in the sequence the song was located in a session
    - Time-of-day
    - Type of playlist song belonged to

---

# Purpose: 
The goal of this notebook employ some feature engineering and baseline modelling to gather an initial impression of the problem.

---

# Summary/Highlights:
- Some feature engineering was employed, including exploding of `'date'` column, and OneHotEncoding of categorical variables
- Previous user interaction history appears to be a predictor of future skip outcomes, as indicated by improve classification accuracy when inputting data related to previous track interaction
- Some features appear to be proxies for skip outcomes in previous tracks

---

In [1]:
# Runs setup script, imports, plotting settings, reads in raw data
%run -i "../scripts/at-setup.py" 

Dataframes in the global name space now include:
session_logs_df
track_features_df


In [2]:
# Reads in data
main_df = pd.read_csv('../data/processed/merged.csv')

# Feature Engineering

Prior to any sort of modelling, it's good practice to take a look at the data types in the dataset. In other words, features that will be use to train a machine learning must be in a numeric form. I'll proceed to take a look at the dataset from the previous notebook.

In [3]:
# Gets the data types of the features in the dataset
main_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 167880 entries, 0 to 167879
Data columns (total 46 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   session_id                       167880 non-null  object 
 1   session_position                 167880 non-null  int64  
 2   session_length                   167880 non-null  int64  
 3   skip_2                           167880 non-null  bool   
 4   context_switch                   167880 non-null  int64  
 5   no_pause_before_play             167880 non-null  int64  
 6   short_pause_before_play          167880 non-null  int64  
 7   long_pause_before_play           167880 non-null  int64  
 8   hist_user_behavior_n_seekfwd     167880 non-null  int64  
 9   hist_user_behavior_n_seekback    167880 non-null  int64  
 10  hist_user_behavior_is_shuffle    167880 non-null  bool   
 11  hour_of_day                      167880 non-null  int64  
 12  da

Most of the features of the dataset are numeric. However, there appear to be some that are not. I'll take a look at the non-numeric columns and see if I could transform them into numeric columns.

---

In [4]:
# Gets columns that are non-numeric
categorical_cols = [print(col) for col in main_df.columns if main_df[col].dtype == 'object']

session_id
date
context_type
hist_user_behavior_reason_start
hist_user_behavior_reason_end
mode


The output above shows which columns that are of type 'object' in the dataset. These are the columns I will look to convert to a numeric type if it's appropriate to.

---
## Initial Impressions

- `'session_id'` is a column I would prefer to keep as is, it serves as a unique identifier for individual listening session. I would like to preserve that information in the dataset
- `'date'` might be a column that I look to explode, or split into day, week, year
- `'context_type'` indicates the type of playlist that a song belongs to, if you recall from the previous notebook, it appeared that there was a difference in skip rate across the different types of playlists, so it is worth numerizing this column
- `'hist_user_behaviour_reason_start'` and ``'hist_user_behaviour_reason_end'` also appears to show differences in skip rate from the EDA in the previous notebook
    - For that reason it's also worth converting to numeric
- Finally, there is `'mode'` which indicates if a song is in a major or minor key
- Conclusion: I will look to convert all the columns above to numeric, except for `'session_id'`, for which I will preserve this information in case there are any operations in which I will require this data in its original form

---

In [5]:
# Assigning variables
X = main_df
y = main_df['skip_2']

In [6]:
# # Splitting into remainder and test sets
# X_remainder, X_test, y_remainder, y_test = \
#     train_test_split(X, y, test_size = 0.2,
#                      random_state=42)

In [7]:
# # Splitting the remainder into train and validation
# X_train, X_validation, y_train, y_validation = \
#     train_test_split(X_remainder, y_remainder, test_size = 0.3,
#                      random_state=42)

## Exploding the `'date'` column

In [8]:
# Gets date column
main_df['date'].head()

0    2018-07-15
1    2018-07-15
2    2018-07-15
3    2018-07-15
4    2018-07-15
Name: date, dtype: object

Granularizing the `'date'` columns unlocks more information about trends on various time scales. It provides more richness as compared to having only the individual day. Allows for possible study of trends on daily, monthly or yearly timeframes.

This is accomplished by:  

Casting the `'date'` column as datetime, and exploding it into weekday, month, and year columns

In [9]:
# Casts 'date' column as datetime
main_df['date'] = pd.to_datetime(session_logs_df['date'])

# Explodes 'date' column into weekday, month, year, columns
main_df['weekday'] = main_df['date'].dt.weekday
main_df['month'] = main_df['date'].dt.month
main_df['year'] = main_df['date'].dt.year

In [10]:
# Sanity checks
display(main_df[['date','weekday', 'month', 'year']].head(1))
print("")
display(main_df[['date','weekday', 'month', 'year']].info())

Unnamed: 0,date,weekday,month,year
0,2018-07-15,6,7,2018



<class 'pandas.core.frame.DataFrame'>
Int64Index: 167880 entries, 0 to 167879
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype         
---  ------   --------------   -----         
 0   date     167880 non-null  datetime64[ns]
 1   weekday  167880 non-null  int64         
 2   month    167880 non-null  int64         
 3   year     167880 non-null  int64         
dtypes: datetime64[ns](1), int64(3)
memory usage: 6.4 MB


None

## OneHotEncoding Categorical Columns

I will encode the categorical columns by employing `'ColumnTransformer'` and `'OneHotEncoder'`. By employing OneHotEncoding rather than dummy variables, I ensure that the model will be able to handle cases in which new categorical variables occur in a test set or in new incoming data.

---

In [11]:
# Creates the column transformations list and  columns to which to apply
col_transforms = [('enc', OneHotEncoder(), ['context_type',
                                                     'hist_user_behavior_reason_start',
                                                     'hist_user_behavior_reason_end',
                                                     'mode'])]

# Creates the column transformer
col_trans = ColumnTransformer(col_transforms)

# Fits to X
col_trans.fit(X)

ColumnTransformer(transformers=[('enc', OneHotEncoder(),
                                 ['context_type',
                                  'hist_user_behavior_reason_start',
                                  'hist_user_behavior_reason_end', 'mode'])])

In [12]:
# Applies the transformation
transformed = col_trans.transform(X) 

In [13]:
# Puts in a DataFrame
transformed_df = pd.DataFrame(transformed.toarray(), columns=col_trans.get_feature_names_out())

In [14]:
# Re-assigns the transformed dataframe to X
X = pd.concat([X, transformed_df], axis = 1)

In [15]:
# List comprehension to figure out which features are categorical
categorical_cols = ([col for col in X.columns if X[col].dtype == 'object'])
print(categorical_cols)

['session_id', 'context_type', 'hist_user_behavior_reason_start', 'hist_user_behavior_reason_end', 'mode']


In [16]:
# Drops OHE parent columns
X = X.drop(columns=['context_type',
                    'hist_user_behavior_reason_start',
                    'hist_user_behavior_reason_end',
                    'mode',
                    'date'], axis=1)

In [17]:
# Gets columns in X after transformations were completed
X.columns

Index(['session_id', 'session_position', 'session_length', 'skip_2',
       'context_switch', 'no_pause_before_play', 'short_pause_before_play',
       'long_pause_before_play', 'hist_user_behavior_n_seekfwd',
       'hist_user_behavior_n_seekback', 'hist_user_behavior_is_shuffle',
       'hour_of_day', 'premium', 'duration', 'release_year',
       'us_popularity_estimate', 'acousticness', 'beat_strength', 'bounciness',
       'danceability', 'dyn_range_mean', 'energy', 'flatness',
       'instrumentalness', 'key', 'liveness', 'loudness', 'mechanism',
       'organism', 'speechiness', 'tempo', 'time_signature', 'valence',
       'acoustic_vector_0', 'acoustic_vector_1', 'acoustic_vector_2',
       'acoustic_vector_3', 'acoustic_vector_4', 'acoustic_vector_5',
       'acoustic_vector_6', 'acoustic_vector_7', 'weekday', 'month', 'year',
       'enc__context_type_catalog', 'enc__context_type_charts',
       'enc__context_type_editorial_playlist',
       'enc__context_type_personalized_pla

In [18]:
# Checks again to see which columns are non-numeric, uses list comprehension
categorical_cols = ([col for col in X.columns if X[col].dtype == 'object'])
print(categorical_cols)

['session_id']


The only column that is of type 'object' is `'session_id'`. I now have a dataset that is ready to be used for some baseline modelling! But let's write the newly transformed dataset to a .csv first.

In [19]:
# Writes transformed dataset to .csv
X.to_csv('../data/processed/features.csv', index_label=False)
y.to_csv('../data/processed/target.csv', index_label=False)

## Baseline Modelling

### Experiment 1: Logistic Regression with Date-exploded features, and OneHotEncoded categorical variables
We'll start off by training a logistic regression model.

In [20]:
# Drops 'skip_2' from the for the sake of this experiment
X1 = X.drop(columns = ['skip_2', 'session_id'], axis = 1)

In [21]:
# Splitting into remainder and test sets
X1_remainder, X1_test, y1_remainder, y1_test = \
    train_test_split(X1, y, test_size = 0.2,
                     random_state=42)

In [22]:
# Splitting the remainder into train and validation
X1_train, X1_validation, y1_train, y1_validation = \
    train_test_split(X1_remainder, y1_remainder, test_size = 0.3,
                     random_state=42)

In [23]:
# Note: This cell will take some time to run!
# Instantiates Logit
logit = LogisticRegression(max_iter=500, verbose=1, n_jobs=-1)

# Fits
logit.fit(X1_train, y1_train)

# Scores on train and validation
print(logit.score(X1_train, y1_train))
print(logit.score(X1_validation, y1_validation))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  1.0min finished


0.8696336637876015
0.8673682120520203


The accuracy on this model is suspiciously high. Although we removed `'skip_2'` we might suspect some sort of data leakage elsewhere. Remember that I had mentioned that there are features that indicate how a track was skipped. In other words, `'enc__hist_user_behavior_reason_end'` might be embedding some hidden information about a track skipping.

In [24]:
# Gets features that indicate how a track ended
X.filter(like='reason_end').columns

Index(['enc__hist_user_behavior_reason_end_backbtn',
       'enc__hist_user_behavior_reason_end_clickrow',
       'enc__hist_user_behavior_reason_end_endplay',
       'enc__hist_user_behavior_reason_end_fwdbtn',
       'enc__hist_user_behavior_reason_end_logout',
       'enc__hist_user_behavior_reason_end_remote',
       'enc__hist_user_behavior_reason_end_trackdone'],
      dtype='object')

I suspect that these columns allowed the model to look into the future. In other words, the model is seeing how the track ended. And it may be signalling whether the track was skipped or not skipped.

In [25]:
# Drops 'skip_2', 'session_id' , and end track features
X2 = X.drop(columns = ['skip_2', 'session_id','enc__hist_user_behavior_reason_end_backbtn',
       'enc__hist_user_behavior_reason_end_clickrow',
       'enc__hist_user_behavior_reason_end_endplay',
       'enc__hist_user_behavior_reason_end_fwdbtn',
       'enc__hist_user_behavior_reason_end_logout',
       'enc__hist_user_behavior_reason_end_remote',
       'enc__hist_user_behavior_reason_end_trackdone'], axis = 1)

In [26]:
# Splitting into remainder and test sets
X2_remainder, X2_test, y2_remainder, y2_test = \
    train_test_split(X2, y, test_size = 0.2,
                     random_state=42)

In [27]:
# Splitting the remainder into train and validation
X2_train, X2_validation, y2_train, y2_validation = \
    train_test_split(X2_remainder, y2_remainder, test_size = 0.3,
                     random_state=42)

In [28]:
# Instantiates Logit
logit2 = LogisticRegression(max_iter=500, verbose=1, n_jobs=-1)

# Fits
logit2.fit(X2_train, y2_train)

# Gets validation score
print(logit2.score(X2_validation, y2_validation))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  1.2min finished


0.7613422019259406


In [29]:
# Gets test score
logit2.score(X2_test, y2_test)

0.7561353347629259

Based on the output above, the classfication accuracy has been reduced. Which confirms my suspicion of data leakage. 

In [30]:
# Gets coefficients from the model
print("Strongest Predictors of Track Skip, Yes")
display(pd.DataFrame(list(zip(list(np.ravel(logit2.coef_)), logit2.feature_names_in_)), columns=['coef', 'feature']).sort_values(by='coef', ascending=False).head())
print("Strongest Predictors of Track Skip, No")
display(pd.DataFrame(list(zip(list(np.ravel(logit2.coef_)), logit2.feature_names_in_)), columns=['coef', 'feature']).sort_values(by='coef', ascending=False).tail())

Strongest Predictors of Track Skip, Yes


Unnamed: 0,coef,feature
52,1.481308,enc__hist_user_behavior_reason_start_fwdbtn
49,0.521965,enc__hist_user_behavior_reason_start_backbtn
4,0.503234,short_pause_before_play
5,0.308042,long_pause_before_play
47,0.09059,enc__context_type_user_collection


Strongest Predictors of Track Skip, No


Unnamed: 0,coef,feature
10,-0.140239,premium
2,-0.147923,context_switch
7,-0.20728,hist_user_behavior_n_seekback
50,-0.277215,enc__hist_user_behavior_reason_start_clickrow
55,-1.729335,enc__hist_user_behavior_reason_start_trackdone


Based on the cell output above. Use of the forward button appears to be a strong predictor of a track being skipped. In contrast, if a current track starts because the previous song was played until the end, the current track is less likely to be skipped.

---

Let's see what happens if I withold information about how the song started.

In [31]:
# Drops 'skip_2', 'session_id' , and end track features
X3 = X.drop(columns = ['skip_2', 'session_id','enc__hist_user_behavior_reason_end_backbtn',
                    'enc__hist_user_behavior_reason_end_clickrow',
                    'enc__hist_user_behavior_reason_end_endplay',
                    'enc__hist_user_behavior_reason_end_fwdbtn',
                    'enc__hist_user_behavior_reason_end_logout',
                    'enc__hist_user_behavior_reason_end_remote',
                    'enc__hist_user_behavior_reason_end_trackdone',
                    'enc__hist_user_behavior_reason_start_fwdbtn',
                    'enc__hist_user_behavior_reason_start_trackdone'], axis = 1)

By witholding some information about how the song started, I'm not letting the model know if the previous track was skipped or played in its entirety. In other words, if the song was started by pressing the forward button, the previous song was likely skipped. If the user arrived at the current song because the previous song was played entirely, then there was likely no skip leading up to the current track. I had suspected this since encountering the differences in skip rate when the current started due to pressing the forward button or the previous track being played until the end. It would appear that previous skip/play behaviour is somehow linked to skip outcomes in the preceding song. 

In [32]:
# Splitting into remainder and test sets
X3_remainder, X3_test, y3_remainder, y3_test = \
    train_test_split(X3, y, test_size = 0.2,
                     random_state=42)

In [33]:
# Splitting the remainder into train and validation
X3_train, X3_validation, y3_train, y3_validation = \
    train_test_split(X3_remainder, y3_remainder, test_size = 0.3,
                     random_state=42)

In [34]:
# Note: This cell will take some time to run
# Instantiates Logit
logit3 = LogisticRegression(max_iter=500, verbose=1, n_jobs=-1)

# Fits
logit3.fit(X3_train, y3_train)

# Scores on train and validation
# print(logit.score(X2_train, y2_train))
print(logit3.score(X3_validation, y3_validation))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  1.2min finished


0.5878834508090937


In [35]:
logit3.score(X3_test, y3_test)

0.580325232308792

My suspicion appears be confirmed based on the classification accuracy of this particular model. Previous user interaction, related to how the track ended, which also tells us if the previous song was skipped, appears to be an important predictor of future skip outcomes. 

---

# Wrapping Up
- In this notebook I managed to gather an initial impression of the problem 
- Some insights drawn from my exploratory data analysis appeared to have re-emerged during baseline modelling 
- I.E. Previous user interaction history being a predictor skip outcomes in the following track
- At this point I have conduct a bit of feature engineering and baseline modelling
- The second logit model employed, without `skip_2` and track end behaviours had a test score of about 76%, which outperforms a naive model which if guesses skip every single time would be right 52% of the time
- In other words, the second logit model would guess about 7 songs correctly out of 10, whereas the naive model would guess roughly 5/10 correctly
---
# Next Steps
- In the next notebook I will continue with feature engineering and will be employing more model experiments
- To see if I can improve classification accuracy above our baseline Logit model