![spotify_logo](../assets/Spotify_Logo_CMYK_Green.png)

# Spotify Skip Prediction: Final Model Evaluation, Discussion, and Conclusion
Notebook: 7 of 7

Author: Alex Thach - alcthach@gmail.com  
BrainStation Data Science Capstone Project - Winter 2022  
April 4, 2022
---

# Recap
- I explored some additional feature engineering in the previous notebook
- Moving beyond baseline features, and cumulative average, I aimed to encode information about the general/aggregrate characteristics of the listening session that was weighed more towards the present moment/current track
- The cumulative average failed to signal what was happening closer to the current song because it weighed all the previous songs equally
- By employing an exponentially-weighted moving average, I was able to encode information that was closer to the current song
- This meant that the model better understood the user interactions and the song characteristics closer to the current song, and cared less about songs from earlier in the session
- To determine how much to weigh the current versus past observations in the moving average calculations, I looped through various values of alpha and fed the transformed features into a classifier and picked the best performing alpha value
- **I converged on a value that delivered a mean 5-fold cross-validation score of 78% with a Random Forest classifier**

# Purpose:

The goal of this notebook is evaluate discuss my best performing model. And to also circle back and reflect on the goals and hypotheses generated at the beginning of the project and see how they compare. I will also be discuss the next steps and future directions I'd like to take this project in.

---

In [1]:
# Runs setup script, imports, plotting settings, reads in raw data
%run -i "../scripts/at-setup.py" 

Dataframes in the global name space now include:
session_logs_df
track_features_df


In [2]:
# Reads in data
main_df = pd.read_csv('../data/processed/merged.csv')
features = pd.read_csv('../data/processed/features.csv')
target = pd.read_csv('../data/processed/target.csv')

In [3]:
# Assigns variables prior to feature engineering
X = features
y = target

In [4]:
print("Computing exponentially-weighted moving average for features.")
X1 = X.groupby('session_id', as_index=False).ewm(alpha = 0.7).mean().reset_index().drop('level_1', axis=1)

# Assigns to X_shifted, will be re-assigned to original dataframe
X1_shifted = X1.groupby('session_id')[['skip_2',   
                                            'enc__hist_user_behavior_reason_end_backbtn',
                                            'enc__hist_user_behavior_reason_end_clickrow',  
                                            'enc__hist_user_behavior_reason_end_endplay',  
                                            'enc__hist_user_behavior_reason_end_fwdbtn',  
                                            'enc__hist_user_behavior_reason_end_logout',
                                            'enc__hist_user_behavior_reason_end_remote',  
                                            'enc__hist_user_behavior_reason_end_trackdone']].shift()


# Re-assigns previous columns to shift columns from above
print("Implementing row stagger for skip and track end behaviour features.")

X1 = X1.assign(**X1_shifted.to_dict(orient='series'))

# Re-assigns 'session_position' to original value Re: I require the original values so that the model has information about where the track is within the session
print("Re-assigning original 'session_position' feature values.")
X1['session_position'] = X['session_position']

# Drops first song from every session
print("Dropping first song from every session.")
X1.drop(X1[X1['session_position'] == 1].index, inplace=True)

# Drops 'session_id' column
print("Dropping 'session_id' column.")
X1.drop('session_id', axis=1, inplace=True)

# Prints complete notification
print ("Transformations complete!")

Computing exponentially-weighted moving average for features.
Implementing row stagger for skip and track end behaviour features.
Re-assigning original 'session_position' feature values.
Dropping first song from every session.
Dropping 'session_id' column.
Transformations complete!


In [5]:
# Sanity Check
X1.head(10)

Unnamed: 0,session_position,session_length,skip_2,context_switch,no_pause_before_play,short_pause_before_play,long_pause_before_play,hist_user_behavior_n_seekfwd,hist_user_behavior_n_seekback,hist_user_behavior_is_shuffle,...,enc__hist_user_behavior_reason_start_trackerror,enc__hist_user_behavior_reason_end_backbtn,enc__hist_user_behavior_reason_end_clickrow,enc__hist_user_behavior_reason_end_endplay,enc__hist_user_behavior_reason_end_fwdbtn,enc__hist_user_behavior_reason_end_logout,enc__hist_user_behavior_reason_end_remote,enc__hist_user_behavior_reason_end_trackdone,enc__mode_major,enc__mode_minor
1,2,20.0,0.0,0.0,0.769231,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.230769,0.769231
2,3,20.0,0.0,0.0,0.935252,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.784173,0.215827
3,4,20.0,0.0,0.0,0.980946,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.936486,0.063514
4,5,20.0,0.0,0.0,0.994316,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.981054,0.018946
5,6,20.0,0.0,0.0,0.998298,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.293815,0.706185
6,7,20.0,0.0,0.0,0.99949,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.700511,0.0,0.0,0.299489,0.788253,0.211747
7,8,20.0,0.700153,0.0,0.999847,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.910199,0.0,0.0,0.089801,0.936486,0.063514
8,9,20.0,0.91006,0.0,0.999954,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.973064,0.0,0.0,0.026936,0.280933,0.719067
9,10,20.0,0.973019,0.0,0.999986,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.99192,0.0,0.0,0.00808,0.084279,0.915721
10,11,20.0,0.991906,0.0,0.999996,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.997576,0.0,0.0,0.002424,0.025283,0.974717


Still need to drop the first track in each session in `y`

In [6]:
y

Unnamed: 0,skip_2
0,False
1,False
2,False
3,False
4,False
...,...
167875,False
167876,False
167877,False
167878,False


In [7]:
# Sanity check, seeing what the mask used for dropping first tracks looks like; using `session_logs_df` shares same index with X1
session_logs_df[session_logs_df['session_position'] == 1].index

Int64Index([     0,     20,     40,     60,     80,     91,    106,    121,
               141,    161,
            ...
            167702, 167722, 167742, 167754, 167774, 167788, 167808, 167828,
            167848, 167860],
           dtype='int64', length=10000)

This means the target variable also needs to be transformed.

In [8]:
# Drops first songs in target variable using the mask from dropping first songs in X1
y.drop(session_logs_df[session_logs_df['session_position'] == 1].index, inplace=True)

In [9]:
# Sanity check; I should be minus 10k rows
y.shape

(157880, 1)

In [10]:
# Change to 1d array
y = y['skip_2']

In [11]:
# Checks to see if I have the correct number of rows in feature and target
X1.shape[0] == y.shape[0]

True

Now that the data has been transformed successfully, I can move on to modelling the data.

In [12]:
# Splitting into remainder and test sets
X1_remainder, X1_test, y1_remainder, y1_test = \
    train_test_split(X1, y, test_size = 0.2,
                     random_state=42)

In [13]:
# Sanity check
print(X1_remainder.shape)
print(y1_remainder.shape)

(126304, 67)
(126304,)


In [14]:
# Splitting the remainder into train and validation
X1_train, X1_validation, y1_train, y1_validation = \
    train_test_split(X1_remainder, y1_remainder, test_size = 0.3,
                     random_state=42)

In [15]:
from sklearn.ensemble import RandomForestClassifier

estimators = [('random_forest_clf', RandomForestClassifier(n_jobs= -1, random_state=42))] 

pipe = Pipeline(estimators)

print("Calculating model score.")

mean_cross_val_score = np.mean(cross_val_score(pipe, X1_remainder, y1_remainder, cv=5, verbose=1))

print(f"Mean 5-fold cross value score: {mean_cross_val_score:.2f}")

Calculating model score.


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Mean 5-fold cross value score: 0.78


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   54.2s finished


In [16]:
# Instantiates Random Forest classifier
rf = RandomForestClassifier(n_jobs=-1, random_state=42)

# Fitting model
rf.fit(X1_train, y1_train)

# Scoring model
rf.score(X1_test, y1_test)

0.7801178109956929

## Methodology and Discussion
- Following baseline modelling using a logistic regression, I wanted to see how a Random Forest classifier would perform on this transformed dataset
- Random forest classifiers typically perform quite well out of the box, and was one of the reasons why I picked this type of classifier, opting not to tune hyperparameters as random forest classifiers tend not to benefit much from hyperparameter tuning
- We can see that the Random Forest classifier performs well above a naive model of that would have a classification accuracy of 52%
- In addition, this model performed slightly better than my baseline model which had a classification accuracy of 76%
- For this slight improvement in classification accuracy, I'm giving up the interpretability that I otherwise would have with the logistic regression model
- For this problem space, I believe that classification accuracy should be prioritized over interpretability as the goal of the project is to predict skip outcome

---
## Wrapping Up

### Reflecting on the Project
Looking back from where the project started, I knew that this was going to be a complex problem space to navigate. Despite my best efforts to simplify the problem and work my way up from there. It would appear that I'm still in the 'simplified problem' stage. This speaks to how complex this problem might actually be. 

Although my best model outperformed the naive model by a considerable margin, there are some assumptions and limitations that should be considered. 

Firstly, the model depends on some sort of data that points to the aggregate information of user interaction behaviours and track features, that are weighted closer to the present. Which means that it will likely not have the ability to predict skip outcomes in sequence. For example, given songs A, B, and C, and their associated user interactions and track features, predict songs D, E, and F, knowing only the track ID. My model would not be capable of predicting in this fashion as it requires some sort of real world truth before it can predict a track skip. Neither does it provide a solution if there is highly likelihood that a track will be skipped during a listening session, the only thing it offers might be a recommendation to remove the track from the playlist. This is because the model's scope is to classify.  

I do however, appreciate that the feature engineering and modelling I conducted helped me to understand the problem space more deeply. And likely serves as a step towards a more sophisticated approach to the problem. 

This is where I'd hope to take the project next. Throughout the course of this project, I grew to appreciate the complexity of this problem. For instance, in the case of a user skipping several songs to reach a song they wanted to hear. Is there anyway for me to gather information about the user's intention during this situation? Did they know the know song they were looking and preferred to skip until they reached that song? I'd like to understand why users employ certain skip/play strategies on the platform and see if I can predict their next move. The next move, could involve a classification and or replacement with song that they are more likely to play. 

During the course of this project, I've always held on to the notion of ad-hoc, or on-the-fly music curation as a feature that could emerge from exploring this problem. I would love to dive deeper and try to uncover hidden structures in the data. And try to understand the users' intentions as they interact with the platform. I will likely go back to explore the problem and data more. And comb through literature that tries to solve similar problems. I will then try employ new methodologies or rework old ones to try and solve this problem. It will likely involve the deep learning as these types of methodologies are better equipped to handle this level of complexity.

To sum up, I was able to reach an abbreviated version of the goal. That is, predicting skip outcome in a static fashion. However, I'm more inclined to explore more dynamic solutions, ones that might lend more practically to the platform and its users. This is the high-level direction I'm hoping to take the project.

