In this section, we are going to be predicting my Spotify Wrapped. There are five main components of a Spotify Wrapped:
1. Top 5 Artists
2. Top 5 Songs
3. Favourite Genre
4. Most Active Listening Time
5. Total Listening Time Estimate

But, Spotify did not provide a genre column in the extended listening history file. So, we will be focusing on the others (and possibly some bonus categories).

We will be focusing on the following strategy:

**Trend Projection** - taking the cleaned and EDA'd data and project the rest of the year based on trends.

And, we will be using **Random Forest** and **k-Nearest Neighbors** to compare predictions.

In [2]:
%pip install -U scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp39-cp39-macosx_12_0_arm64.whl (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 3.1 MB/s eta 0:00:01
[?25hCollecting joblib>=1.2.0
  Downloading joblib-1.5.1-py3-none-any.whl (307 kB)
[K     |████████████████████████████████| 307 kB 5.2 MB/s eta 0:00:01
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Collecting scipy>=1.6.0
  Downloading scipy-1.13.1-cp39-cp39-macosx_12_0_arm64.whl (30.3 MB)
[K     |████████████████████████████████| 30.3 MB 4.4 MB/s eta 0:00:01
[?25hInstalling collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.5.1 scikit-learn-1.6.1 scipy-1.13.1 threadpoolctl-3.6.0
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kern

In [7]:
# Import relevant libraries (it's a lot omg)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.inspection import permutation_importance

In [5]:
df = pd.read_csv("engineered_spotify_data.csv")

In [6]:
df.head()

Unnamed: 0,ts,platform,ms_played,conn_country,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,spotify_track_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,hour,day,month,minutes_played,track_artist,time_of_day,album_type_x,album_type_y,time_diff,new_session,session_id,duration_minutes,is_top_artist,is_top_track,date,daily_listen_count_x,daily_listen_count_y,is_favourite_hour,same_album_as_previous,album_listen_block
0,2025-01-01 05:11:45+00:00,osx,158443,TT,deja vu,Olivia Rodrigo,SOUR,spotify:track:6HU7h9RYOaPRFeh0R3UeAr,trackdone,endplay,False,True,False,1735679560,False,5,Wednesday,2025-01,2.640717,deja vu - Olivia Rodrigo,Morning,Album,Album,0.0,False,0,2.640717,False,False,2025-01-01,22,22,False,False,0
1,2025-01-01 05:14:52+00:00,osx,188306,TT,Crash My Car,COIN,Dreamland,spotify:track:5SN3mwuodiwY3jPejBuUD5,clickrow,trackdone,False,False,False,1735708304,False,5,Wednesday,2025-01,3.138433,Crash My Car - COIN,Morning,Single,Single,187.0,False,0,3.138433,False,False,2025-01-01,22,22,False,False,0
2,2025-01-01 05:18:50+00:00,osx,177280,TT,Everybody Talks,Neon Trees,Picture Show,spotify:track:2iUmqdfGZcHIhS3b9E9EWq,trackdone,trackdone,False,False,False,1735708522,False,5,Wednesday,2025-01,2.954667,Everybody Talks - Neon Trees,Morning,Single,Single,238.0,False,0,2.954667,False,False,2025-01-01,22,22,False,False,0
3,2025-01-01 05:22:13+00:00,osx,202496,TT,She Looks So Perfect,5 Seconds of Summer,5 Seconds Of Summer,spotify:track:1CQ2cMfrmFM1YdfmjENKVE,trackdone,trackdone,False,False,False,1735708730,False,5,Wednesday,2025-01,3.374933,She Looks So Perfect - 5 Seconds of Summer,Morning,Single,Single,203.0,False,0,3.374933,False,False,2025-01-01,22,22,False,False,0
4,2025-01-01 05:25:51+00:00,osx,218013,TT,Tongue Tied,GROUPLOVE,Never Trust a Happy Song,spotify:track:0GO8y8jQk1PkHzS31d699N,trackdone,trackdone,False,False,False,1735708933,False,5,Wednesday,2025-01,3.63355,Tongue Tied - GROUPLOVE,Morning,Single,Single,218.0,False,0,3.63355,False,False,2025-01-01,22,22,False,False,0


### Trend Projection

In [8]:
# Top 5 Artists
top_artists_time = (
    df.groupby('master_metadata_album_artist_name')['ms_played']
    .sum()
    .sort_values(ascending=False)
    .head(5) / 60000
)

# Project for the full year
df['ts'] = pd.to_datetime(df['ts'])
months_so_far = df['ts'].dt.month.nunique()

# Scale up as if the year continued with the same pattern
# (There's a pattern? LOL)
scale_factor = 12 / months_so_far
top_artists_projected = (top_artists_time * scale_factor).round(2)

print("Projected Top 5 Artist by Listening Time (Full Year): ")
print(top_artists_projected)

Projected Top 5 Artist by Listening Time (Full Year): 
master_metadata_album_artist_name
Ellise            5315.65
Vlad Holiday      4473.00
Nico & Chelsea    3515.39
Nico Collins      2551.81
Waterparks        2532.03
Name: ms_played, dtype: float64


In [9]:
# Top 5 Tracks
top_tracks_time = (
    df.groupby('track_artist')['ms_played']
    .sum()
    .sort_values(ascending=False)
    .head(5) / 60000
)

# Project for the full yeah
top_tracks_projected = (top_tracks_time * scale_factor).round(2)

print("Projected Top 5 Songs by Listening Time (Full Year): ")
print(top_tracks_projected)

Projected Top 5 Songs by Listening Time (Full Year): 
track_artist
Eye to Eye - Nico & Chelsea                                    3154.37
505 - Arctic Monkeys                                            998.49
My Life Is Over - Chelsea Collins                               659.09
Cupid's Chokehold / Breakfast in America - Gym Class Heroes     611.01
BRAINDEAD - WesGhost                                            519.48
Name: ms_played, dtype: float64


In [10]:
# Most Active Listening Hour
hourly_listening = (
    df.groupby('hour')['ms_played']
    .sum()
    .sort_values(ascending=False)
)

top_hour = hourly_listening.idxmax()
minutes = round(hourly_listening.max() / 60000, 1)
print(f"My Peak Listening Hour: {top_hour}, with approximately {minutes} minutes listened")

My Peak Listening Hour: 1, with approximately 3297.7 minutes listened


In [12]:
# Total Listening Time Estimate
total_minutes = df['minutes_played'].sum()
print(f"Estimated Total Listening Time: {total_minutes:1f} minutes")

Estimated Total Listening Time: 34403.614633 minutes


### Preprocessing

In [None]:
# Encode categorical target variable
le = LabelEncoder()
df['top_artist_encoded'] = le.fit_transform(df['top_artist'])

### Random Forest Classifier
It is a tree-based model that build multiple decision trees and averages their predictions.

### K-Nearest Neighbors (KNN)
It compares each new row with similar rows in the dataset.