### Recommender System

Earlier, we created a dataset of playlists and tracks. We then performed preprocessing, text vectorization (using BoW and GloVe), and topic modeling (using LDA based on BoW) on the track lyrics.

Now, using the GloVe text vectorizations, we will create a content-based recommender system as well as a user-based collaborative filtering recommender system. We will also compare these two models' performances.

In [None]:
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163004 sha256=2f293c679f1046b6512d195f45b92cdf0dbd6b9d4418ab3f13217385bd2b74a1
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


In [None]:
import pickle
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise.prediction_algorithms.knns import KNNWithMeans


#### Load dataframe from csv file

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load in the playlists dataset with GloVe word embeddings from pkl file

df = pd.read_csv("/content/drive/MyDrive/ML Application/Final Code/data/playlist_with_embeddings_dataset.csv")
df.head()

Unnamed: 0,Playlist ID,Playlist Name,Track Name,Artist Name,Album Name,Track URI,Lyrics,Preprocessed Lyrics,Preprocessed Playlist Name,Preprocessed Track Name,Preprocessed Album Name,Lyrics Embedding,Playlist Name Embedding,Dominant Topic,Dominant Topic Probability
0,30,Garage Rock,Take My Side,Will Butler,Policy,spotify:track:6v4zAuJTlszNdKrbbnEFu8,Where's the fire? Let it burn\nWhere're the ch...,"['fire', 'let', 'burn', 'child', 'child', 'got...","['garage', 'rock']","['take', 'side']",['policy'],[ 1.47042751e-01 1.87495738e-01 2.75591016e-...,[-6.4670e-01 9.8998e-01 -1.4379e-01 -3.0598e-...,Self-Empowerment + Confident,0.921849
1,30,Garage Rock,Everyday it Starts,Parquet Courts,Content Nausea,spotify:track:25JD35LDh7CEJ9gKUNruVj,Everyday it starts\nEveryday it starts\nEveryd...,"['everyday', 'start', 'everyday', 'start', 'ev...","['garage', 'rock']","['everyday', 'start']","['content', 'nausea']",[ 0.14214703 0.13033065 -0.1995055 -0.361942...,[-6.4670e-01 9.8998e-01 -1.4379e-01 -3.0598e-...,Life + Struggles,0.5017
2,30,Garage Rock,Content Nausea,Parquet Courts,Content Nausea,spotify:track:24bk2iKzr3VcymFnzIy3oS,"Content nausea, World War Four\nSeems like it ...","['content', 'nausea', 'world', 'war', 'four', ...","['garage', 'rock']","['content', 'nausea']","['content', 'nausea']",[ 2.36339107e-01 1.71193734e-01 3.91736887e-...,[-6.4670e-01 9.8998e-01 -1.4379e-01 -3.0598e-...,Self-Empowerment + Confident,0.893172
3,30,Garage Rock,Slide Machine,Parquet Courts,Content Nausea,spotify:track:7rjK8CDTtTn2KG9Zja7ETj,I've been down South where they use the slide ...,"['south', 'use', 'slide', 'machine', 'god', 'o...","['garage', 'rock']","['slide', 'machine']","['content', 'nausea']",[-0.04323885 -0.0324073 0.3804018 -0.931053...,[-6.4670e-01 9.8998e-01 -1.4379e-01 -3.0598e-...,Relationships,0.98926
4,30,Garage Rock,Pretty Machines,Parquet Courts,Content Nausea,spotify:track:1zPn4tfkWfowVG3ROo1wUY,"Ah, moonlight\nIt's hard to believe it\nAnd it...","['moonlight', 'hard', 'believe', 'harder', 'ne...","['garage', 'rock']","['pretty', 'machine']","['content', 'nausea']",[ 0.11442823 0.0718515 0.0574954 -0.275225...,[-6.4670e-01 9.8998e-01 -1.4379e-01 -3.0598e-...,Self-Empowerment + Confident,0.749053


In [None]:
# Sadly, the 'lyrics_embedding' col got loaded as strings instead of np.ndarrays
# We need to change them back
df['Lyrics Embedding'] = df['Lyrics Embedding'].apply(lambda x: np.fromstring(x[1:-1], sep=' ') if isinstance(x, str) else np.nan)

In [None]:
print("Dtype:", df['lyrics_embedding'].dtype)
df.head()

Dtype: object


Unnamed: 0,playlist_id,playlist_name,track_name,artist_name,album_name,track_uri,lyrics,preprocessed_lyrics,lyrics_embedding
0,30,Garage Rock,Take My Side,Will Butler,Policy,spotify:track:6v4zAuJTlszNdKrbbnEFu8,Where's the fire? Let it burn\nWhere're the ch...,"['fire', 'let', 'burn', 'child', 'child', 'got...","[0.15700878, 0.21275039, 0.02718402, -0.339923..."
1,30,Garage Rock,Everyday it Starts,Parquet Courts,Content Nausea,spotify:track:25JD35LDh7CEJ9gKUNruVj,Everyday it starts\nEveryday it starts\nEveryd...,"['everyday', 'start', 'everyday', 'start', 'ev...","[0.14214703, 0.13033065, -0.1995055, -0.361942..."
2,30,Garage Rock,Everyday it Starts,Parquet Courts,Content Nausea,spotify:track:25JD35LDh7CEJ9gKUNruVj,Everyday it starts\nEveryday it starts\nEveryd...,"['everyday', 'start', 'everyday', 'start', 'ev...","[0.14214703, 0.13033065, -0.1995055, -0.361942..."
3,30,Garage Rock,Content Nausea,Parquet Courts,Content Nausea,spotify:track:24bk2iKzr3VcymFnzIy3oS,"Content nausea, World War Four\nSeems like it ...","['content', 'nausea', 'world', 'war', 'four', ...","[0.24021536, 0.20146036, 0.056565303, -0.26903..."
4,30,Garage Rock,Content Nausea,Parquet Courts,Content Nausea,spotify:track:24bk2iKzr3VcymFnzIy3oS,"Content nausea, World War Four\nSeems like it ...","['content', 'nausea', 'world', 'war', 'four', ...","[0.24021536, 0.20146036, 0.056565303, -0.26903..."


#### Calculate cosine similarity matrix (similarity between all pairs of songs based on GloVe vectorization)

In [None]:
# Create df of unique tracks, their lyrics embedding, and index
# 'index' will help us map between similarity matrix and track_ids in the future

keep_cols = ['Track Name', 'Artist Name', 'Album Name', 'Track URI', 'Lyrics Embedding']
unique_tracks_df = df.drop_duplicates(subset='Track URI')[keep_cols].reset_index(drop=True)

# Add 'index' column, just in case original indices are lost if the df is grouped, etc.
unique_tracks_df['Index'] = range(len(unique_tracks_df))

unique_tracks_df.head()

Unnamed: 0,Track Name,Artist Name,Album Name,Track URI,Lyrics Embedding,Index
0,Take My Side,Will Butler,Policy,spotify:track:6v4zAuJTlszNdKrbbnEFu8,"[0.147042751, 0.187495738, 0.0275591016, -0.36...",0
1,Everyday it Starts,Parquet Courts,Content Nausea,spotify:track:25JD35LDh7CEJ9gKUNruVj,"[0.14214703, 0.13033065, -0.1995055, -0.361942...",1
2,Content Nausea,Parquet Courts,Content Nausea,spotify:track:24bk2iKzr3VcymFnzIy3oS,"[0.236339107, 0.171193734, 0.0391736887, -0.28...",2
3,Slide Machine,Parquet Courts,Content Nausea,spotify:track:7rjK8CDTtTn2KG9Zja7ETj,"[-0.04323885, -0.0324073, 0.3804018, -0.931053...",3
4,Pretty Machines,Parquet Courts,Content Nausea,spotify:track:1zPn4tfkWfowVG3ROo1wUY,"[0.11442823, 0.0718515, 0.0574954, -0.27522582...",4


In [None]:
# Prepare to compute cosine similarity matrix

# Get nparray of all lyric_embeddings
# Drop rows with missing values in the 'lyrics_embedding' col; there should only be one such col
unique_tracks_df = unique_tracks_df.dropna(subset=['Lyrics Embedding'])
unique_tracks_df['Index'] = range(len(unique_tracks_df))
df = df.dropna(subset=['Lyrics Embedding'])

# Convert the 'lyrics_embedding' col to a NumPy array
vectors = unique_tracks_df['Lyrics Embedding'].to_numpy()
print(vectors.shape)
vectors_2d = np.stack(vectors)

(7703,)


In [None]:
# Compute cosine similarity matrix for all pairs of vectors
sim_matrix = cosine_similarity(vectors_2d)
print(sim_matrix)

[[1.         0.89476373 0.97581476 ... 0.90803152 0.97345787 0.93849852]
 [0.89476373 1.         0.90244384 ... 0.84664793 0.89630115 0.85024605]
 [0.97581476 0.90244384 1.         ... 0.90761921 0.97277721 0.9440965 ]
 ...
 [0.90803152 0.84664793 0.90761921 ... 1.         0.86909817 0.88881653]
 [0.97345787 0.89630115 0.97277721 ... 0.86909817 1.         0.91215628]
 [0.93849852 0.85024605 0.9440965  ... 0.88881653 0.91215628 1.        ]]


#### Content-Based Recommender System

Given a user (a playlist of tracks), it recommends the top n tracks that are similar to the tracks in the original playlist based on semantic content (track lyrics).

Our algorithm will follow the algorithm (learned in class):

$$ \hat{r}_{u,i} = \frac{\sum_{j \in N_i^K} {\rm sim}(i,j) r_{u,j} }{\sum_{j \in N_i^K} {\rm sim}(i,j)}$$

Essentially, this says that given a user (a playlist), we can estimate the ratings of a track $i$ that is not in the playlist. We find the $k$ most similar tracks to $i$ and sum their similarities with $i$, multiplying each similarity by the rating (we will use 5) if that track is also in the playlist. We standardize by dividing by the sum of similarities (not multiplied by rating).

In [None]:
# HELPER Functions: index_to_uri and uri_to_index
# Use the unique_tracks_df to convert between the two (will be useful for when finding cosine similarity via index, but need to map those results to a specific track)

def index_to_uri(index):
    uri = unique_tracks_df.loc[unique_tracks_df['Index'] == index, 'Track URI'].values[0]
    return uri


def uri_to_index(uri):
    index = unique_tracks_df.loc[unique_tracks_df['Track URI'] == uri, 'Index'].values[0]
    return index


# HELPER Function: get_uris_in_playlist
# Given a playlist id, get the set of all unique tracks in that playlist

def get_uris_in_playlist(playlist_id):
    playlist = df[df['Playlist ID'] == playlist_id]
    return set(playlist['Track URI'])


# HELPER Function: get_topk_similar(i, k)
# Given a track_uri 'i', checks cosine similarity matrix to return (track_uri, similarity) of top k most similar lyrics

def get_topk_index_sim(uri, k):

    # Get index to search similarity matrix with
    index = uri_to_index(uri)

    # Extract similarities for that index, then sort
    sims = sim_matrix[index]
    top_k_indices = np.argsort(sims)[::-1][:k+1]  # k+1 since we don't want the get the similarity '1' between this track and itself
    top_k_sims = sims[top_k_indices]

    return [[index, sim] for index, sim in zip(top_k_indices[1:], top_k_sims[1:])]  # skipping the first elem since it will be same as the current track!


# HELPER Function: estimate_rating
# Use the formula described above to estimate ratings of tracks the playlist has not seen before

def estimate_rating(uri, topk, uris_in_plist):
    num = 0
    denom = 0
    for index, sim in topk:
        denom += sim

        uri = index_to_uri(index)
        if uri in uris_in_plist:  # if similar track is in playlist, multiply it by the 'rating' (we will set as 10 to weight it much higher than other similar tracks not in the playlist)
            num += (sim * 5)
        else:
            num += sim
    return num / denom


# HELPER Function: provide_recs
# Given a dataframe with estimated recs for new tracks, will cleanly provide recommendations

def get_recs(ratings_df, k):
    # Sort ratings_df and retain only top k
    ratings_df = ratings_df.sort_values(by='Estimated Rating', ascending=False)
    top_k_ratings_df = ratings_df.head(k)

    # Iterate through top k tracks and return df with relevant info
    rows = []
    for i, row in top_k_ratings_df.iterrows():
        uri = row['Track URI']
        rating = row['Estimated Rating']

        # Find row in original df that matches with track_uri
        match = unique_tracks_df.loc[unique_tracks_df['Track URI'] == uri]

        # Extract fields
        tr_name = match['Track Name'].values[0]
        art_name = match['Artist Name'].values[0]
        alb_name = match['Album Name'].values[0]
        trk_uri = match['Track URI'].values[0]

        # Create new row for official recommendation
        new_row = {'Track URI': trk_uri,'Track Name': tr_name, 'Artist Name': art_name, 'Album Name': alb_name, 'Recommendation Score': rating}

        rows.append(new_row)

    return pd.DataFrame(rows)


In [None]:
def content_rec_topk_tracks(playlist_id, topk=10):

    # Get set of all unique tracks in playlist_id
    uris_in_plist = get_uris_in_playlist(playlist_id)

    # Create df with estimated ratings for all unique tracks not in the playlist
    rows = []

    # Iterate through all unique track_uris and calculate rating
    unique_track_uris = df['Track URI'].unique().tolist()
    for i in unique_track_uris:
        if i not in uris_in_plist:  # potential rec
            similar = get_topk_index_sim(i, topk)
            est_rating = estimate_rating(i, similar, uris_in_plist)

            # Add to estimated ratings df
            row = {'Track URI': i, 'Estimated Rating': est_rating}
            rows.append(row)

    # Create dataframe with recommendation info
    ratings_df = pd.DataFrame(rows)

    # Return df with recommendations
    return get_recs(ratings_df, topk)


In [None]:
# TEST: Testing our content-based rec sys
# Find a playlist to test
df.head(1)

Unnamed: 0,Playlist ID,Playlist Name,Track Name,Artist Name,Album Name,Track URI,Lyrics,Preprocessed Lyrics,Preprocessed Playlist Name,Preprocessed Track Name,Preprocessed Album Name,Lyrics Embedding,Playlist Name Embedding,Dominant Topic,Dominant Topic Probability
0,30,Garage Rock,Take My Side,Will Butler,Policy,spotify:track:6v4zAuJTlszNdKrbbnEFu8,Where's the fire? Let it burn\nWhere're the ch...,"['fire', 'let', 'burn', 'child', 'child', 'got...","['garage', 'rock']","['take', 'side']",['policy'],"[0.147042751, 0.187495738, 0.0275591016, -0.36...",[-6.4670e-01 9.8998e-01 -1.4379e-01 -3.0598e-...,Self-Empowerment + Confident,0.921849


In [None]:
# WARNING: This cell takes 25 seconds to run!
# TEST: Testing our content-based rec sys
# Let's find recommendations for 'Garage Rock', which has 'playlist_id' = 30

recs = content_rec_topk_tracks(30, topk=5)

In [None]:
recs.head(5)

Unnamed: 0,Track URI,Track Name,Artist Name,Album Name,Recommendation Score
0,spotify:track:1soBMqyS6BvqVHUXsanF9d,How The West Was Won And Where It Got Us,R.E.M.,New Adventures In Hi-Fi,3.399109
1,spotify:track:6uCmU6ldcsVpLAKNCojVg8,Fire Escape,Andrew McMahon in the Wilderness,Zombies On Broadway,3.39858
2,spotify:track:7gHJaZYdapleFeZbG5YuUz,Almost Home,Ben Rector,Brand New,2.602044
3,spotify:track:0MPVkwXbh5ZYBIePKtCt6n,Three Blocks,Real Estate,Days,2.601927
4,spotify:track:0gmbgwZ8iqyMPmXefof8Yf,How You Remind Me - LP Mix,Nickelback,Silver Side Up,2.601416


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

# Split data into train and test sets
train_df, test_df = train_test_split(df, test_size=0.25, random_state=42)

# Get track URIs from playlist IDs in the training dataset
def get_uris_in_playlist_train(playlist_id):
    playlist = train_df[train_df['Playlist ID '] == playlist_id]
    return set(playlist['Track URI'])

# Evaluate model on the test dataset
def evaluate_model(test_df, k=10):
    y_true = []
    y_pred = []
    for _, row in test_df.iterrows():
        playlist_id = row['Playlist ID']
        actual_uris = get_uris_in_playlist_train(playlist_id)  # Ensure this gets tracks from the test set
        predicted_ratings = content_rec_topk_tracks(playlist_id, topk=k)  # Generate recommendations based on training data
        for uri in actual_uris:
            y_true.append(5)  # Assume a high rating for songs actually in the playlist
            predicted_rating = predicted_ratings[predicted_ratings['Track URI'] == uri]['Recommendation Score']
            y_pred.append(predicted_rating.iloc[0] if not predicted_rating.empty else 1)  # Assume a low rating if not recommended
    mse = mean_squared_error(y_true, y_pred)
    rmse = sqrt(mse)
    return mse, rmse

# Calculate MSE and RMSE
mse, rmse = evaluate_model(test_df)
print(f"MSE: {mse}, RMSE: {rmse}")
