# Recommender Playlists

Exploration implementing recommender system techniques using Spotify playlist data:

- Popularity Recommender: Recommend popular songs regardless of user's preferences
- Content-based Recommender: Use song attributes (e.g. genre) to recommend similar songs
- Collaborative Recommender: Predict what songs a user might be interested in based on a collection of preference information from multiple users
- Hybrid Recommender: A hybrid approach can be used to overcome some of the common problems in recommender systems such as the cold start problem and the sparsity problem
- Popularity + Hybrid Recommender: An extension of the hybrid approach which applies weighting/mixes in songs based on popularity

For the purposes of this dataset we will focus on the playlist_tracks_df dataset and treat different playlists as different users.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import yaml

In [69]:
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from scipy.sparse import vstack
from scipy.sparse.linalg import svds
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [3]:
# To create a playlist and add tracks
import spotipy 
from spotipy.oauth2 import SpotifyOAuth

## Import Data

- Artist and track data was pulled using the Spotify API via the spotipy package
- Data was saved in pickle format using music_data.py and data_functions.py modules
- The data can now be quickly read by multiple workflows

In [4]:
top_artist_df = pd.read_pickle("spotify/top_artists.pkl")
followed_artists_df = pd.read_pickle("spotify/followed_artists.pkl")
top_tracks_df = pd.read_pickle("spotify/top_tracks.pkl")
saved_tracks_df = pd.read_pickle("spotify/saved_tracks.pkl")
playlist_tracks_df = pd.read_pickle("spotify/playlist_tracks.pkl")
recommendation_tracks_df = pd.read_pickle("spotify/recommendation_tracks.pkl")

In [5]:
playlist_tracks_df.head()

Unnamed: 0,id,name,popularity,type,is_local,explicit,duration_ms,disc_number,track_number,artist_id,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,uri,track_href,analysis_url,time_signature
0,6fgbQt13JlpN59PytgTMsA,Snow (Hey Oh),61,audio_features,False,False,334667,1,1,0L8ExT028jH3ddEcZwqJJ5,...,0.0499,0.116,1.7e-05,0.119,0.599,104.655,spotify:track:6fgbQt13JlpN59PytgTMsA,https://api.spotify.com/v1/tracks/6fgbQt13JlpN...,https://api.spotify.com/v1/audio-analysis/6fgb...,4
1,3JOdLCIBzQYwHIvpN3isVf,Grand Theft Autumn / Where Is Your Boy,62,audio_features,False,False,191373,1,3,4UXqAaa6dQYAk18Lv7PEgX,...,0.0608,0.000706,0.0,0.275,0.72,135.45,spotify:track:3JOdLCIBzQYwHIvpN3isVf,https://api.spotify.com/v1/tracks/3JOdLCIBzQYw...,https://api.spotify.com/v1/audio-analysis/3JOd...,4
2,7pAT4dOUzjq8Ziap5ShIqC,Where'd You Go (feat. Holly Brook & Jonah Matr...,58,audio_features,False,True,231867,1,6,7dWYWUbO68rXJOcyA7SpJk,...,0.238,0.262,0.00197,0.113,0.25,179.999,spotify:track:7pAT4dOUzjq8Ziap5ShIqC,https://api.spotify.com/v1/tracks/7pAT4dOUzjq8...,https://api.spotify.com/v1/audio-analysis/7pAT...,4
3,1b7vg5T9YKR3NNqXfBYRF7,Check Yes Juliet,53,audio_features,False,False,220133,1,3,3ao3jf5d70Tf4fPh2bnXVl,...,0.0774,0.0024,0.0,0.163,0.314,166.866,spotify:track:1b7vg5T9YKR3NNqXfBYRF7,https://api.spotify.com/v1/tracks/1b7vg5T9YKR3...,https://api.spotify.com/v1/audio-analysis/1b7v...,4
4,12qZHAeOyTf93YAWvGDTat,All The Small Things,0,audio_features,False,False,168000,1,8,6FBDaR13swtiWwGhX1WQsP,...,0.0505,0.00844,0.0,0.529,0.712,148.119,spotify:track:12qZHAeOyTf93YAWvGDTat,https://api.spotify.com/v1/tracks/12qZHAeOyTf9...,https://api.spotify.com/v1/audio-analysis/12qZ...,4


In [6]:
with open("spotify/playlists.yml", 'r') as stream:
    playlist_ids = yaml.safe_load(stream)

## Evaluation Metric

Here we use the Top-N accuracy metric, which evaluates the accuracy of the top recommendations provided to a user by comparing to the items the user has actually interacted in test set. This evaluation method works as follows:

- For each user
    - For each item the user has interacted in test set
        - Sample n other items the user has never interacted with (assume these are not relevant, but the user may just have not been aware of them)
        - Ask the recommender model to produce a ranked list of recommended items, from a set composed one interacted item and the 100 non-interacted ("non-relevant") items
        - Compute the Top-N accuracy metrics for this user and interacted item from the recommendations ranked list (is the item along the Top-N ranked items)
- Aggregate the global Top-N accuracy metrics

In [8]:
class ModelEvaluator:
    
    def __init__(self, tracks):
        self.tracks = tracks
        
    def evaluate_model_for_playlist(self, model, playlist_id, n=100, seed=42):
        # Getting the items in test set
        tracks_interacted, tracks_not_interacted = get_interacted_tracks(self.tracks, playlist_id)
        train, test = train_test_split(tracks_interacted, test_size=0.2, random_state=seed)
        # Getting a ranked recommendation list from a model for a given user
        ranked_recommendations_df = model.recommend_items(playlist_id)

        hits_at_5_count, hits_at_10_count = 0, 0
        for index, row in test.iterrows():
            non_interacted_sample = tracks_not_interacted.sample(n, random_state=seed)
            evaluation_ids = [row['id']] + non_interacted_sample['id'].tolist()
            evaluation_recommendations_df = ranked_recommendations_df[ranked_recommendations_df['id'].isin(evaluation_ids)]
            # Verifying if the current interacted item is among the Top-N recommended items
            hits_at_5_count += 1 if row['id'] in evaluation_recommendations_df['id'][:5].tolist() else 0
            hits_at_10_count += 1 if row['id'] in evaluation_recommendations_df['id'][:10].tolist() else 0

        playlist_metrics = {'n': n,
                            'evaluation_count': len(test),
                            'hits@5': hits_at_5_count,
                            'hits@10': hits_at_10_count, 
                            'recall@5': hits_at_5_count / len(test),
                            'recall@10': hits_at_10_count / len(test),
                           }
        
        return playlist_metrics

    def evaluate_model(self, model, n=100, seed=42):
        playlists = []
        for playlist_id in self.tracks['playlist_id'].unique():
            playlist_metrics = self.evaluate_model_for_playlist(model, playlist_id, n=n, seed=seed)  
            playlist_metrics['playlist_id'] = playlist_id
            playlists.append(playlist_metrics)

        detailed_playlists_metrics = pd.DataFrame(playlists).sort_values('evaluation_count', ascending=False)
        
        global_recall_at_5 = detailed_playlists_metrics['hits@5'].sum() / detailed_playlists_metrics['evaluation_count'].sum()
        global_recall_at_10 = detailed_playlists_metrics['hits@10'].sum() / detailed_playlists_metrics['evaluation_count'].sum()
        
        global_metrics = {'model_name': model.model_name,
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10,
                         }  
                            
        return global_metrics, detailed_playlists_metrics
    
model_evaluator = ModelEvaluator(playlist_tracks_df)

### Interacted tracks

Now to evaluate a model for a playlist (and overall), we need to get both iteracted and non-interacted tracks for a playlist.

In [7]:
def get_interacted_tracks(tracks, playlist_id):
    interacted_track_ids = set(tracks[tracks['playlist_id'] == playlist_id]['id'])
    tracks_interacted = tracks[tracks['id'].isin(interacted_track_ids)]
    tracks_not_interacted = tracks[~tracks['id'].isin(interacted_track_ids)]

    tracks_interacted = tracks_interacted.drop_duplicates(subset='id', keep="first").reset_index()
    tracks_not_interacted = tracks_not_interacted.drop_duplicates(subset='id', keep="first").reset_index()

    return tracks_interacted, tracks_not_interacted

In [48]:
interacted_tracks, non_interacted_tracks = get_interacted_tracks(playlist_tracks_df, playlist_ids['Chill'])

## Popularity Recommender

A popularity based recommender recommends songs in order of overall popularity, regardless of what the user has listened to. Spotify's 'audio features' API call automatically comes with a 'popularity' feature. Although it is 0 in ~10% of cases (higher than you'd expect - these are probably default null values), this is perfect for creating a Popularity recommender.

As song popularity generally accounts for the "wisdom of the crowds", it usually provides good recommendations overall. However this isn't tailored to the user in particular, as a good recommender system should be.

In [127]:
class PopularityRecommender:
    
    def __init__(self, tracks):
        self.tracks = tracks
        self.model_name = 'Popularity Recommender'
    
    def recommend_items(self, playlist_id, ignore_ids=[]):
        recommendations_df = self.tracks[~self.tracks['id'].isin(ignore_ids)] \
                                .drop_duplicates(subset='id', keep="first").reset_index() \
                                .sort_values('popularity', ascending=False)

        return recommendations_df
    
popularity_model = PopularityRecommender(playlist_tracks_df)

In [47]:
# You can see this is essentially sorted by popularity
popularity_model_recommendations = popularity_model.recommend_items(playlist_ids['Chill'], interacted_tracks['id'].tolist())
popularity_model_recommendations.head()

NameError: name 'popularity_model' is not defined

In [179]:
popularity_model_metrics, popularity_model_details = model_evaluator.evaluate_model(popularity_model)

print(popularity_model_metrics)
popularity_model_details[[x for x in popularity_model_details.columns if x != 'playlist_id']] \
    .sort_values('recall@5', ascending=False) \
    .head(10)

{'model_name': 'Popularity', 'recall@5': 0.09318497913769123, 'recall@10': 0.17385257301808066}


Unnamed: 0,n,evaluation_count,hits@5,hits@10,recall@5,recall@10
28,100,2,2,2,1.0,1.0
2,100,14,7,7,0.5,0.5
12,100,17,7,9,0.411765,0.529412
9,100,17,7,9,0.411765,0.529412
14,100,20,8,11,0.4,0.55
41,100,21,8,11,0.380952,0.52381
0,100,19,6,6,0.315789,0.315789
11,100,36,9,11,0.25,0.305556
46,100,4,1,1,0.25,0.25
15,100,20,5,6,0.25,0.3


## Content-based Recommender

A content-based recommender leverages attributes from items the user has interacted with to recommend similar items. As it depends only on the past this method avoids the cold-start problem.

For text items we can use a popular information retrieval method used in search engines named TF-IDF. This technique converts unstructured text into a vector structure, where each word is represented by a position in the vector, and the value measures how relevant a given word is for an article. We can then compute the cosine similarity between tracks the user has iteracted with and those they haven't.

### TF-IDF

First we need to apply the TF-IDF technique, and use it to build playlist profiles

In [75]:
def get_tfidf(tracks, ngram_range=(1,2), min_df=0.003, max_df=0.5, max_features=5000):
    # Transform list cols to string, we use bigrams later so no need to remove spaces
    tracks['genres_str'] = tracks['genres'].apply(lambda x: ' '.join(x))

    # Vector size 5000 model
    vectorizer = TfidfVectorizer(analyzer='word',
                                 ngram_range=ngram_range,
                                 min_df=min_df,
                                 max_df=max_df,
                                 max_features=max_features,
                                 stop_words=stopwords.words('english'))  # might need to download stopwords, follow prompt

    # Don't include album_genres column here as similar to genres column and we don't want additional genre weighting in this case 
    # vectorizer.fit_transform takes a string
    tfidf_matrix = vectorizer.fit_transform(tracks['name'] + ' ' +
                                            tracks['artist_name'] + ' ' +
                                            tracks['album_name'] + ' ' +
                                            tracks['playlist_name'] + ' ' +
                                            tracks['genres_str']
                                           )
    tfidf_feature_names = vectorizer.get_feature_names()

    return tfidf_matrix, tfidf_feature_names

In [76]:
tfidf_matrix, tfidf_feature_names = get_tfidf(playlist_tracks_df)
tfidf_matrix

<3505x948 sparse matrix of type '<class 'numpy.float64'>'
	with 47341 stored elements in Compressed Sparse Row format>

In [89]:
def get_item_profile(tracks, track_id, tfidf_matrix):
    idx = tracks['id'].tolist().index(track_id)
    item_profile = tfidf_matrix[idx:idx+1]
    return item_profile

def get_item_profiles(tracks, track_ids, tfidf_matrix):
    item_profiles_list = [get_item_profile(tracks, x, tfidf_matrix) for x in track_ids]
    item_profiles = vstack(item_profiles_list)
    return item_profiles

def build_users_profile(tracks, playlist_id, interactions_indexed_df, tfidf_matrix):
    # There isn't any weighting we want to do in this case, 
    # but a common approach is weighting by interaction strength (liking, commenting, etc.)
    interactions_df = interactions_indexed_df.loc[playlist_id]
    user_item_profiles = get_item_profiles(tracks, interactions_df['id'], tfidf_matrix)
    return user_item_profiles

def build_users_profiles(tracks, tfidf_matrix): 
    user_profiles = {}
    for playlist_id in tracks['playlist_id'].unique():
        interacted_tracks, non_interacted_tracks = get_interacted_tracks(tracks, playlist_id)
        user_profiles[playlist_id] = build_users_profile(tracks, playlist_id, interacted_tracks.set_index('playlist_id'), tfidf_matrix)
    return user_profiles

In [90]:
user_profiles = build_users_profiles(playlist_tracks_df, tfidf_matrix)
len(user_profiles)

KeyError: '4hG2oTIaiIjSgemDp9bWc1'

### Content-based Recommender

Now with our matrix of playlist profiles setup, we can apply a content-based recommender.

In [70]:
class ContentRecommender:
    
    def __init__(self, tracks):
        self.tracks = tracks
        self.model_name = 'Content-based Recommender'

#     def _get_similar_items_to_user_profile(self, person_id, topn=1000):
#         #Computes the cosine similarity between the user profile and all item profiles
#         cosine_similarities = cosine_similarity(user_profiles[person_id], tfidf_matrix)
#         #Gets the top similar items
#         similar_indices = cosine_similarities.argsort().flatten()[-topn:]
#         #Sort the similar items by similarity
#         similar_items = sorted([(item_ids[i], cosine_similarities[0,i]) for i in similar_indices], key=lambda x: -x[1])
#         return similar_items
        
#     def recommend_items(self,user_id, items_to_ignore=[], topn=10, verbose=False):
#         similar_items = self._get_similar_items_to_user_profile(user_id)
#         #Ignores items the user has already interacted
#         similar_items_filtered = list(filter(lambda x: x[0] not in items_to_ignore, similar_items))
        
#         recommendations_df = pd.DataFrame(similar_items_filtered, columns=['contentId', 'recStrength']) \
#                                     .head(topn)

#         return recommendations_df
    
content_model = ContentRecommender(playlist_tracks_df)