## Recreating Spotify's Discover Weekly

Spotify launched their Discover Weekly playlists in 2015, and it has been a great success. The secret? A mixture of collaborative filtering with music from other users, and a profile of each user’s music tastes. A Quartz Editor got an exclusive look into the inner workings of the system and a beautiful visual representation of their taste profile. 

This notebook will explore the recommendation aspect of the system. I'll be making use of both content-based recommendation with the micro-genres and artists combined with collaborative filtering with user activity from Whisperify to attempt to achieve song recommendations as good as Whisperify's. 


### Setup

If you have cloned this repository and are running this locally, you should create a Spotify app and define a `.env` file in the `/python` directory with the contents below. 
```
CLIENT_ID="your_spotify_app_id"
CLIENT_SECRET="your_spotify_app_secret"
USER_ID="your_spotify_username"
```

**Importing the required libraries**

We'll be using the `spotipy` library to obtain Spotify data, `dotenv` to load the env variables from the `.env` file, the `pandas` library to manipulate our data, and `surprise` and `sklearn` for the collaborative filtering and content recommendation models. 

In [73]:
import os, sys
import numpy as np
import pandas as pd

from dotenv import load_dotenv
import spotipy
import spotipy.util as util

from surprise.model_selection import train_test_split
from surprise import Dataset, Reader, SVD, accuracy
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

To recommend songs, we need datasets to train on. I parsed the user data from Whisperify into csv files formatted similarly to the MovieLens dataset. One file contains the popularity of each song with listeners, and the other contains the metadata of each song. 

The data in the files has been anonymized.

In [74]:
data = pd.read_csv(os.path.join(sys.path[0], 'pos_rating.csv'))
df_songs = pd.read_csv(os.path.join(sys.path[0], 'songs.csv'))

In [75]:
# Preview of ratings.csv
data.head()

Unnamed: 0,user,song,rating,country
0,1,60oMXS1hZtvK1u4wmmOIyT,5,CH
1,1,2MsNSKQNQNRklkKFxxvIav,5,CH
2,1,5qz2JLETV6S7nWakMB37ki,5,CH
3,1,4knqZHNHlEBBEmfETcpTyr,5,CH
4,1,0nNrpT5W9YOvR3bEa7ym12,5,CH


In [76]:
# Preview of songs.csv
df_songs.head()

Unnamed: 0,id,title,artists,genres
0,60oMXS1hZtvK1u4wmmOIyT,Virtual Friends,DROELOE,"bass trap,edm,electronic trap,future bass,pop ..."
1,2MsNSKQNQNRklkKFxxvIav,9 and Three Quarters (Run Away),TOMORROW X TOGETHER,k-pop
2,5qz2JLETV6S7nWakMB37ki,FAST,"CHUNG HA,Mommy Son","k-pop,k-rap"
3,4knqZHNHlEBBEmfETcpTyr,D.D.D,THE BOYZ,"k-pop,k-pop boy group"
4,0nNrpT5W9YOvR3bEa7ym12,Moonwalk,WayV,"chinese idol pop,k-pop"


Now, since we anonymized the data, we need to call the Spotify API to get our personal top songs in the same format as the above dataframes.  

In [77]:
# loads and sets API keys
load_dotenv()

client_id = os.getenv('CLIENT_ID')
client_secret = os.getenv('CLIENT_SECRET')
user_id = os.getenv("USER_ID")

# redirect uri and scope for obtaining Spotify token
redirect_uri = "http://localhost:9999"
scope = 'user-library-read, playlist-modify-public, user-top-read user-read-private'

In [78]:
# obtains user token from Spotify
token = util.prompt_for_user_token(user_id, scope=scope, client_id=client_id, \
                                   client_secret=client_secret, redirect_uri=redirect_uri)
sp = spotipy.Spotify(auth = token)

# make request for my top tracks
top_tracks = sp.current_user_top_tracks(limit=49, time_range="long_term")
songs = top_tracks["items"]
top_tracks = sp.next(top_tracks) # there are 2 pages of top tracks
songs.extend(top_tracks['items'])

In [79]:
i = 0
df_arr = []
while i < len(songs):
  retsongs = songs[i:min(i + 50, len(songs))]

  artist_in_arr = []
  track_name = []
  artist_arr = []
  for track in retsongs:
    track_name.append(track["name"])
    a_tmp = []
    for artist in track["artists"]:
      artist_arr.append(artist["id"])
      a_tmp.append(artist["id"])
    artist_in_arr.append(a_tmp)
  artists_api = list(set(artist_arr))
  j = 0
  artist_info = []
  while j < len(artists_api):
    ret_art = sp.artists(artists_api[j:min(j + 50, len(artists_api))])
    artist_info += ret_art["artists"]
    j += 50
  genres = {}
  for artist in artist_info:
    genres[artist["id"]] = {'genres': artist["genres"], 'name': artist["name"]}
  for ind in range(len(artist_in_arr)):
    item = artist_in_arr[ind]
    artist_name = []
    genres_list = []
    for artist in item:
      artist_name.append(genres[artist]["name"])
      genres_list += genres[artist]["genres"]
    new_row = [songs[i + ind]["id"], track_name[ind], ",".join(artist_name), ",".join(genres_list)]
    df_arr.append(new_row)
  i += 50

my_songinfo = pd.DataFrame(df_arr,columns=["id", "title", "artists", "genres"])
my_songinfo.head()

Unnamed: 0,id,title,artists,genres
0,6NVjujGb9fnl25fjzm5dTy,Freak (feat. Bonn),"Avicii,Bonn","big room,dance pop,edm,pop,swedish pop"
1,77w4HJEAGzRwHTapyXjFl1,Holes - Radio Edit,Passenger,"folk-pop,neo mellow,pop,pop rock"
2,7FMEEYbIfiEazN02exf120,The Wrong Direction,Passenger,"folk-pop,neo mellow,pop,pop rock"
3,0L9lXMXddmoBbBUeF7A9An,Nervous (The Ooh Song) - Mark McCabe Remix,"Gavin James,Mark McCabe","irish pop,neo mellow,pop,pop rock,viral pop,ir..."
4,7DFNE7NO0raLIUbgzY2rzm,Let Her Go,Passenger,"folk-pop,neo mellow,pop,pop rock"


In [80]:
rating_data = []
my_user = sp.current_user()
counter = 0
total = len(songs) // 5 + 1

for song in songs:
    rating_data.append([my_user["id"], song["id"], 5 - counter//total, my_user["country"]])
    counter += 1

my_ratinginfo = pd.DataFrame(rating_data,columns=["user", "song", "rating", "country"])
my_ratinginfo.head()

Unnamed: 0,user,song,rating,country
0,at8official,6NVjujGb9fnl25fjzm5dTy,5,NL
1,at8official,77w4HJEAGzRwHTapyXjFl1,5,NL
2,at8official,7FMEEYbIfiEazN02exf120,5,NL
3,at8official,0L9lXMXddmoBbBUeF7A9An,5,NL
4,at8official,7DFNE7NO0raLIUbgzY2rzm,5,NL


### Content-based recommendation

The first recommendation we'll be implementing is content-based recommendation. It essentially makes use of item-item relationships, and makes suggestions based on how similar an item is to another item. 

**Preprocess song metadata**

We'll be using `TfidfVectorizer` which is a text feature extraction algorithm that makes use of how frequent a word appears in a document and the number of documents it appears in. 

We need to provide the important text of a song in the form of a string, so we modify the genres column of a song to be space-separated instead (We could also add the artist name into the metadata column, but since genres on Spotify are currently tied to artists, that addition might make it too specific). 

In [81]:
# Add personal song data onto data from CSV
data = pd.concat([data,my_ratinginfo])
df_songs = pd.concat([df_songs,my_songinfo]).drop_duplicates(subset='id', keep="first")
print(data.shape, df_songs.shape)


(1058023, 4) (341255, 4)


In [82]:
# preprocessing for string recognition, get data into metadata column and join full song title for readability
df_songs['metadata'] = df_songs["genres"].apply(
    lambda x: str(x).replace(",", " "))
df_songs['full'] = df_songs["title"] + " - " + df_songs["artists"]
df_metadata = df_songs[["id", "full", "metadata"]]

# join the song and user rating data, filling any gaps
df_joined = data.merge(df_songs, left_on="song", right_on="id")
df_joined.fillna("", inplace=True)
# Preview Roddy Rich - The Box in the data after preprocessing
df_joined[df_joined.song == "0nbXyq5TXYPCO7pr3N8S4I"].head()
print(df_joined.shape, df_metadata.shape)

(1058023, 10) (341255, 3)


**Train TF-IDF Model**

All that's left now is to train the model on the `metadata` column we just created. Then, define a function to find similarity between songs for a given user 

In [83]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_metadata['metadata'])
print("Matrix Shape:", tfidf_matrix.shape)

Matrix Shape: (341255, 2160)


In [84]:
def get_content_rec(user):
    user_songs = df_joined[df_joined.user == user]
    user_sims = []
    for index, row in user_songs.iterrows():
        new_song_vector = tfidf.transform([row["metadata"]])
        user_sims.append(linear_kernel(new_song_vector, tfidf_matrix))
    user_scores = sum(user_sims)[0] / len(user_songs)
    sim_scores = []
    for i in range(len(user_scores)):
        # gets the song id in column 0, and full song name in column 1
        sim_scores.append(
            [i, user_scores[i], df_metadata.iloc[i, 0], df_metadata.iloc[i, 1]])
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    return sim_scores

In [85]:
# Test out the recommendations
content_rec = pd.DataFrame(get_content_rec(user_id), columns=[
                          "idx", "sim_score", "song", "full"])
content_rec.head(10)

Unnamed: 0,idx,sim_score,song,full
0,230189,0.455747,2OZhWugxgM4wvjI3MXRhJg,Heavenly Father (feat. Noah Kahan) - Recorded ...
1,291875,0.443453,4ERDsm0YhyQuP32Qrvg2ne,"The Truth - Sam Feldt Remix - James Blunt,Sam ..."
2,168573,0.4321,3vRvuatWmKcHhxbv6choLR,"Fiction (feat. Tom Odell) - Kygo,Tom Odell"
3,159891,0.429353,3nHHMV5YnocvTRiYqolTm9,"Pieces (feat. Noah Kahan) - Matoma,Noah Kahan"
4,45755,0.425094,3KxA3AlLpX2RXaARLnpuCt,"Kids - Seeb Remix - OneRepublic,Seeb"
5,174053,0.425094,7dOeiXeTSfA1ixaYmQcWu7,"Rich Love (with Seeb) - OneRepublic,Seeb"
6,25065,0.423517,1ti46UopDrJPwxNl5AMmVB,Best I Can (With Seeb) - Petey Remix - America...
7,275519,0.423517,0GNOl0qrkeLPBdKhA08OIu,"Best I Can (With Seeb) - American Authors,Seeb"
8,26993,0.421221,4sJqSKPc5fZ5OZ8JiVI44N,"Stranger Things (feat. OneRepublic) - Kygo,One..."
9,41016,0.421221,7xbWAw3LMgRMn4omR5yVn3,"Lose Somebody - Kygo,OneRepublic"


### Collaborative Filtering

The first recommendation actually returns pretty great results, but it mainly chooses songs that have artists you already like. What if you want to explore further? Collaborative filtering looks at user-item relationships, and uses data from users like you to recommend songs. 

**Single Value Decomposition**

We'll be using SVD to for matrix factoring for our algorithm. We load in the ratings data, and train the model with a trainset and testset. Then check the accuracy our predictions. 

In [86]:
reader = Reader(rating_scale=(1, 5))
rating_data = Dataset.load_from_df(data[['user', 'song', 'rating']], reader)

trainset, testset = train_test_split(rating_data, test_size=.25)
algorithm = SVD(n_epochs=5, verbose=True)
algorithm.fit(trainset)
predictions = algorithm.test(testset)

accuracy.rmse(predictions)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
RMSE: 0.8110


0.8110085716022891

In [87]:
def pred_user_rating(userId):
    if userId in data.user.unique():
        userId_list = data[data.user == userId].song.tolist()
        predictedL = []
        songs_to_check = list(set(data.song.tolist()))
        for i in songs_to_check:
            if i not in userId_list:
                predicted = algorithm.predict(userId, i)
                predictedL.append((i, predicted[3] / 5))
        top_predictions = sorted(predictedL, key=lambda x: x[1], reverse=True)
        return top_predictions

Our accuracy isn't too great, and you can see that with the recommended songs as well. I've heard of one or two of the artists, and the genres aren't genres I actually listen to. 

In [88]:
collab_only = pd.DataFrame(pred_user_rating(
    user_id), columns=["song", "col_score"])
collab_only.merge(df_metadata, left_on="song", right_on="id").head(10)

Unnamed: 0,song,col_score,id,full,metadata
0,6n2OY7elbEfj8FTK8oeouw,0.801113,6n2OY7elbEfj8FTK8oeouw,Creature of Habit - Slowly Slowly,aussie emo australian garage punk australian h...
1,3IM7Y0M9ZJp0M1vUZ6z2Wa,0.787811,3IM7Y0M9ZJp0M1vUZ6z2Wa,Race Car Blues - Slowly Slowly,aussie emo australian garage punk australian h...
2,3i8GllbdGuwIXaVrlIoE0r,0.754365,3i8GllbdGuwIXaVrlIoE0r,What do you think? - Agust D,k-rap
3,3nzsuf1WQSBLSnBzokrWfu,0.738172,3nzsuf1WQSBLSnBzokrWfu,Skin - Spacey Jane,australian garage punk australian hip hop aust...
4,0v1x6rN6JHRapa03JElljE,0.737349,0v1x6rN6JHRapa03JElljE,Dynamite - BTS,k-pop k-pop boy group
5,0A31i51V3omHTJMDCJxDfh,0.73694,0A31i51V3omHTJMDCJxDfh,Morbid Stuff - PUP,alternative emo canadian indie canadian pop pu...
6,4R2kfaDFhslZEMJqAFNpdd,0.735002,4R2kfaDFhslZEMJqAFNpdd,cardigan - Taylor Swift,dance pop pop post-teen pop
7,7w3pIUjz7BTTqj9uAws40m,0.729845,7w3pIUjz7BTTqj9uAws40m,So What - LOONA,k-pop k-pop girl group
8,2TZ9kaTiLewawHZK7yylSC,0.7297,2TZ9kaTiLewawHZK7yylSC,No Fear No More - Madeon,big room complextro edm electro house electrop...
9,0SJ7vFES0Lj6pnumh3DhCe,0.729248,0SJ7vFES0Lj6pnumh3DhCe,People - Agust D,k-rap


### Combining recommendation systems

Now that we have both functions, we can run both algorithms, get the recommendations and them combine the metrics from both to find the recommended songs for a user. 

In [89]:
content_df = pd.DataFrame(get_content_rec(user_id), columns=[
                          "idx", "sim_score", "song", "full"])
collab_df = pd.DataFrame(pred_user_rating(
    user_id), columns=["song", "col_score"])

# Join both recommendations, and calculate average score
df_hybrid = content_df.merge(collab_df, on="song")
df_hybrid.fillna(0, inplace=True)
df_hybrid["hybrid"] = df_hybrid["sim_score"] + df_hybrid["col_score"]
df_hybrid = df_hybrid.sort_values(
    "hybrid", ascending=False).drop_duplicates(subset=['full'])

df_hybrid.head(20)

Unnamed: 0,idx,sim_score,song,full,col_score,hybrid
0,230189,0.455747,2OZhWugxgM4wvjI3MXRhJg,Heavenly Father (feat. Noah Kahan) - Recorded ...,0.626822,1.082569
55,46930,0.401318,1ZLrDPgR7mvuTco3rQK8Pk,Way Back Home (feat. Conor Maynard) - Sam Feld...,0.679506,1.080824
73,48955,0.399986,46NBoIAHrmR7qcUGCIFEjR,This One's for You (feat. Zara Larsson) (Offic...,0.6692,1.069186
153,151751,0.393774,4fC6tDEFmSBBCRBUOPeQ8b,"Think About You - Galantis Remix - Kygo,Valeri...",0.674837,1.068611
25,74907,0.408059,2wOVXsIRIoniqipE9I5ZVx,"Only Want You - Sam Feldt Remix - Rita Ora,Sam...",0.659538,1.067596
96,202049,0.399986,6IQPONuERifb4i4Syhmbq6,Lovers on the Sun (feat. Sam Martin) - David G...,0.664111,1.064096
37,43552,0.404845,72LdECwKyf3kGV1pd9HxS7,"Grip - Seeb,Bastille",0.657932,1.062777
16,225147,0.41556,3K7vPyMCcecKRotnu08MMP,Crazier Things (with Noah Kahan) - Chelsea Cut...,0.646182,1.061742
168,213018,0.39298,6FuO9rHsf0ZcRs7TBY8fC9,Beneath The Streetlights And The Moon - JP Cooper,0.668093,1.061072
7,275519,0.423517,0GNOl0qrkeLPBdKhA08OIu,"Best I Can (With Seeb) - American Authors,Seeb",0.637238,1.060755


### Conclusion
The recommended songs are suited to my genres, while also having an element of exploration from users similar to me. There is still more to improve, including using audio features for the content-based recommendation, and finding ways to store more user data for the collaborative filtering model. 