# Spotify Recommendation Systems

In this notebook I try a couple of different recommendation systems for music on Spotify.

1. Content-Based
    - Using Spotify's track audio features, I compute the euclidean distance between songs in an attempt to find songs that sound similar to each other.  I then recommend the most similar songs.  This will take into account features provided by Spotify such as acousticness, loudness, energy, key, and tempo.
    
2. Collaborative Filtering
    - I wanted to attempt this but I don't believe it's possible to get data on that many specific users, even anonymized.
    - This version takes into account a user's history.  It identifies songs that a user has liked, finds users that have liked the most similar songs, and then recommends songs those users liked that the original user has not liked (presuming that they have not yet heard the song, not that they already disliked it). 

In [1]:
import requests
import spotipy

import sys
sys.path.append('../modules/')
import lyrics_grab
import credentials
from spotipy.oauth2 import SpotifyClientCredentials
import pickle
import numpy as np
import pandas as pd

In [2]:
# connect to Spotify API
auth_manager = SpotifyClientCredentials(client_id=credentials.spotify_client_id,
                                        client_secret=credentials.spotify_client_secret)
sp = spotipy.Spotify(auth_manager=auth_manager)

In [3]:
# open the json of metal artists
with open('../data/metal_raw.pickle','rb') as rf:
    metal_raw = pickle.load(rf)

In [4]:
# extact only relevant info
artists = lyrics_grab.extract_artist_info(metal_raw)

In [5]:
# get data from Spotify about each song
# songs = lyrics_grab.extract_song_info(list(artists.keys()))

In [6]:
# with open('../data/song_info.pickle','wb') as out:
#     pickle.dump(songs,out)

In [7]:
# song_ids = lyrics_grab.get_song_ids(songs)

In [8]:
# get characteristics for each song from Spotify (such as loudness, energy, etc.)
# audio_features = lyrics_grab.get_audio_features(song_ids)

In [9]:
# with open('../data/audio_features.pickle','wb') as out:
#     pickle.dump(audio_features,out)

In [10]:
with open('../data/song_info.pickle','rb') as rf:
    songs = pickle.load(rf)
    
with open('../data/audio_features.pickle','rb') as rf:
    audio_features = pickle.load(rf)

In [11]:
audio_df = pd.DataFrame(audio_features)
songs_df = pd.DataFrame(songs)

In [90]:
df = songs_df.merge(audio_df,left_on='id',right_on='id')

In [91]:
df.sort_values('popularity',ascending=False,inplace=True)
df.reset_index(drop=True,inplace=True)

In [92]:
df = df.iloc[:20000,:]

In [93]:
df_large = df.copy()

In [94]:
#only keep columns that will be relevant to end user
df = df[['song_name','artist_name','album_name','popularity','link']]

# eliminate non-metal artists
df = df[df.artist_name!='Rauw Alejandro']
df = df[df.artist_name!='Jowell & Randy']
df = df[df.artist_name!='Rachel Platten']
df = df[df.artist_name!='Corinne Baily Rae']
df = df[df.artist_name!="Rag'n'Bone Man"]
df = df[df.artist_name!='Lenin Ramírez']
df = df[df.artist_name!='Au/Ra']
df = df[df.artist_name!='Hot Chelle Rae']
df = df[df.artist_name!='Don Omar']
df = df[df.artist_name!='Rae Sremmurd']
df = df[df.artist_name!='Don Omar']
df = df[df.artist_name!='girl in red']
df = df[df.artist_name!='Corinne Bailey Rae']
df = df[df.artist_name!='Ray Parker Jr.']
df = df[df.artist_name!='Elle King']
df = df[df.artist_name!='Omarion']
df = df[df.artist_name!='Leona Lewis']
df = df[df.artist_name!='Chance the Rapper']
df = df[df.artist_name!='Lin-Manuel Miranda']
df = df[df.artist_name!='White Noise Baby Sleep']
df = df[df.artist_name!='Ray LaMontagne']
df = df[df.artist_name!='Rain Sounds']
df = df[df.artist_name!='The Weeknd']
df = df[df.artist_name!='Rain Sounds For Sleep']
df = df[df.artist_name!='RaeLynn']
df = df[df.artist_name!='Carin Leon']
df = df[df.artist_name!='Rascal Flatts']
df = df[df.artist_name!='Isaiah Rashad']
df = df[df.artist_name!='Ray J']
df = df[df.artist_name!='Céline Dion']
df = df[df.artist_name!='Baby Sleep']
df = df[df.artist_name!='Taylor Ray Holbrook']
df = df[df.artist_name!='Vancouver Sleep Clinic']

In [95]:
with open('../../lyric-nlp/song_df.pickle','wb') as out:
    pickle.dump(df,out)

In [96]:
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Scale and perform PCA on all of the "audio features".  Then take pairwise distances.

In [97]:
scaler = StandardScaler()

In [98]:
num_data = df_large.select_dtypes('number')
num_data.drop('popularity',inplace=True,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [99]:
scaled_data = scaler.fit_transform(num_data)

In [100]:
pca = PCA().fit(scaled_data)

In [101]:
sum(pca.explained_variance_ratio_[:8])

0.8127899073612093

In [102]:
pca_data = pca.transform(scaled_data)

In [103]:
pca_data.shape

(20000, 14)

In [104]:
# get pairwise distances from array or df
song_similarities = pairwise_distances(pca_data[:,:8])

In [105]:
song_similarities.shape

(20000, 20000)

In [106]:
song_dist_df = pd.DataFrame(song_similarities)

In [107]:
# find the nth percentile distance so you can set the rest to zero
np.percentile(np.array(song_dist_df[0]),1)

1.6910657507014797

In [108]:
# whatever the value is at that percentile, set everything above that equal to zero
song_dist_df[song_dist_df>1.5] = 0

In [109]:
import scipy.sparse

In [110]:
# make sparse to save space
sparse_mat = scipy.sparse.csc_matrix(song_dist_df)

In [111]:
with open('../../lyric-nlp/song_similarities_sparse.pickle','wb') as out:
    pickle.dump(sparse_mat,out)

In [None]:
# find the index of a given song
df[(df.song_name=='Bat Country') & (df.artist_name=='Avenged Sevenfold')].index

In [113]:
# get the 20 closest songs by index to the given song
result = pd.DataFrame(sparse_mat[:,445].toarray()).sort_values(by=0,ascending=False).head(20).index

In [114]:
# locate the closest songs in df and accompanying info
df.iloc[result,:].sort_values('popularity',ascending=False).head(10)

Unnamed: 0,song_name,artist_name,album_name,popularity,link
999,Runaway,Linkin Park,Hybrid Theory (Bonus Edition),64,https://open.spotify.com/track/6xtQ23d8GEXgcxy...
1042,Red Beam (feat. Sean Kingston),Trippie Redd,Pegasus,63,https://open.spotify.com/track/1xyCNQC00FGkQdY...
2081,Moving On,Asking Alexandria,From Death To Destiny,58,https://open.spotify.com/track/7IiqF2tYiixnpBc...
4446,Weathered,Creed,Weathered,51,https://open.spotify.com/track/33P0xVFGIPfCB2Z...
5529,Here Without You - Acoustic,3 Doors Down,Here Without You (Acoustic),49,https://open.spotify.com/track/2HIU8kiFo9sB5ih...
5713,City Is Ours,Big Time Rush,BTR,48,https://open.spotify.com/track/2UI9qShtz4xwB4B...
6423,In My Sword I Trust,Ensiferum,Unsung Heroes,47,https://open.spotify.com/track/54eGNQCFTHq0q9d...
8248,Bato Sa Buhangin,Cinderella,Sce: T.L Ako Sa'yo,44,https://open.spotify.com/track/4hIQjUWvzYYviDq...
9481,Savior,Any Given Day,Overpower,43,https://open.spotify.com/track/2d1ZaZ3XjkTP43N...
11409,Failed Hope,Slaughter to Prevail,Misery Sermon,40,https://open.spotify.com/track/61IdfYsnIuNtlPu...


In [None]:
artist_scores = df.groupby('artist_name')[['scores','popularity']].mean().reset_index().sort_values('scores',ascending=False).reset_index(drop=True)
artist_scores

In [115]:
df_large.head()

Unnamed: 0,id,song_name,artist_name,album_name,popularity,duration_ms_x,explicit,link,danceability,energy,...,instrumentalness,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms_y,time_signature
0,4ZRrLHqzhGRXYj2qcB4s5S,Tattoo - Remix with Camilo,Rauw Alejandro,Tattoo (Remix with Camilo),91,222680,False,https://open.spotify.com/track/4ZRrLHqzhGRXYj2...,0.848,0.637,...,0.0,0.0521,0.698,96.988,audio_features,spotify:track:4ZRrLHqzhGRXYj2qcB4s5S,https://api.spotify.com/v1/tracks/4ZRrLHqzhGRX...,https://api.spotify.com/v1/audio-analysis/4ZRr...,222680,4
1,3ThrfRAmijYU098H9q9tAs,Enchule,Rauw Alejandro,Enchule,85,185718,False,https://open.spotify.com/track/3ThrfRAmijYU098...,0.763,0.666,...,0.0,0.301,0.587,90.086,audio_features,spotify:track:3ThrfRAmijYU098H9q9tAs,https://api.spotify.com/v1/tracks/3ThrfRAmijYU...,https://api.spotify.com/v1/audio-analysis/3Thr...,185719,4
2,50ZC4PM7hywH27RcCfViau,Elegí (feat. Dímelo Flow),Rauw Alejandro,Elegí (feat. Dímelo Flow),85,197721,False,https://open.spotify.com/track/50ZC4PM7hywH27R...,0.824,0.631,...,0.000116,0.0531,0.678,171.965,audio_features,spotify:track:50ZC4PM7hywH27RcCfViau,https://api.spotify.com/v1/tracks/50ZC4PM7hywH...,https://api.spotify.com/v1/audio-analysis/50ZC...,197721,4
3,60a0Rd6pjrkxjPbaKzXjfq,In the End,Linkin Park,Hybrid Theory (Bonus Edition),84,216880,False,https://open.spotify.com/track/60a0Rd6pjrkxjPb...,0.556,0.864,...,0.0,0.209,0.4,105.143,audio_features,spotify:track:60a0Rd6pjrkxjPbaKzXjfq,https://api.spotify.com/v1/tracks/60a0Rd6pjrkx...,https://api.spotify.com/v1/audio-analysis/60a0...,216880,4
4,75JFxkI2RXiU7L9VXzMkle,The Scientist,Coldplay,A Rush of Blood to the Head,84,309600,False,https://open.spotify.com/track/75JFxkI2RXiU7L9...,0.557,0.442,...,1.5e-05,0.11,0.213,146.277,audio_features,spotify:track:75JFxkI2RXiU7L9VXzMkle,https://api.spotify.com/v1/tracks/75JFxkI2RXiU...,https://api.spotify.com/v1/audio-analysis/75JF...,309600,4


### By Artist

Now I do the same thing but aggregating to the artist level.

In [178]:
artist_scores = df_large.groupby('artist_name').mean()

In [179]:
# remove non-metal artists (I don't know how they came up in a Spotify search for genre=metal)

artist_scores.drop(['Rauw Alejandro','Jowell & Randy','Rachel Platten',"Rag'n'Bone Man",
                   'Lenin Ramírez','Au/Ra','Hot Chelle Rae','Don Omar','Rae Sremmurd','girl in red',
                   'Corinne Bailey Rae','Ray Parker Jr.','Elle King','Omarion','Leona Lewis','Chance the Rapper',
                   'Lin-Manuel Miranda','White Noise Baby Sleep','Ray LaMontagne','Rain Sounds','The Weeknd',
                   'Rain Sounds For Sleep','RaeLynn','Carin Leon','Rascal Flatts','Isaiah Rashad','Ray J',
                   'Céline Dion','Baby Sleep','Taylor Ray Holbrook','Vancouver Sleep Clinic',"Tmsoft’s White Noise Sleep Sounds",
                   'Nelson Haynes','Sleepy Clouds','Léon Branche','Deep Sleep Music Collective','Sleep Baby Sleep',
                   'Sleep Tentacles','White Noise Therapy','Binaural Beats Sleep','The Sound Designers','Pathfinder',
                   'Nile While','Jalen Tyree','Duncan Honeybourne','Thelonious Monk','Mrm Team','Oceanografers'],axis=0,inplace=True)

Perform the same steps as before, this time using more principal components because there is far less data.

1. scale and perform PCA
2. calculate pairwise distances between artists
3. turn to sparse matrix
4. obtain input of an artist and find index of that artist
5. find n closest artists to that index
6. pass that result back to the dataframe and return information

In [180]:
scaled_artist_data = scaler.fit(artist_scores)
scaled_artist_data = scaler.transform(artist_scores)

In [181]:
pca_artist = PCA().fit(scaled_artist_data)

In [182]:
pca_artist_data = pca_artist.transform(scaled_artist_data)

In [183]:
sum(pca_artist.explained_variance_ratio_[:12])

0.9533864611086745

In [184]:
artist_distances = pairwise_distances(pca_artist_data[:,:12])

In [185]:
artist_distances.shape

(1796, 1796)

In [186]:
sparse_artists = scipy.sparse.csc_matrix(artist_distances)

In [187]:
with open('../../lyric-nlp/artists_sparse.pickle','wb') as out:
    pickle.dump(sparse_artists,out)

In [None]:
ind = save[save.artist_name=='Metallica'].index

In [196]:
result = pd.DataFrame(sparse_artists[:,960].toarray()).sort_values(by=0,ascending=False).head(20).index

In [189]:
save = artist_scores.reset_index()[['artist_name','popularity']]

In [197]:
save.iloc[result,:].head(10)

Unnamed: 0,artist_name,popularity
1349,"Sounds Of Nature : Thunderstorm, Rain",55.0
1366,Stephen Barton,36.0
1794,Æther Realm,35.0
1761,YPC Next Level,33.0
926,Lucid Nightmare,38.0
1480,The Haunted House of Horror Sound Effects,35.0
696,Hypnosis Therapy,37.0
1168,Rain Sounds Dreamer,39.0
19,A Winged Victory for the Sullen,35.0
577,Flawless2k,33.0
