# Spotify Recommendation System

This project involves using content-based filtering to build a music recommendation system for Spotify songs. The main purpose is to build myself a new playlist containing songs that have an upbeat reggae/RnB vibe, as I've just recently started to explore this specific genre of music.

### Import Libraries

In [1]:
import spotipy
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

### Data Cleaning

In [2]:
song_original = pd.read_csv('SpotifyFeatures.csv')
song_original.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


In [3]:
song_original.shape

(232725, 18)

In [4]:
song_original.isnull().sum()

genre               0
artist_name         0
track_name          0
track_id            0
popularity          0
acousticness        0
danceability        0
duration_ms         0
energy              0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
valence             0
dtype: int64

In [5]:
len(song_original) - len(song_original['track_id'].drop_duplicates()) # Number of duplicated track id's

55951

In [6]:
song_original = song_original.drop_duplicates('track_id')

In [7]:
song_original.describe()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0
mean,36.273162,0.404135,0.541068,236127.2,0.557025,0.172073,0.224531,-10.137605,0.127395,117.203679,0.451595
std,17.391016,0.366302,0.190387,130513.2,0.275839,0.322936,0.211027,6.395551,0.204345,31.325091,0.26782
min,0.0,0.0,0.0569,15387.0,2e-05,0.0,0.00967,-52.457,0.0222,30.379,0.0
25%,25.0,0.0456,0.415,178253.0,0.344,0.0,0.0975,-12.851,0.0368,92.006,0.222
50%,37.0,0.288,0.558,219453.0,0.592,7e-05,0.13,-8.191,0.0494,115.0065,0.44
75%,49.0,0.791,0.683,268547.0,0.789,0.0908,0.277,-5.631,0.102,138.79975,0.667
max,100.0,0.996,0.989,5552917.0,0.999,0.999,1.0,3.744,0.967,242.903,1.0


### Feature Engineering

In [8]:
song_data = song_original
# One-hot encoding
genre = pd.get_dummies(song_data['genre'])
key = pd.get_dummies(song_data['key'])
mode = pd.get_dummies(song_data['mode'])

In [9]:
genre.shape

(176774, 27)

In [10]:
key.shape

(176774, 12)

In [11]:
mode.shape

(176774, 2)

In [12]:
cols = ['track_id','acousticness', 'danceability', 'duration_ms', 'energy', 'instrumentalness',
        'liveness', 'loudness', 'speechiness', 'tempo', 'valence']
scaled = song_data[cols] # New df containing only necessary columns
scaled.head()

Unnamed: 0,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
0,0BRjO6ga9RKCKjfDqeFgWV,0.611,0.389,99373,0.91,0.0,0.346,-1.828,0.0525,166.969,0.814
1,0BjC1NfoEOOusryehmNudP,0.246,0.59,137373,0.737,0.0,0.151,-5.559,0.0868,174.003,0.816
2,0CoSDzoNIKCRs124s9uTVy,0.952,0.663,170267,0.131,0.0,0.103,-13.879,0.0362,99.488,0.368
3,0Gc6TVm52BwZD07Ki6tIvf,0.703,0.24,152427,0.326,0.0,0.0985,-12.178,0.0395,171.758,0.227
4,0IuslXpMROHdEPvSl1fTQK,0.95,0.331,82625,0.225,0.123,0.202,-21.15,0.0456,140.576,0.39


In [13]:
sc = MinMaxScaler()
scaled.iloc[:,1:] = sc.fit_transform(scaled.iloc[:,1:]) # Scale numerical features
scaled.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


Unnamed: 0,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
0,0BRjO6ga9RKCKjfDqeFgWV,0.613454,0.356292,0.015167,0.910909,0.0,0.339614,0.900856,0.03207,0.642704,0.814
1,0BjC1NfoEOOusryehmNudP,0.246988,0.571934,0.022029,0.737732,0.0,0.14271,0.834469,0.068374,0.675801,0.816
2,0CoSDzoNIKCRs124s9uTVy,0.955823,0.650252,0.027969,0.131113,0.0,0.094241,0.686429,0.014818,0.325182,0.368
3,0Gc6TVm52BwZD07Ki6tIvf,0.705823,0.196438,0.024747,0.326313,0.0,0.089697,0.716695,0.018311,0.665238,0.227
4,0IuslXpMROHdEPvSl1fTQK,0.953815,0.294067,0.012142,0.225209,0.123123,0.194208,0.557054,0.024767,0.518516,0.39


In [14]:
scaled.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
count,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0,176774.0
mean,0.405758,0.519438,0.039863,0.557573,0.172245,0.216959,0.753001,0.111341,0.408541,0.451595
std,0.367773,0.204256,0.023569,0.276121,0.323259,0.213087,0.113798,0.216284,0.147396,0.26782
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.045783,0.384186,0.029411,0.344331,0.0,0.088688,0.704721,0.015453,0.289977,0.222
50%,0.289157,0.537603,0.036851,0.592584,7e-05,0.121505,0.787637,0.028789,0.398202,0.44
75%,0.794177,0.671709,0.045717,0.789786,0.090891,0.26994,0.833188,0.084462,0.510158,0.667
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [15]:
song_clean = scaled.join([genre, key, mode]) # Append one-hot encoded features to df

In [16]:
song_clean.head()

Unnamed: 0,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,...,C#,D,D#,E,F,F#,G,G#,Major,Minor
0,0BRjO6ga9RKCKjfDqeFgWV,0.613454,0.356292,0.015167,0.910909,0.0,0.339614,0.900856,0.03207,0.642704,...,1,0,0,0,0,0,0,0,1,0
1,0BjC1NfoEOOusryehmNudP,0.246988,0.571934,0.022029,0.737732,0.0,0.14271,0.834469,0.068374,0.675801,...,0,0,0,0,0,1,0,0,0,1
2,0CoSDzoNIKCRs124s9uTVy,0.955823,0.650252,0.027969,0.131113,0.0,0.094241,0.686429,0.014818,0.325182,...,0,0,0,0,0,0,0,0,0,1
3,0Gc6TVm52BwZD07Ki6tIvf,0.705823,0.196438,0.024747,0.326313,0.0,0.089697,0.716695,0.018311,0.665238,...,1,0,0,0,0,0,0,0,1,0
4,0IuslXpMROHdEPvSl1fTQK,0.953815,0.294067,0.012142,0.225209,0.123123,0.194208,0.557054,0.024767,0.518516,...,0,0,0,0,1,0,0,0,1,0


### Access Spotify API 

In order to generate song suggestions, we need an existing music playlist to calculate cosine similarity between songs. I have a small playlist on Spotify containing around 20 songs which we'll import through the Spotify API.

In [17]:
# Removed id and secret
client_id = 
client_secret = 

In [18]:
# Get playlist data
scope = 'user-library-read'
pl_token = spotipy.util.prompt_for_user_token(scope=scope, client_id=client_id,
                                      client_secret=client_secret,
                                      redirect_uri='http://localhost')
data = spotipy.Spotify(auth = pl_token)
pl_data = {}
for i in data.current_user_playlists()['items']:
    pl_data[i['name']] = i['uri'].split(':')[2]

The playlist we'll import is named '🌅' and contains a small mix of upbeat Reggae and old-school RnB songs. 

In [19]:
# Import playlist data into dataframe
playlist = pd.DataFrame()

for i, x in enumerate(data.playlist(pl_data['🌅'])['tracks']['items']):
    playlist.loc[i, 'artist'] = x['track']['artists'][0]['name']
    playlist.loc[i, 'track_name'] = x['track']['name']
    playlist.loc[i, 'track_id'] = x['track']['id']
    playlist.loc[i, 'url'] = x['track']['album']['images'][1]['url']
    playlist.loc[i, 'date_added'] = x['added_at']
    
playlist['date_added'] = pd.to_datetime(playlist['date_added'])

playlist = playlist[playlist['track_id'].isin(song_original['track_id'].values)]
playlist.sort_values('date_added', inplace = True)
playlist.reset_index(inplace = True, drop = True)

In [20]:
playlist

Unnamed: 0,artist,track_name,track_id,url,date_added
0,Inner Circle,Sweat (A La La La La Long),1SssFw74DdHVjRa6ADggdD,https://i.scdn.co/image/ab67616d00001e02cd07cd...,2022-06-09 05:42:21+00:00
1,UB40,Red Red Wine,4uOKFydzAejjSFqYbv1XPt,https://i.scdn.co/image/ab67616d00001e02f1dd69...,2022-06-09 05:52:00+00:00
2,Will Smith,Miami,6e8Ou0wiqAzIpWb2eSxll8,https://i.scdn.co/image/ab67616d00001e02ddf2f9...,2022-06-09 06:04:04+00:00
3,Will Smith,"Men In Black - From ""Men In Black"" Soundtrack",2FK7fxjzQEXD7Z32HSF0Hl,https://i.scdn.co/image/ab67616d00001e02ddf2f9...,2022-06-09 06:04:33+00:00
4,Bob Marley & The Wailers,Could You Be Loved,5O4erNlJ74PIF6kGol1ZrC,https://i.scdn.co/image/ab67616d00001e021c4041...,2022-06-09 06:05:17+00:00
5,The Notorious B.I.G.,Mo Money Mo Problems (feat. Puff Daddy & Mase)...,4INDiWSKvqSKDEu7mh8HFz,https://i.scdn.co/image/ab67616d00001e02fde79b...,2022-06-09 06:10:27+00:00
6,Michael Jackson,Don't Stop 'Til You Get Enough,46eu3SBuFCXWsPT39Yg3tJ,https://i.scdn.co/image/ab67616d00001e02702729...,2022-06-09 06:12:22+00:00
7,Justin Timberlake,Señorita,0aj2QKJvz6CePykmlTApiD,https://i.scdn.co/image/ab67616d00001e02346a57...,2022-06-09 06:13:46+00:00
8,Snoop Dogg,Signs,4HSAJpNocVNJbwbQvtCMdO,https://i.scdn.co/image/ab67616d00001e02e80371...,2022-06-09 06:16:04+00:00
9,Michael Jackson,Rock with You - Single Version,7oOOI85fVQvVnK5ynNMdW7,https://i.scdn.co/image/ab67616d00001e02702729...,2022-06-09 06:18:05+00:00


In [21]:
playlist.shape

(15, 5)

In [22]:
# Merge songs in playlist with their respective features in the dataset
playlist_full = pd.merge(song_clean, playlist[['track_id', 'date_added']],
                         on = 'track_id', how = 'inner')
playlist.reset_index(inplace = True)
playlist_full.tail()

Unnamed: 0,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,...,D,D#,E,F,F#,G,G#,Major,Minor,date_added
10,1SssFw74DdHVjRa6ADggdD,0.01245,0.783285,0.038226,0.727722,0.0,0.148769,0.796285,0.21359,0.671199,...,0,0,0,0,0,0,0,1,0,2022-06-09 05:42:21+00:00
11,1BkY0N8ChFk2mdLbAUu8ZK,0.460843,0.802596,0.034319,0.418407,0.0,0.281048,0.677621,0.080123,0.56156,...,0,0,0,0,0,1,0,1,0,2022-06-19 07:44:10+00:00
12,1Rb4eWCv2mPz7bVyrwbwvP,0.269076,0.909881,0.055967,0.797794,0.000231,0.092222,0.792068,0.071655,0.3415,...,0,0,0,0,0,0,0,0,1,2022-06-09 09:37:05+00:00
13,62GYoGszQfROZswLee6W3O,0.315261,0.850874,0.037425,0.466456,0.023423,0.120495,0.654508,0.035246,0.374344,...,0,0,0,0,0,0,0,1,0,2022-06-09 06:21:47+00:00
14,1HGyhNaRUFEDBiVLbvtbL6,0.161647,0.886278,0.041465,0.733728,0.00026,0.327497,0.77052,0.111981,0.304549,...,0,0,0,0,0,0,0,0,1,2022-06-12 06:02:57+00:00


In [23]:
playlist_full.shape

(15, 53)

In [24]:
# Create new df excluding all songs already in our playlist
nonplaylist = song_clean[~song_clean['track_id'].isin(playlist['track_id'].values)]

In [25]:
nonplaylist.shape

(176759, 52)

### Calculate Cosine Similarity

In [26]:
most_recent_date = playlist_full.iloc[-1,-1]
for i, x in playlist_full.iterrows():
    # Return number of days from recent date since song was added to playlist
    playlist_full.loc[i, 'days_from_recent'] = int((most_recent_date.to_pydatetime() - x.iloc[-1].to_pydatetime()).days)
playlist_full.head()

Unnamed: 0,track_id,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,...,D#,E,F,F#,G,G#,Major,Minor,date_added,days_from_recent
0,6e8Ou0wiqAzIpWb2eSxll8,0.034036,0.879841,0.032898,0.534525,1.7e-05,0.034766,0.82342,0.136325,0.366039,...,0,0,0,0,0,0,1,0,2022-06-09 06:04:04+00:00,2.0
1,0aj2QKJvz6CePykmlTApiD,0.073193,0.841219,0.05047,0.629622,0.0,0.06516,0.83817,0.023285,0.318006,...,0,0,0,0,0,0,1,0,2022-06-09 06:13:46+00:00,2.0
2,2FK7fxjzQEXD7Z32HSF0Hl,0.096486,0.798305,0.038263,0.581573,0.000296,0.088284,0.782299,0.048582,0.363093,...,0,0,1,0,0,0,0,1,2022-06-09 06:04:33+00:00,2.0
3,5ByAIlEEnxYdvpnezg7HTX,0.473896,0.892715,0.051883,0.816813,0.0,0.196228,0.850287,0.237934,0.309033,...,0,0,0,0,0,0,1,0,2022-06-09 09:36:01+00:00,2.0
4,4INDiWSKvqSKDEu7mh8HFz,0.012851,0.84551,0.043704,0.884883,2e-06,0.195218,0.853205,0.056837,0.348935,...,0,0,0,1,0,0,0,1,2022-06-09 06:10:27+00:00,2.0


In [27]:
# Assign weights for each song based on date added; higher weights for recently added songs and vice versa
playlist_full['weight'] = playlist_full['days_from_recent'].apply(lambda x: 1.2 ** (-x))
playlist_full.update(playlist_full.iloc[:,1:-3].mul(playlist_full['weight'].astype(int), 0)) # Multiply features by weight
playlist_full = playlist_full.iloc[:,:-3] # Remove date columns
final_playlist = playlist_full.sum(axis = 0) # Create a series containing sum of all playlist features
final_playlist.shape

(52,)

### Create Song Recommendations

In [28]:
nonplaylist_full = song_original[song_original['track_id'].isin(nonplaylist['track_id'].values)] # Df containing all features
# Calculate cosine similarity between playlist features and songs in nonplaylist dataframe
nonplaylist_full['similarity'] = cosine_similarity(nonplaylist.drop(['track_id'], axis = 1).values,
                                              final_playlist.drop(labels = 'track_id').values.reshape(1, -1))[:,0]
suggestions = nonplaylist_full.sort_values('similarity', ascending = False).head(50) # Sort similar songs in descending order
suggestions.reset_index(inplace = True)
suggestions = suggestions[['track_name', 'artist_name', 'genre', 'track_id']]
suggestions # Dataframe containing top 50 most similar songs to playlist

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,track_name,artist_name,genre,track_id
0,Roadside Love,Tribal Theory,Reggae,6Iw1feg4EkzoLeaoTS4Hxe
1,Love I,The Green,Reggae,4Hrw5VbQ9xDPdHdLCqBsSA
2,Masada,Alpha Blondy,Reggae,7rGwJm16BHBkQiFF6PLnGM
3,"Two Birds, One Stone",Mike Pinto,Reggae,3zN34ncJohnYqSWetYg1to
4,Au de Cabeça - Ao Vivo,Natiruts,Reggae,0taUNgt7DfKRNzCV8uyuqN
5,On a Mission,The Skints,Reggae,2AGppDiHy8wbYCm9GE9tNI
6,What a Fire,The Ethiopians,Reggae,5tjg8d1H9RTv53eQScKmwJ
7,Reincarnated Souls,Bunny Wailer,Reggae,4D1Y494qjTjydvwv7TD1F3
8,Rua Kenana,David Grace & Injustice,Reggae,6TSOEqBlEZUdKpcNyvatne
9,A Fool Will Fall,Wailing Souls,Reggae,5ha8kyqTD3WQNwWBAwJwRV
