# What is spotify song recommender system?
Spotify employs several independent ML models and algorithms to generate item representations and user representations.  Let's break down exactly how this process works - starting with the track/artist representations:

Generating Track Representations: Content-based and Collaborative filtering:
   
**1)	Content-based filtering:** 
*   Aiming to describe the track by examining the content itself i.e., 
based on song metadata like acoustics, artists, frequency, beats, language etc. 
*   It is used to recommend songs that are similar to the other songs in the dataset (all the songs).

**2)	Collaborative filtering:**
*   	Aiming to describe the track in its connection with other tracks on the platform by studying user-generated assets.
*   	Recommend songs based on the overlap of songs in playlists in the dataset i.e., it compares the song only with the songs in a particular playlist for each user.

The Spotify recommendation system uses collaborative filtering to recommend songs and podcasts to users.


In [1]:
pip install spotipy #importing spotipy library to use spotify API

Collecting spotipy
  Downloading spotipy-2.23.0-py3-none-any.whl (29 kB)
Collecting redis>=3.5.3
  Downloading redis-5.0.1-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.3/250.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
Installing collected packages: redis, spotipy
Successfully installed redis-5.0.1 spotipy-2.23.0
[0mNote: you may need to restart the kernel to use updated packages.


# To access the Spotify Web API, we will use a python-based library known as Spotipy

In [2]:
#importing libraries
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth
import spotipy.util as util

from skimage import io
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime


from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity

# Importing spotify features dataset
We will require additional data related to the features of songs present in the Spotify application for this implementation. Using these features, we will determine the similarity between our playlist and the songs not in our playlist. Based on the similarity, we will get a new playlist recommended.

For this purpose, I have used a Kaggle dataset. 

**Dataset includes 18 columns:**
genre, 
artist_name, 
track_name, 
track_id, 
popularity, 
acousticness, 
danceability, 
duration_ms, 
energy, 
instrumentalness, key, liveness,  loudness, 
mode, 
speechiness, 
tempo, 
time_signature, and valence

**Categorical attribues are 6:** genre, artist_name, track_name, track_id, key, mode

**Continuous attributes are 12:** popularity, accousticness, danceability, duration_ms, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature, and valence

In [3]:
#reading data
spotify_data = pd.read_csv('/kaggle/input/spotifyfeatures/SpotifyFeatures.csv')
spotify_data.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,0,0.611,0.389,99373,0.91,0.0,C#,0.346,-1.828,Major,0.0525,166.969,4/4,0.814
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,1,0.246,0.59,137373,0.737,0.0,F#,0.151,-5.559,Minor,0.0868,174.003,4/4,0.816
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,3,0.952,0.663,170267,0.131,0.0,C,0.103,-13.879,Minor,0.0362,99.488,5/4,0.368
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,0,0.703,0.24,152427,0.326,0.0,C#,0.0985,-12.178,Major,0.0395,171.758,4/4,0.227
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,4,0.95,0.331,82625,0.225,0.123,F,0.202,-21.15,Major,0.0456,140.576,4/4,0.39


# Feature engineering
In the dataset, we can observe that multiple columns represent the possible features for a song. Out of these, few features are categorical (columns having discrete values) like genre, key, mode, etc.

Therefore, the first step would be **to convert these categorical features into one-hot encoding (OHE)** so that our songs can be represented as vectors in a feature space.


# One Hot Encoding (OHE)
One hot encoding is a technique used to represent categorical variables as numerical values in a machine learning model.

One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity and overfitting. It is important to use it cautiously

In [4]:
spotify_features_df = spotify_data

genre_OHE = pd.get_dummies(spotify_features_df.genre)
key_OHE = pd.get_dummies(spotify_features_df.key)

print(genre_OHE.head())
print(key_OHE.head())

   A Capella  Alternative  Anime  Blues  Children's Music  Children’s Music  \
0          0            0      0      0                 0                 0   
1          0            0      0      0                 0                 0   
2          0            0      0      0                 0                 0   
3          0            0      0      0                 0                 0   
4          0            0      0      0                 0                 0   

   Classical  Comedy  Country  Dance  ...  Pop  R&B  Rap  Reggae  Reggaeton  \
0          0       0        0      0  ...    0    0    0       0          0   
1          0       0        0      0  ...    0    0    0       0          0   
2          0       0        0      0  ...    0    0    0       0          0   
3          0       0        0      0  ...    0    0    0       0          0   
4          0       0        0      0  ...    0    0    0       0          0   

   Rock  Ska  Soul  Soundtrack  World  
0     0   

# Min-Max Normalization on Continuous Attributes

As we can see that the numerical columns have different ranges, we will perform a max-min normalization to change the values of numeric columns in the dataset to a standard scale.

The equation for max-min normalization is given as follows:

x normalized = (x – x minimum) / (x maximum – x minimum)

We are normalizing all continuous attributes except popularity and time signature

In [5]:
from sklearn.preprocessing import MinMaxScaler

scaled_features = MinMaxScaler().fit_transform([
  spotify_features_df['acousticness'].values,
  spotify_features_df['danceability'].values,
  spotify_features_df['duration_ms'].values,
  spotify_features_df['energy'].values,
  spotify_features_df['instrumentalness'].values,
  spotify_features_df['liveness'].values,
  spotify_features_df['loudness'].values,
  spotify_features_df['speechiness'].values,
  spotify_features_df['tempo'].values,
  spotify_features_df['valence'].values,
  spotify_features_df['popularity'].values,  
  ])

spotify_features_df[['acousticness','danceability','duration_ms','energy','instrumentalness','liveness','loudness','speechiness','tempo','valence','popularity']] = scaled_features.T

spotify_features_df.head()

Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Movie,Henri Salvador,C'est beau de faire un Show,0BRjO6ga9RKCKjfDqeFgWV,1.8e-05,2.5e-05,2.2e-05,1.0,2.8e-05,1.8e-05,C#,2.2e-05,0.0,Major,1.9e-05,0.001699,4/4,2.7e-05
1,Movie,Martin & les fées,Perdu d'avance (par Gad Elmaleh),0BjC1NfoEOOusryehmNudP,4.8e-05,4.2e-05,4.5e-05,1.0,4.6e-05,4e-05,F#,4.2e-05,0.0,Minor,4.1e-05,0.001307,4/4,4.6e-05
2,Movie,Joseph Williams,Don't Let Me Be Lonely Tonight,0CoSDzoNIKCRs124s9uTVy,9.9e-05,8.7e-05,8.5e-05,1.0,8.2e-05,8.2e-05,C,8.2e-05,0.0,Minor,8.2e-05,0.000666,5/4,8.4e-05
3,Movie,Henri Salvador,Dis-moi Monsieur Gordon Cooper,0Gc6TVm52BwZD07Ki6tIvf,8e-05,8.4e-05,8.1e-05,1.0,8.2e-05,8e-05,C#,8.1e-05,0.0,Major,8e-05,0.001207,4/4,8.1e-05
4,Movie,Fabien Nataf,Ouverture,0IuslXpMROHdEPvSl1fTQK,0.000304,0.000267,0.00026,1.0,0.000259,0.000257,F,0.000258,0.0,Major,0.000256,0.001957,4/4,0.000261


# Removing Redundant Features

We drop the features that are not considered to determine the similarity and the categorical features that are already converted into OHE vectors.

In [6]:
#discarding the categorical and unnecessary features 
spotify_features_df = spotify_features_df.drop('genre',axis = 1)
spotify_features_df = spotify_features_df.drop('artist_name', axis = 1)
spotify_features_df = spotify_features_df.drop('track_name', axis = 1)
spotify_features_df = spotify_features_df.drop('key', axis = 1)
spotify_features_df = spotify_features_df.drop('mode', axis = 1)
spotify_features_df = spotify_features_df.drop('time_signature', axis = 1)

# Appending the OHE columns of the categorical features

In [7]:
spotify_features_df = spotify_features_df.join(genre_OHE)
spotify_features_df = spotify_features_df.join(key_OHE)

spotify_features_df.head()

Unnamed: 0,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,...,B,C,C#,D,D#,E,F,F#,G,G#
0,0BRjO6ga9RKCKjfDqeFgWV,1.8e-05,2.5e-05,2.2e-05,1.0,2.8e-05,1.8e-05,2.2e-05,0.0,1.9e-05,...,0,0,1,0,0,0,0,0,0,0
1,0BjC1NfoEOOusryehmNudP,4.8e-05,4.2e-05,4.5e-05,1.0,4.6e-05,4e-05,4.2e-05,0.0,4.1e-05,...,0,0,0,0,0,0,0,1,0,0
2,0CoSDzoNIKCRs124s9uTVy,9.9e-05,8.7e-05,8.5e-05,1.0,8.2e-05,8.2e-05,8.2e-05,0.0,8.2e-05,...,0,1,0,0,0,0,0,0,0,0
3,0Gc6TVm52BwZD07Ki6tIvf,8e-05,8.4e-05,8.1e-05,1.0,8.2e-05,8e-05,8.1e-05,0.0,8e-05,...,0,0,1,0,0,0,0,0,0,0
4,0IuslXpMROHdEPvSl1fTQK,0.000304,0.000267,0.00026,1.0,0.000259,0.000257,0.000258,0.0,0.000256,...,0,0,0,0,0,0,1,0,0,0


# Connecting to Spotify Web API
In the next step, I will fetch my Spotify playlist data. To connect to the Spotify Web API, you will need a unique client id and a client secret key.

In [8]:
pip install flask

[0mNote: you may need to restart the kernel to use updated packages.


In [9]:
pip install werkzeug

[0mNote: you may need to restart the kernel to use updated packages.


In [10]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

#you can get your client id and secret by creating a developers account at spotify
auth_manager = SpotifyClientCredentials(
    client_id= "your_client_id", 
    client_secret = "your_client_secret",
)
sp = spotipy.Spotify(auth_manager = auth_manager)

playlists = sp.user_playlists('your_username')
playlist_dic = {}
playlist_cover_art = {}

while playlists:
    for i, playlist in enumerate(playlists['items']):
        playlist_dic[playlist['name']] = playlist['uri'].split(':')[2]
        playlist_cover_art[playlist['uri'].split(':')[2]] = playlist['images'][0]['url']
        print("%4d %s %s" % (i + 1, playlist['uri'],  playlist['name']))
        
    if playlists['next']:
        playlists = sp.next(playlists)
    else:
        playlists = None
        
print("\n\n\n",playlist_dic)

   1 spotify:playlist:1fsyssTHyfyUycDrIgo6aW 🎶❤️
   2 spotify:playlist:7aO7bWDCp9eEJd05ynQQQW ✨
   3 spotify:playlist:4p4B5IA559YiffpXz4wJBM Taylor Swift
   4 spotify:playlist:1l03bp5kFmrbWDevsEQvjF Fifty shades
   5 spotify:playlist:2KovmFEYmBt8jmXSPwrPYl Roadtrip songs



 {'🎶❤️': '1fsyssTHyfyUycDrIgo6aW', '✨': '7aO7bWDCp9eEJd05ynQQQW', 'Taylor Swift': '4p4B5IA559YiffpXz4wJBM', 'Fifty shades': '1l03bp5kFmrbWDevsEQvjF', 'Roadtrip songs': '2KovmFEYmBt8jmXSPwrPYl'}


In [11]:
#creating the playlist dataframe with extended features using Spotify data
def generate_playlist_df(playlist_name, playlist_dic, spotify_data):
    
    playlist = pd.DataFrame()

    for i, j in enumerate(sp.playlist(playlist_dic[playlist_name])['tracks']['items']):
        playlist.loc[i, 'artist'] = j['track']['artists'][0]['name']
        playlist.loc[i, 'track_name'] = j['track']['name']
        playlist.loc[i, 'track_id'] = j['track']['id']
        playlist.loc[i, 'url'] = j['track']['album']['images'][1]['url']
        playlist.loc[i, 'date_added'] = j['added_at']

    playlist['date_added'] = pd.to_datetime(playlist['date_added'])  
    
    playlist = playlist[playlist['track_id'].isin(spotify_data['track_id'].values)].sort_values('date_added',ascending = False)

    return playlist

playlist_df = generate_playlist_df('Taylor Swift', playlist_dic, spotify_data)

playlist_df.head()

Unnamed: 0,artist,track_name,track_id,url,date_added
23,Taylor Swift,King Of My Heart,7HuBDWi18s4aJM8UFnNheH,https://i.scdn.co/image/ab67616d00001e02da5d5a...,2021-12-25 06:44:55+00:00
15,Taylor Swift,Getaway Car,0VE4kBnHJUgtMf0dy6DRmW,https://i.scdn.co/image/ab67616d00001e02da5d5a...,2021-12-25 06:44:36+00:00
12,Taylor Swift,Bad Blood,273dCMFseLcVsoSWx59IoE,https://i.scdn.co/image/ab67616d00001e029abdf1...,2021-12-24 18:41:14+00:00
7,Taylor Swift,Shake It Off,5xTtaWoae3wi06K5WfVUUH,https://i.scdn.co/image/ab67616d00001e029abdf1...,2021-12-24 18:40:36+00:00
4,Taylor Swift,Blank Space,1p80LdxRV74UKvL8gnD7ky,https://i.scdn.co/image/ab67616d00001e029abdf1...,2021-12-24 18:40:17+00:00


In [12]:
def generate_playlist_vector(spotify_features, playlist_df, weight_factor):
    
    spotify_features_playlist = spotify_features[spotify_features['track_id'].isin(playlist_df['track_id'].values)]
    spotify_features_playlist = spotify_features_playlist.merge(playlist_df[['track_id','date_added']], on = 'track_id', how = 'inner')
    
    spotify_features_nonplaylist = spotify_features[~spotify_features['track_id'].isin(playlist_df['track_id'].values)]
    
    playlist_feature_set = spotify_features_playlist.sort_values('date_added',ascending=False)
    
    
    most_recent_date = playlist_feature_set.iloc[0,-1]
    
    for ix, row in playlist_feature_set.iterrows():
        playlist_feature_set.loc[ix,'days_from_recent'] = int((most_recent_date.to_pydatetime() - row.iloc[-1].to_pydatetime()).days)
        
    
    playlist_feature_set['weight'] = playlist_feature_set['days_from_recent'].apply(lambda x: weight_factor ** (-x))
    
    playlist_feature_set_weighted = playlist_feature_set.copy()
    
    playlist_feature_set_weighted.update(playlist_feature_set_weighted.iloc[:,:-3].mul(playlist_feature_set_weighted.weight.astype(int),0))   
    
    playlist_feature_set_weighted_final = playlist_feature_set_weighted.iloc[:, :-3]
    

    
    return playlist_feature_set_weighted_final.sum(axis = 0), spotify_features_nonplaylist

playlist_vector, nonplaylist_df = generate_playlist_vector(spotify_features_df, playlist_df, 1.2)
print(playlist_vector.shape)
print(nonplaylist_df.head())

(51,)
                 track_id  popularity  acousticness  danceability  \
0  0BRjO6ga9RKCKjfDqeFgWV    0.000018      0.000025      0.000022   
1  0BjC1NfoEOOusryehmNudP    0.000048      0.000042      0.000045   
2  0CoSDzoNIKCRs124s9uTVy    0.000099      0.000087      0.000085   
3  0Gc6TVm52BwZD07Ki6tIvf    0.000080      0.000084      0.000081   
4  0IuslXpMROHdEPvSl1fTQK    0.000304      0.000267      0.000260   

   duration_ms    energy  instrumentalness  liveness  loudness  speechiness  \
0          1.0  0.000028          0.000018  0.000022       0.0     0.000019   
1          1.0  0.000046          0.000040  0.000042       0.0     0.000041   
2          1.0  0.000082          0.000082  0.000082       0.0     0.000082   
3          1.0  0.000082          0.000080  0.000081       0.0     0.000080   
4          1.0  0.000259          0.000257  0.000258       0.0     0.000256   

   ...  B  C  C#  D  D#  E  F  F#  G  G#  
0  ...  0  0   1  0   0  0  0   0  0   0  
1  ...  0  0   0  

In [13]:
def generate_recommendation(spotify_data, playlist_vector, nonplaylist_df):

    non_playlist = spotify_data[spotify_data['track_id'].isin(nonplaylist_df['track_id'].values)]
    non_playlist['sim'] = cosine_similarity(nonplaylist_df.drop(['track_id'], axis = 1).values, playlist_vector.drop(labels = 'track_id').values.reshape(1, -1))[:,0]
    non_playlist_top15 = non_playlist.sort_values('sim',ascending = False).head(15)
    non_playlist_top15['url'] = non_playlist_top15['track_id'].apply(lambda x: sp.track(x)['album']['images'][1]['url'])
    
    return  non_playlist_top15

top15 = generate_recommendation(spotify_data, playlist_vector, nonplaylist_df)  
top15.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,genre,artist_name,track_name,track_id,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,sim,url
108413,Pop,Lil Yachty,Yacht Club (feat. Juice WRLD),5R2rsbwCDXORX2tLfprRmM,0.00052,4.7e-05,4.9e-05,1.0,4.8e-05,4.5e-05,G,4.6e-05,0.0,Major,4.7e-05,0.001006,4/4,4.8e-05,0.806559,https://i.scdn.co/image/ab67616d00001e02fc6284...
113610,Pop,Rihanna,"Yeah, I Said It",4kqxy0SvQ2N34nOJ9ggfMu,0.000492,6e-05,6e-05,1.0,5.9e-05,5.6e-05,G,5.9e-05,0.0,Major,5.7e-05,0.001017,4/4,5.8e-05,0.806559,https://i.scdn.co/image/ab67616d00001e0233de85...
16006,Dance,The Kinks,All Day and All of the Night,78JmElAFmrPNhLjovDR9Jm,0.000543,5.7e-05,5.9e-05,1.0,6.1e-05,5.5e-05,G,5.6e-05,0.0,Major,5.6e-05,0.001023,4/4,6.1e-05,0.806559,https://i.scdn.co/image/ab67616d00001e02718bbc...
108147,Pop,Kodak Black,MoshPit (feat. Juice WRLD),4KX8vXbouybtUptEyYxtIk,0.000506,2.7e-05,2.9e-05,1.0,2.9e-05,2.5e-05,G,2.5e-05,0.0,Major,2.6e-05,0.000988,4/4,2.8e-05,0.806559,https://i.scdn.co/image/ab67616d00001e02f9508e...
13655,Dance,Julia Michaels,Heaven,1T575AhHueYinKSDflEsGK,0.000459,5.2e-05,5e-05,1.0,4.9e-05,4.7e-05,G,4.8e-05,0.0,Minor,4.8e-05,0.001089,3/4,4.9e-05,0.806559,https://i.scdn.co/image/ab67616d00001e026cd979...
