# Building a Contemporary Music Recommendation System

## Kaggle Dataset

We first begin by importing the necessary libraries, and the dataset of songs we will be using for this project. The dataset is retrieved from https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-600k-tracks. It contains a dataset of 600k Spotify songs from 1921-2020, containing tracks from artists with 1 million+ listeners on Spotify. It was created using the Spotify Web API. 

In [76]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

songs_data = pd.read_csv("tracks.csv")

songs_data

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.4450,0,-13.338,1,0.4510,0.674,0.744000,0.1510,0.1270,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.2630,0,-22.136,1,0.9570,0.797,0.000000,0.1480,0.6550,102.009,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.434,0.1770,1,-21.180,1,0.0512,0.994,0.021800,0.2120,0.4570,130.418,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918000,0.1040,0.3970,169.980,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.402,0.1580,3,-16.900,0,0.0390,0.989,0.130000,0.3110,0.1960,103.220,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586667,5rgu12WBIHQtvej2MdHSH0,云与海,50,258267,0,['阿YueYue'],['1QLBXKM5GCpyQQSVMNZqrZ'],2020-09-26,0.560,0.5180,0,-7.471,0,0.0292,0.785,0.000000,0.0648,0.2110,131.896,4
586668,0NuWgxEp51CutD2pJoF4OM,blind,72,153293,0,['ROLE MODEL'],['1dy5WNgIKQU6ezkpZs4y8z'],2020-10-21,0.765,0.6630,0,-5.223,1,0.0652,0.141,0.000297,0.0924,0.6860,150.091,4
586669,27Y1N4Q4U3EfDU5Ubw8ws2,What They'll Say About Us,70,187601,0,['FINNEAS'],['37M5pPGs6V1fchFJSgCguX'],2020-09-02,0.535,0.3140,7,-12.823,0,0.0408,0.895,0.000150,0.0874,0.0663,145.095,4
586670,45XJsGpFTyzbzeWK8VzR8S,A Day At A Time,58,142003,0,"['Gentle Bones', 'Clara Benin']","['4jGPdu95icCKVF31CcFKbS', '5ebPSE9YI5aLeZ1Z2g...",2021-03-05,0.696,0.6150,10,-6.212,1,0.0345,0.206,0.000003,0.3050,0.4380,90.029,4


In [77]:
songs_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 586672 entries, 0 to 586671
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                586672 non-null  object 
 1   name              586601 non-null  object 
 2   popularity        586672 non-null  int64  
 3   duration_ms       586672 non-null  int64  
 4   explicit          586672 non-null  int64  
 5   artists           586672 non-null  object 
 6   id_artists        586672 non-null  object 
 7   release_date      586672 non-null  object 
 8   danceability      586672 non-null  float64
 9   energy            586672 non-null  float64
 10  key               586672 non-null  int64  
 11  loudness          586672 non-null  float64
 12  mode              586672 non-null  int64  
 13  speechiness       586672 non-null  float64
 14  acousticness      586672 non-null  float64
 15  instrumentalness  586672 non-null  float64
 16  liveness          58

Now, we can see the different features of a song in the dataset. Each song has quantitative values for qualitative features - such as danceability and acousticness! We can use this to our advantage when recommending a song for the user. 

Next, we remove null values and duplicates from our dataset.

In [78]:
songs_data.isnull().sum()

id                   0
name                71
popularity           0
duration_ms          0
explicit             0
artists              0
id_artists           0
release_date         0
danceability         0
energy               0
key                  0
loudness             0
mode                 0
speechiness          0
acousticness         0
instrumentalness     0
liveness             0
valence              0
tempo                0
time_signature       0
dtype: int64

In [79]:
songs_data = songs_data.dropna()
songs_data

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.4450,0,-13.338,1,0.4510,0.674,0.744000,0.1510,0.1270,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.2630,0,-22.136,1,0.9570,0.797,0.000000,0.1480,0.6550,102.009,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.434,0.1770,1,-21.180,1,0.0512,0.994,0.021800,0.2120,0.4570,130.418,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918000,0.1040,0.3970,169.980,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.402,0.1580,3,-16.900,0,0.0390,0.989,0.130000,0.3110,0.1960,103.220,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586667,5rgu12WBIHQtvej2MdHSH0,云与海,50,258267,0,['阿YueYue'],['1QLBXKM5GCpyQQSVMNZqrZ'],2020-09-26,0.560,0.5180,0,-7.471,0,0.0292,0.785,0.000000,0.0648,0.2110,131.896,4
586668,0NuWgxEp51CutD2pJoF4OM,blind,72,153293,0,['ROLE MODEL'],['1dy5WNgIKQU6ezkpZs4y8z'],2020-10-21,0.765,0.6630,0,-5.223,1,0.0652,0.141,0.000297,0.0924,0.6860,150.091,4
586669,27Y1N4Q4U3EfDU5Ubw8ws2,What They'll Say About Us,70,187601,0,['FINNEAS'],['37M5pPGs6V1fchFJSgCguX'],2020-09-02,0.535,0.3140,7,-12.823,0,0.0408,0.895,0.000150,0.0874,0.0663,145.095,4
586670,45XJsGpFTyzbzeWK8VzR8S,A Day At A Time,58,142003,0,"['Gentle Bones', 'Clara Benin']","['4jGPdu95icCKVF31CcFKbS', '5ebPSE9YI5aLeZ1Z2g...",2021-03-05,0.696,0.6150,10,-6.212,1,0.0345,0.206,0.000003,0.3050,0.4380,90.029,4


There may be duplicate songs from different album release versions, and they have different IDs. For instance, there are two songs of Harry Styles' "Adore You." Thus, we remove duplicates based solely upon name of the song and the artist.

In [80]:
songs_data[songs_data['name']=='Adore You']

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
86217,5AnCLGg35ziFOloEnXK4uu,Adore You,71,278747,0,['Miley Cyrus'],['5YGY8feqx7naU7z4HrwZM6'],2013-10-04,0.583,0.655,0,-5.407,1,0.0315,0.111,4e-06,0.113,0.201,119.759,4
91884,3jjujdWJ72nww5eGnfs2E7,Adore You,88,207133,0,['Harry Styles'],['6KImCVD70vtIoJWnq6nGn3'],2019-12-13,0.676,0.771,8,-3.675,1,0.0483,0.0237,7e-06,0.102,0.569,99.048,4
92524,1M4qEo4HE3PRaCOM7EXNJq,Adore You,74,207133,0,['Harry Styles'],['6KImCVD70vtIoJWnq6nGn3'],2019-12-06,0.676,0.771,8,-3.675,1,0.0483,0.0237,7e-06,0.102,0.569,99.048,4


In [81]:
songs_data.drop_duplicates(subset = ['name', 'artists'])

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.4450,0,-13.338,1,0.4510,0.674,0.744000,0.1510,0.1270,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.2630,0,-22.136,1,0.9570,0.797,0.000000,0.1480,0.6550,102.009,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.434,0.1770,1,-21.180,1,0.0512,0.994,0.021800,0.2120,0.4570,130.418,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918000,0.1040,0.3970,169.980,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.402,0.1580,3,-16.900,0,0.0390,0.989,0.130000,0.3110,0.1960,103.220,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586667,5rgu12WBIHQtvej2MdHSH0,云与海,50,258267,0,['阿YueYue'],['1QLBXKM5GCpyQQSVMNZqrZ'],2020-09-26,0.560,0.5180,0,-7.471,0,0.0292,0.785,0.000000,0.0648,0.2110,131.896,4
586668,0NuWgxEp51CutD2pJoF4OM,blind,72,153293,0,['ROLE MODEL'],['1dy5WNgIKQU6ezkpZs4y8z'],2020-10-21,0.765,0.6630,0,-5.223,1,0.0652,0.141,0.000297,0.0924,0.6860,150.091,4
586669,27Y1N4Q4U3EfDU5Ubw8ws2,What They'll Say About Us,70,187601,0,['FINNEAS'],['37M5pPGs6V1fchFJSgCguX'],2020-09-02,0.535,0.3140,7,-12.823,0,0.0408,0.895,0.000150,0.0874,0.0663,145.095,4
586670,45XJsGpFTyzbzeWK8VzR8S,A Day At A Time,58,142003,0,"['Gentle Bones', 'Clara Benin']","['4jGPdu95icCKVF31CcFKbS', '5ebPSE9YI5aLeZ1Z2g...",2021-03-05,0.696,0.6150,10,-6.212,1,0.0345,0.206,0.000003,0.3050,0.4380,90.029,4


## Exploratory Data Analysis

Next, we attempt to analyze our dataset. Since we are hoping to create a contemporary song recommendation system, it would be useful to understand changes in music trends over time. We also group the songs by decade.

In [82]:
songs_data.describe()

Unnamed: 0,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0,586601.0
mean,27.573212,230054.9,0.044091,0.563612,0.542071,5.221594,-10.205789,0.658797,0.10487,0.449803,0.113425,0.213933,0.552306,118.46793,3.87341
std,18.369417,126532.8,0.205298,0.166101,0.25191,3.51942,5.089422,0.474114,0.179902,0.348812,0.266843,0.184328,0.257673,29.762942,0.473112
min,0.0,3344.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,13.0,175083.0,0.0,0.453,0.343,2.0,-12.891,0.0,0.034,0.0969,0.0,0.0983,0.346,95.606,4.0
50%,27.0,214907.0,0.0,0.577,0.549,5.0,-9.242,1.0,0.0443,0.422,2.4e-05,0.139,0.564,117.387,4.0
75%,41.0,263867.0,0.0,0.686,0.748,8.0,-6.481,1.0,0.0763,0.784,0.00955,0.278,0.769,136.324,4.0
max,100.0,5621218.0,1.0,0.991,1.0,11.0,5.376,1.0,0.971,0.996,1.0,1.0,1.0,246.381,5.0


## Model 1: Similar Songs using K-Nearest Neighbors

The k-nearest neighbors (k-NN) algorithm finds similar elements given a certain query point, perfect for our usage! 

We will only query with songs from after 2015, as we are trying to give the user modern songs. This dataset is also quite large, so we will take a random tracks from the most modern decade. The query point for this algorithm will be the song that the user inputs. The k value is the number of similar songs that the user requests.

In [83]:
type(songs_data['release_date'][0])


str

In [88]:
#standardize release-date strings (some which have year-month-date, and some with year) with just the year

songs_data.release_date.replace({'-.*': ''}, regex=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  songs_data.release_date.replace({'-.*': ''}, regex=True, inplace=True)


In [89]:
print(songs_data['release_date'])

0         1922
1         1922
2         1922
3         1922
4         1922
          ... 
586667    2020
586668    2020
586669    2020
586670    2021
586671    2015
Name: release_date, Length: 586601, dtype: object


In [90]:
songs_data['release_date'] = songs_data['release_date'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  songs_data['release_date'] = songs_data['release_date'].astype(int)


In [94]:
df = songs_data[songs_data['release_date'] > 2015]
df

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
39511,6Pkt6qVikqPBt9bEQy8iTz,A Lover's Concerto,41,159560,0,['The Toys'],['6lH5PpuiMa5SpfjoIOlwCS'],2020,0.671,0.867,2,-2.706,1,0.0571,0.436,0.000000,0.1390,0.8390,120.689,4
39529,1hx7X9cMXHWJjknb9O6Ava,The September Of My Years - Live At The Sands ...,26,187333,0,['Frank Sinatra'],['1Mxqyy3pSjf8kZZL4QVxS0'],2018,0.319,0.201,7,-17.796,1,0.0623,0.887,0.000000,0.9040,0.2390,117.153,3
39533,19oquvXf3bc65GSqtPYA5S,It Was A Very Good Year - Live At The Sands Ho...,25,236800,0,['Frank Sinatra'],['1Mxqyy3pSjf8kZZL4QVxS0'],2018,0.269,0.129,7,-18.168,0,0.0576,0.938,0.000005,0.6830,0.1600,82.332,3
39581,55qyghODi24yaDgKBI6lx0,"The Circle Game - Live at The 2nd Fret, Philad...",18,313093,0,['Joni Mitchell'],['5hW4L92KnC6dX9t7tYM4Ve'],2020,0.644,0.212,11,-14.118,1,0.0347,0.881,0.000022,0.7980,0.4410,117.072,3
39583,00xemFYjQNRpOlPhVaLAHa,"Urge For Going - Live at The 2nd Fret, Philade...",18,295093,0,['Joni Mitchell'],['5hW4L92KnC6dX9t7tYM4Ve'],2020,0.627,0.184,1,-15.533,1,0.0450,0.955,0.000162,0.0986,0.2990,115.864,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
586666,1ZwZsVZUiyFwIHMNpI3ERt,Skyscraper,4,106002,0,['Emilie Chin'],['4USdOnfLczwUglA3TrdHs2'],2020,0.626,0.530,5,-13.117,0,0.0284,0.113,0.856000,0.1040,0.2150,120.113,4
586667,5rgu12WBIHQtvej2MdHSH0,云与海,50,258267,0,['阿YueYue'],['1QLBXKM5GCpyQQSVMNZqrZ'],2020,0.560,0.518,0,-7.471,0,0.0292,0.785,0.000000,0.0648,0.2110,131.896,4
586668,0NuWgxEp51CutD2pJoF4OM,blind,72,153293,0,['ROLE MODEL'],['1dy5WNgIKQU6ezkpZs4y8z'],2020,0.765,0.663,0,-5.223,1,0.0652,0.141,0.000297,0.0924,0.6860,150.091,4
586669,27Y1N4Q4U3EfDU5Ubw8ws2,What They'll Say About Us,70,187601,0,['FINNEAS'],['37M5pPGs6V1fchFJSgCguX'],2020,0.535,0.314,7,-12.823,0,0.0408,0.895,0.000150,0.0874,0.0663,145.095,4


In [101]:
df = df.sample(n = 10000, replace = True)
df

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
231158,2HyzvdOcJLmeAMoxx2VoP3,All Related,54,369450,0,['Nessi Gomes'],['0XkaohuHZ6J8KcJWCdyAk8'],2016,0.539,0.319,1,-15.400,0,0.0282,0.7090,0.008400,0.1110,0.315,144.944,3
389863,367w6feGZwdnti6kX34u80,Молодым,50,208936,0,['LIZER'],['0j6G5eiOcrSdlyqaYwtxwS'],2019,0.716,0.537,6,-7.273,1,0.0571,0.2540,0.000000,0.0773,0.282,89.972,4
399094,1dc3T6mSL8jC6TIIoNwaGH,Volare,47,213240,0,['Izi'],['6289Bbkkk3gaCbh1K7Rv8F'],2017,0.675,0.618,1,-5.268,1,0.0290,0.0320,0.000000,0.1000,0.261,123.942,4
91030,1sppdZoKN39258NWRfb6TN,Think About Me - Single; 2018 Remaster,32,164480,0,['Fleetwood Mac'],['08GQAI4eElDnROBrJRGE0X'],2018,0.658,0.800,0,-7.834,1,0.0289,0.0449,0.000487,0.1060,0.771,118.553,4
238590,3fBJNawfhtgCmo58trRK2z,Cerrando Ciclos,75,195453,0,['Banda MS de Sergio Lizárraga'],['2C6i0I5RiGzDKN9IAF8reh'],2020,0.740,0.417,2,-5.713,1,0.0380,0.1920,0.000008,0.2250,0.607,125.832,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
536015,5fSeTAzIRArFycesAKt2ri,Střízlivej,32,210402,1,"['Viktor Sheen', 'Protiva']","['4cG43cUBRJWWDsRh4SW48i', '2Bu87vdMT0plbktpNo...",2016,0.630,0.796,1,-5.910,1,0.2870,0.0953,0.000010,0.0663,0.540,91.883,4
200533,1U9m9PpLiQdCfUo6KrYFmT,No Puedo vivir Sin Ti,25,222955,0,['Jose Miguel Tu Bachatero'],['3zaTtK1PhRA0U14eAwaThg'],2017,0.885,0.681,4,-5.040,0,0.0431,0.1490,0.000013,0.0888,0.847,127.993,4
373674,6pJI8iZBKPuMjWDstvO081,Benimle Kayboldun,67,189521,0,['Kaan Boşnak'],['03cvjFEHz8eGwYHq1L0Pp2'],2018,0.488,0.520,11,-11.051,0,0.0434,0.2240,0.389000,0.0709,0.696,180.035,4
416490,2avvGgqlddEXMoYSSzML97,她沒在看我,47,178788,0,['E.SO'],['2qXGNIlmY3JrYkxOWyXZsd'],2020,0.772,0.509,1,-8.948,1,0.0642,0.4700,0.000007,0.0895,0.162,99.003,4


In [108]:
def query(queryID, songsData, columns, k):
    
    #if the song is already in the dataset, drop it 
    querySong = songsData.loc[songsData['id'] == queryID]
    #songsData = songsData.drop([songsData['id'] == queryID])
                       
                          
    new_songs = songsData[columns].copy(deep=True)
    new_songs['dist'] = new_songs.apply(lambda x: np.linalg.norm(x-querySong), axis=1)
    new_songs = new_songs.sort_values('dist')
    
    return new_songs.head(k).index

In [109]:
columns = ['acousticness','danceability','energy','instrumentalness','liveness','speechiness','valence', 
          'tempo']
query('08y9GfoqCWfOGsKdwojr5e', df, columns, 3)

TypeError: loop of ufunc does not support argument 0 of type NoneType which has no callable sqrt method