**CS4710 Final Project Music Recommendation System**

Group Members: Carter Bassler (cnb8mq), Elise Galletta (eag9sb), Gabe Levy (gel4xet), Kate Newton(krn4jc)

Sites consulted:
- https://www.geeksforgeeks.org/music-recommendation-system-using-machine-learning/
- https://youtube.com/watch?v=tooddaC14q4

Dataset:
- https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks

Import dependencies


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import seaborn as sb
import re
import sys
import itertools
import json
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from spotipy.oauth2 import SpotifyOAuth
import spotipy.util as util


from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE

import warnings
warnings.filterwarnings('ignore')


Read in tracks.csv<br>
Dataset contains 600,000+ tracks from 1921-2020<br>
Includes basic information (id, title, artist, length) as well as features of each track (danceability, energy, liveness, acousticness...)

In [2]:
tracks = pd.read_csv('tracks.csv')
tracks.head()

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.445,0,-13.338,1,0.451,0.674,0.744,0.151,0.127,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.263,0,-22.136,1,0.957,0.797,0.0,0.148,0.655,102.009,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.434,0.177,1,-21.18,1,0.0512,0.994,0.0218,0.212,0.457,130.418,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918,0.104,0.397,169.98,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.402,0.158,3,-16.9,0,0.039,0.989,0.13,0.311,0.196,103.22,4


"Artists" data set contains genre info<br>
We need to pull this genre info and combine it with the tracks data to assign a genre to the tracks

In [3]:
artists = pd.read_csv('artists.csv')
artists.head()

Unnamed: 0,id,followers,genres,name,popularity
0,0DheY5irMjBUeLybbCUEZ2,0.0,[],Armid & Amir Zare Pashai feat. Sara Rouzbehani,0
1,0DlhY15l3wsrnlfGio2bjU,5.0,[],ปูนา ภาวิณี,0
2,0DmRESX2JknGPQyO15yxg7,0.0,[],Sadaa,0
3,0DmhnbHjm1qw6NCYPeZNgJ,0.0,[],Tra'gruda,0
4,0Dn11fWM7vHQ3rinvWEl4E,2.0,[],Ioannis Panoutsopoulos,0


Regex from source to format genre data as a list of strings

In [4]:
artists['genres_upd'] = artists['genres'].apply(lambda x: [re.sub(' ','_',i) for i in re.findall(r"'([^']*)'", x)])


Now that we have the genres in a list format, we can show the genre by indexing into the list. <br><br>

Note we haven't yet cleared out any empty lists (no genres), so for this example we are indexing 9434, a random index where the genre list for that artist is not empty so that we can show a genre result for testing purposes

In [5]:
artists['genres_upd'].values[9434][0]

'czech_folk'

The tracks data is about songs, but does not include genre information which is important to the model.

The artists data is at an artist level and includes the important genre information that we want to combine with the track data.

<br>
Making a column called 'artists_upd' which formates it like the genres_updated column, a list of strings. We need to use 2 regexs: one for normal artists, one for artists with any apostrophes in the name. Then combine these two columns (v1, v2) into the final artists_upd


In [6]:
tracks['artists_upd_v1'] = tracks['artists'].apply(lambda x: re.findall(r"'([^']*)'", x))
tracks['artists_upd_v2'] = tracks['artists'].apply(lambda x: re.findall('\"(.*?)\"',x))
tracks['artists_upd'] = np.where(tracks['artists_upd_v1'].apply(lambda x: not x), tracks['artists_upd_v2'], tracks['artists_upd_v1'] )


Now that we have the artists updated into the format we want to work in, we need to remove any duplicate songs. We will do this by creating a new column called 'artists_song' which takes the format of ArtistSong. Example: TheDave BrubeckQuartetLover, Taylor SwiftLover, etc. This is based on song title and release date.

In [7]:
tracks['artists_song'] = tracks.apply(lambda row: row['artists_upd'][0]+str(row['name']),axis = 1)
tracks.sort_values(['artists_song','release_date'], ascending = False, inplace = True)
# tracks[tracks['name']=='Lover']

Now let's drop the duplicates of any song. We will show that this worked by showing the results of tracks that have the name Lover

In [8]:
tracks.drop_duplicates('artists_song',inplace = True)

# print(tracks[tracks['name']=='Lover'])

For some songs, there are multiple artists in the list but the data is in a single row. We need to explode this column.

artists_exploded will now distribute any songs with multiple artists into separate rows. the artists' names will be separte in the artists_upd column.

Using Under Pressure as an example of a track with multiple artists to make sure the change is as expected.

In [9]:
tracks[tracks['name']=='Under Pressure']
tracks = tracks.explode('artists_upd')

In [10]:
tracks[tracks['name']=='Under Pressure']

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artists_upd_v1,artists_upd_v2,artists_upd,artists_song
68387,3URCEz3axo6lHuDBhg59jz,Under Pressure,47,272333,1,['Thug Life'],['76fO70b6BK2xt3UVOe4BIZ'],1994-01-01,0.792,0.753,...,0.0191,0.0278,0.122,0.231,96.043,4,[Thug Life],[],Thug Life,Thug LifeUnder Pressure
75641,5oidljiMjeJTWUGZ4TfFea,Under Pressure,65,237520,0,"['Queen', 'David Bowie']","['1dfeR4HaWDbWqFHLkxsg1d', '0oSGxfWSnnOXhD2fKu...",2002-10-21,0.678,0.643,...,0.24,0.0,0.0708,0.534,113.46,4,"[Queen, David Bowie]",[],Queen,QueenUnder Pressure
75641,5oidljiMjeJTWUGZ4TfFea,Under Pressure,65,237520,0,"['Queen', 'David Bowie']","['1dfeR4HaWDbWqFHLkxsg1d', '0oSGxfWSnnOXhD2fKu...",2002-10-21,0.678,0.643,...,0.24,0.0,0.0708,0.534,113.46,4,"[Queen, David Bowie]",[],David Bowie,QueenUnder Pressure
304479,5LME7YULt0enp6UAB8VoDn,Under Pressure,62,559533,1,['Logic'],['4xRYI6VqpkE3UwrDrAZL8L'],2014-10-21,0.701,0.856,...,0.239,0.0,0.106,0.563,83.982,4,[Logic],[],Logic,LogicUnder Pressure


Now do a left join with tracks as left table and artists as right table. The artists name is 'artists_upd' in the left table and 'name' in the right table, so we will merge on those values.

In [11]:
artists_exploded = tracks.merge(artists, how = 'left', left_on = 'artists_upd',right_on = 'name')

artists_exploded = artists_exploded[~artists_exploded.genres_upd.isnull()]

In [12]:
print(artists_exploded[artists_exploded['id_x'] =='5oidljiMjeJTWUGZ4TfFea'])

                          id_x          name_x  popularity_x  duration_ms  \
219621  5oidljiMjeJTWUGZ4TfFea  Under Pressure            65       237520   
219622  5oidljiMjeJTWUGZ4TfFea  Under Pressure            65       237520   
219623  5oidljiMjeJTWUGZ4TfFea  Under Pressure            65       237520   

        explicit                   artists  \
219621         0  ['Queen', 'David Bowie']   
219622         0  ['Queen', 'David Bowie']   
219623         0  ['Queen', 'David Bowie']   

                                               id_artists release_date  \
219621  ['1dfeR4HaWDbWqFHLkxsg1d', '0oSGxfWSnnOXhD2fKu...   2002-10-21   
219622  ['1dfeR4HaWDbWqFHLkxsg1d', '0oSGxfWSnnOXhD2fKu...   2002-10-21   
219623  ['1dfeR4HaWDbWqFHLkxsg1d', '0oSGxfWSnnOXhD2fKu...   2002-10-21   

        danceability  energy  ...        artists_upd_v1  artists_upd_v2  \
219621         0.678   0.643  ...  [Queen, David Bowie]              []   
219622         0.678   0.643  ...  [Queen, David Bowie]    

In [13]:
artists_genres = artists_exploded.groupby('id_x')['genres_upd'].apply(list).reset_index()
artists_genres['genre_list'] = artists_genres['genres_upd'].apply(lambda x: list(set(list(itertools.chain.from_iterable(x)))))
artists_genres.head()

Unnamed: 0,id_x,genres_upd,genre_list
0,0004Uy71ku11n3LMpuyf59,[[polish_rock]],[polish_rock]
1,000CSYu4rvd8cQ7JilfxhZ,"[[country_quebecois, rock_quebecois]]","[rock_quebecois, country_quebecois]"
2,000DsoWJKHdaUmhgcnpr8j,[[barnmusik]],[barnmusik]
3,000G1xMMuwxNHmwVsBdtj1,"[[candy_pop, new_wave, new_wave_pop, permanent...","[candy_pop, new_wave_pop, power_pop, permanent..."
4,000KblXP5csWFFFsD6smOy,"[[chamame, folclore_salteno, folklore_argentino]]","[folklore_argentino, chamame, folclore_salteno]"


Now do a left join with tracks as left table and artists_genre_consolidated as right table. The song ID is 'id' in the left table and 'id_x' in the right table, so we will merge on those values. It is id_x because the previous left join created an id_x and id_y because the column name was repeated across tables.

In [14]:
tracks = tracks.merge(artists_genres[['id_x','genre_list']], how = 'left', left_on = 'id', right_on='id_x')

Now we have the following:
- all data in dataframe 'tracks'
- artist name formatted as a single string in 'artists_upd'
- if multiple artists to a single song, they are distributed across new rows
- the genres are listed under genre_list


In [15]:
tracks[tracks['artists_upd']=='Taylor Swift']

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,liveness,valence,tempo,time_signature,artists_upd_v1,artists_upd_v2,artists_upd,artists_song,id_x,genre_list
24814,55n9yjI6qqXh5F2mYvUc2y,I Don’t Wanna Live Forever (Fifty Shades Darker),80,247160,0,"['ZAYN', 'Taylor Swift']","['5ZsFI1h6hIdQRw2ti0hz81', '06HL4z0CvFAxyc27GX...",2017-11-09,0.731,0.445,...,0.1400,0.1060,117.970,4,"[ZAYN, Taylor Swift]",[],Taylor Swift,ZAYNI Don’t Wanna Live Forever (Fifty Shades D...,55n9yjI6qqXh5F2mYvUc2y,"[pop, dance_pop, post-teen_pop, uk_pop]"
76299,60hGQrn24APqEFSLObLeDc,Highway Don't Care,57,279067,0,"['Tim McGraw', 'Taylor Swift', 'Keith Urban']","['6roFdX1y5BYSbp60OTJWMd', '06HL4z0CvFAxyc27GX...",2013-01-01,0.477,0.790,...,0.0841,0.5050,157.797,4,"[Tim McGraw, Taylor Swift, Keith Urban]",[],Taylor Swift,Tim McGrawHighway Don't Care,60hGQrn24APqEFSLObLeDc,"[australian_country, country_road, contemporar..."
101740,7dW84mWkdWE5a6lFWxJCBG,‘tis the damn season,74,229840,0,['Taylor Swift'],['06HL4z0CvFAxyc27GXpf02'],2020-12-11,0.575,0.434,...,0.1050,0.3480,145.916,4,[Taylor Swift],[],Taylor Swift,Taylor Swift‘tis the damn season,7dW84mWkdWE5a6lFWxJCBG,"[pop, post-teen_pop]"
101741,0lx2cLdOt3piJbcaXIV74f,willow,85,214707,0,['Taylor Swift'],['06HL4z0CvFAxyc27GXpf02'],2020-12-11,0.392,0.574,...,0.1450,0.5290,81.112,4,[Taylor Swift],[],Taylor Swift,Taylor Swiftwillow,0lx2cLdOt3piJbcaXIV74f,"[pop, post-teen_pop]"
101742,0PurA4JVJ8YQgSVopY8fn6,tolerate it,74,245440,1,['Taylor Swift'],['06HL4z0CvFAxyc27GXpf02'],2020-12-11,0.316,0.361,...,0.0797,0.2210,74.952,5,[Taylor Swift],[],Taylor Swift,Taylor Swifttolerate it,0PurA4JVJ8YQgSVopY8fn6,"[pop, post-teen_pop]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101943,0YgHuReCSPwTXYny7isLja,A Perfectly Good Heart,47,220147,0,['Taylor Swift'],['06HL4z0CvFAxyc27GXpf02'],2006-10-24,0.483,0.751,...,0.1280,0.2680,156.092,4,[Taylor Swift],[],Taylor Swift,Taylor SwiftA Perfectly Good Heart,0YgHuReCSPwTXYny7isLja,"[pop, post-teen_pop]"
101944,6ce41JVa6KFaryfmC2bu2W,22,73,232120,0,['Taylor Swift'],['06HL4z0CvFAxyc27GXpf02'],2012-10-22,0.661,0.729,...,0.0477,0.6680,103.987,4,[Taylor Swift],[],Taylor Swift,Taylor Swift22,6ce41JVa6KFaryfmC2bu2W,"[pop, post-teen_pop]"
101945,2yLa0QULdQr0qAIvVwN6B5,...Ready For It?,73,208187,0,['Taylor Swift'],['06HL4z0CvFAxyc27GXpf02'],2017-11-10,0.613,0.764,...,0.1970,0.4170,160.015,4,[Taylor Swift],[],Taylor Swift,Taylor Swift...Ready For It?,2yLa0QULdQr0qAIvVwN6B5,"[pop, post-teen_pop]"
593723,1MaqkdFNIKPdpQGDzme5ss,Two Is Better Than One (feat. Taylor Swift),66,242840,0,"['Boys Like Girls', 'Taylor Swift']","['0vWCyXMrrvMlCcepuOJaGI', '06HL4z0CvFAxyc27GX...",2009-09-07,0.521,0.702,...,0.0962,0.2500,127.976,4,"[Boys Like Girls, Taylor Swift]",[],Taylor Swift,Boys Like GirlsTwo Is Better Than One (feat. T...,1MaqkdFNIKPdpQGDzme5ss,"[neo_mellow, neon_pop_punk, pop_rock, pop_punk..."


In [16]:
tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 674010 entries, 0 to 674009
Data columns (total 26 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                674010 non-null  object 
 1   name              674009 non-null  object 
 2   popularity        674010 non-null  int64  
 3   duration_ms       674010 non-null  int64  
 4   explicit          674010 non-null  int64  
 5   artists           674010 non-null  object 
 6   id_artists        674010 non-null  object 
 7   release_date      674010 non-null  object 
 8   danceability      674010 non-null  float64
 9   energy            674010 non-null  float64
 10  key               674010 non-null  int64  
 11  loudness          674010 non-null  float64
 12  mode              674010 non-null  int64  
 13  speechiness       674010 non-null  float64
 14  acousticness      674010 non-null  float64
 15  instrumentalness  674010 non-null  float64
 16  liveness          67

After all of this, we can check how many songs have no genre list associated with them. Its a small amount ~10,000 entries, so we will just drop these rows. We are dropping the rows because genre is an important predictor in our model.

In [17]:
tracks.isnull().sum()

id                      0
name                    1
popularity              0
duration_ms             0
explicit                0
artists                 0
id_artists              0
release_date            0
danceability            0
energy                  0
key                     0
loudness                0
mode                    0
speechiness             0
acousticness            0
instrumentalness        0
liveness                0
valence                 0
tempo                   0
time_signature          0
artists_upd_v1          0
artists_upd_v2          0
artists_upd             0
artists_song            0
id_x                10307
genre_list          10307
dtype: int64

In [18]:
tracks.dropna(subset=['genre_list'], inplace=True)

In [19]:
tracks.isnull().sum()

id                  0
name                0
popularity          0
duration_ms         0
explicit            0
artists             0
id_artists          0
release_date        0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
artists_upd_v1      0
artists_upd_v2      0
artists_upd         0
artists_song        0
id_x                0
genre_list          0
dtype: int64

Now we need to normalize the different features of the tracks since each of them are floats based on a different scale.

In [20]:
tracks['year'] = tracks['release_date'].apply(lambda x: x.split('-')[0])

In [21]:
float_cols = tracks.dtypes[tracks.dtypes == 'float64'].index.values

In [22]:
ohe_cols = 'popularity'

In [23]:
tracks['popularity'].describe()


count    663703.000000
mean         26.692141
std          18.799327
min           0.000000
25%          11.000000
50%          26.000000
75%          40.000000
max          99.000000
Name: popularity, dtype: float64

In [24]:
tracks['popularity_red'] = tracks['popularity'].apply(lambda x: int(x/5))
tracks['consolidates_genre_lists'] = tracks['genre_list'].apply(lambda d: d if isinstance(d, list) else [])


In [25]:
tracks.head()

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,time_signature,artists_upd_v1,artists_upd_v2,artists_upd,artists_song,id_x,genre_list,year,popularity_red,consolidates_genre_lists
0,3u1C6nWVRoP5F0w8gGrDL3,사랑의 미로,25,222380,0,['최진희'],['1NSrAf8XJYJVgAXKoxaMet'],1987-06-01,0.367,0.194,...,4,[최진희],[],최진희,최진희사랑의 미로,3u1C6nWVRoP5F0w8gGrDL3,[trot],1987,5,[trot]
2,1jvoY322nxyKXq8OBhgmSY,어떡하죠,44,244360,0,['지선'],['2Mo9NQaNCFCWSR5CnlfmbN'],2011-10-13,0.606,0.341,...,4,[지선],[],지선,지선어떡하죠,1jvoY322nxyKXq8OBhgmSY,[],2011,8,[]
4,7rxpWwcXNgDUXl0wN0gUvp,천국의 기억 장정우 Version,31,280372,0,['장정우'],['5L7zKs2ftwENWOMI7LFaN1'],2003-12-24,0.494,0.656,...,4,[장정우],[],장정우,장정우천국의 기억 장정우 Version,7rxpWwcXNgDUXl0wN0gUvp,[],2003,6,[]
5,0cEvzbXjxkOxgBUmBUcHZW,그것만은..,32,294452,0,['장정우'],['5L7zKs2ftwENWOMI7LFaN1'],2003-12-24,0.311,0.487,...,4,[장정우],[],장정우,장정우그것만은..,0cEvzbXjxkOxgBUmBUcHZW,[],2003,6,[]
8,6uRXQW8BqB3N9WKfe9gfdw,Title 허밍,37,102307,0,['이경섭'],['191huMISbbIeUELiiEGZ7L'],2007-11-09,0.825,0.604,...,4,[이경섭],[],이경섭,이경섭Title 허밍,6uRXQW8BqB3N9WKfe9gfdw,[],2007,7,[]


In [26]:
tracks['year'] = tracks['year'].astype(int)

In [27]:
tracks = tracks[tracks['year'] > 1990]

In [28]:
tracks.head()

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,time_signature,artists_upd_v1,artists_upd_v2,artists_upd,artists_song,id_x,genre_list,year,popularity_red,consolidates_genre_lists
2,1jvoY322nxyKXq8OBhgmSY,어떡하죠,44,244360,0,['지선'],['2Mo9NQaNCFCWSR5CnlfmbN'],2011-10-13,0.606,0.341,...,4,[지선],[],지선,지선어떡하죠,1jvoY322nxyKXq8OBhgmSY,[],2011,8,[]
4,7rxpWwcXNgDUXl0wN0gUvp,천국의 기억 장정우 Version,31,280372,0,['장정우'],['5L7zKs2ftwENWOMI7LFaN1'],2003-12-24,0.494,0.656,...,4,[장정우],[],장정우,장정우천국의 기억 장정우 Version,7rxpWwcXNgDUXl0wN0gUvp,[],2003,6,[]
5,0cEvzbXjxkOxgBUmBUcHZW,그것만은..,32,294452,0,['장정우'],['5L7zKs2ftwENWOMI7LFaN1'],2003-12-24,0.311,0.487,...,4,[장정우],[],장정우,장정우그것만은..,0cEvzbXjxkOxgBUmBUcHZW,[],2003,6,[]
8,6uRXQW8BqB3N9WKfe9gfdw,Title 허밍,37,102307,0,['이경섭'],['191huMISbbIeUELiiEGZ7L'],2007-11-09,0.825,0.604,...,4,[이경섭],[],이경섭,이경섭Title 허밍,6uRXQW8BqB3N9WKfe9gfdw,[],2007,7,[]
9,02q7sJH0d1lC0oDd8gf513,별,44,302022,0,['유미'],['0kgK64SIuInLUon0mn9o4h'],2006-12-12,0.299,0.324,...,4,[유미],[],유미,유미별,02q7sJH0d1lC0oDd8gf513,[],2006,8,[]


In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler

In [30]:
#ohe prep
def ohe_prep(df, col, new_col_name):

  new_df = pd.get_dummies(df[col])
  names = new_df.columns
  new_df.columns = [new_col_name + "|" + str(i) for i in names]
  new_df.reset_index(drop=True, inplace = True)
  return new_df

TF-IDF Section

In [31]:
def tfIDF(df, float_cols):
    #tfIDF section
    vector= TfidfVectorizer()
    matrix = vector.fit_transform(df['consolidates_genre_lists'].apply(lambda x: " ".join(x)))
    genres = pd.DataFrame(matrix.toarray())
    genres.columns = ['genre' + "|" + i for i in vector.get_feature_names_out()]
    genres.reset_index(drop=True, inplace=True)

    year = ohe_prep(df, 'year', 'year') * 0.5
    popularity = ohe_prep(df, 'popularity_red', 'pop') * 0.15

    floats = df[float_cols].reset_index(drop = True)
    scaler=MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns=floats.columns) * 0.1

    concatDf = pd.concat([genres, popularity, year], axis =1)

    concatDf['id'] = df['id'].values

    return concatDf


In [32]:
complete_set = tfIDF(tracks, float_cols=float_cols)

In [33]:
complete_set.head()

Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|48g,genre|_brasileira,genre|_hip_hop,genre|_house,genre|a3,genre|a_cappella,genre|abstract,genre|abstract_beats,...,year|2013,year|2014,year|2015,year|2016,year|2017,year|2018,year|2019,year|2020,year|2021,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1jvoY322nxyKXq8OBhgmSY
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7rxpWwcXNgDUXl0wN0gUvp
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0cEvzbXjxkOxgBUmBUcHZW
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6uRXQW8BqB3N9WKfe9gfdw
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,02q7sJH0d1lC0oDd8gf513


Spotify Section

In [138]:
SPOTIPY_CLIENT_ID = ''
SPOTIPY_CLIENT_SECRET = ''


In [139]:
scope = 'user-library-read'

if len(sys.argv) > 1:
    username = sys.argv[1]
else:
    print("Usage: %s username" % (sys.argv[0],))
    sys.exit()

In [140]:
auth_manager = SpotifyClientCredentials(client_id=SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET)
sp = spotipy.Spotify(auth_manager=auth_manager)

In [141]:
token = util.prompt_for_user_token(scope, client_id= SPOTIPY_CLIENT_ID, client_secret=SPOTIPY_CLIENT_SECRET, redirect_uri='http://localhost:8080/callback')

In [142]:
sp = spotipy.Spotify(auth=token)

In [143]:
id_name = {}
list_photo = {}
for i in sp.current_user_playlists()['items']:

    id_name[i['name']] = i['uri'].split(':')[2]
    list_photo[i['uri'].split(':')[2]] = i['images'][0]['url']

In [144]:
id_name

{'myplaylist': '1vLKhf7F7YNrPaNWvOcOOe',
 'music': '6a4uYH93Rx5niGNsEsNNti',
 'My Playlist #2': '1lupOCTMnviLFP08M9xjfo',
 'playlist': '6nMhyAYkJJ2Hs0bfaxTlj4'}

In [145]:
def create_necessary_outputs(playlist_name,id_dic, df):
    playlist = pd.DataFrame()
    playlist_name = playlist_name

    for ix, i in enumerate(sp.playlist(id_dic[playlist_name])['tracks']['items']):
        playlist.loc[ix, 'artist'] = i['track']['artists'][0]['name']
        playlist.loc[ix, 'name'] = i['track']['name']
        playlist.loc[ix, 'id'] = i['track']['id']
        playlist.loc[ix, 'url'] = i['track']['album']['images'][1]['url']
        playlist.loc[ix, 'date_added'] = i['added_at']

    playlist['date_added'] = pd.to_datetime(playlist['date_added'])  
    
    playlist = playlist[playlist['id'].isin(df['id'].values)].sort_values('date_added',ascending = False)
    
    return playlist

In [146]:
id_name

{'myplaylist': '1vLKhf7F7YNrPaNWvOcOOe',
 'music': '6a4uYH93Rx5niGNsEsNNti',
 'My Playlist #2': '1lupOCTMnviLFP08M9xjfo',
 'playlist': '6nMhyAYkJJ2Hs0bfaxTlj4'}

In [155]:
playlist = create_necessary_outputs('music', id_name,tracks)

The next code + output cell shows the current playlist, artists, song names, song id, url of the album cover art image, and the date it was added.

In [156]:
playlist

Unnamed: 0,artist,name,id,url,date_added
95,IU,BBIBBI,4as4XEOR03oGm1STUKl6pa,https://i.scdn.co/image/ab67616d00001e02a1d785...,2023-12-04 19:12:02+00:00
93,Jessi,NUNU NANA,2cUzIBGMvx2BZ2Q1fzjdl1,https://i.scdn.co/image/ab67616d00001e02e411b1...,2023-12-04 19:12:02+00:00
90,BAEKHYUN,Candy,6kKC35sKUh2FUx4M0qAL44,https://i.scdn.co/image/ab67616d00001e02a28084...,2023-12-04 19:12:02+00:00
84,OH MY GIRL,Dolphin,3NfgdU9mbIVhcJF3XvC0c9,https://i.scdn.co/image/ab67616d00001e024957fc...,2023-12-04 19:12:02+00:00
78,EXO,Ko Ko Bop,5EzitieoPnjyKHAq0gfRMa,https://i.scdn.co/image/ab67616d00001e02b787b5...,2023-12-04 19:12:02+00:00
77,NCT 127,無限的我 무한적아; LIMITLESS,6m96zMB8BrLrpyT7NvApxB,https://i.scdn.co/image/ab67616d00001e02ca74d3...,2023-12-04 19:12:02+00:00
1,Alicia Keys,Girl on Fire,4esOae7i4rqTbAu9o5Pxco,https://i.scdn.co/image/ab67616d00001e02a4fe49...,2023-12-04 19:11:43+00:00
48,Lana Del Rey,Happiness is a butterfly,3lG6OtGDsYAOALxEmubQQm,https://i.scdn.co/image/ab67616d00001e02879e93...,2023-12-04 19:11:43+00:00
41,Shawn Mendes,Mercy,0AS63m1wHv9n4VVRizK6Hc,https://i.scdn.co/image/ab67616d00001e02ea3ef7...,2023-12-04 19:11:43+00:00
0,Taylor Swift,Lover,1dGr1c8CrMLDpV6mPbImSI,https://i.scdn.co/image/ab67616d00001e02e787cf...,2023-12-04 19:11:43+00:00


In [157]:
def generate_playlist_feature(complete_feature_set, playlist_df, weight_factor):
    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]
    complete_feature_set_playlist = complete_feature_set_playlist.merge(playlist_df[['id','date_added']], on = 'id', how = 'inner')
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]
    
    playlist_feature_set = complete_feature_set_playlist.sort_values('date_added',ascending=False)

    most_recent_date = playlist_feature_set.iloc[0,-1]
    
    for ix, row in playlist_feature_set.iterrows():
        playlist_feature_set.loc[ix,'months_from_recent'] = int((most_recent_date.to_pydatetime() - row.iloc[-1].to_pydatetime()).days / 30)
        
    playlist_feature_set['weight'] = playlist_feature_set['months_from_recent'].apply(lambda x: weight_factor ** (-x))
    
    playlist_feature_set_weighted = playlist_feature_set.copy()
    playlist_feature_set_weighted.update(playlist_feature_set_weighted.iloc[:,:-4].mul(playlist_feature_set_weighted.weight,0))
    playlist_feature_set_weighted_final = playlist_feature_set_weighted.iloc[:, :-4]
    
    return playlist_feature_set_weighted_final.sum(axis = 0), complete_feature_set_nonplaylist

In [158]:
complete_feature_set_playlist_vector, complete_feature_set_nonplaylist = generate_playlist_feature(complete_set, playlist, 1.03)


Next write the recommendation generation function.

This will use a cosine similarity function to find songs similar to the user's taste in music

In [152]:
def generate_playlist_recos(df, features, nonplaylist_features):
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    non_playlist_df_top_40['url'] = non_playlist_df_top_40['id'].apply(lambda x: sp.track(x)['album']['images'][1]['url'])
    
    return non_playlist_df_top_40

Call the generate playlist recos from previous cell to generate 40 recommendations.

In [159]:
myplaylist_top40recs = generate_playlist_recos(tracks, complete_feature_set_playlist_vector, complete_feature_set_nonplaylist)

In [160]:
myplaylist_top40recs

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,...,artists_upd_v2,artists_upd,artists_song,id_x,genre_list,year,popularity_red,consolidates_genre_lists,sim,url
324104,2qG81jL9UIP54uS8gYyP4k,Who (feat. BTS),76,180413,0,"['Lauv', 'BTS']","['5JZ7CnR6gTvEMKX4g70Amv', '3Nrfpe0tUJi4K4DXYW...",2020-03-06,0.632,0.345,...,[],Lauv,LauvWho (feat. BTS),2qG81jL9UIP54uS8gYyP4k,"[k-pop, k-pop_boy_group, post-teen_pop, pop, e...",2020,15,"[k-pop, k-pop_boy_group, post-teen_pop, pop, e...",0.773201,https://i.scdn.co/image/ab67616d00001e0236b12a...
324105,2qG81jL9UIP54uS8gYyP4k,Who (feat. BTS),76,180413,0,"['Lauv', 'BTS']","['5JZ7CnR6gTvEMKX4g70Amv', '3Nrfpe0tUJi4K4DXYW...",2020-03-06,0.632,0.345,...,[],BTS,LauvWho (feat. BTS),2qG81jL9UIP54uS8gYyP4k,"[k-pop, k-pop_boy_group, post-teen_pop, pop, e...",2020,15,"[k-pop, k-pop_boy_group, post-teen_pop, pop, e...",0.773201,https://i.scdn.co/image/ab67616d00001e0236b12a...
411037,0pYacDCZuRhcrwGUA5nTBe,eight(Prod.&Feat. SUGA of BTS),78,167573,0,"['IU', 'SUGA']","['3HqSLMAZ3g3d5poNaI7GOU', '0ebNdVaOfp6N0oZ1gu...",2020-05-06,0.676,0.869,...,[],IU,IUeight(Prod.&Feat. SUGA of BTS),0pYacDCZuRhcrwGUA5nTBe,[k-pop],2020,15,[k-pop],0.77305,https://i.scdn.co/image/ab67616d00001e02c63be0...
411038,0pYacDCZuRhcrwGUA5nTBe,eight(Prod.&Feat. SUGA of BTS),78,167573,0,"['IU', 'SUGA']","['3HqSLMAZ3g3d5poNaI7GOU', '0ebNdVaOfp6N0oZ1gu...",2020-05-06,0.676,0.869,...,[],SUGA,IUeight(Prod.&Feat. SUGA of BTS),0pYacDCZuRhcrwGUA5nTBe,[k-pop],2020,15,[k-pop],0.77305,https://i.scdn.co/image/ab67616d00001e02c63be0...
356225,5dntGTbUtmUO239wQ0k3yM,Mmmh,77,192347,0,['KAI'],['6iVo62B0bdTknRcrktCmak'],2020-11-30,0.659,0.718,...,[],KAI,KAIMmmh,5dntGTbUtmUO239wQ0k3yM,[k-pop],2020,15,[k-pop],0.77305,https://i.scdn.co/image/ab67616d00001e0237ff46...
101771,5kI4eCXXzyuIUXjQra0Cxi,betty,75,294521,1,['Taylor Swift'],['06HL4z0CvFAxyc27GXpf02'],2020-07-24,0.593,0.377,...,[],Taylor Swift,Taylor Swiftbetty,5kI4eCXXzyuIUXjQra0Cxi,"[pop, post-teen_pop]",2020,15,"[pop, post-teen_pop]",0.772047,https://i.scdn.co/image/ab67616d00001e0295f754...
101745,0Jlcvv8IykzHaSmj49uNW8,the 1,78,210252,1,['Taylor Swift'],['06HL4z0CvFAxyc27GXpf02'],2020-07-24,0.777,0.357,...,[],Taylor Swift,Taylor Swiftthe 1,0Jlcvv8IykzHaSmj49uNW8,"[pop, post-teen_pop]",2020,15,"[pop, post-teen_pop]",0.772047,https://i.scdn.co/image/ab67616d00001e0295f754...
101759,5BK0uqwY9DNfZ630STAEaq,gold rush,77,185320,1,['Taylor Swift'],['06HL4z0CvFAxyc27GXpf02'],2020-12-11,0.512,0.462,...,[],Taylor Swift,Taylor Swiftgold rush,5BK0uqwY9DNfZ630STAEaq,"[pop, post-teen_pop]",2020,15,"[pop, post-teen_pop]",0.772047,https://i.scdn.co/image/ab67616d00001e0233b854...
226680,1v6svH1Fyx9C1nIt1mA2DT,All I Want,75,177323,0,['Olivia Rodrigo'],['1McMsnEElThX1knmY4oliG'],2020-01-10,0.376,0.43,...,[],Olivia Rodrigo,Olivia RodrigoAll I Want,1v6svH1Fyx9C1nIt1mA2DT,"[pop, post-teen_pop]",2020,15,"[pop, post-teen_pop]",0.772047,https://i.scdn.co/image/ab67616d00001e02103b7e...
441759,4rHIfYRQw6ittRQXAlurk0,Favorite Mistake,74,172907,0,['Giveon'],['4fxd5Ee7UefO4CUXgwJ7IP'],2020-03-27,0.535,0.409,...,[],Giveon,GiveonFavorite Mistake,4rHIfYRQw6ittRQXAlurk0,"[r&b, pop]",2020,14,"[r&b, pop]",0.76605,https://i.scdn.co/image/ab67616d00001e023317fc...
