# Spotify Song Popularity Predictor and Genre Recommendation System

In [1]:
# !pip install spotipy

In [2]:
import pandas as pd
import numpy as np
import random
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances, cosine_distances, cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from datetime import datetime

%matplotlib inline

## 5. Song Recommender Model 

In [3]:
df_recommender = pd.read_csv("../datasets/data_recommender.csv")
df_recommender

Unnamed: 0,acousticness,artists,danceability,energy,explicit,instrumentalness,key,liveness,loudness,mode,name,popularity,speechiness,tempo,valence,year,duration_mins
0,0.991000,['Mamie Smith'],0.598,0.2240,0,0.000522,5,0.3790,-12.628,0,Keep A Song In Your Soul,0.12,0.0936,149.976,0.6340,1920,2.81
1,0.643000,"[""Screamin' Jay Hawkins""]",0.852,0.5170,0,0.026400,5,0.0809,-7.261,0,I Put A Spell On You,0.07,0.0534,86.889,0.9500,1920,2.50
2,0.993000,['Mamie Smith'],0.647,0.1860,0,0.000018,0,0.5190,-12.098,1,Golfing Papa,0.04,0.1740,97.600,0.6890,1920,2.73
3,0.000173,['Oscar Velazquez'],0.730,0.7980,0,0.801000,2,0.1280,-7.311,1,True House Music - Xavier Santos & Carlos Gomi...,0.17,0.0425,127.997,0.0422,1920,7.03
4,0.295000,['Mixe'],0.704,0.7070,1,0.000246,10,0.4020,-6.036,0,Xuniverxe,0.02,0.0768,122.076,0.2990,1920,2.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159329,0.995000,"['Ludovico Einaudi', 'Johannes Bornlöf']",0.297,0.0287,0,0.908000,8,0.0995,-30.008,1,Nuvole bianche,0.00,0.0564,141.636,0.0678,2021,5.82
159330,0.995000,"['Ludovico Einaudi', 'Johannes Bornlöf']",0.343,0.0165,0,0.878000,9,0.0774,-30.915,0,Una Mattina,0.00,0.0455,126.970,0.1510,2021,3.44
159331,0.988000,"['Ludovico Einaudi', 'Johannes Bornlöf']",0.316,0.0573,0,0.879000,3,0.1200,-24.121,1,Night,0.00,0.0515,81.070,0.0373,2021,5.06
159332,0.795000,['Alessia Cara'],0.429,0.2110,0,0.000000,4,0.1960,-11.665,1,A Little More,0.00,0.0360,94.710,0.2280,2021,2.41


In [4]:
df_recommender.shape

(159334, 17)

### 5.1 Preprocessing

Since the dataframe still contains almost 160,000 rows, it's not feasible to create a recommender with all those songs. Instead, we'll be taking a subset of the dataframe with songs from 1990 onwards. In addition, all duplicated song versions from the SAME artist will be dropped, keeping only the version with the highest popularity. 

We will still keep the same song sung by different artist to ensure that songs with identical titles are not deleted. 

#### 5.1.1 Main Dataset of Songs

In [5]:
df_recommender = df_recommender.sort_values('popularity', ascending=False)
df_recommender.drop_duplicates(subset=['artists', 'name'], keep='first', inplace=True)

In [6]:
df_recommender_2000 = df_recommender[df_recommender.year >= 2000]
df_recommender_2000.describe()['year']

count    36381.000000
mean      2011.921003
std          6.130138
min       2000.000000
25%       2007.000000
50%       2013.000000
75%       2017.000000
max       2021.000000
Name: year, dtype: float64

In [7]:
df_recommender_2000.shape

(36381, 17)

In [8]:
df_recommender_2000[df_recommender_2000.name.duplicated()]['name'].value_counts().head(50)

2000 Years                                                     51
Happy New Year                                                 21
7 Years                                                        18
Stay                                                           16
New Year                                                       16
Have Yourself a Merry Little Christmas                         15
Year 2000                                                      15
It's the Most Wonderful Time of the Year                       12
Neujahrsgruß / New Year's Address / Allocution du Nouvel An    12
Home                                                           12
2000 Light Years from Home                                     12
Hold On                                                        11
2000 Years Ago                                                 10
A Thousand Years                                               10
Winter Wonderland                                              10
Intro     

In [9]:
drop_features = ['explicit', 'key', 'mode', 'year', 'duration_mins', 'artists', 'popularity']
scale_features = ['loudness', 'tempo']
features = ['acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'speechiness', 'valence']

In [10]:
df_recommender_2000 = df_recommender_2000.drop(drop_features, axis=1)
df_recommender_2000

Unnamed: 0,acousticness,danceability,energy,instrumentalness,liveness,loudness,name,speechiness,tempo,valence
18796,0.721000,0.585,0.436,0.000013,0.1050,-8.761,drivers license,0.0601,143.874,0.132
18617,0.221000,0.700,0.722,0.000000,0.2720,-3.558,Mood (feat. iann dior),0.0369,90.989,0.756
18620,0.468000,0.737,0.802,0.000000,0.0931,-4.771,positions,0.0878,144.015,0.682
18712,0.212000,0.863,0.666,0.000493,0.1030,-4.158,BICHOTA,0.1520,163.908,0.838
18634,0.401000,0.731,0.573,0.000052,0.1130,-10.059,DÁKITI,0.0544,109.928,0.145
...,...,...,...,...,...,...,...,...,...,...
129279,0.080200,0.254,0.633,0.000871,0.0551,-8.697,NINETEEN,0.0321,75.980,0.516
129281,0.006780,0.753,0.691,0.705000,0.0893,-9.075,Azrael - Dahu Remix,0.0306,122.007,0.231
129283,0.002270,0.741,0.825,0.068200,0.1670,-2.924,Speak Up (feat. Wynter Gordon),0.0720,128.020,0.319
129287,0.008510,0.626,0.533,0.556000,0.1170,-13.078,Lost Vegas,0.0367,127.018,0.377


In [11]:
def min_max_scale(X, range=(0, 1)):
    mi, ma = range
    X_std = (X - X.min()) / (X.max() - X.min())
    X_scaled = X_std * (ma - mi) + mi
    return X_scaled

In [12]:
for feature in scale_features: 
    df_recommender_2000[feature] = min_max_scale(df_recommender_2000[feature])
    print(df_recommender_2000[feature].describe())

count    36381.000000
mean         0.803281
std          0.077806
min          0.000000
25%          0.778138
50%          0.821970
75%          0.852207
max          1.000000
Name: loudness, dtype: float64
count    36381.000000
mean         0.553333
std          0.128064
min          0.000000
25%          0.450918
50%          0.563180
75%          0.627107
max          1.000000
Name: tempo, dtype: float64


In [13]:
df_recommender_2000 = df_recommender_2000.set_index('name', drop=True)
df_recommender_2000.head()

Unnamed: 0_level_0,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
drivers license,0.721,0.585,0.436,1.3e-05,0.105,0.789966,0.0601,0.648215,0.132
Mood (feat. iann dior),0.221,0.7,0.722,0.0,0.272,0.880072,0.0369,0.409945,0.756
positions,0.468,0.737,0.802,0.0,0.0931,0.859065,0.0878,0.648851,0.682
BICHOTA,0.212,0.863,0.666,0.000493,0.103,0.869681,0.152,0.738477,0.838
DÁKITI,0.401,0.731,0.573,5.2e-05,0.113,0.767487,0.0544,0.495274,0.145


#### 5.1.2 Genre Dataset

In [14]:
df_genres = pd.read_csv("../datasets/data_by_genres_cleaned.csv", index_col=['genres'])
df_genres.head()

Unnamed: 0_level_0,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,duration_mins
genres,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
21st century classical,0.7546,0.2841,0.15958,0.484374,0.16858,-22.1534,0.06206,91.351,0.14338,0.066,4,1,5.88
432hz,0.485515,0.312,0.391678,0.47725,0.26594,-18.131267,0.071717,118.900933,0.236483,0.412,11,1,17.46
8-bit,0.0289,0.673,0.95,0.63,0.069,-7.899,0.292,192.816,0.997,0.0,5,1,2.22
[],0.535793,0.546937,0.48543,0.278442,0.22097,-11.624754,0.101511,116.06898,0.486361,0.123508,7,1,4.16
a cappella,0.694276,0.516172,0.330533,0.03608,0.222983,-12.656547,0.083627,105.506031,0.454077,0.390862,7,1,3.36


In [1]:
df_genres = df_genres.sort_values('popularity', ascending=False)
df_genres.head(30)

NameError: name 'df_genres' is not defined

In [16]:
for feature in scale_features: 
    df_genres[feature] = min_max_scale(df_genres[feature])
    print(df_genres[feature].describe())

count    3232.000000
mean        0.760296
std         0.126312
min         0.000000
25%         0.714633
50%         0.792946
75%         0.844711
max         1.000000
Name: loudness, dtype: float64
count    3232.000000
mean        0.387396
std         0.115094
min         0.000000
25%         0.319488
50%         0.390281
75%         0.444143
max         1.000000
Name: tempo, dtype: float64


Re-organising columns

In [30]:
column_order = list(df_recommender_2000.columns)
df_genres = df_genres[column_order]
df_genres.head(30)

Unnamed: 0_level_0,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
genres,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
chinese electropop,0.00257,0.66,0.787,0.0,0.323,0.906223,0.032,0.535536,0.199
korean mask singer,0.238,0.658,0.749,2e-06,0.272,0.947077,0.0634,0.448493,0.442
yaoi,0.00655,0.603,0.964,3e-06,0.143,0.949114,0.0487,0.489123,0.796
dutch rap pop,0.216,0.767,0.643,0.000104,0.0806,0.890308,0.101,0.721269,0.536
rochester mn indie,0.0347,0.655,0.487,0.897,0.271,0.820842,0.033,0.521566,0.0454
dong-yo,0.259,0.829,0.886,0.0,0.0559,0.977775,0.112,0.35651,0.777
afroswing,0.171333,0.758333,0.512,3.3e-05,0.101033,0.831536,0.211,0.270224,0.480333
j-rap,0.0961,0.648,0.904,2e-06,0.0629,0.906901,0.0786,0.402578,0.728
estonian pop,0.285,0.458,0.696,0.0,0.123,0.902451,0.0402,0.748831,0.418
irish pop,0.3086,0.47475,0.513625,0.024295,0.146075,0.815918,0.0577,0.329866,0.265238


### 5.2 Cosine Similarity for Songs

In this section, we explore the cosine similarity between songs and create an algorithm that returns the top 30 most similar songs to a particular song chosen. In this case, we use the song "drivers license" as our reference. 

In [None]:
sim_matrix = cosine_similarity(df_recommender_2000)
songs_sim = pd.DataFrame(sim_matrix, columns=df_recommender_2000.index, index=df_recommender_2000.index)
songs_sim

In [None]:
songs_sim['drivers license'].sort_values(ascending=False).head(20)

These are the top 30 songs similar to drivers license.

### 5.3 Cosine Similarity for Genres

This is the main focus for the recommender system - to build a playlist of 30 songs which are most similar to a particular genre selected and add the playlist to my Spotify using the Spotify API. 

In [20]:
genres_to_recc = [genre for genre in list(df_genres.head(30).index)]
genres_to_recc[25:]

['future house',
 'finnish alternative rock',
 'german pop rock',
 'mexican pop punk',
 'scandinavian r&b']

Let's recommend songs with similar features to swedish tropical house

In [19]:
genre_choice = 'newcastle nsw indie'

In [20]:
df_genre_recc = pd.concat([pd.DataFrame(df_genres.loc[genre_choice]).T, df_recommender_2000], axis=0)
df_genre_recc.head()

Unnamed: 0,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
newcastle nsw indie,0.00392,0.498,0.81,0.00717,0.0635,0.919472,0.0263,0.123309,0.594
drivers license,0.721,0.585,0.436,1.3e-05,0.105,0.789966,0.0601,0.648215,0.132
Mood (feat. iann dior),0.221,0.7,0.722,0.0,0.272,0.880072,0.0369,0.409945,0.756
positions,0.468,0.737,0.802,0.0,0.0931,0.859065,0.0878,0.648851,0.682
BICHOTA,0.212,0.863,0.666,0.000493,0.103,0.869681,0.152,0.738477,0.838


In [21]:
start_time = datetime.now()
print(f"Recommending {genre_choice} music...")
sim_matrix_genre = cosine_similarity(df_genre_recc)
songs_sim_genre = pd.DataFrame(sim_matrix_genre, columns=df_genre_recc.index, index=df_genre_recc.index)
songs_sim_genre

end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

Recommending newcastle nsw indie music...
Duration: 0:00:20.597138


In [22]:
df_songs_sim_genre = pd.DataFrame(sim_matrix_genre, columns=df_genre_recc.index, index=df_genre_recc.index)
df_songs_sim_genre

Unnamed: 0,newcastle nsw indie,drivers license,Mood (feat. iann dior),positions,BICHOTA,DÁKITI,Therefore I Am,WITHOUT YOU,Whoopty,34+35,...,Machete - Original Mix,Horizon,Loungin' - VIP,Levitate,My Love,NINETEEN,Azrael - Dahu Remix,Speak Up (feat. Wynter Gordon),Lost Vegas,Move - Tune Brothers Remix
newcastle nsw indie,1.000000,0.724687,0.953603,0.905278,0.899158,0.843406,0.876148,0.914448,0.908123,0.908271,...,0.682731,0.832716,0.796043,0.776753,0.808266,0.968849,0.809486,0.925539,0.832912,0.922086
drivers license,0.724687,1.000000,0.831729,0.912169,0.850600,0.960201,0.828661,0.892910,0.852326,0.891436,...,0.689322,0.749565,0.775348,0.725985,0.804517,0.790323,0.761643,0.840220,0.781530,0.741348
Mood (feat. iann dior),0.953603,0.831729,1.000000,0.973148,0.973901,0.908220,0.951011,0.975167,0.979637,0.977329,...,0.736795,0.842029,0.825338,0.799567,0.867232,0.955534,0.829228,0.942901,0.876002,0.944957
positions,0.905278,0.912169,0.973148,1.000000,0.978995,0.949464,0.939219,0.968827,0.979259,0.972211,...,0.713929,0.846456,0.843540,0.803407,0.879320,0.937570,0.834077,0.938627,0.875090,0.925382
BICHOTA,0.899158,0.850600,0.973901,0.978995,1.000000,0.912725,0.969215,0.974853,0.997912,0.975046,...,0.713632,0.849238,0.848772,0.813478,0.903444,0.926328,0.838302,0.939999,0.893827,0.948050
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
NINETEEN,0.968849,0.790323,0.955534,0.937570,0.926328,0.858533,0.870766,0.925681,0.937656,0.910813,...,0.681212,0.844028,0.837631,0.781472,0.842520,1.000000,0.808118,0.929348,0.851987,0.932845
Azrael - Dahu Remix,0.809486,0.761643,0.829228,0.834077,0.838302,0.855005,0.841463,0.850032,0.836242,0.863348,...,0.934347,0.993045,0.982756,0.994410,0.976386,0.808118,1.000000,0.910081,0.986976,0.913608
Speak Up (feat. Wynter Gordon),0.925539,0.840220,0.942901,0.938627,0.939999,0.944521,0.895099,0.949330,0.942803,0.965288,...,0.823879,0.907538,0.907960,0.870232,0.927903,0.929348,0.910081,1.000000,0.922458,0.922860
Lost Vegas,0.832912,0.781530,0.876002,0.875090,0.893827,0.861264,0.890730,0.892009,0.892912,0.896859,...,0.913901,0.988929,0.986437,0.985143,0.994134,0.851987,0.986976,0.922458,1.000000,0.952072


In [23]:
df_songs_sim_genre[genre_choice].sort_values(ascending=False)[1:31]

Beer Never Broke My Heart                                   0.987315
Must've Never Met You                                       0.986413
Young                                                       0.986351
Unstoppable                                                 0.985955
Windows Are Rolled Down                                     0.985551
Get Back                                                    0.985430
Come & Get It                                               0.985272
Primo Victoria                                              0.985169
Steal My Girl                                               0.984567
City                                                        0.984371
Hallelujah                                                  0.984324
Neverender                                                  0.983602
That Was A Crazy Game Of Poker                              0.983592
He's Mine                                                   0.983567
Can I Get an Outlaw               

In [24]:
song_list = list(df_songs_sim_genre[genre_choice].sort_values(ascending=False)[1:31].index)
song_list

['Beer Never Broke My Heart',
 "Must've Never Met You",
 'Young',
 'Unstoppable',
 'Windows Are Rolled Down',
 'Get Back',
 'Come & Get It',
 'Primo Victoria',
 'Steal My Girl',
 'City',
 'Hallelujah',
 'Neverender',
 'That Was A Crazy Game Of Poker',
 "He's Mine",
 'Can I Get an Outlaw',
 'Rill Rill',
 'Watch The Wind Blow By',
 'Home',
 'The Sun',
 'All American Nightmare',
 'Make A Move',
 'Priceless',
 'Right Girl',
 'Blind Eye',
 'The Country Boy Song (feat. Earl Dibbles Jr.)',
 'Tirando la H',
 'Stand Up',
 "Son's Gonna Rise (feat. Robert Randolph) - New Radio Mix",
 'Naughty List (with Dixie D’Amelio)',
 'อย่าหักหลัง']

### 5.4 Spotify API 

In [22]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth

In [23]:
# Replace SPOTIFY ID and SPOTIFY_SECRET with your own Spotify details 
CLIENT_ID = SPOTIFY_ID
CLIENT_SECRET = SPOTIFY_SECRET 

In [24]:
sp = spotipy.Spotify(
    auth_manager=SpotifyOAuth(client_id=CLIENT_ID,
                              client_secret=CLIENT_SECRET,
                              redirect_uri="http://example.com",
                              scope="playlist-modify-public",
                              show_dialog=True,
                             cache_path="token.txt"))
user_id = sp.current_user()["id"]

In [26]:
def find_song(top_20_songs):
    song_uris = []
    for song in top_20_songs: 
        result = sp.search(q=f"track:{song}", type="track")
        try: 
            uri = result['tracks']['items'][0]['uri']
            song_uris.append(uri)
        except IndexError:
            print(f"{song} doesn't exist in Spotify. Skipped.")
    return song_uris

For loop to generate playlists for top 30 genres and automatically create playlists on Spotify

In [28]:
for genre in genres_to_recc[25:]: 
    df_genre_recc = pd.concat([pd.DataFrame(df_genres.loc[genre]).T, df_recommender_2000], axis=0)
    start_time = datetime.now()
    print(f"Recommending {genre} music...")
    sim_matrix_genre = cosine_similarity(df_genre_recc)
    songs_sim_genre = pd.DataFrame(sim_matrix_genre, columns=df_genre_recc.index, index=df_genre_recc.index)
    songs_sim_genre
    end_time = datetime.now()
    print('Duration: {}'.format(end_time - start_time))
    
    df_songs_sim_genre = pd.DataFrame(sim_matrix_genre, columns=df_genre_recc.index, index=df_genre_recc.index)
    song_list = list(df_songs_sim_genre[genre].sort_values(ascending=False)[1:31].index)
    
    print(song_list)
    
    playlist_name = f"Sounds like {genre.title()} Playlist"
    playlist = sp.user_playlist_create(user='1184735198', name=playlist_name)
    sp.playlist_add_items(playlist_id=playlist['id'], items=find_song(song_list))
    

Recommending future house music...
Duration: 0:00:06.553475
["Don't It", "The Remedy (I Won't Worry)", 'Rich Kids - [Middle Cla$$ MIX]', 'Dani California', 'Careful with That Mic...', '&', 'Figured You Out', 'El Ultimo Adios', 'the WORLD', 'Galway Girl', 'Young Volcanoes', 'I Just Wanna Be Mad', 'I Hate Hippies', 'Real Good Man', 'Cheers (Drink To That)', 'Hot Mama', 'One Whole Day (feat. Wiz Khalifa)', 'Raised On It', 'สาปแช่ง พวกแย่งแฟน', 'Little Red Wagon', "Burnin' Up", 'Take It From Me', 'After the Moment', 'Chicks Dig It - Single Edit', 'Overprotected - Radio Edit', 'Somebody To You', 'Ocean Breathes Salty', 'Shark In The Water', 'Swing', "I'm Not Famous"]
Recommending finnish alternative rock music...
Duration: 0:00:32.188491
['Small Town Boy', 'D.J. D.J.', 'Raise Hell', 'Shake up Christmas 2011 (Official Coca-Cola Christmas Song)', 'Jim Sharp', 'Kiss You', 'Animals - Karaoke Version In the Style of Nickelback', 'Overprotected - Radio Edit', 'Northern Redneck', 'El Ultimo Adios'

## 5.5 Song Recommender Conclusion 

Using the cosine similarity matrix, we were able to build playlists of songs whose musical features were the most similar to the genres selected (Top 30 genres in this case). Upon further inspection of the playlists, we note that the songs recommended may not necessarily have been of the same genre, despite having similar musical features. For example, Katy Perry's track "Last Friday Night" is clearly a pop song, yet it was recommended in the Scandanavian R&B playlist, or how Eminem's rap hit "Sing For The moment" was considered similar to the Swedish Tropical House genre. 

This leads us to conclude that although songs may be categorised in a particular genre, their musical features could be similar to songs of another genre. 

In general, the playlists that generated the most similar songs tended tp be from the pop genres. For example, in the Mexican Pop Punk playlist, close to 50% of the songs could be classified as pop, despite being part of other genres as well. <a href="https://carrieunderwood.fandom.com/wiki/Country_Music">Carrie Underwood</a> and <a href="https://www.allmusic.com/artist/sam-hunt-mn0002674420/biography">Sam Hunt</a> and <a href="https://theboot.com/top-sara-evans-songs/">Sara Evans</a> typically associated with country music yet their music is generally defined as country pop. 

Overall, we see that the evolution of music has blurred the lines between the genres of music, meaning that songs traditionally classified as pop could also fall under indie - creating the subgenre Indie Pop. We can also point towards playlists such as Indonesian R&B which consists of mostly English songs but also contains Thai and Korean songs! From this, we see that musical features are language-agnostic and it is possible to be recommended songs which may not necessarily be from that genre, but they have similar musical features to that genre. 

In conclusion, music is fluid where genres transcend languages and music transcends the genres that define them. While recommending music similar to one genre, we may inadvertently recommend songs whose features are similar to other genres. 