This notebook is focused on building the recommender. 

## Import Lbraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from numpy import dot
from numpy.linalg import norm

from sklearn.preprocessing import StandardScaler

plt.style.use('fivethirtyeight')

In [2]:
#import data
data_df = pd.read_csv('../data/spotify_songs_1922/cleaned_data.csv')

In [3]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155098 entries, 0 to 155097
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           155098 non-null  float64
 1   year              155098 non-null  int64  
 2   acousticness      155098 non-null  float64
 3   artists           155098 non-null  object 
 4   danceability      155098 non-null  float64
 5   duration_ms       155098 non-null  int64  
 6   energy            155098 non-null  float64
 7   explicit          155098 non-null  int64  
 8   id                155098 non-null  object 
 9   instrumentalness  155098 non-null  float64
 10  key               155098 non-null  int64  
 11  liveness          155098 non-null  float64
 12  loudness          155098 non-null  float64
 13  mode              155098 non-null  int64  
 14  name              155098 non-null  object 
 15  popularity        155098 non-null  int64  
 16  release_date      15

In [4]:
#number of songs and unique IDs
print(data_df.shape[0])
print(data_df.id.nunique())

155098
155098


The number of unique ID corresponds with the number of songs that we have.

In [5]:
#create dataframe only used to apply cosine similarity calculation on
# only take the numerical columns and drop the numerical ones that do not represent audio features
data_df_cosim = data_df.select_dtypes(include=['int64','float64']).drop(columns=['year','popularity','explicit','duration_ms'])

#concat with id so that we can use the grouby function on it
data_df_cosim = pd.concat([data_df_cosim,data_df.id], axis =1)

In [6]:
data_df_cosim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155098 entries, 0 to 155097
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           155098 non-null  float64
 1   acousticness      155098 non-null  float64
 2   danceability      155098 non-null  float64
 3   energy            155098 non-null  float64
 4   instrumentalness  155098 non-null  float64
 5   key               155098 non-null  int64  
 6   liveness          155098 non-null  float64
 7   loudness          155098 non-null  float64
 8   mode              155098 non-null  int64  
 9   speechiness       155098 non-null  float64
 10  tempo             155098 non-null  float64
 11  id                155098 non-null  object 
dtypes: float64(9), int64(2), object(1)
memory usage: 14.2+ MB


In [7]:
#this puts the dataset into a matrix and indexed by the id
song_sim_matrix = data_df_cosim.groupby('id').sum()

song_sim_matrix.head()

Unnamed: 0_level_0,valence,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
000G1xMMuwxNHmwVsBdtj1,0.555,0.0131,0.256,0.895,0.000106,2,0.0821,-4.86,1,0.0707,191.307
000GyYHG4uWmlXieKLij8u,0.146,0.469,0.797,0.898,0.0,1,0.0824,-5.922,0,0.52,89.926
000Npgk5e2SgwGaIsN3ztv,0.494,0.98,0.277,0.145,0.879,3,0.111,-19.898,1,0.0845,75.644
000ZxLGm7jDlWCHtcXSeBe,0.854,0.795,0.685,0.483,0.878,2,0.113,-10.202,0,0.0337,97.694
000jBcNljWTnyjB4YO7ojf,0.969,0.656,0.788,0.808,0.0,5,0.154,-6.59,1,0.0395,113.046


In [8]:
#creating scaling function
def scaler(data):
    ss = StandardScaler()
    return ss.fit_transform(data)

#scale our dataframe
song_sim_matrix_scaled = scaler(song_sim_matrix)

#convert our numpy matrix into dataframe
song_sim_matrix_df = pd.DataFrame(song_sim_matrix_scaled, index=song_sim_matrix.index,columns=song_sim_matrix.columns)

In [9]:
#show scaled dataframe
song_sim_matrix_df.head()

Unnamed: 0_level_0,valence,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
000G1xMMuwxNHmwVsBdtj1,0.093255,-1.297802,-1.616679,1.54047,-0.526461,-0.910407,-0.711547,1.160521,0.643933,-0.180412,2.416146
000GyYHG4uWmlXieKLij8u,-1.467362,-0.084203,1.469889,1.551683,-0.526802,-1.194807,-0.709823,0.973435,-1.552956,2.508301,-0.87937
000Npgk5e2SgwGaIsN3ztv,-0.139502,1.276071,-1.496868,-1.262732,2.300917,-0.626006,-0.545525,-1.488629,0.643933,-0.09783,-1.343625
000ZxLGm7jDlWCHtcXSeBe,1.234146,0.783604,0.830895,0.000578,2.2977,-0.910407,-0.534036,0.219455,-1.552956,-0.401828,-0.626862
000jBcNljWTnyjB4YO7ojf,1.67295,0.413588,1.418541,1.215299,-0.526802,-0.057205,-0.298504,0.855758,0.643933,-0.36712,-0.127826


In [10]:
#test custom cosine similarity formula
#cos_sim = dot(a, b)/(norm(a)*norm(b))

first_song = song_sim_matrix_df.iloc[0]
second_song = song_sim_matrix_df.iloc[1]

print(f'similarity score of song: {dot(first_song, second_song)/(norm(second_song)*norm(first_song))}')

similarity score of song: -0.033654198083342184


It works! Now to do in a functiont that can apply to all the other songs to find the most similar songs

In [11]:
#getting the score 

def sim_score_generator(song,artist,number_of_recommedations):
    
    """takes in song and artist to return list of similar songs"""
    
    #takes artist and song name to find song id
    song_id = data_df[(data_df.main_artist == artist)&(data_df.name == song)]['id'].values[0]
    
    song_array = song_sim_matrix_df.loc[song_id].to_numpy()
    
    #tuple of sim score with song_index
    sim_score = []
    #counting the index
    song_index = -1
    
    #iterate through all songs in dataset
    for song in range(len(data_df)):
        
        song_index += 1

        # convert in to array
        other_song_array = song_sim_matrix_df.iloc[song].to_numpy()
        # sim score calculation
        score = dot(song_array,other_song_array)/(norm(song_array)*norm(other_song_array))
        
        #get other song id
        other_song_id = song_sim_matrix_df.iloc[song].name
        
        #get other song id
        
        # append to list
        song_score_and_id = (score,song_index, other_song_id)
    
        # assign the position index of the song with the score
        sim_score.append(song_score_and_id)
    
    
    #list of N recommendations based on highest sim score and corresponding index
    return(sorted(sim_score,reverse=True)[1:number_of_recommedations+1])

In [12]:
def show_recommendations(top_recomendations):
    #unpack tuples in list of recommendation
    for score,song_idx,other_song_id in top_recomendations:
        #find the id of the from the sim matrix 
        
        sim_song_id = song_sim_matrix_df.iloc[song_idx].name
        
        
        #print the corresponding artist and song to that matches the id from the data_df
        print(f"Artist: {data_df[data_df.id.str.match(sim_song_id)]['main_artist'].values[0]}")
        print(f"Song: {data_df[data_df.id.str.match(sim_song_id)]['name'].values[0]}")
        print(f"Year & Popularity: {data_df[data_df.id.str.match(sim_song_id)]['year'].values[0]}, {data_df[data_df.id.str.match(sim_song_id)]['popularity'].values[0]}\n") 


The reccomender will be tested on three songs that are familiar to most and also a frame of reference to what it sounds like. 

In [13]:
song = 'Ice Ice Baby'
artist = "Vanilla Ice"

#return 10 similar song with raw similiarity scores and assign into list 
recommendations = sim_score_generator(song,artist,10)

#print similar songs with corresponding artist and song name
show_recommendations(recommendations)

Artist: Vanilla Ice
Song: Ice Ice Baby - Radio Edit
Year & Popularity: 1990, 36

Artist: Tom Tom Club
Song: Genius of Love - Long Version
Year & Popularity: 1981, 32

Artist: Fern Kinney
Song: Groove Me
Year & Popularity: 1979, 36

Artist: Mellow Man Ace
Song: Mentirosa
Year & Popularity: 1989, 44

Artist: Whodini
Song: Growing Up
Year & Popularity: 1986, 23

Artist: M.C. Luscious
Song: Boom I Got Your Boyfriend
Year & Popularity: 1992, 32

Artist: Pat Benatar
Song: Rated 'X'
Year & Popularity: 1979, 26

Artist: Tom Tom Club
Song: Genius of Love
Year & Popularity: 1981, 57

Artist: Prince
Song: Housequake
Year & Popularity: 1987, 44

Artist: Jimmy Cliff
Song: Vietnam - Single Version
Year & Popularity: 1970, 31



In [14]:
song = 'All of Me'
artist = "John Legend"

#return 10 similar song with raw similiarity scores and assign into list 
recommendations = sim_score_generator(song,artist,10)

#print similar songs with corresponding artist and song name
show_recommendations(recommendations)

Artist: Rosemary Clooney
Song: Snow (with Percy Faith & His Orchestra)
Year & Popularity: 1954, 40

Artist: Martina McBride
Song: In My Daughter's Eyes
Year & Popularity: 2003, 51

Artist: Irma Serrano
Song: Canción de un Preso
Year & Popularity: 1962, 30

Artist: Emeli Sandé
Song: Clown
Year & Popularity: 2012, 45

Artist: Fairuz
Song: Baadak Ala Baly
Year & Popularity: 1963, 37

Artist: Ruth B.
Song: If By Chance
Year & Popularity: 2017, 73

Artist: Johnnyswim
Song: What Are You Doing New Years Eve
Year & Popularity: 2014, 53

Artist: Axel
Song: Amo
Year & Popularity: 2003, 59

Artist: Luis Fernando Ramirez Zuluaga
Song: Mantelito Blanco (Instrumental y Voz)
Year & Popularity: 1930, 0

Artist: Veronica Gonzalez Day
Song: La vie en rose
Year & Popularity: 1949, 0



In [16]:
song = "Mr. Brightside"
artist = "The Killers"

#return 10 similar song with raw similiarity scores and assign into list 
recommendations = sim_score_generator(song,artist,10)
 
#print similar songs with corresponding artist and song name
show_recommendations(recommendations)


Artist: Taking Back Sunday
Song: My Blue Heaven
Year & Popularity: 2006, 42

Artist: Adema
Song: Giving In
Year & Popularity: 2001, 51

Artist: Scorpions
Song: Under The Same Sun
Year & Popularity: 1993, 32

Artist: Sex Pistols
Song: Liar
Year & Popularity: 1977, 32

Artist: Oasis
Song: Don't Look Back In Anger
Year & Popularity: 1995, 73

Artist: Mitski
Song: Townie
Year & Popularity: 2014, 54

Artist: AFI
Song: 6 To 8
Year & Popularity: 2000, 36

Artist: Seether
Song: Breakdown
Year & Popularity: 2007, 54

Artist: Creed
Song: My Sacrifice
Year & Popularity: 2001, 70

Artist: Third Day
Song: Trust In Jesus
Year & Popularity: 2010, 45



observation: 

The recommender shows to recommend songs from different time periods, languages and of different popularoty. However, what makes it reliable as a recommender and not random is hard to say. Therefore to improve this further, my own user data will be used to improve recommendations. 