### Recommendation Engine

Overview: The idea behind a song recommendation system is to identify songs that a user may enjoy based on their listening preferences. In order to accomplish this task I will pull in my listening history and compare various song features from my listening history with songs I not in my listening history to recommend songs that are most like the ones I listen to.

Step 1: Combine my listening history with a random library of songs 
<br>Step 2: Preprocess the data so that each desired feature is considered
<br>Step 3: Create a cosine similarity matrix
<br>Step 4: Create a function to make recommendations given a selected song

### Import Libraries & Data

In [158]:
import pandas as pd
import numpy as np
import pickle

from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity, cosine_distances
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer

In [171]:
kaggle = pd.read_csv('../data/kaggle.csv')
%store -r history

In [172]:
history.drop_duplicates(subset = 'trackID', inplace = True)
history.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1426 entries, 0 to 3622
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   endTime            1426 non-null   object 
 1   artistName         1426 non-null   object 
 2   trackName          1426 non-null   object 
 3   msPlayed           1426 non-null   int64  
 4   count              1426 non-null   int64  
 5   artist_id          1426 non-null   object 
 6   genres             1426 non-null   object 
 7   artist_popularity  1426 non-null   float64
 8   followers          1426 non-null   float64
 9   trackID            1426 non-null   object 
 10  danceability       1426 non-null   float64
 11  energy             1426 non-null   float64
 12  key                1426 non-null   float64
 13  loudness           1426 non-null   float64
 14  mode               1426 non-null   float64
 15  speechiness        1426 non-null   float64
 16  acousticness       1426 

In [173]:
history.head(1)

Unnamed: 0,endTime,artistName,trackName,msPlayed,count,artist_id,genres,artist_popularity,followers,trackID,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,2021-10-26 23:02,Daniel Caesar,Transform (feat. Charlotte Day Wilson),277984,1,20wkVLutqVOYrc0kxFs7rA,"['canadian contemporary r&b', 'pop', 'r&b']",76.0,3551511.0,1jQfgl9WRle7D8a3GXLwaD,...,-10.656,1.0,0.031,0.511,1.9e-05,0.256,0.348,68.963,280587.0,4.0


In [174]:
kaggle.dropna(inplace = True)
kaggle.drop_duplicates(subset = 'id', inplace = True)
kaggle.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 162853 entries, 0 to 170652
Data columns (total 24 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        162853 non-null  int64  
 1   valence           162853 non-null  float64
 2   year              162853 non-null  int64  
 3   acousticness      162853 non-null  float64
 4   artists           162853 non-null  object 
 5   danceability      162853 non-null  float64
 6   duration_ms       162853 non-null  int64  
 7   energy            162853 non-null  float64
 8   explicit          162853 non-null  int64  
 9   id                162853 non-null  object 
 10  instrumentalness  162853 non-null  float64
 11  key               162853 non-null  int64  
 12  liveness          162853 non-null  float64
 13  loudness          162853 non-null  float64
 14  mode              162853 non-null  int64  
 15  name              162853 non-null  object 
 16  popularity        16

In [177]:
historytemp = history.drop(columns = ['endTime', 'msPlayed', 'count', 'artist_id', 'duration_ms', 'time_signature', 'trackID'])
historytemp.head()

Unnamed: 0,artistName,trackName,genres,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,Daniel Caesar,Transform (feat. Charlotte Day Wilson),"['canadian contemporary r&b', 'pop', 'r&b']",76.0,3551511.0,0.498,0.292,5.0,-10.656,1.0,0.031,0.511,1.9e-05,0.256,0.348,68.963
1,Nia Sultana,Coconut Water,"['alternative r&b', 'chill r&b', 'indie r&b']",42.0,18048.0,0.724,0.432,11.0,-9.945,1.0,0.203,0.0876,3.2e-05,0.0828,0.696,84.992
2,Nia Sultana,Positions,"['alternative r&b', 'chill r&b', 'indie r&b']",42.0,18048.0,0.639,0.535,1.0,-7.714,1.0,0.0596,0.489,0.00126,0.109,0.321,84.975
3,Wizkid,Essence (feat. Tems),"['afro dancehall', 'afropop', 'azonto', 'niger...",76.0,2822455.0,0.849,0.707,0.0,-6.002,1.0,0.113,0.0266,9e-06,0.618,0.602,104.027
4,Nija,Finesse,"['alternative r&b', 'indie r&b', 'pop r&b', 'r...",48.0,30018.0,0.795,0.486,5.0,-8.364,1.0,0.0475,0.134,2.6e-05,0.26,0.351,132.045


In [184]:
library = kaggle.drop(columns = ['Unnamed: 0','year', 'duration_ms', 'explicit', 'release_date', 'artistName', 'artist_id', 'id'])
library.head()

Unnamed: 0,valence,acousticness,artists,danceability,energy,instrumentalness,key,liveness,loudness,mode,name,popularity,speechiness,tempo,genre,followers
0,0.0594,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berli...",0.279,0.211,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,0.0366,80.954,"['classical', 'post-romantic era', 'russian ro...",560524.0
1,0.963,0.732,['Dennis Day'],0.819,0.341,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,0.415,60.936,[],551.0
2,0.0394,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadi...,0.328,0.166,0.913,3,0.101,-14.85,1,Gati Bali,5,0.0339,110.339,['javanese gamelan'],438.0
3,0.165,0.967,['Frank Parker'],0.275,0.309,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,0.0354,100.109,[],0.0
4,0.253,0.957,['Phil Regan'],0.418,0.193,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,0.038,101.665,[],26.0


In [178]:
print(library.shape, historytemp.shape)

(162853, 16) (1426, 16)


In [185]:
library = library[['name', 'artists', 'genre', 'popularity', 'followers', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
library.rename(columns = {'name':'trackName','artists':'artistName', 'popularity': 'artist_popularity', 'genre':'genres'}, inplace = True)
library.head()

Unnamed: 0,trackName,artistName,genres,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...","['Sergei Rachmaninoff', 'James Levine', 'Berli...","['classical', 'post-romantic era', 'russian ro...",4,560524.0,0.279,0.211,10,-20.096,1,0.0366,0.982,0.878,0.665,0.0594,80.954
1,Clancy Lowered the Boom,['Dennis Day'],[],5,551.0,0.819,0.341,7,-12.441,1,0.415,0.732,0.0,0.16,0.963,60.936
2,Gati Bali,['KHP Kridhamardawa Karaton Ngayogyakarta Hadi...,['javanese gamelan'],5,438.0,0.328,0.166,3,-14.85,1,0.0339,0.961,0.913,0.101,0.0394,110.339
3,Danny Boy,['Frank Parker'],[],3,0.0,0.275,0.309,5,-9.316,1,0.0354,0.967,2.8e-05,0.381,0.165,100.109
4,When Irish Eyes Are Smiling,['Phil Regan'],[],2,26.0,0.418,0.193,3,-10.096,1,0.038,0.957,2e-06,0.229,0.253,101.665


In [186]:
combined = pd.concat([library, historytemp], axis=0)
combined.shape

(164279, 16)

In [192]:
combined['Song'] = combined['trackName'] + '-' + combined['artistName'] 
combined.drop(columns = ['trackName', 'artistName'], inplace = True)

In [193]:
combined.head()

Unnamed: 0,genres,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,Song
0,"['classical', 'post-romantic era', 'russian ro...",4.0,560524.0,0.279,0.211,10.0,-20.096,1.0,0.0366,0.982,0.878,0.665,0.0594,80.954,"Piano Concerto No. 3 in D Minor, Op. 30: III. ..."
1,[],5.0,551.0,0.819,0.341,7.0,-12.441,1.0,0.415,0.732,0.0,0.16,0.963,60.936,Clancy Lowered the Boom-['Dennis Day']
2,['javanese gamelan'],5.0,438.0,0.328,0.166,3.0,-14.85,1.0,0.0339,0.961,0.913,0.101,0.0394,110.339,Gati Bali-['KHP Kridhamardawa Karaton Ngayogya...
3,[],3.0,0.0,0.275,0.309,5.0,-9.316,1.0,0.0354,0.967,2.8e-05,0.381,0.165,100.109,Danny Boy-['Frank Parker']
4,[],2.0,26.0,0.418,0.193,3.0,-10.096,1.0,0.038,0.957,2e-06,0.229,0.253,101.665,When Irish Eyes Are Smiling-['Phil Regan']


In [90]:
# kaggle[kaggle['name'] == 'No Mercy']

In [91]:
# combined[combined['followers'].isnull() == True]

#### Feature Engineering 

Features
<br> danceability, energy, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, genre, popularity, followers

1) Normalize values over 1
2) Get Dummies or TFDIF on the genre
3) Turn into single vector
4) Calculate Cosine Similarity
5) Generate Recommendations

In [11]:
features = ['danceability', 'energy', 'loudness', 'mode', 'speechiness', 'acousticness', 
            'instrumentalness', 'liveness', 'valence', 'popularity', 'genre', 'followers', 'tempo', 'key']

In [49]:
#Step 1 Normalize Features over 1
combined['artist_popularity'] = pd.DataFrame(preprocessing.normalize([combined['artist_popularity']]).T)
combined['loudness'] = pd.DataFrame(preprocessing.normalize([combined['loudness']]).T)
combined['followers'] = pd.DataFrame(preprocessing.normalize([combined['followers']]).T)
combined['tempo'] = pd.DataFrame(preprocessing.normalize([combined['tempo']]).T)
combined['key'] = pd.DataFrame(preprocessing.normalize([combined['key']]).T)

In [13]:
#Step 2 TFDIF Genre Category
combined['genres'] = [''.join(x) for x in combined['genres']]

vectorizer = TfidfVectorizer(max_features = 5000)
vectors = vectorizer.fit_transform(combined['genres'])
vocab = vectorizer.vocabulary_

genre_tfdf = pd.DataFrame(vectors.toarray(), columns = [vectorizer.get_feature_names_out()])

genre_tfdf.reset_index(drop = True, inplace=True)
genre_tfdf.iloc[0]

21st                  0.0
6indie                0.0
abstract              0.0
accordion             0.0
acid                  0.0
                     ... 
zelectropoppoppost    0.0
zgen                  0.0
zindie                0.0
zmodern               0.0
zuliana               0.0
Name: 0, Length: 5000, dtype: float64

In [14]:
combined.reset_index(inplace = True, drop = True)
temp = combined['trackName']
combined.drop(columns = ['genres', 'trackName'], inplace = True)

In [15]:
#Step 3
df = pd.concat([combined, genre_tfdf], axis = 1)
df.index = temp

In [17]:
combined.head()

Unnamed: 0,trackName,genres,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...","[classical, post-romantic era, russian romanti...",0.000293,0.0002106536,0.279,0.211,0.004447,-0.004352,1.0,0.0366,0.982,0.878,0.665,0.0594,0.001869
1,Clancy Lowered the Boom,[],0.000366,2.070743e-07,0.819,0.341,0.003113,-0.002694,1.0,0.415,0.732,0.0,0.16,0.963,0.001407
2,Gati Bali,[javanese gamelan],0.000366,1.646072e-07,0.328,0.166,0.001334,-0.003216,1.0,0.0339,0.961,0.913,0.101,0.0394,0.002548
3,Danny Boy,[],0.00022,0.0,0.275,0.309,0.002223,-0.002017,1.0,0.0354,0.967,2.8e-05,0.381,0.165,0.002311
4,When Irish Eyes Are Smiling,[],0.000146,9.771203e-09,0.418,0.193,0.001334,-0.002186,1.0,0.038,0.957,2e-06,0.229,0.253,0.002347


In [17]:
df.head()

Unnamed: 0_level_0,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,...,"(zdance,)","(zelectropopgauze,)","(zelectropopindie,)","(zelectropopla,)","(zelectropoppop,)","(zelectropoppoppost,)","(zgen,)","(zindie,)","(zmodern,)","(zuliana,)"
trackName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",0.000293,0.0002106536,0.279,0.211,0.004447,-0.004352,1.0,0.0366,0.982,0.878,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Clancy Lowered the Boom,0.000366,2.070743e-07,0.819,0.341,0.003113,-0.002694,1.0,0.415,0.732,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Gati Bali,0.000366,1.646072e-07,0.328,0.166,0.001334,-0.003216,1.0,0.0339,0.961,0.913,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Danny Boy,0.00022,0.0,0.275,0.309,0.002223,-0.002017,1.0,0.0354,0.967,2.8e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
When Irish Eyes Are Smiling,0.000146,9.771203e-09,0.418,0.193,0.001334,-0.002186,1.0,0.038,0.957,2e-06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Recommendation System -- No Genres --

In [197]:
#Create a datframe without genres
temp = combined.drop(columns = ['genres', 'Song'])
temp.index = combined['Song']
temp.head()

Unnamed: 0_level_0,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
Song,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve-['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']",4.0,560524.0,0.279,0.211,10.0,-20.096,1.0,0.0366,0.982,0.878,0.665,0.0594,80.954
Clancy Lowered the Boom-['Dennis Day'],5.0,551.0,0.819,0.341,7.0,-12.441,1.0,0.415,0.732,0.0,0.16,0.963,60.936
Gati Bali-['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],5.0,438.0,0.328,0.166,3.0,-14.85,1.0,0.0339,0.961,0.913,0.101,0.0394,110.339
Danny Boy-['Frank Parker'],3.0,0.0,0.275,0.309,5.0,-9.316,1.0,0.0354,0.967,2.8e-05,0.381,0.165,100.109
When Irish Eyes Are Smiling-['Phil Regan'],2.0,26.0,0.418,0.193,3.0,-10.096,1.0,0.038,0.957,2e-06,0.229,0.253,101.665


### Get Recommendations

In [93]:
temp.dropna(inplace = True)
temp.shape

(164279, 13)

In [195]:
temp.index[-1]

'2 Be Loved (Am I Ready)-Lizzo'

In [198]:
#calculate similarity scores
similarity = cosine_similarity(temp, temp.iloc[[-1]])

# sim = pd.DataFrame(similarity, index = temp.index, columns = temp.iloc[[0]].index)
sim = pd.DataFrame(similarity, index = temp.index, columns = [temp.index[-1]])

In [199]:
sim.sort_values(by = temp.index[-1], ascending = False).head(-20)

Unnamed: 0_level_0,2 Be Loved (Am I Ready)-Lizzo
Song,Unnamed: 1_level_1
2 Be Loved (Am I Ready)-Lizzo,1.000000
Sex on Fire-['Kings of Leon'],1.000000
Truth Hurts-['Lizzo'],1.000000
Unforgettable-['Thomas Rhett'],1.000000
Truth Hurts-Lizzo,1.000000
...,...
Mago Amar Ei Bhabna-['Narayan Chandra Mukherjee'],0.000029
Kapitel 18 - Oktoberfrühling-['Kurt Heynicke'],0.000029
"Maran Shiyare Daladali Kare, Pt. 1-['Prodyut Narayan']",0.000029
Bighore Behare Charinu Ek-['Balaidas Sil'],0.000029


In [203]:
#Recommendation Function- Update this to only return the Song not the similarity score
def get_recommendations(library, user_input):
    #calculate similarity
    similarity = cosine_similarity(library, user_input)
    
    #create a similarity DataFrame
    sim = pd.DataFrame(similarity, index = library.index, columns = [user_input.index[0]])
    
    #filter to top 5 
    recs = sim.sort_values(by = test.index[0], ascending = False)[1:6]
    
    return recs

In [201]:
get_recommendations(temp, test)

Unnamed: 0_level_0,Have You Ever Loved A Woman
Song,Unnamed: 1_level_1
What Is Love? - Extended Mix-['Howard Jones'],1.0
Ice Ice Baby-['Vanilla Ice'],1.0
Honey White-['Morphine'],1.0
September In The Rain-['Dinah Washington'],1.0
Suburban Home-['Descendents'],1.0


In [159]:
with open('../models/get_recommendation.pkl', 'wb') as f:
    pickle.dump(get_recommendations, f)

In [202]:
with open('../models/library.pkl', 'wb') as f:
    pickle.dump(temp, f)

In [None]:
1. Ask user to input artist and track name- Function Built
2. Lookup features needed to calculate similarity - Function Built
3. Cosine Similarity between user input song and library
4. Return top 5 recommendations

In [105]:
%store -r test

#### Recommendations are not great some things to experiment: 
<br>Add Genre
<br>Normalize Features- Will have to do this on the input song as well* likely manual w/function
<br>Check recommendations against library list and only surface recommendations in the same genre

In [160]:
def test_func(number1,number2):
    return number1 + number2
    

In [161]:
with open('../models/test.pkl', 'wb') as f:
    pickle.dump(test_func, f)

In [166]:
with open('../get_users_track.pkl','rb') as io:
    pickle.load(io)

AttributeError: Can't get attribute 'get_users_track' on <module '__main__'>