### Recommendation Engine

Overview: The idea behind a song recommendation system is to identify songs that a user may enjoy based on their listening preferences. In order to accomplish this task I will pull in my listening history and compare various song features from my listening history with songs I not in my listening history to recommend songs that are most like the ones I listen to.

Step 1: Combine my listening history with a random library of songs 
<br>Step 2: Preprocess the data so that each desired feature is considered
<br>Step 3: Create a cosine similarity matrix
<br>Step 4: Create a function to make recommendations given a selected song

### Import Libraries & Data

In [1]:
from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity, cosine_distances
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

In [2]:
%store -r history
%store -r kaggle

In [33]:
history.drop_duplicates(subset = 'trackID', inplace = True)
history.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1426 entries, 0 to 3796
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   endTime            1426 non-null   object 
 1   artistName         1426 non-null   object 
 2   trackName          1426 non-null   object 
 3   msPlayed           1426 non-null   int64  
 4   count              1426 non-null   int64  
 5   artist_id          1426 non-null   object 
 6   genres             1426 non-null   object 
 7   artist_popularity  1426 non-null   float64
 8   followers          1426 non-null   float64
 9   trackID            1426 non-null   object 
 10  danceability       1426 non-null   float64
 11  energy             1426 non-null   float64
 12  key                1426 non-null   float64
 13  loudness           1426 non-null   float64
 14  mode               1426 non-null   float64
 15  speechiness        1426 non-null   float64
 16  acousticness       1426 

In [4]:
history.head(1)

Unnamed: 0,endTime,artistName,trackName,msPlayed,count,artist_id,genres,artist_popularity,followers,trackID,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,2021-10-26 23:02,Daniel Caesar,Transform (feat. Charlotte Day Wilson),277984,1,20wkVLutqVOYrc0kxFs7rA,"[canadian contemporary r&b, pop, r&b]",76.0,3551511.0,1jQfgl9WRle7D8a3GXLwaD,...,-10.656,1.0,0.031,0.511,1.9e-05,0.256,0.348,68.963,280587.0,4.0


In [32]:
kaggle.dropna(inplace = True)
kaggle.drop_duplicates(subset = 'id', inplace = True)
kaggle.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 162853 entries, 0 to 170652
Data columns (total 23 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           162853 non-null  float64
 1   year              162853 non-null  int64  
 2   acousticness      162853 non-null  float64
 3   artists           162853 non-null  object 
 4   danceability      162853 non-null  float64
 5   duration_ms       162853 non-null  int64  
 6   energy            162853 non-null  float64
 7   explicit          162853 non-null  int64  
 8   id                162853 non-null  object 
 9   instrumentalness  162853 non-null  float64
 10  key               162853 non-null  int64  
 11  liveness          162853 non-null  float64
 12  loudness          162853 non-null  float64
 13  mode              162853 non-null  int64  
 14  name              162853 non-null  object 
 15  popularity        162853 non-null  int64  
 16  release_date      16

In [34]:
historytemp = history.drop(columns = ['endTime', 'artistName', 'msPlayed', 'count', 'artist_id', 'duration_ms', 'time_signature', 'trackID'])
historytemp.head()

Unnamed: 0,trackName,genres,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,Transform (feat. Charlotte Day Wilson),"[canadian contemporary r&b, pop, r&b]",76.0,3551511.0,0.498,0.292,5.0,-10.656,1.0,0.031,0.511,1.9e-05,0.256,0.348,68.963
1,Coconut Water,"[alternative r&b, chill r&b, indie r&b]",42.0,18048.0,0.724,0.432,11.0,-9.945,1.0,0.203,0.0876,3.2e-05,0.0828,0.696,84.992
2,Positions,"[alternative r&b, chill r&b, indie r&b]",42.0,18048.0,0.639,0.535,1.0,-7.714,1.0,0.0596,0.489,0.00126,0.109,0.321,84.975
3,Essence (feat. Tems),"[afro dancehall, afropop, azonto, nigerian hip...",76.0,2822455.0,0.849,0.707,0.0,-6.002,1.0,0.113,0.0266,9e-06,0.618,0.602,104.027
4,Finesse,"[alternative r&b, indie r&b, pop r&b, r&b]",48.0,30018.0,0.795,0.486,5.0,-8.364,1.0,0.0475,0.134,2.6e-05,0.26,0.351,132.045


In [35]:
library = kaggle.drop(columns = ['year', 'artists', 'duration_ms', 'explicit', 'release_date', 'artistName', 'artist_id', 'id'])
library.head()

Unnamed: 0,valence,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,name,popularity,speechiness,tempo,genre,followers
0,0.0594,0.982,0.279,0.211,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...",4,0.0366,80.954,"[classical, post-romantic era, russian romanti...",560524.0
1,0.963,0.732,0.819,0.341,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,0.415,60.936,[],551.0
2,0.0394,0.961,0.328,0.166,0.913,3,0.101,-14.85,1,Gati Bali,5,0.0339,110.339,[javanese gamelan],438.0
3,0.165,0.967,0.275,0.309,2.8e-05,5,0.381,-9.316,1,Danny Boy,3,0.0354,100.109,[],0.0
4,0.253,0.957,0.418,0.193,2e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,0.038,101.665,[],26.0


In [36]:
print(library.shape, historytemp.shape)

(162853, 15) (1426, 15)


In [37]:
library = library[['name', 'genre', 'popularity', 'followers', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
library.rename(columns = {'name':'trackName', 'popularity': 'artist_popularity', 'genre':'genres'}, inplace = True)
library.head()

Unnamed: 0,trackName,genres,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...","[classical, post-romantic era, russian romanti...",4,560524.0,0.279,0.211,10,-20.096,1,0.0366,0.982,0.878,0.665,0.0594,80.954
1,Clancy Lowered the Boom,[],5,551.0,0.819,0.341,7,-12.441,1,0.415,0.732,0.0,0.16,0.963,60.936
2,Gati Bali,[javanese gamelan],5,438.0,0.328,0.166,3,-14.85,1,0.0339,0.961,0.913,0.101,0.0394,110.339
3,Danny Boy,[],3,0.0,0.275,0.309,5,-9.316,1,0.0354,0.967,2.8e-05,0.381,0.165,100.109
4,When Irish Eyes Are Smiling,[],2,26.0,0.418,0.193,3,-10.096,1,0.038,0.957,2e-06,0.229,0.253,101.665


In [39]:
combined = pd.concat([library, historytemp], axis=0)
combined.shape

(164279, 15)

In [41]:
combined.isnull().sum()

trackName            0
genres               0
artist_popularity    0
followers            0
danceability         0
energy               0
key                  0
loudness             0
mode                 0
speechiness          0
acousticness         0
instrumentalness     0
liveness             0
valence              0
tempo                0
dtype: int64

#### Feature Engineering 

Features
<br> danceability, energy, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, genre, popularity, followers

1) Normalize values over 1
2) Get Dummies or TFDIF on the genre
3) Turn into single vector
4) Calculate Cosine Similarity
5) Generate Recommendations

In [11]:
features = ['danceability', 'energy', 'loudness', 'mode', 'speechiness', 'acousticness', 
            'instrumentalness', 'liveness', 'valence', 'popularity', 'genre', 'followers', 'tempo', 'key']

In [15]:
#Step 1 Normalize Features over 1
combined['artist_popularity'] = pd.DataFrame(preprocessing.normalize([combined['artist_popularity']]).T)
combined['loudness'] = pd.DataFrame(preprocessing.normalize([combined['loudness']]).T)
combined['followers'] = pd.DataFrame(preprocessing.normalize([combined['followers']]).T)
combined['tempo'] = pd.DataFrame(preprocessing.normalize([combined['tempo']]).T)
combined['key'] = pd.DataFrame(preprocessing.normalize([combined['key']]).T)

In [13]:
#Step 2 TFDIF Genre Category
combined['genres'] = [''.join(x) for x in combined['genres']]

vectorizer = TfidfVectorizer(max_features = 5000)
vectors = vectorizer.fit_transform(combined['genres'])
vocab = vectorizer.vocabulary_

genre_tfdf = pd.DataFrame(vectors.toarray(), columns = [vectorizer.get_feature_names_out()])

genre_tfdf.reset_index(drop = True, inplace=True)
genre_tfdf.iloc[0]

21st                  0.0
6indie                0.0
abstract              0.0
accordion             0.0
acid                  0.0
                     ... 
zelectropoppoppost    0.0
zgen                  0.0
zindie                0.0
zmodern               0.0
zuliana               0.0
Name: 0, Length: 5000, dtype: float64

In [14]:
combined.reset_index(inplace = True, drop = True)
temp = combined['trackName']
combined.drop(columns = ['genres', 'trackName'], inplace = True)

In [15]:
#Step 3
df = pd.concat([combined, genre_tfdf], axis = 1)
df.index = temp

In [17]:
combined.head()

Unnamed: 0,trackName,genres,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...","[classical, post-romantic era, russian romanti...",0.000293,0.0002106536,0.279,0.211,0.004447,-0.004352,1.0,0.0366,0.982,0.878,0.665,0.0594,0.001869
1,Clancy Lowered the Boom,[],0.000366,2.070743e-07,0.819,0.341,0.003113,-0.002694,1.0,0.415,0.732,0.0,0.16,0.963,0.001407
2,Gati Bali,[javanese gamelan],0.000366,1.646072e-07,0.328,0.166,0.001334,-0.003216,1.0,0.0339,0.961,0.913,0.101,0.0394,0.002548
3,Danny Boy,[],0.00022,0.0,0.275,0.309,0.002223,-0.002017,1.0,0.0354,0.967,2.8e-05,0.381,0.165,0.002311
4,When Irish Eyes Are Smiling,[],0.000146,9.771203e-09,0.418,0.193,0.001334,-0.002186,1.0,0.038,0.957,2e-06,0.229,0.253,0.002347


In [17]:
df.head()

Unnamed: 0_level_0,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,...,"(zdance,)","(zelectropopgauze,)","(zelectropopindie,)","(zelectropopla,)","(zelectropoppop,)","(zelectropoppoppost,)","(zgen,)","(zindie,)","(zmodern,)","(zuliana,)"
trackName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",0.000293,0.0002106536,0.279,0.211,0.004447,-0.004352,1.0,0.0366,0.982,0.878,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Clancy Lowered the Boom,0.000366,2.070743e-07,0.819,0.341,0.003113,-0.002694,1.0,0.415,0.732,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Gati Bali,0.000366,1.646072e-07,0.328,0.166,0.001334,-0.003216,1.0,0.0339,0.961,0.913,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Danny Boy,0.00022,0.0,0.275,0.309,0.002223,-0.002017,1.0,0.0354,0.967,2.8e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
When Irish Eyes Are Smiling,0.000146,9.771203e-09,0.418,0.193,0.001334,-0.002186,1.0,0.038,0.957,2e-06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Without Genres

In [16]:
combined.head()

Unnamed: 0,trackName,genres,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,"Piano Concerto No. 3 in D Minor, Op. 30: III. ...","[classical, post-romantic era, russian romanti...",0.000293,0.0002106536,0.279,0.211,0.004447,-0.004352,1.0,0.0366,0.982,0.878,0.665,0.0594,0.001869
1,Clancy Lowered the Boom,[],0.000366,2.070743e-07,0.819,0.341,0.003113,-0.002694,1.0,0.415,0.732,0.0,0.16,0.963,0.001407
2,Gati Bali,[javanese gamelan],0.000366,1.646072e-07,0.328,0.166,0.001334,-0.003216,1.0,0.0339,0.961,0.913,0.101,0.0394,0.002548
3,Danny Boy,[],0.00022,0.0,0.275,0.309,0.002223,-0.002017,1.0,0.0354,0.967,2.8e-05,0.381,0.165,0.002311
4,When Irish Eyes Are Smiling,[],0.000146,9.771203e-09,0.418,0.193,0.001334,-0.002186,1.0,0.038,0.957,2e-06,0.229,0.253,0.002347


In [42]:
temp = combined
temp.index = combined['trackName']
temp.drop(columns = ['trackName', 'genres'], inplace =True)
temp.head()

Unnamed: 0_level_0,artist_popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
trackName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",4.0,560524.0,0.279,0.211,10.0,-20.096,1.0,0.0366,0.982,0.878,0.665,0.0594,80.954
Clancy Lowered the Boom,5.0,551.0,0.819,0.341,7.0,-12.441,1.0,0.415,0.732,0.0,0.16,0.963,60.936
Gati Bali,5.0,438.0,0.328,0.166,3.0,-14.85,1.0,0.0339,0.961,0.913,0.101,0.0394,110.339
Danny Boy,3.0,0.0,0.275,0.309,5.0,-9.316,1.0,0.0354,0.967,2.8e-05,0.381,0.165,100.109
When Irish Eyes Are Smiling,2.0,26.0,0.418,0.193,3.0,-10.096,1.0,0.038,0.957,2e-06,0.229,0.253,101.665


### Get Recommendations

In [46]:
similarity = cosine_similarity(temp[0:10_000])

In [47]:
sim = pd.DataFrame(similarity, index = temp[0:10_000].index, columns = temp[0:10_000].index)

In [28]:
sim.head()

trackName,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",Clancy Lowered the Boom,Gati Bali,Danny Boy,When Irish Eyes Are Smiling,Gati Mardika,The Wearing of the Green,"Morceaux de fantaisie, Op. 3: No. 2, Prélude in C-Sharp Minor. Lento",La Mañanita - Remasterizado,Il Etait Syndiqué,...,Master Blaster (Jammin'),Play for Today - 2006 Remaster,El Noa Noa,Every Woman in the World,"Yeh, Yeh",Kill the Poor,Sometimes a Fantasy,Don't Let Him Go,Skateaway,More Than I Can Say - Remastered
trackName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",1.0,0.57054,0.717702,0.752065,0.728752,0.577175,0.612009,0.846847,0.399564,0.591482,...,0.156113,0.151415,0.465207,0.483041,0.451963,0.458305,0.431955,0.428636,0.407876,0.457912
Clancy Lowered the Boom,0.57054,1.0,0.571433,0.810696,0.864174,0.727525,0.766786,0.411438,0.612019,0.851258,...,0.591055,0.548424,0.750526,0.69282,0.774961,0.649779,0.751227,0.731085,0.717001,0.773072
Gati Bali,0.717702,0.571433,1.0,0.709496,0.716964,0.877794,0.593903,0.511755,0.398055,0.549621,...,0.134602,0.135675,0.47283,0.480163,0.455522,0.415414,0.368995,0.421003,0.409576,0.429664
Danny Boy,0.752065,0.810696,0.709496,1.0,0.985696,0.739299,0.80952,0.515718,0.513346,0.795755,...,0.248416,0.257127,0.581803,0.674042,0.652626,0.643336,0.598527,0.630935,0.598999,0.639084
When Irish Eyes Are Smiling,0.728752,0.864174,0.716964,0.985696,1.0,0.764605,0.836732,0.514139,0.55395,0.817638,...,0.30253,0.282239,0.608102,0.691015,0.682636,0.615639,0.599819,0.634707,0.61598,0.664341


In [20]:
# combined.loc[['BREAK MY SOUL']]

In [21]:
# combined.loc[['BREAK MY SOUL']].index

In [32]:
similarity_bms = cosine_similarity(df[0:10_000])

MemoryError: Unable to allocate 4.79 GiB for an array with shape (5013, 128295) and data type float64

In [23]:
# sim.sort_values(by = 'BREAK MY SOUL', ascending = False)

In [None]:
def get_reccomendation(song):
    similarity = cosine_similarity(df, df.loc[[f'{song}']])
    