### Recommendation Engine

Overview: The idea behind a song recommendation system is to identify songs that a user may enjoy based on their listening preferences. In order to accomplish this task I will pull in my listening history and library to compare against a user selected song and generate recommendations.

Step 1: Combine my listening history with a random library of songs 
<br>Step 2: Create a function to preprocess the data, calculate similarity and generate recommendations 
<br>Step 3: Evaluate the recommendations

### Import Libraries & Data

In [1]:
import pandas as pd
import numpy as np
import pickle

from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity, cosine_distances
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
#Load in the data
%store -r kaggle
%store -r extended
%store -r track1
%store -r track2

#### Combine Extended Listening History and Song Library

In [3]:
#Dropping unnecessary columns and combining listening history and library
history = extended.drop(columns = ['listen_count', 'year', 'ts'])
library = kaggle.drop(columns = ['Unnamed: 0','year', 'duration_ms', 'explicit', 'release_date', 'artistName'])

library = library[['id', 'artist_id','name', 'artists', 'genre', 'popularity', 'followers', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
library.rename(columns = {'name':'trackName','artists':'artistName'}, inplace = True)
history = history[['trackID', 'artist_id','master_metadata_track_name', 'master_metadata_album_artist_name', 'genre', 'popularity', 'followers', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
history.rename(columns = {'trackID': 'id','master_metadata_track_name':'trackName', 'master_metadata_album_artist_name': 'artistName'}, inplace = True)

#Join the library and listening history into one dataframe
combined = pd.concat([library, history], axis=0)
combined.drop_duplicates(subset = 'id', inplace = True)

#Combine track and artist to easily identify songs
combined['song'] = combined['trackName'] + '-' + combined['artistName'] 
combined.drop(columns = ['trackName', 'id', 'artistName'], inplace = True)
combined.reset_index(inplace = True, drop = True)

### Recommendation System 1 -- No Genres --
For the first go I will build a recommendation system that does not consider the artist genre. I will add the user input song and the combined library into one dataset, normalize features between 0 and 1 and calculate the cosine similarity. I added a step to exclude recommending tracks from the same artist as the input song and restricted recommendations to the 5 highest similarity scores. 

In [4]:
#Recommendation Function 1
def recommender1(library, user_input):
    combined = pd.concat([library, user_input])
    df = combined.drop(columns = ['artist_id', 'genre', 'song'])
    df.reset_index(inplace = True, drop = True)
    
    #normalize features
    df['popularity'] = pd.DataFrame(preprocessing.normalize([df['popularity']]).T)
    df['loudness'] = pd.DataFrame(preprocessing.normalize([df['loudness']]).T)
    df['followers'] = pd.DataFrame(preprocessing.normalize([df['followers']]).T)
    df['tempo'] = pd.DataFrame(preprocessing.normalize([df['tempo']]).T)
    df['key'] = pd.DataFrame(preprocessing.normalize([df['key']]).T)
    
    #calculate similarity
    similarity = cosine_similarity(df, df.iloc[[-1]])
    
    #create a similarity DataFrame
    sim = pd.DataFrame(similarity, index = df.index, columns = [user_input.index[0]])
    
    #filter out songs where the artist is the same as the input artist
    combined['score'] = sim.iloc[:,0].reset_index(drop= True)
    combined = combined[combined['artist_id'] != combined['artist_id'].iloc[-1]]
    
    #filter to top 5 
    recs = combined.sort_values(by = 'score', ascending = False)[1:6]
    
    return recs

In [5]:
track1

Unnamed: 0,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
CPR - Sped Up,57LYzLEk2LcFghVwuWbcuS,"[pop, r&b, rap]",78,4250014,0.693,0.437,10,-11.564,0,0.222,0.701,0.0102,0.123,0.282,83.173


In [6]:
recommender1(combined, track1)

Unnamed: 0,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,song,score
103083,12TcsyqAd3lJUcZFJVogXZ,['lo-fi beats'],63.0,37110.0,0.662,0.349,5.0,-14.022,0.0,0.214,0.637,0.00213,0.124,0.234,141.951,I Fall in Love Too Easliy-['Caleb Belkin'],0.998054
118446,1z7b1Pr1rSlvWRzsW3HOrS,"['hawaiian hip hop', 'rap']",60.0,4579258.0,0.597,0.418,5.0,-10.03,0.0,0.177,0.652,0.0,0.12,0.313,129.948,Scared-['Russ'],0.99742
36275,4V8LLVI7PbaPR0K2TGSxFF,"['hip hop', 'rap']",68.0,9825320.0,0.686,0.429,8.0,-10.402,0.0,0.15,0.71,3e-06,0.166,0.27,130.09,Where This Flower Blooms (feat. Frank Ocean)-[...,0.997157
111758,3rxIQc9kWT6Ueg4BhnOwRK,"['disco', 'funk', 'quiet storm', 'soul']",32.0,283806.0,0.438,0.27,2.0,-19.46,0.0,0.129,0.513,0.0129,0.0738,0.169,174.306,Velas-['Quincy Jones'],0.996587
167119,78xUyw6FkVZrRAtziFdtdu,"['alternative hip hop', 'conscious hip hop', '...",58.0,1228638.0,0.654,0.34,5.0,-16.453,0.0,0.157,0.737,0.0111,0.103,0.266,74.971,The Unraveling-The Roots,0.994188


In [7]:
track2

Unnamed: 0,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
Lose Control,0xRXCcSX89eobfrshSVdyu,"[dance pop, edm, electro house, pop, pop dance...",75,567423,0.598,0.526,10,-8.659,0,0.0415,0.129,0,0.14,0.529,123.935


In [8]:
recommender1(combined, track2)

Unnamed: 0,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,song,score
126435,6IRouO5mvvfcyxtPDKMYFN,"['album rock', 'classic rock', 'hard rock', 'm...",24.0,2586503.0,0.603,0.529,11.0,-7.393,0.0,0.0447,0.137,1.9e-05,0.116,0.537,118.52,Spellbinder-['Foreigner'],0.999631
47173,3eqjTLE0HfPfh78zjh6TqT,"['classic rock', 'heartland rock', 'mellow gol...",44.0,5124548.0,0.479,0.423,11.0,-13.126,0.0,0.0357,0.12,3.2e-05,0.104,0.407,116.06,Point Blank-['Bruce Springsteen'],0.999509
36081,1Xyo4u8uXC1ZmMpatF05PJ,"['canadian contemporary r&b', 'canadian pop', ...",67.0,51962229.0,0.586,0.525,1.0,-7.163,0.0,0.0615,0.111,0.0,0.134,0.508,133.629,Die For You-['The Weeknd'],0.999384
18035,1Xyo4u8uXC1ZmMpatF05PJ,"['canadian contemporary r&b', 'canadian pop', ...",77.0,51962229.0,0.582,0.525,1.0,-7.163,0.0,0.0627,0.111,0.0,0.134,0.51,133.249,Die For You-['The Weeknd'],0.99937
10645,6IRouO5mvvfcyxtPDKMYFN,"['album rock', 'classic rock', 'hard rock', 'm...",68.0,2586503.0,0.516,0.486,3.0,-11.81,0.0,0.0272,0.0987,1e-06,0.143,0.471,132.542,Cold as Ice-['Foreigner'],0.999151


#### Recommendation System 2 -- Genre --
Similar to model 1 however here I will also utilize genre to calculate the similarity by running a count vectorizer on the genre field. In my initial runs I was seeing most recommendations coming from the same artist as the input track, as such I added a filter to only recommend songs from a different artist.

In [9]:
def recommender2(library, user_input):    
    #Join Library and User_Input
    user_input['genre'] = str(user_input['genre'][0])
    user_input['song'] = user_input.index
    combined = pd.concat([library, user_input])
    df = combined.drop(columns = ['artist_id'])
    df.reset_index(inplace = True, drop = True)
    
    #normalize features
    df['popularity'] = pd.DataFrame(preprocessing.normalize([df['popularity']]).T)
    df['loudness'] = pd.DataFrame(preprocessing.normalize([df['loudness']]).T)
    df['followers'] = pd.DataFrame(preprocessing.normalize([df['followers']]).T)
    df['tempo'] = pd.DataFrame(preprocessing.normalize([df['tempo']]).T)
    df['key'] = pd.DataFrame(preprocessing.normalize([df['key']]).T)
    
    #Count Vectorize library and input    
    vectorizer = CountVectorizer(max_features = 200)
    genre = vectorizer.fit_transform(df['genre'])

    #Create genre dataframe
    genre_df = pd.DataFrame(genre.toarray(), columns =[vectorizer.get_feature_names_out()])

    #Add genre count to extended df
    combined_genre = pd.concat([df, genre_df], axis = 1)
    combined_genre.index = combined_genre['song']
    #combined_genre.reset_index(inplace = True, drop = True)
    combined_genre.drop(columns = ['genre', 'song'], inplace = True)
    
    #calculate similarity
    similarity = cosine_similarity(combined_genre, combined_genre.iloc[[-1]])
    
    #create a similarity DataFrame
    sim = pd.DataFrame(similarity, index = combined_genre.index, columns = [user_input.index[0]])
    
    #filter out songs where the artist is the same as the input artist
    combined['score'] = sim.iloc[:,0].reset_index(drop= True)
    combined = combined[combined['artist_id'] != combined['artist_id'].iloc[-1]]
    
    #filter to top 5 
    recs = combined.sort_values(by = 'score', ascending = False)[1:6]
    recs.reset_index(inplace = True)
    
    return recs

#### Recommendation Evaluation
In order to evaluate the recommendations I will pull a few songs using my get_users_track function from the data collection notebook, store the results and get the recommendations here. I will then listen to the input song and each recommendation to evaluate how reasonable the recommendations are. 

##### Track 1
Track 1 is an R&B song by Summer Walker called CPR. It is a slower, smooth, and groovy track. On track one both Models perform fairly well. While Model 1 does surface some songs that aren't necessary a direct genre match every song feels like a reasonable recommendation. I would give the edge however to Model 2 which provided some very accurate recommendations. 

If I were to rank them I would move the top recommendation in Model 2 to the 4th spot as all the SZA songs had a very similar sound and feel to the user input track CPR by Summer Walker. While the first track is a korean r&b I actually quite enjoyed it, the language was different however the instrumentals and the sound was quite similar to the user input.

In [10]:
track1

Unnamed: 0,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
CPR - Sped Up,57LYzLEk2LcFghVwuWbcuS,"[pop, r&b, rap]",78,4250014,0.693,0.437,10,-11.564,0,0.222,0.701,0.0102,0.123,0.282,83.173


In [11]:
recommender1(combined, track1)

Unnamed: 0,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,song,score
103083,12TcsyqAd3lJUcZFJVogXZ,['lo-fi beats'],63.0,37110.0,0.662,0.349,5.0,-14.022,0.0,0.214,0.637,0.00213,0.124,0.234,141.951,I Fall in Love Too Easliy-['Caleb Belkin'],0.998054
118446,1z7b1Pr1rSlvWRzsW3HOrS,"['hawaiian hip hop', 'rap']",60.0,4579258.0,0.597,0.418,5.0,-10.03,0.0,0.177,0.652,0.0,0.12,0.313,129.948,Scared-['Russ'],0.99742
36275,4V8LLVI7PbaPR0K2TGSxFF,"['hip hop', 'rap']",68.0,9825320.0,0.686,0.429,8.0,-10.402,0.0,0.15,0.71,3e-06,0.166,0.27,130.09,Where This Flower Blooms (feat. Frank Ocean)-[...,0.997157
111758,3rxIQc9kWT6Ueg4BhnOwRK,"['disco', 'funk', 'quiet storm', 'soul']",32.0,283806.0,0.438,0.27,2.0,-19.46,0.0,0.129,0.513,0.0129,0.0738,0.169,174.306,Velas-['Quincy Jones'],0.996587
167119,78xUyw6FkVZrRAtziFdtdu,"['alternative hip hop', 'conscious hip hop', '...",58.0,1228638.0,0.654,0.34,5.0,-16.453,0.0,0.157,0.737,0.0111,0.103,0.266,74.971,The Unraveling-The Roots,0.994188


In [12]:
recommender2(combined, track1)

Unnamed: 0,index,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,song,score
0,133231,5HenzRvMtSrgtvU16XAoby,"['k-pop', 'k-rap', 'korean r&b']",56.0,542000.0,0.538,0.436,3.0,-8.73,0.0,0.0504,0.726,0.0,0.151,0.37,87.111,Yanghwa BRDG-['Zion.T'],0.990536
1,164583,7tYKF4w9nC0nq9CsPZTHyP,"['pop', 'r&b', 'rap']",83.0,8668117.0,0.535,0.388,7.0,-9.805,0.0,0.155,0.507,0.0,0.11,0.412,106.703,I Hate U-SZA,0.988163
2,165088,7tYKF4w9nC0nq9CsPZTHyP,"['pop', 'r&b', 'rap']",83.0,8668117.0,0.577,0.523,0.0,-5.538,0.0,0.0429,0.491,0.0,0.147,0.29,134.93,Drew Barrymore-SZA,0.985938
3,117877,7tYKF4w9nC0nq9CsPZTHyP,"['pop', 'r&b', 'rap']",52.0,8662532.0,0.51,0.71,5.0,-6.891,0.0,0.113,0.629,0.0,0.0607,0.367,131.891,"Babylon (feat. Kendrick Lamar)-['SZA', 'Kendri...",0.979404
4,148017,2auC28zjQyVTsiZKNgPRGs,"['k-rap', 'pop']",65.0,11359089.0,0.447,0.263,5.0,-13.754,0.0,0.0342,0.828,0.0049,0.118,0.0721,147.161,tokyo-['RM'],0.971697


##### Track 2

Track 2 is Lose Control by Meduza. Its an edm track that is a bit faster pace and although not represented in the data I do believe it is very danceable. This is where we start to see Model 1 struggle. I don't think these recommendations are particularly good, the only one that falls in the same realm is perhaps Die for You by the Weekend. Model 2 performs exceptionally better than Model 1 here. All of Model 2's recommendations again appear reasonable and the first three I feel were exceptionally good not only in the same genre but very much had a similar sound and vibe as the user input.

In [13]:
track2

Unnamed: 0,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
Lose Control,0xRXCcSX89eobfrshSVdyu,"[dance pop, edm, electro house, pop, pop dance...",75,567423,0.598,0.526,10,-8.659,0,0.0415,0.129,0,0.14,0.529,123.935


In [14]:
recommender1(combined, track2)

Unnamed: 0,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,song,score
126435,6IRouO5mvvfcyxtPDKMYFN,"['album rock', 'classic rock', 'hard rock', 'm...",24.0,2586503.0,0.603,0.529,11.0,-7.393,0.0,0.0447,0.137,1.9e-05,0.116,0.537,118.52,Spellbinder-['Foreigner'],0.999631
47173,3eqjTLE0HfPfh78zjh6TqT,"['classic rock', 'heartland rock', 'mellow gol...",44.0,5124548.0,0.479,0.423,11.0,-13.126,0.0,0.0357,0.12,3.2e-05,0.104,0.407,116.06,Point Blank-['Bruce Springsteen'],0.999509
36081,1Xyo4u8uXC1ZmMpatF05PJ,"['canadian contemporary r&b', 'canadian pop', ...",67.0,51962229.0,0.586,0.525,1.0,-7.163,0.0,0.0615,0.111,0.0,0.134,0.508,133.629,Die For You-['The Weeknd'],0.999384
18035,1Xyo4u8uXC1ZmMpatF05PJ,"['canadian contemporary r&b', 'canadian pop', ...",77.0,51962229.0,0.582,0.525,1.0,-7.163,0.0,0.0627,0.111,0.0,0.134,0.51,133.249,Die For You-['The Weeknd'],0.99937
10645,6IRouO5mvvfcyxtPDKMYFN,"['album rock', 'classic rock', 'hard rock', 'm...",68.0,2586503.0,0.516,0.486,3.0,-11.81,0.0,0.0272,0.0987,1e-06,0.143,0.471,132.542,Cold as Ice-['Foreigner'],0.999151


In [15]:
recommender2(combined, track2)

Unnamed: 0,index,artist_id,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,song,score
0,36195,2XnBwblw31dfGnspMIwgWz,"['dance pop', 'edm', 'electro house', 'pop', '...",79.0,2404390.0,0.646,0.741,5.0,-4.973,0.0,0.034,0.0275,0.0,0.312,0.571,123.077,More Than You Know-['Axwell /\\ Ingrosso'],0.98406
1,18631,4ofCBoyEiGSePFAG500xev,"['dance pop', 'edm', 'electro house', 'pop', '...",87.0,197750.0,0.88,0.751,7.0,-4.258,0.0,0.0874,0.177,6.4e-05,0.106,0.884,117.948,Ride It-['Regard'],0.983788
2,101719,2XnBwblw31dfGnspMIwgWz,"['dance pop', 'edm', 'electro house', 'pop', '...",58.0,2404623.0,0.525,0.849,10.0,-7.727,0.0,0.0484,0.00232,0.00196,0.322,0.277,128.055,"Leave The World Behind - Radio Edit-['Axwell',...",0.982462
3,53805,4sTQVOfp9vEMCemLw50sbu,"['dance pop', 'edm', 'electro house', 'pop', '...",65.0,3282138.0,0.708,0.945,5.0,-3.247,0.0,0.234,0.00104,0.0803,0.227,0.545,127.96,Peanut Butter Jelly-['Galantis'],0.979322
4,118565,4sTQVOfp9vEMCemLw50sbu,"['dance pop', 'edm', 'electro house', 'pop', '...",70.0,3282138.0,0.674,0.915,6.0,-3.999,0.0,0.0411,0.0282,0.00337,0.273,0.78,126.038,No Money-['Galantis'],0.978813


#### Conclusion & Next Steps
In conclusion I was able to successfully build a model and streamlit app that takes in a user input artist and track compares that to the combined track library and serves five recommendations from artists unique from the input artist. While Model 1 struggles with certain genres and styles of music, Model 2 which considers genre appears to perform reasonably well. Given additional time I would like to make the following improvements to the serve better recommendations.

Next Steps:
<br>1.Experiment with KNN Models
<br>2. Pull in more data from spotify api to have a more robust library to choose from
<br>3. Incoporate Audio Data
<br>4. Find a way to capture the relationship between similar genres
<br>5. Deploy and receive user feedback on where the recommendation engine struggles

In [16]:
#Pickled Data for streamlit app
with open('../models/library2.pkl', 'wb') as f:
    pickle.dump(combined, f)