### Recommendation Engine

Overview: The idea behind a song recommendation system is to identify songs that a user may enjoy based on their listening preferences. In order to accomplish this task I will pull in my listening history and compare various song features from my listening history with songs I not in my listening history to recommend songs that are most like the ones I listen to.

Step 1: Combine my listening history with a random library of songs 
<br>Step 2: Preprocess the data so that each desired feature is considered
<br>Step 3: Create a cosine similarity matrix
<br>Step 4: Create a function to make recommendations given a selected song

### Import Libraries & Data

In [1]:
import pandas as pd
import numpy as np
import pickle

from sklearn.metrics.pairwise import pairwise_distances, cosine_similarity, cosine_distances
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
#Load in the data
kaggle = pd.read_csv('../data/kaggle.csv')
%store -r extended

In [3]:
#Remove Duplicates 
extended.drop_duplicates(subset = 'trackID', inplace = True)
kaggle.dropna(inplace = True)
kaggle.drop_duplicates(subset = 'id', inplace = True)

In [4]:
#Dropping unnecessary columns and combining listening history and library
historytemp = extended.drop(columns = ['listen_count', 'year', 'ts', 'artist_id', 'trackID'])
library = kaggle.drop(columns = ['Unnamed: 0','year', 'duration_ms', 'explicit', 'release_date', 'artistName', 'artist_id', 'id'])

library = library[['name', 'artists', 'genre', 'popularity', 'followers', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
library.rename(columns = {'name':'trackName','artists':'artistName'}, inplace = True)
historytemp = historytemp[['master_metadata_track_name', 'master_metadata_album_artist_name', 'genre', 'popularity', 'followers', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']]
historytemp.rename(columns = {'master_metadata_track_name':'trackName', 'master_metadata_album_artist_name': 'artistName'}, inplace = True)

#Join the library and listening history into one dataframe
combined = pd.concat([library, historytemp], axis=0)

#Combine track and artist to easily identify songs
combined['Song'] = combined['trackName'] + '-' + combined['artistName'] 
combined.drop(columns = ['trackName', 'artistName'], inplace = True)
combined.reset_index(inplace = True)

### Recommendation System -- No Genres --

In [5]:
#Create a datframe without genres
no_genre = combined.drop(columns = ['genre', 'Song'])
no_genre.index = combined['Song']
no_genre.drop(columns = ['index'], inplace = True)
no_genre.dropna(inplace = True)

In [6]:
#calculate similarity scores
similarity = cosine_similarity(no_genre, no_genre.iloc[[-1]])

# sim = pd.DataFrame(similarity, index = temp.index, columns = temp.iloc[[0]].index)
sim = pd.DataFrame(similarity, index = no_genre.index, columns = [no_genre.index[-1]])

#return ten songs most similar to input song
sim.sort_values(by = no_genre.index[-1], ascending = False).head(10)

Unnamed: 0_level_0,All The Things (Your Man Won't Do)-Joe
Song,Unnamed: 1_level_1
All The Things (Your Man Won't Do)-Joe,1.0
Fallin-Macklemore,1.0
Notion-['Tash Sultana'],1.0
parents-['YUNGBLUD'],1.0
Free Fallin'-['Tom Petty'],1.0
Life Goes On-Oliver Tree,1.0
Spice Girl-['Aminé'],1.0
L-O-V-E-['Nat King Cole'],1.0
Single Saturday Night-['Cole Swindell'],1.0
She Used To Be Mine-['Sara Bareilles'],1.0


In [25]:
#Recommendation Function- Update this to only return the Song not the similarity score
def recommendations_nogenre(library, user_input):
    #calculate similarity
    similarity = cosine_similarity(library, user_input)
    
    #create a similarity DataFrame
    sim = pd.DataFrame(similarity, index = library.index, columns = [user_input.index[0]])
    
    #filter to top 5 
    recs = sim.sort_values(by = test.index[0], ascending = False)[1:6]
    
    return recs

In [22]:
test.head(1)

Unnamed: 0,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
Have You Ever Loved A Woman,51,351424,0.678,0.388,1,-11.033,1,0.0273,0.675,3.4e-05,0.131,0.622,95.705


In [26]:
recommendations_nogenre(no_genre, test)

Unnamed: 0_level_0,Have You Ever Loved A Woman
Song,Unnamed: 1_level_1
Ice Ice Baby-['Vanilla Ice'],1.0
Honey White-['Morphine'],1.0
Jettin' - Remastered-Digable Planets,1.0
September In The Rain-['Dinah Washington'],1.0
Suburban Home-['Descendents'],1.0


In [None]:
# with open('../models/get_recommendation.pkl', 'wb') as f:
#     pickle.dump(get_recommendations, f)

In [None]:
# with open('../models/library.pkl', 'wb') as f:
#     pickle.dump(temp, f)

In [9]:
%store -r test

#### Recommendations are not great some things to experiment: 
<br>Add Genre
<br>Normalize Features- Will have to do this on the input song as well* likely manual w/function
<br>Check recommendations against library list and only surface recommendations in the same genre

#### Normalizing Features

In [None]:
#Step 1 Normalize Features over 1
normalized = combined
normalized['artist_popularity'] = pd.DataFrame(preprocessing.normalize([normalized['artist_popularity']]).T)
normalized['loudness'] = pd.DataFrame(preprocessing.normalize([normalized['loudness']]).T)
normalized['followers'] = pd.DataFrame(preprocessing.normalize([normalized['followers']]).T)
normalized['tempo'] = pd.DataFrame(preprocessing.normalize([normalized['tempo']]).T)
normalized['key'] = pd.DataFrame(preprocessing.normalize([normalized['key']]).T)

In [None]:
normalized.index = normalized['Song']
normalized.drop(columns = ['Song', 'genres'], inplace = True)
normalized.head(1)

In [None]:
normalized.shape

#### Recommendation with Genre
Not seeing much difference with Genre included. I am trying to first calculate the scores then only surface recommendations within the same genre. The problem is the genres column is a long list so its likely an exact match on genres would only surface songs from the same artist. Is there a way I can write a function to identify songs with at least 1 matching genre in the genres list.

In [11]:
#Count Vectorize the Genre Column
vectorizer = CountVectorizer(max_features = 200)
genre = vectorizer.fit_transform(combined['genre'])

#Create genre dataframe
genre_df = pd.DataFrame(genre.toarray(), columns =[vectorizer.get_feature_names_out()])

#Add genre count to extended df
combined_genre = pd.concat([combined, genre_df], axis = 1)
combined_genre.index = combined_genre['Song']
#combined_genre.reset_index(inplace = True, drop = True)
combined_genre.drop(columns = ['index', 'genre', 'Song'], inplace = True)

In [12]:
#Calculate the similarity between vectors
similarity = cosine_similarity(combined_genre, combined_genre.iloc[[-1]])

#create a dataframe with similarity scores
sim = pd.DataFrame(similarity, index = combined_genre.index, columns = [combined_genre.index[-1]])

#return 10 songs with highest similarity to input song
sim.sort_values(by = combined_genre.index[-1], ascending = False).head(10)

Unnamed: 0_level_0,All The Things (Your Man Won't Do)-Joe
Song,Unnamed: 1_level_1
All The Things (Your Man Won't Do)-Joe,1.0
Fallin-Macklemore,1.0
Notion-['Tash Sultana'],1.0
Do It-['Chloe x Halle'],1.0
Life Goes On-Oliver Tree,1.0
Spice Girl-['Aminé'],1.0
She Used To Be Mine-['Sara Bareilles'],1.0
skeletons-keshi,1.0
What If I Told You That I Love You-['Ali Gatie'],1.0
parents-['YUNGBLUD'],1.0


In [42]:
#Recommendation Function- Update this to only return the Song not the similarity score
def recommendations_genre(library, user_input):
    #Join Library and User_Input
    user_input['genre'] = str(user_input['genre'][0])
    user_input['Song'] = user_input.index
    df = pd.concat([library, user_input])
    df.drop(columns = ['index'], inplace = True)
    df.reset_index(inplace = True, drop = True)
    
    #normalize features
    df['popularity'] = pd.DataFrame(preprocessing.normalize([df['popularity']]).T)
    df['loudness'] = pd.DataFrame(preprocessing.normalize([df['loudness']]).T)
    df['followers'] = pd.DataFrame(preprocessing.normalize([df['followers']]).T)
    df['tempo'] = pd.DataFrame(preprocessing.normalize([df['tempo']]).T)
    df['key'] = pd.DataFrame(preprocessing.normalize([df['key']]).T)
    
    #Count Vectorize library and input    
    vectorizer = CountVectorizer(max_features = 200)
    genre = vectorizer.fit_transform(df['genre'])

    #Create genre dataframe
    genre_df = pd.DataFrame(genre.toarray(), columns =[vectorizer.get_feature_names_out()])

    #Add genre count to extended df
    combined_genre = pd.concat([df, genre_df], axis = 1)
    combined_genre.index = combined_genre['Song']
    #combined_genre.reset_index(inplace = True, drop = True)
    combined_genre.drop(columns = ['genre', 'Song'], inplace = True)
    
    #calculate similarity
    similarity = cosine_similarity(combined_genre, combined_genre.iloc[[-1]])
    
    #create a similarity DataFrame
    sim = pd.DataFrame(similarity, index = combined_genre.index, columns = [user_input.index[0]])
    
    #filter to top 5 
    recs = sim.sort_values(by = user_input.index[0], ascending = False)[50:55]
    recs.reset_index(inplace = True)
    
    return recs[['Song']]

In [None]:
%store -r sample

In [44]:
temp = recommendations_genre(combined, sample)

In [45]:
temp

Unnamed: 0,Song
0,Halley's Comet-Billie Eilish
1,Tilted-['Christine and the Queens']
2,Sad Girl-['Lana Del Rey']
3,American-['Lana Del Rey']
4,Burning Desire-['Lana Del Rey']


In [18]:
with open('../models/library2.pkl', 'wb') as f:
    pickle.dump(combined, f)

In [30]:
combined.head(1)

Unnamed: 0,index,genre,popularity,followers,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,Song
0,0,"['classical', 'post-romantic era', 'russian ro...",4.0,560524.0,0.279,0.211,10.0,-20.096,1.0,0.0366,0.982,0.878,0.665,0.0594,80.954,"Piano Concerto No. 3 in D Minor, Op. 30: III. ..."
