## The Song Recommender

    The song recommender script uses the Millions Song Dataset. The current notebook has a subset of the data   containing 10,000 songs. There are two files provided:
    1. Triplets file: This contains user_id, song_id, and played count
    2. Songs metadata file: This contains song_id, title, release_by and artist_name.
    
    Sources for the files are: 
    1. https://static.turi.com/datasets/millionsong/10000.txt
    2. https://static.turi.com/datasets/millionsong/song_data.csv
    
    

In [1]:
import pandas as pd
import numpy as np

    Data processing pipeline created to merge both the triplets file and songs metadata into one single pandas dataframe. This will allow us to further process and explore and build the recommendation system. 

In [2]:
triplets_file = pd.read_csv('/Users/agni/Documents/datascience/kaggle/song_recommender/triplet.txt',delimiter='\t',
                           names = ["user_id", "song_id", "played_count" ])

In [3]:
songs_metadata = pd.read_csv('/Users/agni/Documents/datascience/kaggle/song_recommender/song_data.csv')

In [4]:
song_df = pd.merge(triplets_file, songs_metadata.drop_duplicates(['song_id']), on="song_id", how="left")

In [34]:
song_df=song_df.sort_values(by=['played_count'], ascending= False )
print("\n**** . Showing first 10 rows for the most played songs by an user ****\n")
song_df[:10]


**** . Showing first 10 rows for the most played songs by an user ****



Unnamed: 0,user_id,song_id,played_count,title,release,artist_name,year
1228366,d13609d62db6df876d3cc388225478618bb7b912,SOFCGSE12AF72A674F,2213,Starshine,Gorillaz,Gorillaz,2000
1048310,50996bbabb6f7857bf0c8019435b5246a0e45cfd,SOUAGPQ12A8AE47B3A,920,Crack Under Pressure,Stress related / Live and learn,Righteous Pigs,1998
1586780,5ea608df0357ec4fda191cb9316fe8e6e65e3777,SOKOSPK12A8C13C088,879,Call It Off (Album Version),The Con,Tegan And Sara,2007
31179,bb85bb79612e5373ac714fcd4469cabeb5ed94e1,SOZQSVB12A8C13C271,796,Paradise & Dreams,Skydivin',Darren Styles,0
1875121,c012ec364329bb08cbe3e62fe76db31f8c5d8ec3,SOBONKR12A58A7A7E0,683,You're The One,If There Was A Way,Dwight Yoakam,1990
1644909,70caceccaa745b6f7bc2898a154538eb1ada4d5a,SOPREHY12AB01815F9,676,I'm On A Boat,Incredibad,The Lonely Island / T-Pain,2009
1731945,972cce803aa7beceaa7d0039e4c7c0ff097e4d55,SOJRFWQ12AB0183582,664,Dance_ Dance,Dance_ Dance,Fall Out Boy,0
1374693,d2232ac7a1ec17b283b5dff243161902b2cb706c,SOLGIWB12A58A77A05,649,Reelin' In The Years,The Definitive Collection,Steely Dan,1972
1819571,f5363481018dc87e8b06f9451e99804610a594fa,SOVRIPE12A6D4FEA19,605,Can't Help But Wait (Album Version),Kiss Presents The Mixtape,Trey Songz,0
515442,f1bdbb9fb7399b402a09fa124210dedf78e76034,SOZPMJT12AAF3B40D1,585,The Quest,A Taste Of Extreme Divinity,HYPOCRISY,2009


   ** Songs grouped by title showing the maximum played song in the playlist by all the users. **

In [38]:
song_grouped = song_df.groupby(['title']).agg({'played_count': 'count'}).reset_index()
grouped_sum = song_grouped['played_count'].sum()
song_grouped['percentage']  = song_grouped['played_count'].div(grouped_sum)*100
song_grouped=song_grouped.sort_values(['played_count', 'title'], ascending = [0,1])
print("\n***** Shows the first 10 most played song by title *****\n")
song_grouped[:10]


***** Shows the first 10 most played song by title *****



Unnamed: 0,title,played_count,percentage
6836,Sehr kosmisch,8277,0.41385
8725,Undo,7032,0.3516
1964,Dog Days Are Over (Radio Edit),6949,0.34745
9496,You're The One,6729,0.33645
6498,Revelry,6145,0.30725
6825,Secrets,5841,0.29205
3437,Horn Concerto No. 4 in E flat K495: II. Romanc...,5385,0.26925
2595,Fireflies,4795,0.23975
3322,Hey_ Soul Sister,4758,0.2379
8494,Tive Sim,4548,0.2274


In [7]:
users = song_df['user_id'].unique()

In [8]:
songs = song_df['title'].unique()

In [9]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(song_df, test_size = 0.20, random_state=0)

## Declares the recommender class

    The following recommender class has been created by dvysardana
    Link to her github repo: https://github.com/dvysardana/RecommenderSystems_PyData_2016

In [10]:
#Class for Popularity based Recommender System modelclass popularity_recommender_py():  
class popularity_recommender_py():
    def __init__(self):        
        self.train_data = None        
        self.user_id = None        
        self.item_id = None        
        self.popularity_recommendations = None            
    #Create the popularity based recommender system model    
    def create(self, train_data, user_id, item_id): 
        self.train_data = train_data
        self.user_id = user_id        
        self.item_id = item_id         
        
        #Get a count of user_ids for each unique song as   recommendation score
        train_data_grouped = train_data.groupby([self.item_id]).agg({self.user_id: 'count'}).reset_index()        
        train_data_grouped.rename(columns = {'user_id': 'score'},inplace=True)            
        #Sort the songs based upon recommendation score
        train_data_sort = train_data_grouped.sort_values(['score', self.item_id], ascending = [0,1])            
        #Generate a recommendation rank based upon score
        train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
        #Get the top 10 recommendations
        self.popularity_recommendations = train_data_sort.head(10)     
        #Use the popularity based recommender system model to    
        #make recommendations    
    def recommend(self, user_id):            
        user_recommendations = self.popularity_recommendations                 
        #Add user_id column for which the recommendations are being generated        
        user_recommendations['user_id'] = user_id            
        #Bring user_id column to the front        
        cols = user_recommendations.columns.tolist()        
        cols = cols[-1:] + cols[:-1]        
        user_recommendations = user_recommendations[cols]
        return user_recommendations

In [11]:
pm = popularity_recommender_py()
pm.create(train_data, 'user_id', 'title')
#user the popularity model to make some prediction
user_id = users[5]
pm.recommend(user_id)

Unnamed: 0,user_id,title,score,Rank
6836,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Sehr kosmisch,6630,1.0
8725,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Undo,5639,2.0
1964,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Dog Days Are Over (Radio Edit),5592,3.0
9496,4bd88bfb25263a75bbdd467e74018f4ae570e5df,You're The One,5396,4.0
6498,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Revelry,4938,5.0
6825,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Secrets,4627,6.0
3437,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Horn Concerto No. 4 in E flat K495: II. Romanc...,4368,7.0
2595,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Fireflies,3835,8.0
3322,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Hey_ Soul Sister,3819,9.0
8494,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Tive Sim,3707,10.0


## Item similarity based collaborative filtering model.

In [12]:
#Class for Item similarity based Recommender System model
import pandas as pd
class item_similarity_recommender_py():
    def __init__(self):
        self.train_data = None
        self.user_id = None
        self.item_id = None
        self.cooccurence_matrix = None
        self.songs_dict = None
        self.rev_songs_dict = None
        self.item_similarity_recommendations = None
        
    #Get unique items (songs) corresponding to a given user
    def get_user_items(self, user):
        user_data = self.train_data[self.train_data[self.user_id] == user]
        user_items = list(user_data[self.item_id].unique())
        
        return user_items
        
    #Get unique users for a given item (song)
    def get_item_users(self, item):
        item_data = self.train_data[self.train_data[self.item_id] == item]
        item_users = set(item_data[self.user_id].unique())
            
        return item_users
        
    #Get unique items (songs) in the training data
    def get_all_items_train_data(self):
        all_items = list(self.train_data[self.item_id].unique())
            
        return all_items
        
    #Construct cooccurence matrix
    def construct_cooccurence_matrix(self, user_songs, all_songs):
            
        ####################################
        #Get users for all songs in user_songs.
        ####################################
        user_songs_users = []        
        for i in range(0, len(user_songs)):
            user_songs_users.append(self.get_item_users(user_songs[i]))
            
        ###############################################
        #Initialize the item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = np.matrix(np.zeros(shape=(len(user_songs), len(all_songs))), float)
           
        #############################################################
        #Calculate similarity between user songs and all unique songs
        #in the training data
        #############################################################
        for i in range(0,len(all_songs)):
            #Calculate unique listeners (users) of song (item) i
            songs_i_data = self.train_data[self.train_data[self.item_id] == all_songs[i]]
            users_i = set(songs_i_data[self.user_id].unique())
            
            for j in range(0,len(user_songs)):       
                    
                #Get unique listeners (users) of song (item) j
                users_j = user_songs_users[j]
                    
                #Calculate intersection of listeners of songs i and j
                users_intersection = users_i.intersection(users_j)
                
                #Calculate cooccurence_matrix[i,j] as Jaccard Index
                if len(users_intersection) != 0:
                    #Calculate union of listeners of songs i and j
                    users_union = users_i.union(users_j)
                    
                    cooccurence_matrix[j,i] = float(len(users_intersection))/float(len(users_union))
                else:
                    cooccurence_matrix[j,i] = 0
                    
        
        return cooccurence_matrix

    
    #Use the cooccurence matrix to make top recommendations
    def generate_top_recommendations(self, user, cooccurence_matrix, all_songs, user_songs):
        print("Non zero values in cooccurence_matrix :%d" % np.count_nonzero(cooccurence_matrix))
        
        #Calculate a weighted average of the scores in cooccurence matrix for all user songs.
        user_sim_scores = cooccurence_matrix.sum(axis=0)/float(cooccurence_matrix.shape[0])
        user_sim_scores = np.array(user_sim_scores)[0].tolist()
 
        #Sort the indices of user_sim_scores based upon their value
        #Also maintain the corresponding score
        sort_index = sorted(((e,i) for i,e in enumerate(list(user_sim_scores))), reverse=True)
    
        #Create a dataframe from the following
        columns = ['user_id', 'title', 'score', 'rank']
        #index = np.arange(1) # array of numbers for the number of samples
        df = pd.DataFrame(columns=columns)
         
        #Fill the dataframe with top 10 item based recommendations
        rank = 1 
        for i in range(0,len(sort_index)):
            if ~np.isnan(sort_index[i][0]) and all_songs[sort_index[i][1]] not in user_songs and rank <= 10:
                df.loc[len(df)]=[user,all_songs[sort_index[i][1]],sort_index[i][0],rank]
                rank = rank+1
        
        #Handle the case where there are no recommendations
        if df.shape[0] == 0:
            print("The current user has no songs for training the item similarity based recommendation model.")
            return -1
        else:
            return df
 
    #Create the item similarity based recommender system model
    def create(self, train_data, user_id, item_id):
        self.train_data = train_data
        self.user_id = user_id
        self.item_id = item_id

    #Use the item similarity based recommender system model to
    #make recommendations
    def recommend(self, user):
        
        ########################################
        #A. Get all unique songs for this user
        ########################################
        user_songs = self.get_user_items(user)    
            
        print("No. of unique songs for the user: %d" % len(user_songs))
        
        ######################################################
        #B. Get all unique items (songs) in the training data
        ######################################################
        all_songs = self.get_all_items_train_data()
        
        print("no. of unique songs in the training set: %d" % len(all_songs))
         
        ###############################################
        #C. Construct item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = self.construct_cooccurence_matrix(user_songs, all_songs)
        
        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        df_recommendations = self.generate_top_recommendations(user, cooccurence_matrix, all_songs, user_songs)
                
        return df_recommendations
    
    #Get similar items to given items
    def get_similar_items(self, item_list):
        
        user_songs = item_list
        
        ######################################################
        #B. Get all unique items (songs) in the training data
        ######################################################
        all_songs = self.get_all_items_train_data()
        
        print("no. of unique songs in the training set: %d" % len(all_songs))
         
        ###############################################
        #C. Construct item cooccurence matrix of size 
        #len(user_songs) X len(songs)
        ###############################################
        cooccurence_matrix = self.construct_cooccurence_matrix(user_songs, all_songs)
        
        #######################################################
        #D. Use the cooccurence matrix to make recommendations
        #######################################################
        user = ""
        df_recommendations = self.generate_top_recommendations(user, cooccurence_matrix, all_songs, user_songs)
         
        return df_recommendations

In [13]:
is_model = item_similarity_recommender_py()
is_model.create(train_data, 'user_id', 'title')

In [14]:
user_id = users[5]
user_items = is_model.get_user_items(user_id)

In [15]:
print("------------------------------------------------------------------------------------")
print("Training data songs for the user userid: %s:" % user_id)
print("------------------------------------------------------------------------------------")

for user_item in user_items:
    print(user_item)

print("----------------------------------------------------------------------")
print("Recommendation process going on:")
print("----------------------------------------------------------------------")

#Recommend songs for the user using personalized model
is_model.recommend(user_id)

------------------------------------------------------------------------------------
Training data songs for the user userid: 4bd88bfb25263a75bbdd467e74018f4ae570e5df:
------------------------------------------------------------------------------------
The Real Slim Shady
Forgive Me
Say My Name
Speechless
Ghosts 'n' Stuff (Original Instrumental Mix)
Missing You
Without Me
Somebody To Love
Just Lose It
----------------------------------------------------------------------
Recommendation process going on:
----------------------------------------------------------------------
No. of unique songs for the user: 9
no. of unique songs in the training set: 9567
Non zero values in cooccurence_matrix :60155


Unnamed: 0,user_id,title,score,rank
0,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Mockingbird,0.057687,1
1,4bd88bfb25263a75bbdd467e74018f4ae570e5df,My Name Is,0.056503,2
2,4bd88bfb25263a75bbdd467e74018f4ae570e5df,U Smile,0.044817,3
3,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Terre Promise,0.044756,4
4,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Eenie Meenie,0.043378,5
5,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Superman,0.042695,6
6,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Hailie's Song,0.041082,7
7,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Drop The World,0.04093,8
8,4bd88bfb25263a75bbdd467e74018f4ae570e5df,Love Me,0.040303,9
9,4bd88bfb25263a75bbdd467e74018f4ae570e5df,OMG,0.040012,10


In [16]:
is_model.get_similar_items(['The Real Slim Shady'])

no. of unique songs in the training set: 9567
Non zero values in cooccurence_matrix :8011


Unnamed: 0,user_id,title,score,rank
0,,My Name Is,0.194948,1
1,,Mockingbird,0.158178,2
2,,Without Me,0.14561,3
3,,Superman,0.113343,4
4,,Terre Promise,0.112377,5
5,,Hailie's Song,0.10038,6
6,,'Till I Collapse,0.092408,7
7,,Just Lose It,0.085859,8
8,,My Dad's Gone Crazy,0.085714,9
9,,In The End (Album Version),0.071725,10


In [17]:
#constants defining the dimensions of our User Rating Matrix (URM) 
MAX_PID = 4 
MAX_UID = 5  
#Compute SVD of the user ratings matrix 
import math as mt
def computeSVD(urm, K):     
    U, s, Vt = sparsesvd(urm, K)      
    dim = (len(s), len(s))     
    S = np.zeros(dim, dtype=np.float32)     
    for i in range(0, len(s)):         
        S[i,i] = mt.sqrt(s[i])      
        U = csc_matrix(np.transpose(U), dtype=np.float32)     
        S = csc_matrix(S, dtype=np.float32)     
        Vt = csc_matrix(Vt, dtype=np.float32)          
        return U, S, Vt

In [18]:
from scipy.sparse import csc_matrix
import numpy, scipy.sparse
from sparsesvd import sparsesvd
#Compute estimated rating for the test user
def computeEstimatedRatings(urm, U, S, Vt, uTest, K, test):
    rightTerm = S*Vt
    estimatedRatings = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
    for userTest in uTest:
        prod = U[userTest, :]*rightTerm
        #we convert the vector to dense format in order to get the     #indices
        #of the movies with the best estimated ratings 
        estimatedRatings[userTest, :] = prod.todense()
        recom = (-estimatedRatings[userTest, :]).argsort()[:250]
    return recom

#Used in SVD calculation (number of latent factors)
K=2
#Initialize a sample user rating matrix
urm = np.array([[3, 1, 2, 3],[4, 3, 4, 3],[3, 2, 1, 5], [1, 6, 5, 2], [5, 0,0 , 0]])
urm = csc_matrix(urm, dtype=np.float32)
#Compute SVD of the input user ratings matrix
U, S, Vt = computeSVD(urm, K)
#Test user set as user_id 4 with ratings [0, 0, 5, 0]
uTest = [4]
print("User id for whom recommendations are needed: %d" % uTest[0])
#Get estimated rating for test user
print("Predictied ratings:")
uTest_recommended_items = computeEstimatedRatings(urm, U, S, Vt, uTest, K, True)
print(uTest_recommended_items)

User id for whom recommendations are needed: 4
Predictied ratings:
[1 2 0 3]
