**Collaborative filtering**

Here we import the needed modules and libraries for this project.

In [1]:
%matplotlib inline

import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import time
from sklearn.externals import joblib
import Recommenders as Recommenders
import Evaluation as Evaluation

In the following cells we import the data to work on it.

In [2]:
user_data = pd.read_csv('greater.csv')
user_data.head()

Unnamed: 0,userid,trackid,qty_greater30
0,133575735,6318967,1
1,133575735,6318968,1
2,133575735,6318969,1
3,133575735,6318970,1
4,133575735,6318971,1


In [3]:
tracks = pd.read_csv('tracks.csv')
tracks.head()

Unnamed: 0,trackid,itemartist
0,2447658,Styx
1,2447654,Styx
2,2447648,Styx
3,2447650,Styx
4,2447642,Styx


In the next cell we merge the two datasets to get a whole one in order to properly work on it.

In [4]:
song_df = pd.merge(user_data, tracks.drop_duplicates(['trackid']), on="trackid", how="left")

In [5]:
song_df.head()

Unnamed: 0,userid,trackid,qty_greater30,itemartist
0,133575735,6318967,1,P!nk
1,133575735,6318968,1,P!nk
2,133575735,6318969,1,P!nk
3,133575735,6318970,1,P!nk
4,133575735,6318971,1,P!nk


In [6]:
#I will only use a portion of the data because the recommender system takes a toll on a laptop and it will take some time to 
#show results.
song_df = song_df.head(100000)
#Merge trackid and artists name columns to make a merged column
song_df['song'] = song_df['trackid'].map(str) + " - " + song_df['itemartist']
song_df['listen_count'] = song_df['qty_greater30'].map(str)
song_df['user_id'] = song_df['userid'].map(str)

Here we will take all the times a song has been listened to, and make it to a percentage out of the whole, so that we can attribute a score for every song.

In [7]:
song_grouped = song_df.groupby(['song']).agg({'listen_count': 'count'}).reset_index()
grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage']  = song_grouped['listen_count'].div(grouped_sum)*100
song_grouped.sort_values(['listen_count', 'song'], ascending = [0,1]).head()

Unnamed: 0,song,listen_count,percentage
51703,4983487 - Michael Jackson,51,0.051
72657,71432093 - George Ezra,38,0.038
23994,2853689 - Abba,33,0.033
7758,14180071 - Queen,30,0.03
11518,16267164 - Wham!,29,0.029


In [8]:
song_grouped.head()

Unnamed: 0,song,listen_count,percentage
0,10000311 - Charles Aznavour,1,0.001
1,10004079 - Tony S.,1,0.001
2,10004082 - Tony S.,1,0.001
3,10004083 - Tony S.,1,0.001
4,10004084 - Tony S.,1,0.001


In [9]:
users = song_df['user_id'].unique()
len(users)
songs = song_df['song'].unique()
len(songs) 

78931

Here we split the data we have so we can train the model for recommending.

In [10]:
train_data, test_data = train_test_split(song_df, test_size = 0.30, random_state=0)

Next we attribute the score so we have a benchmark after what we do the filtering. Also we check to see what music certain users have in their playlists.

In [11]:
pm = Recommenders.popularity_recommender_py()
pm.create(train_data, 'user_id', 'song')
#user the popularity model to make some prediction
user_id = users[5]
pm.recommend(user_id)

Unnamed: 0,user_id,song,score,Rank
37873,133582061,4983487 - Michael Jackson,40,1.0
17570,133582061,2853689 - Abba,24,2.0
53193,133582061,71432093 - George Ezra,23,3.0
53194,133582061,71432094 - George Ezra,22,4.0
54865,133582061,76109932 - Calvin Harris,21,5.0
5705,133582061,14180071 - Queen,20,6.0
8494,133582061,16267164 - Wham!,20,7.0
19710,133582061,30958489 - Chic,20,8.0
20139,133582061,3123595 - Whitney Houston,20,9.0
4235,133582061,1321662 - Fleetwood Mac,19,10.0


In [12]:
user_id = users[8]
pm.recommend(user_id)

Unnamed: 0,user_id,song,score,Rank
37873,133596268,4983487 - Michael Jackson,40,1.0
17570,133596268,2853689 - Abba,24,2.0
53193,133596268,71432093 - George Ezra,23,3.0
53194,133596268,71432094 - George Ezra,22,4.0
54865,133596268,76109932 - Calvin Harris,21,5.0
5705,133596268,14180071 - Queen,20,6.0
8494,133596268,16267164 - Wham!,20,7.0
19710,133596268,30958489 - Chic,20,8.0
20139,133596268,3123595 - Whitney Houston,20,9.0
4235,133596268,1321662 - Fleetwood Mac,19,10.0


Here we create the recommender model, the one that does the actual recommending. 

In [13]:
is_model = Recommenders.item_similarity_recommender_py()
is_model.create(train_data, 'user_id', 'song')

Then, we apply it to the playlist of user nr.5 and see what we get.

In [14]:
#Print the songs for the user in training data
user_id = users[5]
user_items = is_model.get_user_items(user_id)
#
print("------------------------------------------------------------------------------------")
print("Training data songs for the user userid: %s:" % user_id)
print("------------------------------------------------------------------------------------")

for user_item in user_items:
    print(user_item)

print("----------------------------------------------------------------------")
print("Recommendation process going on:")
print("----------------------------------------------------------------------")

#Recommend songs for the user using personalized model
is_model.recommend(user_id)

------------------------------------------------------------------------------------
Training data songs for the user userid: 133582061:
------------------------------------------------------------------------------------
66978170 - Roy Orbison
11811825 - Adele
66978162 - Roy Orbison
61580882 - Black Sabbath
89407 - Madonna
45374103 - Top Hit Music Charts
11811824 - Adele
61580868 - Black Sabbath
66978166 - Roy Orbison
45365096 - Top Hit Music Charts
66978171 - Roy Orbison
61580881 - Black Sabbath
66978177 - Roy Orbison
61580880 - Black Sabbath
66978169 - Roy Orbison
45374101 - Top Hit Music Charts
61580875 - Black Sabbath
66978172 - Roy Orbison
45374091 - Top Hit Music Charts
66978173 - Roy Orbison
66978165 - Roy Orbison
61580877 - Black Sabbath
66978163 - Roy Orbison
66978176 - Roy Orbison
61580878 - Black Sabbath
11811820 - Adele
31422488 - Crystal Waters
66978167 - Roy Orbison
66978164 - Roy Orbison
66978168 - Roy Orbison
------------------------------------------------------------

Unnamed: 0,user_id,song,score,rank
0,133582061,18128003 - Roy Orbison,0.133333,1
1,133582061,18128000 - Roy Orbison,0.133333,2
2,133582061,18127998 - Roy Orbison,0.133333,3
3,133582061,18128010 - Roy Orbison,0.133333,4
4,133582061,18128008 - Roy Orbison,0.133333,5
5,133582061,10562977 - Roy Orbison,0.068452,6
6,133582061,6298143 - Roy Orbison,0.068452,7
7,133582061,3472495 - Roy Orbison,0.058968,8
8,133582061,3472506 - Roy Orbison,0.058968,9
9,133582061,22959945 - Roy Orbison,0.055,10


We test it again for user nr.7 to see if it properly works again.

In [None]:
user_id = users[7]
#Fill in the code here
user_items = is_model.get_user_items(user_id)
#
print("------------------------------------------------------------------------------------")
print("Training data songs for the user userid: %s:" % user_id)
print("------------------------------------------------------------------------------------")

for user_item in user_items:
    print(user_item)

print("----------------------------------------------------------------------")
print("Recommendation process going on:")
print("----------------------------------------------------------------------")

#Recommend songs for the user using personalized model
is_model.recommend(user_id)

------------------------------------------------------------------------------------
Training data songs for the user userid: 133596029:
------------------------------------------------------------------------------------
13868331 - Eddie Cantor
47466121 - The Fureys
13868328 - Eddie Cantor
47450469 - The Fureys
47466136 - The Fureys
13868320 - Eddie Cantor
47466129 - The Fureys
13868325 - Eddie Cantor
47466134 - The Fureys
47466108 - The Fureys
47466112 - The Fureys
13868326 - Eddie Cantor
13868322 - Eddie Cantor
47466115 - The Fureys
----------------------------------------------------------------------
Recommendation process going on:
----------------------------------------------------------------------
No. of unique songs for the user: 14
no. of unique songs in the training set: 57768


Then here we will get similar items for just a song, not a playlist.

In [None]:
is_model.get_similar_items(['31422488 - Crystal Waters'])

In [None]:
song = '31422488 - Crystal Waters'

is_model.get_similar_items([song])

In [None]:
start = time.time()

#Define what percentage of users to use for precision recall calculation
user_sample = 0.05

#Instantiate the precision_recall_calculator class
pr = Evaluation.precision_recall_calculator(test_data, train_data, pm, is_model)

#Call method to calculate precision and recall values
(pm_avg_precision_list, pm_avg_recall_list, ism_avg_precision_list, ism_avg_recall_list) = pr.calculate_measures(user_sample)

end = time.time()
print(end - start)