#Building a song recommender


#Fire up GraphLab Create

In [1]:
import graphlab

#Load music data

In [2]:
song_data = graphlab.SFrame('song_data.gl/')

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1483743307.log


This non-commercial license of GraphLab Create for academic use is assigned to cristinaguerreroflores@gmail.com and will expire on December 19, 2017.


#Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.

In [3]:
song_data.head()

user_id,song_id,listen_count,title,artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1,The Cove,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1,Stronger,Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5,Apuesta Por El Rock 'N' Roll ...,Héroes del Silencio
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1,Paper Gangsta,Lady GaGa
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1,Stacked Actors,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1,Sehr kosmisch,Harmonia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1,Heaven's gonna burn your eyes ...,Thievery Corporation feat. Emiliana Torrini ...

song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia ...
Stronger - Kanye West
Constellations - Jack Johnson ...
Learn To Fly - Foo Fighters ...
Apuesta Por El Rock 'N' Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your eyes - Thievery ...


##Showing the most popular songs in the dataset

In [4]:
graphlab.canvas.set_target('ipynb')

In [None]:
song_data['song'].show()

In [None]:
len(song_data)

##Count number of unique users in the dataset

In [None]:
users = song_data['user_id'].unique()

In [None]:
len(users)

#Create a song recommender

In [None]:
train_data,test_data = song_data.random_split(.8,seed=0)

##Simple popularity-based recommender

In [None]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')

###Use the popularity model to make some predictions

A popularity model makes the same prediction for all users, so provides no personalization.

In [None]:
popularity_model.recommend(users=[users[0]])

In [None]:
popularity_model.recommend(users=[users[1]])

##Build a song recommender with personalization

We now create a model that allows us to make personalized recommendations to each user. 

In [None]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')

###Applying the personalized model to make song recommendations

As you can see, different users get different recommendations now.

In [None]:
personalized_model.recommend(users=[users[0]])

In [None]:
personalized_model.recommend(users=[users[1]])

###We can also apply the model to find similar songs to any song in the dataset

In [None]:
personalized_model.get_similar_items(['With Or Without You - U2'])

In [None]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])

#Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves. 

In [None]:
if graphlab.version[:3] >= "1.6":
    model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
    graphlab.show_comparison(model_performance,[popularity_model, personalized_model])
else:
    %matplotlib inline
    model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=.05)

The curve shows that the personalized model provides much better performance. 

#Assignment

In [7]:
'''Counting unique users: The method .unique() can be used to select the unique elements in a column of data.
In this question, you will compute the number of unique users who have listened to songs by various artists.
For example, to find out the number of unique users who listened to songs by 'Kanye West', 
all you need to do is select the rows of the song data where the artist is 'Kanye West', and then 
count the number of unique entries in the ‘user_id’ column. 
Compute the number of unique users for each of these artists: 
'Kanye West', 'Foo Fighters', 'Taylor Swift' and 'Lady GaGa'. '''
artists=['Kanye West', 'Foo Fighters', 'Taylor Swift', 'Lady GaGa']
for artist in artists:
    listeners_of_song = song_data[song_data['artist']==artist]
    users = listeners_of_song['user_id'].unique()
    print artist + ' ' + str(len(users))

Kanye West 2522
Foo Fighters 2055
Taylor Swift 3246
Lady GaGa 2928


In [12]:
'''Using groupby-aggregate to find the most popular and least popular artist: 
each row of song_data contains the number of times a user listened to particular song by a particular artist.
If we would like to know how many times any song by 'Kanye West' was listened to, we need to 
select all the rows where ‘artist’=='Kanye West' and sum the ‘listen_count’ column. 
If we would like to find the most popular artist, we would need to follow this procedure for each artist, 
which would be very slow. Instead, you will learn about a very important method:'''
newSFrame = song_data.groupby(key_columns='artist', operations={'total_count': graphlab.aggregate.SUM('listen_count')})
newSFrame = newSFrame.sort('total_count', ascending=False) 
print newSFrame[0] #Most popular: Kings Of Leon 	43218
print newSFrame[-1] #Least popular: William Tabbert 	14

{'total_count': 43218, 'artist': 'Kings Of Leon'}
{'total_count': 14, 'artist': 'William Tabbert'}


In [13]:
'''Using groupby-aggregate to find the most recommended songs: Now that we learned how to use .groupby() to 
compute aggregates for each value in a column, let’s use to find the song that is most recommended by the 
personalized_model model we learned in the iPython notebook above. Follow these steps to find the most 
recommended song:
    Split the data into 80% training, 20% testing, using seed=0, as was done in the iPython notebook above.
    Train an item_similarity_recommender, as done in the iPython notebook, using the training data.
    Next, we are going to make recommendations for the users in the test data, but 
    there are over 200,000 users (58,628 unique users) in the test set. 
    Computing recommendations for these many users can be slow in some computers. 
    Thus, we will use only the first 10,000 users only in this question. '''
train_data,test_data = song_data.random_split(.8,seed=0)
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')
subset_test_users = test_data['user_id'].unique()[0:10000]

In [20]:
recommendations = personalized_model.recommend(subset_test_users,k=1)
#recommendations.head()
newRecommendations = recommendations.groupby(key_columns='song', operations={'count': graphlab.aggregate.COUNT()})

In [21]:
print newRecommendations[0] #Most popular: {'count': 3, 'song': 'The Climb - Miley Cyrus'}
print newRecommendations[-1] #Least popular: {'count': 1, 'song': 'Dark Matter - Andrew Bird'}

{'count': 3, 'song': 'The Climb - Miley Cyrus'}
{'count': 1, 'song': 'Dark Matter - Andrew Bird'}
