# Build a song recommender system

In [1]:
import turicreate as tc

## Load some music data

In [3]:
song_data = tc.SFrame('../data/song_data.sframe/')

## Explore our data

In [4]:
song_data

user_id,song_id,listen_count,title,artist
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOAKIMP12A8C130995,1,The Cove,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Paco De Lucia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBXHDL12A81C204C0,1,Stronger,Kanye West
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOBYHAJ12A6701BF1D,1,Constellations,Jack Johnson
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODACBL12A8C13C273,1,Learn To Fly,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODDNQT12A6D4F5F7E,5,Apuesta Por El Rock 'N' Roll ...,Héroes del Silencio
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SODXRTY12AB0180F3B,1,Paper Gangsta,Lady GaGa
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFGUAY12AB017B0A8,1,Stacked Actors,Foo Fighters
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOFRQTD12A81C233C0,1,Sehr kosmisch,Harmonia
b80344d063b5ccb3212f76538 f3d9e43d87dca9e ...,SOHQWYZ12A6D4FA701,1,Heaven's gonna burn your eyes ...,Thievery Corporation feat. Emiliana Torrini ...

song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De Lucia ...
Stronger - Kanye West
Constellations - Jack Johnson ...
Learn To Fly - Foo Fighters ...
Apuesta Por El Rock 'N' Roll - Héroes del ...
Paper Gangsta - Lady GaGa
Stacked Actors - Foo Fighters ...
Sehr kosmisch - Harmonia
Heaven's gonna burn your eyes - Thievery ...


## Show the most popular songs in the dataset
aka the most listened songs.

In [9]:
song_data.groupby("song",
...            {'count':tc.aggregate.COUNT()}).sort('count', ascending=False)

song,count
Sehr kosmisch - Harmonia,5970
Undo - Björk,5281
You're The One - Dwight Yoakam ...,4806
Dog Days Are Over (Radio Edit) - Florence + The ...,4536
Revelry - Kings Of Leon,4339
Horn Concerto No. 4 in E flat K495: II. Romance ...,3949
Secrets - OneRepublic,3916
Tive Sim - Cartola,3185
Fireflies - Charttraxx Karaoke ...,3171
Hey_ Soul Sister - Train,3132


In [10]:
users = song_data['user_id'].unique()

In [11]:
len(users)

66346

# Build a simple song recommender

In [12]:
train_data,test_data = song_data.random_split(.8,seed=0)

### Simple popularity-based recommender

In [13]:
# turicreate has a built-in a popularity recommender model.
# such model takes two parameters: user_id and item_id
popularity_model = tc.popularity_recommender.create(train_data,
                                                           user_id = 'user_id',
                                                           item_id = 'song')

### Use the popularity model to make some predictions

In [14]:
# What do you recommend for the first user of our users array?
popularity_model.recommend(users=[users[0]])

user_id,song,score,rank
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Sehr kosmisch - Harmonia,4754.0,1
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Undo - Björk,4227.0,2
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,You're The One - Dwight Yoakam ...,3781.0,3
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Dog Days Are Over (Radio Edit) - Florence + The ...,3633.0,4
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Revelry - Kings Of Leon,3527.0,5
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Horn Concerto No. 4 in E flat K495: II. Romance ...,3161.0,6
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Secrets - OneRepublic,3148.0,7
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Hey_ Soul Sister - Train,2538.0,8
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Fireflies - Charttraxx Karaoke ...,2532.0,9
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Tive Sim - Cartola,2521.0,10


In [15]:
# Its based on popularity, so every user is recommended the same songs!
popularity_model.recommend(users=[users[1]])

user_id,song,score,rank
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Sehr kosmisch - Harmonia,4754.0,1
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Undo - Björk,4227.0,2
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,You're The One - Dwight Yoakam ...,3781.0,3
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Dog Days Are Over (Radio Edit) - Florence + The ...,3633.0,4
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Revelry - Kings Of Leon,3527.0,5
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Horn Concerto No. 4 in E flat K495: II. Romance ...,3161.0,6
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Secrets - OneRepublic,3148.0,7
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Hey_ Soul Sister - Train,2538.0,8
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Fireflies - Charttraxx Karaoke ...,2532.0,9
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Tive Sim - Cartola,2521.0,10


# Build a recommender with personalization

In [17]:
# The builtin model is item_similarity_recommender
# This model also takes the parameters for user_id and item_id
personalized_model = tc.item_similarity_recommender.create(train_data,
                                                                  user_id = 'user_id',
                                                                  item_id = 'song')

## Make personalized song recommendations

Question: does it recommend only songs that the given usar has NOT listened to yet?

In [18]:
personalized_model.recommend(users=[users[0]])

user_id,song,score,rank
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Riot In Cell Block Number Nine - Dr Feelgood ...,0.0374999940395355,1
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Sei Lá Mangueira - Elizeth Cardoso ...,0.0331632643938064,2
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,The Stallion - Ween,0.0322580635547637,3
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Rain - Subhumans,0.0314159244298934,4
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,West One (Shine On Me) - The Ruts ...,0.0306771993637084,5
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Back Against The Wall - Cage The Elephant ...,0.0301204770803451,6
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Life Less Frightening - Rise Against ...,0.0284431129693985,7
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,A Beggar On A Beach Of Gold - Mike And The ...,0.023002490401268,8
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Audience Of One - Rise Against ...,0.0193938463926315,9
279292bb36dbfc7f505e36ebf 038c81eb1d1d63e ...,Blame It On The Boogie - The Jacksons ...,0.0189873427152633,10


In [20]:
personalized_model.recommend(users=[users[1]])

user_id,song,score,rank
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Grind With Me (Explicit Version) - Pretty Ricky ...,0.0459424376487731,1
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,There Goes My Baby - Usher ...,0.0331920742988586,2
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Panty Droppa [Intro] (Album Version) - Trey ...,0.031856620311737,3
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Nobody (Featuring Athena Cage) (LP Version) - ...,0.0278467655181884,4
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Youth Against Fascism - Sonic Youth ...,0.0262914180755615,5
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Nice & Slow - Usher,0.0239639401435852,6
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Making Love (Into The Night) - Usher ...,0.0238176941871643,7
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Naked - Marques Houston,0.0228925704956054,8
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,I.nner Indulgence - DESTRUCTION ...,0.0220767498016357,9
c067c22072a17d33310d7223d 7b79f819e48cf42 ...,Love Lost (Album Version) - Trey Songz ...,0.0204497694969177,10


# Apply model to find similar songs in the data set

In [22]:
# What songs are similar to a given item_id?
# i.e. if you listen to what songs what other songs might you like
personalized_model.get_similar_items(['With Or Without You - U2'])

song,similar,score,rank
With Or Without You - U2,I Still Haven't Found What I'm Looking For ...,0.0428571701049804,1
With Or Without You - U2,Hold Me_ Thrill Me_ Kiss Me_ Kill Me - U2 ...,0.033734917640686,2
With Or Without You - U2,Window In The Skies - U2,0.032835841178894,3
With Or Without You - U2,Vertigo - U2,0.030075192451477,4
With Or Without You - U2,Sunday Bloody Sunday - U2,0.0271317958831787,5
With Or Without You - U2,Bad - U2,0.0251798629760742,6
With Or Without You - U2,A Day Without Me - U2,0.0237154364585876,7
With Or Without You - U2,Another Time Another Place - U2 ...,0.0203251838684082,8
With Or Without You - U2,Walk On - U2,0.0202020406723022,9
With Or Without You - U2,Get On Your Boots - U2,0.0196850299835205,10


In [23]:
# Notice the item name passed is within brackets []
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])

song,similar,score,rank
Chan Chan (Live) - Buena Vista Social Club ...,Murmullo - Buena Vista Social Club ...,0.1881188154220581,1
Chan Chan (Live) - Buena Vista Social Club ...,La Bayamesa - Buena Vista Social Club ...,0.1871921420097351,2
Chan Chan (Live) - Buena Vista Social Club ...,Amor de Loca Juventud - Buena Vista Social Club ...,0.1848341226577758,3
Chan Chan (Live) - Buena Vista Social Club ...,Diferente - Gotan Project,0.0214592218399047,4
Chan Chan (Live) - Buena Vista Social Club ...,Mistica - Orishas,0.0205761194229125,5
Chan Chan (Live) - Buena Vista Social Club ...,Hotel California - Gipsy Kings ...,0.0193049907684326,6
Chan Chan (Live) - Buena Vista Social Club ...,Nací Orishas - Orishas,0.0191571116447448,7
Chan Chan (Live) - Buena Vista Social Club ...,Le Moulin - Yann Tiersen,0.0187969803810119,8
Chan Chan (Live) - Buena Vista Social Club ...,Gitana - Willie Colon,0.0187969803810119,9
Chan Chan (Live) - Buena Vista Social Club ...,Criminal - Gotan Project,0.0187793374061584,10


# Compare the models' performance quantitatively
We now formally compare the `popularity_model` and the `personalized_model` using precision-recall curves. 

In [25]:
# turicreate has a convenient utility method compare_models.
# takes in a list of models, and some test data to evaluate.
model_performance = tc.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=.05)
# for this command to be quick we only choose to sample 5% of users.

compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0





Precision and recall summary statistics by cutoff
+--------+----------------------+-----------------------+
| cutoff |    mean_precision    |      mean_recall      |
+--------+----------------------+-----------------------+
|   1    | 0.01740020470829069  | 0.0045129365497840425 |
|   2    | 0.018253155919481394 |  0.009582867566490901 |
|   3    | 0.019447287615148457 |  0.015981710109141015 |
|   4    | 0.018850221767314903 |  0.020531632494273238 |
|   5    | 0.01733196861139543  |  0.023330395579627943 |
|   6    | 0.01643352666894117  |  0.02631523249486402  |
|   7    | 0.015694302285909276 |  0.02941240542315262  |
|   8    | 0.014713408393039926 |  0.03143619288706291  |
|   9    | 0.014557034004321632 |  0.03432850130649514  |
|   10   | 0.014124872057318355 |  0.03743526603045539  |
+--------+----------------------+-----------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1





Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.024906175366769024 | 0.007065963718983164 |
|   2    | 0.022517911975434995 | 0.011740140143415474 |
|   3    | 0.020470829068577272 | 0.015828069538407315 |
|   4    | 0.018509041282838617 |  0.0182586027263612  |
|   5    | 0.01726373251450019  | 0.022174101818421162 |
|   6    | 0.015694302285909255 | 0.02419560849704146  |
|   7    | 0.014573280694058593 | 0.025916403718348448 |
|   8    | 0.01356192425793245  | 0.027567350481884783 |
|   9    | 0.012964858410098942 | 0.030030631732269393 |
|   10   | 0.012487205731832146 | 0.03219131838630305  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]



The table shows that **personalization** significantly improves a model's performance

# Programming Assignment - Week 5

We will do 3 tasks:
1. **Count the unique users**. Compute the number of unique users for each of these artists:  'Kanye West', 'Foo Fighters', 'Taylor Swift' and 'Lady GaGa'.

In [29]:
song_data.column_names()

['user_id', 'song_id', 'listen_count', 'title', 'artist', 'song']

In [32]:
kanye_users = song_data[song_data['artist'] == 'Kanye West']['user_id'].unique()
len(kanye_users)

2522

In [33]:
foof_users = song_data[song_data['artist'] == 'Foo Fighters']['user_id'].unique()
len(foof_users)

2055

In [34]:
tswift_users = song_data[song_data['artist'] == 'Taylor Swift']['user_id'].unique()
len(tswift_users)

3246

In [35]:
gaga_users = song_data[song_data['artist'] == 'Lady GaGa']['user_id'].unique()
len(gaga_users)

2928

2. **Use groupby to aggregate the most and least popular artist**. Use the listen_count attribute of each song to determine the artists with most listens and least listens. Use aggregation to sum these values for every artist and sort them.

In [36]:
artist_listen_counts = song_data.groupby("artist",
...            {'count':tc.aggregate.SUM('listen_count')}).sort('count', ascending=False)

In [39]:
print("Most popular:",artist_listen_counts[0])
print("Least popular:",artist_listen_counts[-1])

Most popular: {'artist': 'Kings Of Leon', 'count': 43218}
Least popular: {'artist': 'William Tabbert', 'count': 14}


3. Use groupby again to find the most recommended songs.
    1. Split the data into .80 training, .20 testing, seed=0
    2. Train an item_similarity_recommender, as done in the Jupyter notebook, using the training data.
    3. Next, make recommendations for users in the test data. Use only the first 10K users.
    4. Make a single song recommendation for each of these users. Store these in an SFrame.
    5. Use aggregate count to find the counts a song appears in the recommendations.

In [40]:
subset_users = test_data['user_id'].unique()[0:10000]

In [42]:
recommendations = personalized_model.recommend(subset_users, k=1)  # recomend k songs to the users passed as array

In [46]:
songs_by_count = recommendations.groupby('song', 
                        operations={'count': tc.aggregate.COUNT()}).sort('count')

In [49]:
songs_by_count[0]    # least recommended

{'song': 'Where I Stood (Album Version) - Missy Higgins', 'count': 1}

In [48]:
songs_by_count[-1]   # most recommended

{'song': 'Undo - Björk', 'count': 439}