# LAB 4 : Collaborative Filtering on Last.fm Dataset

In this lab, we use the Last.fm Dataset (https://www.last.fm/)  - 360K Users (http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html) \
The dataset contains <user, artist, plays> tuples of 360,000 users.\
The data format of our database is: <em> user-mboxshal \t musicbrainz-artist-id \t artist-name \t plays. </em>

Using the implicit.datasets module to download last.fm locally






In [95]:
import pandas as pd
import numpy as np
from implicit.datasets.lastfm import get_lastfm

# artists and users are the string arrays labeling each row and column of the artist_user_plays matrix

# The artist_user_plays matrix is a scipy sparse matrix representing the number of times each artist was played by users, 
# each row represents different artists, and each column represents different users.

artists, users, artist_user_plays = get_lastfm()
print(artist_user_plays)



  (0, 73470)	32.0
  (0, 97856)	24.0
  (0, 235382)	1339.0
  (0, 266072)	211.0
  (1, 171865)	23.0
  (2, 180892)	70.0
  (3, 285031)	23.0
  (4, 15103)	9.0
  (5, 81700)	16.0
  (6, 284057)	56.0
  (7, 335320)	24.0
  (8, 182831)	113.0
  (9, 12461)	3.0
  (10, 78717)	2.0
  (10, 149431)	2.0
  (10, 220512)	2.0
  (10, 261830)	2.0
  (10, 280610)	2.0
  (10, 297146)	2.0
  (11, 296825)	118.0
  (12, 332435)	202.0
  (13, 41075)	30.0
  (14, 298571)	138.0
  (15, 295693)	219.0
  (16, 185703)	35.0
  :	:
  (292364, 4775)	5.0
  (292365, 147943)	1329.0
  (292366, 95230)	157.0
  (292367, 56086)	5.0
  (292367, 137277)	70.0
  (292367, 287297)	87.0
  (292368, 294859)	68.0
  (292369, 308202)	125.0
  (292370, 42122)	315.0
  (292370, 263053)	729.0
  (292370, 301225)	7.0
  (292371, 229732)	3.0
  (292372, 355627)	384.0
  (292373, 337693)	486.0
  (292374, 212855)	48.0
  (292375, 231253)	127.0
  (292376, 147738)	7.0
  (292377, 220443)	125.0
  (292378, 219957)	1.0
  (292379, 14949)	66.0
  (292380, 218794)	25.0
  (292381, 2

Weight matrix before training a model 
- Reducing the impact of users who have played the same artist thousands of times.
- Reducing the weight given to popular items


In [52]:
from implicit.nearest_neighbours import bm25_weight

artist_user = bm25_weight(artist_user_plays, K1=100, B=0.8)
print(artist_user)

  (0, 73470)	464.12640081352487
  (0, 97856)	395.4254916528028
  (0, 235382)	917.7576795317125
  (0, 266072)	801.254668853217
  (1, 171865)	479.0537259553822
  (2, 180892)	701.6462524574976
  (3, 285031)	469.2366708878609
  (4, 15103)	274.6530366072618
  (5, 81700)	392.0203057167537
  (6, 284057)	624.2906299439671
  (7, 335320)	482.02241184218633
  (8, 182831)	763.8439378564416
  (9, 12461)	115.27320299060298
  (10, 78717)	79.93913331578632
  (10, 149431)	82.3790196718816
  (10, 220512)	80.7530809045117
  (10, 261830)	81.8842571288024
  (10, 280610)	80.1359339749864
  (10, 297146)	80.33706026790095
  (11, 296825)	764.3099590503471
  (12, 332435)	833.4437619557374
  (13, 41075)	510.69738868947763
  (14, 298571)	759.3034882234267
  (15, 295693)	791.1280662266632
  (16, 185703)	550.3715435013044
  :	:
  (292364, 4775)	178.0367505676167
  (292365, 147943)	970.515999738615
  (292366, 95230)	778.3288671797941
  (292367, 56086)	174.82443977973026
  (292367, 137277)	693.1595624862529
  (292367

Train an ALS model using implicit

In [53]:
from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(factors=64, regularization=0.05, alpha=2.0)
# Implicit expect user-item (user-artist)
user_artist = artist_user.T.tocsr()

model.fit(user_artist)

100%|██████████| 15/15 [09:04<00:00, 36.28s/it]


The result

In [96]:
userid = 12345
ids, scores = model.recommend(userid, user_artist[userid], N=10, filter_already_liked_items=False)
# print(ids)
pd.DataFrame({"artist": artists[ids], "score": scores, "already_liked": np.in1d(ids, user_artist[userid].indices),})

Unnamed: 0,artist,score,already_liked
0,spiritual front,1.023052,False
1,d-a-d,1.012068,True
2,von thronstahl,1.011382,True
3,storm,0.983674,False
4,blood axis,0.975657,False
5,arditi,0.9699,True
6,puissance,0.968462,True
7,the coffinshakers,0.965118,True
8,type o negative,0.961425,True
9,triarii,0.960693,True


In [97]:
itemid = list(artists).index("maroon 5")
print(f"Artist ID {itemid} : {artists[itemid]}")
ids, scores = model.similar_items(itemid)

# display the results using pandas for nicer formatting
pd.DataFrame({"artist": artists[ids], "score": scores})

Artist ID 181675 : maroon 5


Unnamed: 0,artist,score
0,maroon 5,1.0
1,jason mraz,0.989672
2,james blunt,0.989037
3,the fray,0.988397
4,onerepublic,0.988026
5,black eyed peas,0.986592
6,justin timberlake,0.986103
7,keane,0.985836
8,mika,0.985701
9,coldplay,0.985569
