# LAB 2 : Collaborative Filtering on Last.fm Dataset

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/biodatlab/xlab-recommendation/blob/main/solution_notebooks/02_collaborative_filtering_lastfm.ipynb) 

In this lab, we use the Last.fm Dataset (https://www.last.fm/)  \
360K Users (http://ocelma.net/MusicRecommendationDataset/lastfm-360K.html) \
The dataset contains <user, artist, plays> tuples of 360,000 users.\
The data format of our database is: <em> user-mboxshal \t musicbrainz-artist-id \t artist-name \t plays. </em>

Install library

In [None]:
# !pip install pandas
# !pip install numpy
!pip install implicit
!pip install gradio

Using the implicit.datasets module to download last.fm locally

In [3]:
import pandas as pd
import numpy as np
from implicit.datasets.lastfm import get_lastfm

# artists and users are the string arrays labeling each row and column of the artist_user_plays matrix

# The artist_user_plays matrix is a scipy sparse matrix representing the number of times each artist was played by users, 
# each row represents different artists, and each column represents different users.

artists, users, artist_user_plays = get_lastfm()
print(artist_user_plays)

  from .autonotebook import tqdm as notebook_tqdm


  (0, 73470)	32.0
  (0, 97856)	24.0
  (0, 235382)	1339.0
  (0, 266072)	211.0
  (1, 171865)	23.0
  (2, 180892)	70.0
  (3, 285031)	23.0
  (4, 15103)	9.0
  (5, 81700)	16.0
  (6, 284057)	56.0
  (7, 335320)	24.0
  (8, 182831)	113.0
  (9, 12461)	3.0
  (10, 78717)	2.0
  (10, 149431)	2.0
  (10, 220512)	2.0
  (10, 261830)	2.0
  (10, 280610)	2.0
  (10, 297146)	2.0
  (11, 296825)	118.0
  (12, 332435)	202.0
  (13, 41075)	30.0
  (14, 298571)	138.0
  (15, 295693)	219.0
  (16, 185703)	35.0
  :	:
  (292364, 4775)	5.0
  (292365, 147943)	1329.0
  (292366, 95230)	157.0
  (292367, 56086)	5.0
  (292367, 137277)	70.0
  (292367, 287297)	87.0
  (292368, 294859)	68.0
  (292369, 308202)	125.0
  (292370, 42122)	315.0
  (292370, 263053)	729.0
  (292370, 301225)	7.0
  (292371, 229732)	3.0
  (292372, 355627)	384.0
  (292373, 337693)	486.0
  (292374, 212855)	48.0
  (292375, 231253)	127.0
  (292376, 147738)	7.0
  (292377, 220443)	125.0
  (292378, 219957)	1.0
  (292379, 14949)	66.0
  (292380, 218794)	25.0
  (292381, 2

Weight matrix before training a model 
- Reducing the impact of users who have played the same artist thousands of times.
- Reducing the weight given to popular items


In [None]:
from implicit.nearest_neighbours import bm25_weight

artist_user = bm25_weight(artist_user_plays, K1=100, B=0.8)
print(artist_user)

Train an ALS model using implicit

In [None]:
from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(factors=64, regularization=0.05, alpha=2.0)
# Implicit expect user-item (user-artist)
user_artist = artist_user.T.tocsr()

model.fit(user_artist)

The result

In [None]:
# userid = 12345

# ids, scores = model.recommend(userid, user_artist[userid], N=10, filter_already_liked_items=False)
# # print(ids)
# df = pd.DataFrame({"artist": artists[ids], "score": scores, "already_liked": np.in1d(ids, user_artist[userid].indices),})

In [None]:
import gradio as gr

def music_recommend(userid):
    userid = int(userid)
    ids, scores = model.recommend(userid, user_artist[userid], N=10, filter_already_liked_items=False)
    df = pd.DataFrame({"artist": artists[ids], "score": scores}, "already_liked": np.in1d(ids, user_artist[userid].indices))
    return df


demo = gr.Interface(
    fn=music_recommend,
    inputs="text",
    outputs="dataframe",
)
demo.launch()


In [None]:
# itemid = list(artists).index("maroon 5")
# print(f"Artist ID {itemid} : {artists[itemid]}")
# ids, scores = model.similar_items(itemid)

# # display the results using pandas for nicer formatting
# pd.DataFrame({"artist": artists[ids], "score": scores})

In [None]:
import gradio as gr

def music_similarity(artist_name):
    itemid = list(artists).index(artist_name)
    print(f"Artist ID {itemid} : {artists[itemid]}")
    ids, scores = model.similar_items(itemid)
    df = pd.DataFrame({"artist": artists[ids], "score": scores})
    return df


demo = gr.Interface(
    fn=music_similarity,
    inputs="text",
    outputs="dataframe",
)
demo.launch()