# LastFM Project  


## Description  

M. Pontier vous contact pour l'aider à construire un système de recommandation. Il dispose d'une base de données comportant des données concernant ses utilisateurs (anonymisés) contenant les artistes qu'ils écoutent sur sa plateforme ainsi que le nombre d'écoutes. M. Pontier souhaite recommander à ses utilisateurs des artistes qu'il n'ont pas encore écoutés, et cela en fonction de leurs préférences musicales.

M. Pontier souhaite utiliser la librairie Lightfm, avec laquelle il a déjà un driver permettant de charger ses données qu'il vous fournit, un vrai bonus. M. Pontier a pu voir que la documentation comporte plusieurs modèles, il souhaite évaluer les modèles sur un jeu de train/test et utiliser le meilleur modéle.

## Veille  

* Quel système de recommandation allez vous mettre en place ?
* Qu'est ce que Lightfm ?
* Qu'est ce qu'un système de recommandation dit à "implicit feedback" ? Et a "explicit feedback" ?

## Modalités pédagogiques

Groupe BLUE ['Xavier', 'Hachem', 'Jean-Pierre', 'Fatima', 'Olivier']  
Groupe DISTANCIEL [Bassem, Dan, Hachem, Ines, Jean-Pierre, Myriam, Nidhal, Olivier, Joshua]

## Livrable  

Notebook & code for the backend (if a backend is done)  

## Ressources  

Brief
* https://github.com/dtrckd/simplon_datai_2020/blob/master/brief_7/brief.md  

Jeux de données Last.fm
* https://grouplens.org/datasets/hetrec-2011/  

LightFM GitHub
* https://github.com/lyst/lightfm  

LightFM doc
* https://making.lyst.com/lightfm/docs/index.html  

The world is large and we know just a small part of it, dont forget the big picture
* https://github.com/jihoo-kim/awesome-RecSys

ROC AUC  
* https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc?hl=fr  

Framapad = HOW TO for FLASK app  
* https://mensuel.framapad.org/p/vl6g3xporc-9lgd?lang=en  

Flask  
* https://www.tutorialspoint.com/flask/flask_http_methods.htm  

Flask POST  
* https://stackoverflow.com/questions/22947905/flask-example-with-post  



## Articles and tutorials on using LightFM

1. [Learning to Rank Sketchfab Models with LightFM]
* http://blog.ethanrosenthal.com/2016/11/07/implicit-mf-part-2/  
2. [Metadata Embeddings for User and Item Cold-start Recommendations]
* http://building-babylon.net/2016/01/26/metadata-embeddings-for-user-and-item-cold-start-recommendations/  
3. [Recommendation Systems - Learn Python for Data Science]
* https://www.youtube.com/watch?v=9gBC9R-msAk  
4. [Using LightFM to Recommend Projects to Consultants]
* https://medium.com/product-at-catalant-technologies/using-lightfm-to-recommend-projects-to-consultants-44084df7321c#.gu887ky51  
5. [Towards Data Science]  
* https://towardsdatascience.com/how-to-build-a-movie-recommender-system-in-python-using-lightfm-8fa49d7cbe3b 

In [1]:
import pandas as pd
import numpy as np

In [2]:
plays = pd.read_csv('../data/hetrec2011-lastfm-2k/user_artists.dat', sep='\t')
artists = pd.read_csv('../data/hetrec2011-lastfm-2k/artists.dat', sep='\t', usecols=['id','name'])

# Merge artist and user pref data
ap = pd.merge(artists, plays, how="inner", left_on="id", right_on="artistID")
ap = ap.rename(columns={"weight": "playCount"})

# Group artist by name
artist_rank = ap.groupby(['name']) \
    .agg({'userID' : 'count', 'playCount' : 'sum'}) \
    .rename(columns={"userID" : 'totalUsers', "playCount" : "totalPlays"}) \
    .sort_values(['totalPlays'], ascending=False)

artist_rank['avgPlays'] = (artist_rank['totalPlays'] / artist_rank['totalUsers']).round(0)
print(artist_rank)

                    totalUsers  totalPlays  avgPlays
name                                                
Britney Spears             522     2393140    4585.0
Depeche Mode               282     1301308    4615.0
Lady Gaga                  611     1291387    2114.0
Christina Aguilera         407     1058405    2601.0
Paramore                   399      963449    2415.0
...                        ...         ...       ...
Morris                       1           1       1.0
Eddie Kendricks              1           1       1.0
Excess Pressure              1           1       1.0
My Mine                      1           1       1.0
A.M. Architect               1           1       1.0

[17632 rows x 3 columns]


In [3]:
# Merge into ap matrix
ap = ap.join(artist_rank, on="name", how="inner") \
    .sort_values(['playCount'], ascending=False)

# Preprocessing
pc = ap.playCount
play_count_scaled = (pc - pc.min()) / (pc.max() - pc.min())
ap = ap.assign(playCountScaled=play_count_scaled)
#print(ap)

# Build a user-artist rating matrix 
ratings_df = ap.pivot(index='userID', columns='artistID', values='playCountScaled')
ratings = ratings_df.fillna(0).values

# Show sparsity
sparsity = float(len(ratings.nonzero()[0])) / (ratings.shape[0] * ratings.shape[1]) * 100
print("sparsity: %.2f" % sparsity)


sparsity: 0.28


In [4]:
ap

Unnamed: 0,id,name,userID,artistID,playCount,totalUsers,totalPlays,avgPlays,playCountScaled
2800,72,Depeche Mode,1642,72,352698,282,1301308,4615.0,1.000000
35843,792,Thalía,2071,792,324663,26,350035,13463.0,0.920513
27302,511,U2,1094,511,320725,185,493024,2665.0,0.909347
8152,203,Blur,1905,203,257978,114,318221,2791.0,0.731441
26670,498,Paramore,1664,498,227829,399,963449,2415.0,0.645960
...,...,...,...,...,...,...,...,...,...
38688,913,Destiny's Child,1810,913,1,83,34746,419.0,0.000000
32955,697,Sia,1290,697,1,56,27597,493.0,0.000000
71811,4988,Chris Spheeris,510,4988,1,5,3106,621.0,0.000000
91319,17080,Haylie Duff,1851,17080,1,1,1,1.0,0.000000


In [5]:
from scipy.sparse import csr_matrix

# Build a sparse matrix
X = csr_matrix(ratings)

n_users, n_items = ratings_df.shape
print("ratings matrix shape", ratings_df.shape)

user_ids = ratings_df.index.values
artist_names = ap.sort_values("artistID")["name"].unique()

ratings matrix shape (1892, 17632)


In [6]:
from lightfm import LightFM
from lightfm.evaluation import auc_score, precision_at_k, recall_at_k
from lightfm.cross_validation import random_train_test_split
from lightfm.data import Dataset

# Build data references + train test
Xcoo = X.tocoo()
data = Dataset()
data.fit(np.arange(n_users), np.arange(n_items))
interactions, weights = data.build_interactions(zip(Xcoo.row, Xcoo.col, Xcoo.data)) 
train, test = random_train_test_split(interactions)

# Ignore that (weight seems to be ignored...)
#train = train_.tocsr()
#test = test_.tocsr()
#train[train==1] = X[train==1]
#test[test==1] = X[test==1]

# To be completed...

Pour l'évaluation, il souhaite comparer la mesure AUC, la précision et le rappel (visiter la documentation de Lightfm), qui devront être présentés dans un tableau, donner les valeurs pour le jeu de train & de test, comparer.  

warning le train et test set ont une forme un peu différente de ce qu'on a l'habitude de voir, donc regardez leurs shape et enquêtez sur ce que c'est/ce qu'ils représentent.

In [7]:
train

<1892x17632 sparse matrix of type '<class 'numpy.int32'>'
	with 73758 stored elements in COOrdinate format>

In [8]:
test

<1892x17632 sparse matrix of type '<class 'numpy.int32'>'
	with 18440 stored elements in COOrdinate format>

### LightFM documentation  

**random_train_test_split** function takes an interaction set and splits it into two disjoint sets, a training set and a test set.  
Note that no effort is made to make sure that all items and users with interactions in the test set also have interactions in the training set.  
This may lead to a partial cold-start problem in the test set ...

    Parameters
    ----------

    interactions : a scipy sparse matrix containing interactions
        The interactions to split.
    test_percentage : float, optional
        The fraction of interactions to place in the test set.
    random_state : int or numpy.random.RandomState, optional
        Random seed used to initialize the numpy.random.RandomState number generator.
        Accepts an instance of numpy.random.RandomState for backwards compatibility.

    Returns
    -------

    (train, test) : (scipy.sparse.COOMatrix, scipy.sparse.COOMatrix)
         A tuple of (train data, test data)

In [9]:
# Train
model = LightFM(learning_rate=0.05, loss='warp')
model.fit(train, epochs=10, num_threads=2)

<lightfm.lightfm.LightFM at 0x7f718dc7a190>

In [10]:
# Evaluate
train_precision = precision_at_k(model, train, k=10).mean()
test_precision = precision_at_k(model, test, k=10, train_interactions=train).mean()

train_recall = recall_at_k(model, train, k=10).mean()
test_recall = recall_at_k(model, test, k=10, train_interactions=train).mean()

train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test, train_interactions=train).mean()

print('PRECISION\tTRAIN\t{:.2%}\t\tTEST\t{:.2%}'.format(train_precision, test_precision))
print('RECALL\t\tTRAIN\t{:.2%}\t\tTEST\t{:.2%}'.format(train_recall, test_recall))
print('AUC\t\tTRAIN\t{:.2%}\t\tTEST\t{:.2%}'.format(train_auc, test_auc))

PRECISION	TRAIN	36.45%		TEST	11.61%
RECALL		TRAIN	9.41%		TEST	12.02%
AUC		TRAIN	96.43%		TEST	85.18%


### RECALL is BETTER on TEST set !

In [11]:
import time
from datetime import datetime

def LastFM_eval():
    # Current date and time
    now = datetime.now()

    learning_rate = [0.05, 0.08, 0.10]
    # No need of 'logistic' & 'warp-kos' for this kind of data set
    loss = ['bpr', 'warp']
    k = [5, 10]
    results = []
    
    for x in learning_rate:
        for y in loss:
            for z in k:
                model = LightFM(learning_rate=x, loss=y)
                
                # Train
                t1 = time.process_time()
                model.fit(train, epochs=10, num_threads=2)
                t2 = time.process_time()
                
                # Time execution measurement
                fit_time = t2 - t1
                
                # Evaluate
                train_precision = precision_at_k(model, train, k=z).mean().round(4)
                test_precision = precision_at_k(model, test, k=z, train_interactions=train).mean().round(4)
                
                train_recall = recall_at_k(model, train, k=z).mean().round(4)
                test_recall = recall_at_k(model, test, k=z, train_interactions=train).mean().round(4)
                
                train_AUC = auc_score(model, train).mean().round(4)
                test_AUC = auc_score(model, test, train_interactions=train).mean().round(4)
                
                # Record in results dictionnary
                dict_temp = {}
                dict_temp = {
                    'Time': fit_time,
                    'K': z,
                    'Name': y,
                    'Learning Rate': x,
                    'Train PRECISION': train_precision,
                    'Train RECALL': train_recall,
                    'Train AUC': train_AUC,
                    'Test PRECISION': test_precision,
                    'Test RECALL': test_recall,
                    'Test AUC': test_AUC
                }
                
                results.append(dict_temp)
    
    # Record in results dataframe
    results = pd.DataFrame(results)
    results.to_csv('../records/{} - LastFM - Evaluation.csv'.format(now), encoding='utf-8')
    return results


In [12]:
LastFM_eval()

Unnamed: 0,Time,K,Name,Learning Rate,Train PRECISION,Train RECALL,Train AUC,Test PRECISION,Test RECALL,Test AUC
0,0.855713,5,bpr,0.05,0.4355,0.0564,0.8506,0.159,0.082,0.7762
1,0.829543,10,bpr,0.05,0.3683,0.0948,0.8518,0.1174,0.1201,0.7809
2,0.867976,5,warp,0.05,0.4423,0.0571,0.9666,0.1688,0.0881,0.8536
3,0.87835,10,warp,0.05,0.3919,0.1008,0.9671,0.1276,0.1314,0.8564
4,1.007056,5,bpr,0.08,0.4879,0.0629,0.9109,0.1766,0.0923,0.8058
5,0.803412,10,bpr,0.08,0.4152,0.1072,0.9044,0.1279,0.1322,0.804
6,0.88958,5,warp,0.08,0.4188,0.0542,0.9805,0.151,0.0784,0.8461
7,1.075852,10,warp,0.08,0.3938,0.1023,0.9822,0.1231,0.1275,0.8503
8,0.810177,5,bpr,0.1,0.4867,0.0631,0.9166,0.1645,0.0861,0.8103
9,0.81069,10,bpr,0.1,0.4016,0.1036,0.9174,0.1166,0.1204,0.8079


### BEST TEST AUC ≈ 85% is reached in ≈ 900 msec with :
* k = 10  
* Loss = WARP  
* Learning rate = 0.05

## Part 1

Après l'obtention du tableau des résultats, voici deux sous taches supplémentaires qui vont nous aider à évaluer/interpréter notre modéle :

* get_recommandation qui prend en entrée un USER et renvoie les ARTISTS recommandés (du meilleur au moins bon au sens du score de recommandation)  
* get_ground_truth qui renvoie les artistes ecoutés par un utilisateur par ordre décroissant du playCountScaled  

Ceci nous permettra d"evaluer qualitativement les résultats que retourne le modèle et le comparer avec la vérité terrain.

In [13]:
# BEST model
model = LightFM(learning_rate=0.05, loss='warp', k=10)
model.fit(train, epochs=10, num_threads=2)

<lightfm.lightfm.LightFM at 0x7f718774dd10>

In [14]:
def get_recommendation(user_id):
    pred = model.predict(user_id, np.arange(n_items))
    recommendation = artist_names[np.argsort(-pred)]
    return recommendation

In [15]:
get_recommendation(2)

array(['Muse', 'The Beatles', 'The Killers', ..., 'Off the Sky',
       'Anthony B', 'Jason Edward Dudley'], dtype=object)

In [16]:
def get_ground_truth(user_id):
    ground_truth = ap[ap['userID']==user_id].sort_values(by="playCountScaled", ascending=False)
    ground_truth = ground_truth['name']
    return ground_truth

In [17]:
get_ground_truth(2)

542              Duran Duran
653                Morcheeba
676                      Air
751             Hooverphonic
769            Kylie Minogue
1067               Daft Punk
1218    Thievery Corporation
1237               Goldfrapp
1319               New Order
1426             Matt Bianco
1428               Talk Talk
1453           Prefab Sprout
1456                  Enigma
1497                Röyksopp
1568                Coldplay
1937               Faithless
1956                 Madonna
2385                Icehouse
2402                    Sade
2431                    Moby
2516                    Dido
2573            Depeche Mode
2855            Café Del Mar
2862                   Basia
2863              Camouflage
2895              Electronic
2910          George Michael
2953          The Adventures
2955         Fiction Factory
2956           Groove Armada
2970              Portishead
3077             Marc Almond
3086              Cock Robin
3087                Cut Copy
3128          

## Part 2  

Comparer les résulats de l'AUC avec le meilleur modèle de lightFM et une PCA (TruncatedSDV).
L'apprentissage devant être le plus rapide possible tout en obtenant les meilleurs résultats, il vous est demandé de trouver le nombre d'itérations permettant d'atteindre la convergence de 95% de la valeur maximal d'AUC sur le jeu de test. --
optimization des hyper-paramètres (k, n, learning_schedule, learning_rate)
clustering des artists avec les embeddings : tracer un visuel des clusterings d'artistes basés sur la matrice d'item embeddings.  

In [18]:
import time
from datetime import datetime

def LightFM_best_param():
    # Current date and time
    now = datetime.now()
    results = []

    #k = [5, 10]
    k = np.arange(1, 11)
    #n = [5, 10]
    n = np.arange(1, 11)
    #learning_schedule=['adagrad', 'adadelta']
    #learning_rate = [0.05, 0.1]
    learning_rate = np.arange(0.01, 0.11, 0.01)
    
    for x in learning_rate:
        for y in n:
            for z in k:
                clf = LightFM(loss='warp', learning_rate=x, n=y, k=z)
                
                # Train
                t1 = time.process_time()
                clf.fit(train, epochs=10, num_threads=2)
                t2 = time.process_time()
                
                # Time execution measurement
                fit_time = t2 - t1
                
                # Evaluate
                train_AUC = auc_score(clf, train).mean().round(4)
                test_AUC = auc_score(clf, test, train_interactions=train).mean().round(4)
                
                # Record in results dictionnary
                dict_temp = {}
                dict_temp = {
                    'Time': fit_time,
                    'K': z,
                    'N': y,
                    'Learning Rate': x,
                    'Train AUC': train_AUC,
                    'Test AUC': test_AUC
                }
                
                results.append(dict_temp)
    
    # Record in results dataframe
    results = pd.DataFrame(results)
    results.to_csv('../records/{} - LightFM - Evaluation.csv'.format(now), encoding='utf-8')
    results = results.sort_values(by="Test AUC", ascending=False)
    return results


In [19]:
LightFM_best_param()

Unnamed: 0,Time,K,N,Learning Rate,Train AUC,Test AUC
537,1.390897,8,4,0.06,0.9747,0.8583
683,1.255717,4,9,0.07,0.9802,0.8576
592,0.899300,3,10,0.06,0.9738,0.8575
577,0.977884,8,8,0.06,0.9739,0.8575
593,1.307713,4,10,0.06,0.9751,0.8574
...,...,...,...,...,...,...
30,0.842468,1,4,0.01,0.8841,0.8031
74,0.936122,5,8,0.01,0.8860,0.8028
68,0.831800,9,7,0.01,0.8848,0.8025
19,0.930456,10,2,0.01,0.8848,0.8024


## Part 3 (à voir mercredi, à compléter)  

Faire une application client serveur permettant d'interoger le modéle :

* 1 page pour connaitre/demander les préférences d'un utilisateur
* 1 page qui compare les 10 artistes les plus écoutés et les 10 artistes les plus recommandés  
* Afficher aussi la distribution des artistes recommandés, en prenant seulement les 5 meilleurs artistes par utilisateur  