# Goal of the Notebook

In this notebook, we create a Collaborative Recommender based on Artists. 

Since we are now using the **implicit** python library, we have to make some changes to the code we used before (we are using Scipy CSR matrices instead of our previous pandas DFs).

We didn't have the time to completely include this new recommender in the pipeline, but we still have results we would like to share.

In [1]:
import numpy as np
import tqdm
import h5py

import h5py
import numpy as np
import pandas as pd
from scipy.sparse import coo_matrix, csr_matrix


from implicit.als import AlternatingLeastSquares
from implicit.approximate_als import (
    AnnoyAlternatingLeastSquares,
    FaissAlternatingLeastSquares,
    NMSLibAlternatingLeastSquares,
)
from implicit.bpr import BayesianPersonalizedRanking
from implicit.datasets.lastfm import get_lastfm
from implicit.lmf import LogisticMatrixFactorization
from implicit.nearest_neighbours import (
    BM25Recommender,
    CosineRecommender,
    TFIDFRecommender,
    bm25_weight,
)

## Loading Data from our HDF5 file

In [2]:
with h5py.File('data/360k.hdf5', "r") as f:
    m = f.get("artist_user_plays")
    plays = csr_matrix((m.get("data"), m.get("indices"), m.get("indptr")))
    artists, users, plays = np.array(f["artist"]), np.array(f["user"]), plays

## Train Test split

The train test split is created by using the Full Dataset as our Test Matrix.

Then, we copy this Test Matrix, remove a percentage of ratings, and keep the list of users from which we removed ratings.

We will Train on the Matrix with removed ratings, and to test, we will select the users from which we removed ratings, and compute results only on them.

In [3]:
import random

# from https://jessesw.com/Rec-System/
def train_test_split(ratings, pct_test = 0.2, seed=1):
    # Set the random seed to zero for reproducibility
    random.seed(seed) 
    
    # Make a copy of the original set to be the test set. 
    test_set = ratings.copy() 
    test_set[test_set != 0] = 1
    
    # find nonzero pairs
    training_set = ratings.copy() 
    nonzero_inds = training_set.nonzero() 
    nonzero_pairs = list(zip(nonzero_inds[0], nonzero_inds[1]))
    
    # sample indices
    num_samples = int(np.ceil(pct_test*len(nonzero_pairs))) 
    samples = random.sample(nonzero_pairs, num_samples) 
    user_inds = [index[0] for index in samples] 
    item_inds = [index[1] for index in samples] 
    
    # set indices to zero
    training_set[user_inds, item_inds] = 0 
    training_set.eliminate_zeros() # 
    
    return training_set, test_set, list(set(user_inds)) # Output the unique list of user rows that were altered

plays = bm25_weight(plays, K1=100, B=0.8).tocsr()
train, test, indices = train_test_split(plays)

## Fitting our model

The implicit library features many kind of models, but one of the best performing ones on the 360K Last.fm dataset is the [Alternating Least Squares](http://yifanhu.net/PUB/cf.pdf) (what we considered previously as Matrix Factorization) with [BM25 ratings](https://en.wikipedia.org/wiki/Okapi_BM25).


In [4]:
model = AlternatingLeastSquares(factors=32, dtype=np.float32)
model.approximate_similar_items = True
model.fit(train)



  0%|          | 0/15 [00:00<?, ?it/s]

# Evaluating our model

We use two methods to verify the predictions of our model, the usual verified way, using Rank Based metrics, and the unorthodox way of sampling some users, and comparing the Predicted Best according to the SVD, compared to the most listened artists of the User.

## Evaluating using Metrics

In order to evaluate our model, we simply compute the recall and precision at k using the previous train-test split.

In [5]:
def recall_precision_at_k(model, test, k=10, size=500):
    
    selected = np.random.choice(len(indices), size=size)

    ratings = model.user_factors[selected].dot(model.item_factors.T)

    user_best = np.argpartition(-ratings,3,axis=1)[:,:10]
    
    intersect = np.take_along_axis(test[:,selected].toarray().T, user_best, axis=1)

    precision = intersect.mean(axis=1).mean()
    recall    = (intersect.sum(axis=1) / np.array(test[:,selected].sum(axis=0).T)).mean()

    return precision,recall

recall_precision_at_k(model, test)

(0.375, 0.08308330047806904)

In [6]:
selected = [0,1]
model.user_factors[selected].dot(model.item_factors.T).shape

(2, 159321)

It seems like we are able to really get a good precision with a higher dimensional model.

## Verification by ourselves : Comparing the Predicted And Target Best

In [7]:
selected = np.random.choice(len(indices), size=10)

ratings = model.user_factors[selected].dot(model.item_factors.T)

predicted = pd.DataFrame(ratings).apply(lambda x : x.argsort()[::-1][:5], axis=1).applymap(lambda x : artists[x].decode())
user_fav  = pd.DataFrame([[artists[x].decode() for x in np.array(train[:,usr].toarray())[:,0].argsort()[::-1][:5]] for usr in selected])

print("PREDICTED BEST")
display(predicted)

print("ACTUAL MOST LISTENED")
display(user_fav)

PREDICTED BEST


Unnamed: 0,159320,159319,159318,159317,159316
0,die apokalyptischen reiter,equilibrium,ensiferum,alestorm,grave digger
1,peaches,fischerspooner,ladytron,the knife,cansei de ser sexy
2,soda stereo,café tacuba,gustavo cerati,babasónicos,fobia
3,girugamesh,ancafe,nightmare,déspairsray,ガゼット
4,cheap trick,van halen,boston,foreigner,lynyrd skynyrd
5,roxette,era,enya,enigma,gregorian
6,the white stripes,radiohead,beck,pixies,beastie boys
7,love is all,the pains of being pure at heart,chairlift,los campesinos!,the xx
8,hercules and love affair,midnight juggernauts,neon neon,new order,sébastien tellier
9,high school musical 2,katy perry,miley cyrus,leona lewis,kelly clarkson


ACTUAL MOST LISTENED


Unnamed: 0,0,1,2,3,4
0,breschdleng,behind the scenery,chinchilla,puhdys,heathen
1,mediengruppe telekommander,chicks on speed,miss kittin,squirrel nut zippers,new young pony club
2,plastiko,freddie mercury & montserrat caballé,la lupita,lost acapulco,plastilina mosh
3,isabelle,blast,unsraw,deathgaze,176biz
4,the jim carroll band,axe,badfinger,uncle kracker,fight
5,ocarina,edvin marton,haddaway,safri duo,ks choice
6,fairport convention,skream,noisia,!!!,the tragically hip
7,scarlet's well,the argument,it hugs back,the transisters,dealership
8,monosurround,david carretta,go-kart mozart,modern english,minitel rose
9,katrina and the waves,renee sandstrom,emily osment,miranda cosgrove,chanel


When verifying by ourselves, we also see that the collaborative recommender seems to be able to model the preferences of the users pretty well.

## A Final Verification : Appending Synthetic Users in the Dataset

In order to verify if the model is able to perform well in practice, we add new users in the Dataset, and find the artists recommended to us by the model. This is inspired by [Last.fm New Artist Recommendation System](https://github.com/timjaya/lastfm/blob/master/Final%20Report.pdf). 

In [8]:
import scipy

def new_user_playlist(playlist):
    
    decoded_artists = [x.decode() for x in artists]
    
    indices  = [i for i,x in enumerate(decoded_artists) if x in playlist]
    data     = [1 for i in indices]
    zero_ind = [0 for i in indices]
    
    new_artist = scipy.sparse.csr_matrix((data, (zero_ind, indices)), shape=(1,len(artists)))
    new_train = scipy.sparse.hstack([train,new_artist.T]).tocsr()
    
    model = AlternatingLeastSquares(factors=32, dtype=np.float32)
    model.approximate_similar_items = True
    model.fit(new_train)

    ratings = model.user_factors[[-1]].dot(model.item_factors.T)

    predicted = pd.DataFrame(ratings).apply(lambda x : x.argsort()[::-1][:20], axis=1).applymap(lambda x : artists[x].decode())
    user_fav  = pd.DataFrame([artists[x].decode() for x in np.array(new_train.tocsr()[:,-1].toarray())[:,0].argsort()[::-1][:20]]).T

    print("PREDICTED BEST")
    display(predicted)

    print("ACTUAL MOST LISTENED")
    display(user_fav)

## Rappers

In [9]:
rappers = ['kanye west', '2pac', 'lil wayne','eminem', 'young jeezy', 'jay-z','drake', 'dr. dre', 'notorious b.i.g']
new_user_playlist(rappers)

  0%|          | 0/15 [00:00<?, ?it/s]

PREDICTED BEST


Unnamed: 0,159320,159319,159318,159317,159316,159315,159314,159313,159312,159311,159310,159309,159308,159307,159306,159305,159304,159303,159302,159301
0,kanye west,jay-z,lil wayne,t.i.,dr. dre,ludacris,lupe fiasco,eminem,notorious b.i.g.,50 cent,2pac,snoop dogg,the game,outkast,nas,busta rhymes,kid cudi,bone thugs-n-harmony,rick ross,dmx


ACTUAL MOST LISTENED


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,kanye west,young jeezy,2pac,lil wayne,drake,jay-z,eminem,dr. dre,gina v. dorio,gin,gimp,gimmik,gimmick,gina x performance,gina young,gimmel,gimma,ginesa ortega,gilvan de oliveira,giluz


## French pop

In [10]:
french_pop = ['serge gainsbourg', 'renaud','charles aznavour','garou','daniel balavoine','patrick bruel','céline dion']
new_user_playlist(french_pop)

  0%|          | 0/15 [00:00<?, ?it/s]

PREDICTED BEST


Unnamed: 0,159320,159319,159318,159317,159316,159315,159314,159313,159312,159311,159310,159309,159308,159307,159306,159305,159304,159303,159302,159301
0,charles aznavour,francis cabrel,vanessa paradis,jacques brel,calogero,bénabar,renaud,jean-jacques goldman,joe dassin,alain souchon,alain bashung,patrick bruel,michel polnareff,zazie,cali,adriano celentano,renan luce,katerine,olivia ruiz,louise attaque


ACTUAL MOST LISTENED


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,céline dion,renaud,serge gainsbourg,charles aznavour,daniel balavoine,garou,patrick bruel,ｃａｐｓｕｌｅ,gin,gimp,gimmik,gimmick,gimmel,gilvan de oliveira,gimma,gin n juice,giluz,gilt trip,gillman,gilliard


## Merging the two together

In [11]:
merging = rappers[:3] + french_pop[:3]
new_user_playlist(merging)

  0%|          | 0/15 [00:00<?, ?it/s]

PREDICTED BEST


Unnamed: 0,159320,159319,159318,159317,159316,159315,159314,159313,159312,159311,159310,159309,159308,159307,159306,159305,159304,159303,159302,159301
0,kanye west,dr. dre,jay-z,snoop dogg,outkast,notorious b.i.g.,2pac,50 cent,lil wayne,eminem,fugees,n*e*r*d,lupe fiasco,lauryn hill,nas,gnarls barkley,t.i.,mc solaar,busta rhymes,john legend


ACTUAL MOST LISTENED


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,renaud,serge gainsbourg,charles aznavour,2pac,lil wayne,kanye west,gilfema,gina,gin palace,gin n juice,gin blossoms,gin,gimp,gimmik,gimmick,gimmel,gimma,gilvan de oliveira,giluz,gilt trip


Rap + French = MC Solaar !

## Some more random artists

In [12]:
playlist = ['daft punk', 'deadmau5', 'john mayer', 'hans zimmer', 'coldplay', 'david guetta', 'kid cudi']
new_user_playlist(playlist)

  0%|          | 0/15 [00:00<?, ?it/s]

PREDICTED BEST


Unnamed: 0,159320,159319,159318,159317,159316,159315,159314,159313,159312,159311,159310,159309,159308,159307,159306,159305,159304,159303,159302,159301
0,coldplay,daft punk,kanye west,justin timberlake,timbaland,the killers,black eyed peas,benny benassi,justice,red hot chili peppers,gorillaz,david guetta,jack johnson,mgmt,moby,eminem,gnarls barkley,snow patrol,ministry of sound,maroon 5


ACTUAL MOST LISTENED


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,coldplay,deadmau5,kid cudi,david guetta,daft punk,john mayer,hans zimmer,ｃａｐｓｕｌｅ,gimmel,gin blossoms,gin,gimp,gimmik,gimmick,gilvan de oliveira,gimma,gin palace,giluz,gilt trip,gillman


# Application in a Group Recommender

We can now use the same Aggregation methods as before to create an Artist-List that fits a group !

In [13]:
def _disagreement_variance(predicts_df):
    # init value
    values = np.zeros(predicts_df.shape[0])

    # iterate over all pairs of users
    for col1 in predicts_df.columns:
        for col2 in predicts_df.columns:
            if col1 != col2:
                # add difference
                values += np.abs(predicts_df[col1] - predicts_df[col2])

    return values * 2/(predicts_df.shape[1] * (predicts_df.shape[1] - 1))

def _group_ratings(predicts_df, relevance_coeff = 0.5, max_rating=10):

    # compute relevance
    average_relevance = predicts_df.mean(axis=1).to_frame('relevance') / max_rating

    # compute variance
    variance = _disagreement_variance(predicts_df).to_frame('variance')

    # join back variance and relevance in a single rating
    group_ratings = average_relevance.join(variance)
    group_ratings['rating'] = ((relevance_coeff*group_ratings['relevance'])
                            + (1-relevance_coeff)*(1-group_ratings['variance']))
    return group_ratings

def compute_playlist(predicts_df,n,relevance_coeff=0.5):

    # compute group ratings
    group_ratings = _group_ratings(predicts_df, relevance_coeff=relevance_coeff)
    
    # compute top ratings
    topn_ratings = group_ratings.sort_values(by='rating', ascending=False).head(n)

    return topn_ratings

In [14]:
selected = np.random.choice(len(users), size=3)

ratings = model.user_factors[selected].dot(model.item_factors.T)

predicted = pd.DataFrame(ratings).apply(lambda x : x.argsort()[::-1][:5], axis=1)
user_fav  = pd.DataFrame([[artists[x].decode() for x in np.array(train[:,usr].toarray())[:,0].argsort()[::-1][:10]] for usr in selected])

selected_art = compute_playlist(pd.DataFrame(ratings).T, 10, relevance_coeff=1).index

print("Predicted")
display(pd.DataFrame([x.decode() for x in artists[selected_art]]).T)

print("Per User favorites")
user_fav

Predicted


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,iron & wine,sufjan stevens,modest mouse,bon iver,girl talk,elliott smith,radiohead,explosions in the sky,death cab for cutie,band of horses


Per User favorites


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,jesse,ganja kru,vexd,caspa & rusko,sub focus,julma-henri & syrjäytyneet,freestylers,ltj bukem,she wants revenge,crystal castles
1,carolyn dawn johnson,deana carter,lorrie morgan,craig morgan,gretchen wilson,patty loveless,jo dee messina,lee ann womack,kellie pickler,miranda lambert
2,ocs,thee oh sees,mugison,elliott,starflyer 59,jana hunter,joan of arc,the promise ring,grouper,pedro the lion
