# LastFM Recommendation System
June [x] 2018

Today we are going to walk through building a basic recommender system which, when given a music artist, will recommend similar artists. The dataset, obtained from LastFM in 2011, contains the play counts of 17,632 artists by 1,892 users, is available at [GroupLens](https://grouplens.org/datasets/hetrec-2011/) on behalf of [Lab41](https://github.com/Lab41/hermes/wiki/Datasets).

In [43]:
# First we import our packages needed for this analysis
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix # compressed Sparse Row matrix
from sklearn.neighbors import NearestNeighbors # use K-Nearest Neighbors to find cosine distance amongst our artists
from fuzzywuzzy import fuzz # fuzzy string matching so we allow for slight misspellings of artist names

In [44]:
# sets output to three decimals
pd.set_option('display.float_format',lambda x: '%.3f' %x)

In [45]:
# Six csv files were provided, but we will only need two for our basic recommender system

# using artists and play count per user
artists = pd.read_csv('artists.dat', sep='\t',usecols=['id','name'])
plays = pd.read_csv('user_artists.dat', sep='\t')

# additional user data,for later analysis
tags = pd.read_csv('tags.dat', sep='\t',encoding='latin-1')
friends = pd.read_csv('user_friends.dat', sep='\t')
utat = pd.read_csv('user_taggedartists-timestamps.dat', sep="\t")
uta = pd.read_csv('user_taggedartists.dat', sep='\t')

In [51]:
# Explore specs of each file provided.  We will only use artists and plays in the basic system.
csv_list = [artists, plays, tags, friends, utat,uta]
csv_names = ['artists', 'plays', 'tags', 'friends', 'user_taggedartists-timestamps','user_taggedartists']
for i in range(len(csv_list)):
    name = csv_names[i]
    shape = csv_list[i].shape
    columns = csv_list[i].columns
    unique = csv_list[i].nunique()
    print("{}\n{}\n{}\n{}\n".format(name, shape, columns,unique))

artists
(17632, 2)
Index(['id', 'name'], dtype='object')
id      17632
name    17632
dtype: int64

plays
(92834, 3)
Index(['userID', 'artistID', 'weight'], dtype='object')
userID       1892
artistID    17632
weight       5436
dtype: int64

tags
(11946, 2)
Index(['tagID', 'tagValue'], dtype='object')
tagID       11946
tagValue    11946
dtype: int64

friends
(25434, 2)
Index(['userID', 'friendID'], dtype='object')
userID      1892
friendID    1892
dtype: int64

user_taggedartists-timestamps
(186479, 4)
Index(['userID', 'artistID', 'tagID', 'timestamp'], dtype='object')
userID        1892
artistID     12523
tagID         9749
timestamp     3549
dtype: int64

user_taggedartists
(186479, 6)
Index(['userID', 'artistID', 'tagID', 'day', 'month', 'year'], dtype='object')
userID       1892
artistID    12523
tagID        9749
day             4
month          12
year           10
dtype: int64



In [5]:
# merge artist and play files
artist_plays = pd.merge(artists, plays,how='left',left_on='id',right_on='artistID')

In [6]:
artist_plays.head()

Unnamed: 0,id,name,userID,artistID,weight
0,1,MALICE MIZER,34,1,212
1,1,MALICE MIZER,274,1,483
2,1,MALICE MIZER,785,1,76
3,2,Diary of Dreams,135,2,1021
4,2,Diary of Dreams,257,2,152


In [52]:
# Obtain total plays by artist
ap2 = (artist_plays.groupby(['name'])['weight'].sum().reset_index().
                rename(columns={'weight':'total_artist_plays','name':'artist_name'})
               [['artist_name','total_artist_plays']])

In [57]:
print(ap2[ap2['artist_name']=='Bon Jovi'])
ap2.head()

     artist_name  total_artist_plays
2151    Bon Jovi               43252


Unnamed: 0,artist_name,total_artist_plays
0,!!!,2826
1,!DISTAIN,1257
2,!deladap,65
3,#####,3707
4,#2 Orchestra,144


In [8]:
# merge plays by artist with plays by user
user_data_with_artist_plays = artist_plays.merge(ap2, left_on='name',right_on='artist_name',how='left')[['userID','artist_name','weight','total_artist_plays']]

In [9]:
# On average, each artist is played 350 times
ap2['total_artist_plays'].describe()

count     17632.000
mean       3923.774
std       34099.342
min           1.000
25%         113.000
50%         350.000
75%        1234.250
max     2393140.000
Name: total_artist_plays, dtype: float64

In [58]:
# The maximum number of plays belongs to Britney Spears
ap2[ap2['total_artist_plays']==max(ap2['total_artist_plays'])]

Unnamed: 0,artist_name,total_artist_plays
2336,Britney Spears,2393140


In [10]:
# artists at the 95th percentile are played ~10,000 times
ap2['total_artist_plays'].quantile(np.arange(.9,1.,.01))

0.900    4645.400
0.910    5350.680
0.920    6193.000
0.930    7320.640
0.940    8685.280
0.950   10693.400
0.960   14257.800
0.970   18969.740
0.980   30137.940
0.990   60096.010
Name: total_artist_plays, dtype: float64

In [68]:
# We set our popularity threshold to capture the most popular artists; this is to reduce noise and improve accuracy
popularity_threshold = 0
udpa = user_data_with_artist_plays[user_data_with_artist_plays['total_artist_plays']>popularity_threshold]
udpa2 = udpa.sort_values(['userID','weight'],ascending=False)

In [69]:
udpa2.head()

Unnamed: 0,userID,artist_name,weight,total_artist_plays
44041,2100,Yann Tiersen,1333,43972
36701,2100,Eluveitie,762,11244
36813,2100,Slayer,553,62107
55016,2099,Flying Lotus,410,13178
60187,2099,Bonobo,397,14601


In [76]:
udpa2.shape

(53861, 4)

In [77]:
udpa2.nunique()

userID                1871
artist_name            943
weight                5083
total_artist_plays     936
dtype: int64

In [71]:
# confirm no duplicate rows
assert udpa[udpa.duplicated(['userID','artist_name'])].empty 

In [79]:
# fit data into a sparse matrix of artist name (rows) vs user (columns) in terms of number of plays
wide_artist_data = udpa2.pivot(index='artist_name',columns='userID',values='weight').fillna(0)
wide_artist_data_sparse = csr_matrix(wide_artist_data.values)
wide_artist_data_sparse.shape

(943, 1871)

In [80]:
# We use K Nearest Neighbors to determine cosine distance amongst artists
model_knn = NearestNeighbors(metric='cosine',algorithm='auto')
model_knn.fit(wide_artist_data_sparse)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [83]:
query_index = np.random.choice(wide_artist_data.shape[0])
distances, indices = model_knn.kneighbors(wide_artist_data.iloc[query_index,:].values.reshape(1,-1),n_neighbors=6)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(wide_artist_data.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}'.format(i, wide_artist_data.index[indices.flatten()[i]],distances.flatten()[i]))

Recommendations for M83:

1: The American Dollar, with distance of 0.5247161232479393
2: Ben Folds, with distance of 0.5404039168093057
3: Electric President, with distance of 0.542619536025968
4: John Frusciante, with distance of 0.6126376111752875
5: Cocteau Twins, with distance of 0.6314801586385921


In [84]:
wide_artist_data_binary = wide_artist_data.apply(np.sign)
wide_artist_data_binary_sparse = csr_matrix(wide_artist_data_binary.values)

In [85]:
model_nn_binary = NearestNeighbors(metric='cosine',algorithm='auto')
model_nn_binary.fit(wide_artist_data_binary_sparse)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [86]:
distances, indices = model_nn_binary.kneighbors(wide_artist_data_binary.iloc[query_index,:].values.reshape(1,-1),n_neighbors=6)

for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(wide_artist_data.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}'.format(i, wide_artist_data.index[indices.flatten()[i]],distances.flatten()[i]))

Recommendations for M83:

1: Sigur Rós, with distance of 0.6997177975740639
2: Explosions in the Sky, with distance of 0.7161716377717764
3: múm, with distance of 0.7380952380952381
4: The Radio Dept., with distance of 0.7506081254445802
5: My Bloody Valentine, with distance of 0.7545840884607619


In [87]:
def print_artist_recommendation(query_artist, artist_plays_matrix,knn_model,k):
    query_index = None
    ratio_tuples = []
    
    for i in artist_plays_matrix.index:
        ratio = fuzz.ratio(i.lower(),query_artist.lower())
        if ratio >= 75:
            current_query_index = artist_plays_matrix.index.tolist().index(i)
            ratio_tuples.append((i,ratio,current_query_index))
            
    print('Possible matches: {0}\n'.format([(x[0],x[1]) for x in ratio_tuples]))
    
    try:
        query_index = max(ratio_tuples,key=lambda x: x[1])[2]
    except:
        print('Your artist did not match any artists in the data. Try again')
        return None
    
    distances, indices = knn_model.kneighbors(artist_plays_matrix.iloc[query_index,:].values.reshape(1,-1),n_neighbors=k+1)
    
    for i in range(0,len(distances.flatten())):
        if i ==0:
            print('Recommendations for {0}:\n'.format(artist_plays_matrix.index[query_index]))
        else:
            print('{0}: {1}, with distance of {2}:'.format(i,artist_plays_matrix.index[indices.flatten()[i]],distances.flatten()[i]))
            
    return None

In [88]:
print_artist_recommendation('red hot chili peppers',wide_artist_data_binary,model_nn_binary,k=10)

Possible matches: [('Red Hot Chili Peppers', 100)]

Recommendations for Red Hot Chili Peppers:

1: Nirvana, with distance of 0.6531898876933021:
2: Oasis, with distance of 0.6804322417333812:
3: The Beatles, with distance of 0.6875337856101311:
4: Foo Fighters, with distance of 0.6907899504265811:
5: Muse, with distance of 0.7147586765340397:
6: U2, with distance of 0.7187370650649147:
7: Radiohead, with distance of 0.7190006405179143:
8: Green Day, with distance of 0.7247622108315265:
9: Aerosmith, with distance of 0.7256105408555654:
10: Pearl Jam, with distance of 0.7287778314914192:


In [89]:
print_artist_recommendation('korn',wide_artist_data_binary,model_nn_binary,k=10)

Possible matches: [('Akon', 75), ('Korn', 100)]

Recommendations for Korn:

1: Limp Bizkit, with distance of 0.6461634268298138:
2: System of a Down, with distance of 0.6510341508052175:
3: Slipknot, with distance of 0.6624614219579241:
4: Deftones, with distance of 0.7007471991677101:
5: Godsmack, with distance of 0.7020602142344381:
6: Metallica, with distance of 0.7408662553447589:
7: Stone Sour, with distance of 0.7441800621927404:
8: Pantera, with distance of 0.7474092572295388:
9: Rammstein, with distance of 0.7498512347383965:
10: Marilyn Manson, with distance of 0.7522026861083237:


In [90]:
print_artist_recommendation('beatles',wide_artist_data_binary,model_nn_binary,k=10)

Possible matches: [('The Beatles', 78)]

Recommendations for The Beatles:

1: Radiohead, with distance of 0.5556344164535882:
2: Pink Floyd, with distance of 0.5567036793441906:
3: Led Zeppelin, with distance of 0.5924683351896096:
4: Arctic Monkeys, with distance of 0.594228466209854:
5: The Strokes, with distance of 0.613038007895694:
6: The Rolling Stones, with distance of 0.6171620636663815:
7: Oasis, with distance of 0.6220355269907726:
8: Muse, with distance of 0.6302872736840136:
9: David Bowie, with distance of 0.6334567471239492:
10: Bob Dylan, with distance of 0.6486358155368461:


In [91]:
print_artist_recommendation('bon jovi',wide_artist_data_binary,model_nn_binary,k=10)

Possible matches: [('Bon Jovi', 100)]

Recommendations for Bon Jovi:

1: Guns N' Roses, with distance of 0.6388424407426927:
2: Aerosmith, with distance of 0.64543051657445:
3: Mötley Crüe, with distance of 0.6485989094725191:
4: Skid Row, with distance of 0.6637036454327252:
5: Poison, with distance of 0.6913933000758161:
6: Scorpions, with distance of 0.6996242954069447:
7: Alice Cooper, with distance of 0.7249904508915366:
8: Queen, with distance of 0.7352129544573991:
9: Def Leppard, with distance of 0.7412254152466171:
10: AC/DC, with distance of 0.7478924811549623:
