# LastFM Recommendation Engine
August 2018

Today we are going to walk through building a basic recommender system which, when given a music artist, will recommend similar artists. The dataset, obtained from LastFM in 2011, contains the play counts of 17,632 artists by 1,892 users, is available at [GroupLens](https://grouplens.org/datasets/hetrec-2011/) on behalf of [Lab41](https://github.com/Lab41/hermes/wiki/Datasets).

There are two main types of recommender system: **content-based**, and **collaborative**. Content-based recommends based on past browsing history of similar items to provide recommendations, whereas collaborative takes data from similar users to provide recommendations.  

Our basic recommender will be a collaborative recommender system, where we essentially build a sparse matrix comparing artist plays (rows) by user (columns). This data will then be passed through a latent mapping algorithm, K-nearest neighbors, to determine cosine similarity amongst the user/artist relationships.  This will help us determine which artists are most similar.  For instance, when a user plays the Beatles, they also have a high probability of playing the Rolling Stones.  

### Import files and packages

In [1]:
# check kernel environment
import sys
sys.executable

'/Users/brianmcmahon/anaconda3/envs/recommender/bin/python'

In [2]:
# Import packages
import pandas as pd
import numpy as np

# Compressed Sparse Row ("CSR") matrix
from scipy.sparse import csr_matrix 

# use K-Nearest Neighbors to find cosine distance amongst artists
from sklearn.neighbors import NearestNeighbors

# fuzzy string matching to allow for differing spelling of artist names
from fuzzywuzzy import fuzz

In [3]:
# set output to three decimals
pd.set_option('display.float_format',lambda x: '%.3f' %x)

In [4]:
# set the random seed for reproducible randomness
seed = np.random.RandomState(seed=42)

Six .csv files are provided in this dataset (download from GroupLens [here](http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip)), but we will only need two for our basic collaborative recommendation engine.  We also have data consisting of artist tags by user with timestamp and friends network of each user. Tags could be applied to a content-based recommendation engine in a separate analysis.  

In [5]:
artists = pd.read_csv('../data/artists.dat',sep='\t',usecols=['id','name'])
plays = pd.read_csv('../data/user_artists.dat',sep='\t')

ap = pd.merge(artists, 
                plays, 
                how='inner',
                left_on='id',
                right_on='artistID')

ap = ap.rename(columns={"weight":"userPlays"})

In [6]:
ap.head()

Unnamed: 0,id,name,userID,artistID,userPlays
0,1,MALICE MIZER,34,1,212
1,1,MALICE MIZER,274,1,483
2,1,MALICE MIZER,785,1,76
3,2,Diary of Dreams,135,2,1021
4,2,Diary of Dreams,257,2,152


### Exploratory Data Analysis (EDA)

Key features in our collaborative engine include:
- userID
- artistID, artistName
- userPlays (plays by user of artist)

Other features in the dataset (not used in this engine) include:
- tag of artist by user with timestamp
- friends network

_The tags can be used in a content-based recommendation engine in a separate analysis._

In [7]:
df_list = [artists, plays, ap]
df_name = ['**Artists**','**Plays**','**Combined**']

assert len(df_list) == len(df_name)

for i in range(len(df_list)):
    print(df_name[i],'\n')
    print("Shape: {}\n".format(df_list[i].shape))
    print("Info:")
    print(df_list[i].info(),'\n')
    print("Unique:\n{}\n".format(df_list[i].nunique()))
    


**Artists** 

Shape: (17632, 2)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17632 entries, 0 to 17631
Data columns (total 2 columns):
id      17632 non-null int64
name    17632 non-null object
dtypes: int64(1), object(1)
memory usage: 275.6+ KB
None 

Unique:
id      17632
name    17632
dtype: int64

**Plays** 

Shape: (92834, 3)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92834 entries, 0 to 92833
Data columns (total 3 columns):
userID      92834 non-null int64
artistID    92834 non-null int64
weight      92834 non-null int64
dtypes: int64(3)
memory usage: 2.1 MB
None 

Unique:
userID       1892
artistID    17632
weight       5436
dtype: int64

**Combined** 

Shape: (92834, 5)

Info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 92834 entries, 0 to 92833
Data columns (total 5 columns):
id           92834 non-null int64
name         92834 non-null object
userID       92834 non-null int64
artistID     92834 non-null int64
userPlays    92834 non-null int64
d

Key statistics include:
    
- 17,632 unique artist ID and name
- 1,892 unique user ID
- 92,834 total data points (one user's history can contain multiple artists)

In [8]:
# 69 million total plays
print("{:,}".format(ap['userPlays'].sum()))

69,183,975


In [9]:
# Determine total plays by artist
ap2 = (ap.groupby(['name'])['userPlays'].sum().
       sort_values(ascending=False).
       reset_index().
       rename(columns={"userPlays":"totalArtistPlays","name":"artist"})
       [['artist','totalArtistPlays']])

In [10]:
ap2.head()

Unnamed: 0,artist,totalArtistPlays
0,Britney Spears,2393140
1,Depeche Mode,1301308
2,Lady Gaga,1291387
3,Christina Aguilera,1058405
4,Paramore,963449


In [11]:
# Britney Spears is the most played at 2.4 million plays
print(ap2.iloc[ap2['totalArtistPlays'].idxmax()])

artist              Britney Spears
totalArtistPlays           2393140
Name: 0, dtype: object


In [12]:
print(ap2[ap2['artist']=='Lady Gaga'])

      artist  totalArtistPlays
2  Lady Gaga           1291387


In [13]:
print(ap2[ap2['artist']=='Bon Jovi'])

       artist  totalArtistPlays
250  Bon Jovi             43252


In [14]:
ap2['totalArtistPlays'].describe()

count     17632.000
mean       3923.774
std       34099.342
min           1.000
25%         113.000
50%         350.000
75%        1234.250
max     2393140.000
Name: totalArtistPlays, dtype: float64

In [15]:
# artists at the 95th percentile are played ~10,000 times
# we will use this information to set the popularity threshold below
# we may want to focus on the most popular artists to reduce noise
# and improve accuracy
ap2['totalArtistPlays'].quantile(np.arange(.9,1.,.01))

0.900    4645.400
0.910    5350.680
0.920    6193.000
0.930    7320.640
0.940    8685.280
0.950   10693.400
0.960   14257.800
0.970   18969.740
0.980   30137.940
0.990   60096.010
Name: totalArtistPlays, dtype: float64

### Prepare Sparse Matrix

In [16]:
# merge plays by artist with plays by user
user_artist_plays = ap.merge(ap2, 
                             left_on='name',
                             right_on='artist',
                             how='left')[['userID',
                                          'artist',
                                          'userPlays',
                                          'totalArtistPlays']]

# confirm no duplicate rows
assert (user_artist_plays[user_artist_plays.
                         duplicated(['userID','artist'])].
                         empty)

In [17]:
popularity_threshold = 10000 # 95th percentile at ~10,000 plays
uap_top = (user_artist_plays[user_artist_plays['totalArtistPlays']>
                             popularity_threshold].
                             sort_values(['userID','userPlays'],
                             ascending=False))

In [18]:
print(uap_top.shape)
uap_top.head()

(53861, 4)


Unnamed: 0,userID,artist,userPlays,totalArtistPlays
44041,2100,Yann Tiersen,1333,43972
36701,2100,Eluveitie,762,11244
36813,2100,Slayer,553,62107
55016,2099,Flying Lotus,410,13178
60187,2099,Bonobo,397,14601


In [19]:
uap_top[['userID','artist']].nunique()

userID    1871
artist     943
dtype: int64

In [20]:
# with threshold at 95th percentile this would show the 5% of artists
# we are including in our engine
print("{:.2f}%".format(100*(uap_top.artist.nunique()/ap.name.nunique())))

5.35%


In [21]:
# Our revised analysis still contains almost all original users
print("{:.2f}%".format(100*(uap_top.userID.nunique()/ap.userID.nunique())))

98.89%


In [22]:
# fit data into a sparse matrix of artist name (row) vs user (column)
# in terms of number of plays by artist/user
pivot_uapt = uap_top.pivot(index='artist',columns='userID',values='userPlays').fillna(0)
sparse_uapt = csr_matrix(pivot_uapt.values)
sparse_uapt.shape

(943, 1871)

In [23]:
# Use K Nearest Neighbors to determine cosine distance amongst artists
knn = NearestNeighbors(metric='cosine')
knn.fit(sparse_uapt)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

### Query by Index Position

In [24]:
pivot_uapt.head()

userID,2,3,4,5,6,7,8,9,10,11,...,2090,2091,2092,2093,2094,2095,2096,2097,2099,2100
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
*NSYNC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1567.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2NE1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,290.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2Pac,0.0,0.0,0.0,0.0,14.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3 Doors Down,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,514.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
30 Seconds to Mars,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
def idx_recommend(data,idx,model,k):
    distances, indices = (model.kneighbors(data.
                                     iloc[idx,:].
                                     values.reshape(1,-1),
                                     n_neighbors=k+1))

    for i in range(0,len(distances.flatten())):
        if i == 0:
            print(('Recommendations for {} (index {}):\n'.
                  format(data.index[idx],idx)))
        else:
            print(('{}: {} ({:.3f})'.
                  format(i,
                         data.index[indices.flatten()[i]],
                         distances.flatten()[i])))
    print('\nNote: cosine distance in parentheses.')

In [26]:
query_index = seed.choice(pivot_uapt.shape[0])
idx_recommend(pivot_uapt,query_index,knn,6)

Recommendations for Belinda (index 102):

1: Nelly Furtado (0.603)
2: t.A.T.u. (0.693)
3: Hilary Duff (0.706)
4: Delta Goodrem (0.775)
5: Lindsay Lohan (0.782)
6: Blutengel (0.822)

Note: cosine distance in parentheses.


In [27]:
query_index = 2
idx_recommend(pivot_uapt,query_index,knn,6)

Recommendations for 2Pac (index 2):

1: Mobb Deep (0.543)
2: Lloyd Banks (0.558)
3: G-Unit (0.592)
4: 50 Cent (0.593)
5: Nas (0.606)
6: Wu-Tang Clan (0.661)

Note: cosine distance in parentheses.


### Query by Artist (using Fuzzy Matching)

In [44]:
def artist_recommend(query_artist, data, model,k):
    idx = None
    ratio_tuples = []
    
    for artist in data.index:
        ratio = fuzz.ratio(artist.lower(),query_artist.lower())
        if ratio >=75:
            current_query_idx = data.index.tolist().index(artist)
            ratio_tuples.append((artist, ratio,current_query_idx))
            
    print('Fuzzy matches: {}\n'.format([(x[0],x[1]) for x in ratio_tuples]))
    
    try:
        query_idx = max(ratio_tuples, key=lambda x: x[1])[2]
        
    except:
        print('Your artist did not match any artists.')
        return None
    
    idx_recommend(data, current_query_idx,model,k)    

In [45]:
artist_recommend('red hot chili peppers',pivot_uapt,knn,10)

Fuzzy matches: [('Red Hot Chili Peppers', 100)]

Recommendations for Red Hot Chili Peppers (index 663):

1: The Offspring (0.568)
2: Kreator (0.630)
3: Mercyful Fate (0.692)
4: John Frusciante (0.692)
5: Neuro Dubel (0.693)
6: Ennio Morricone (0.694)
7: Рубль (0.695)
8: In Extremo (0.696)
9: Riverside (0.697)
10: Katie Melua (0.701)

Note: cosine distance in parentheses.


### Convert weighted matrix to binary
Binary classification consists of played (1) or not played (0).

In [30]:
buapt = pivot_uapt.apply(np.sign)
sparse_buapt = csr_matrix(buapt.values)

In [31]:
bknn = NearestNeighbors(metric='cosine')
bknn.fit(sparse_buapt)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [32]:
artist_recommend('red hot chili peppers',buapt,bknn,10)

Fuzzy matches: [('*NSYNC', 7), ('2NE1', 8)]

Recommendations for 2NE1 (index 1):

1: 4minute (0.489)
2: 소녀시대 (0.532)
3: BIG BANG (0.610)
4: SHINee (0.610)
5: BoA (0.632)
6: 倖田來未 (0.768)
7: Rihanna (0.810)
8: Ke$ha (0.825)
9: Britney Spears (0.827)
10: Katy Perry (0.827)

Note: cosine distance in parentheses.


In [33]:
artist_recommend('korn',buapt,bknn,10)

Fuzzy matches: [('*NSYNC', 20), ('2NE1', 25)]

Recommendations for 2NE1 (index 1):

1: 4minute (0.489)
2: 소녀시대 (0.532)
3: BIG BANG (0.610)
4: SHINee (0.610)
5: BoA (0.632)
6: 倖田來未 (0.768)
7: Rihanna (0.810)
8: Ke$ha (0.825)
9: Britney Spears (0.827)
10: Katy Perry (0.827)

Note: cosine distance in parentheses.
