# Collaborative item-based filtering for music artists
Within this notebook I built a recommender for music artists using a collaborative item-based filtering approach. As a basis, I used the last.fm-dataset which can be found [here](http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/index.html). The content of the notebook is based on [this](https://beckernick.github.io/music_recommender/) tutorial.

The recommender has been deployed in a sample web-application and can be found [here](https://denmei.github.io/): ![artist_recommender](img/artist_recommender.png)

## Theoretical background
To provide a better understanding of the approach, here's a small introduction (from Wikipedia, I have to admit).

### Collaborative Filtering
* Collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration aming multiple agents, viewpoints, data sources etc.
* There exist different approaches for collaborative filtering:
    * **User-based**: Use ratings from like-minded users to give a recommendation for the active user
    * **Item-based**: Build item-item-matrix determining similiarity between pairs of items &rarr; Use tastes of current user to find similar items

### Item-based collaborative filtering
**How do we get the similarities?**
* Look at users who rated both items 
* Similarity is dependend on the ratings given by users who have rated both of them
* There exist a lot of different metrics to measure this similarity, e.g. Cosine-similarity, Euclidian Distance...
![Item-based CF](img/icollaborative_filtering.png)

**How do we use the similarities?**
* We use the most similar items to the ones the user already rated to generate a list of recommendations
* *people who rate item X highly, like you, also tend to rate item Y highly, and you haven't rated item Y yet, so you should try it*

In [2]:
import os
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import OneHotEncoder
from scipy.sparse import csr_matrix
import pickle
pd.options.display.float_format = "{:.3f}".format

## Import data + Data Exploration

### Load user profiles
The user profiles dataset contains data about 359.347 different users, including their id, geneder, age, country and the data when they signed up for the service.
![profiles](img/profiles.png)

In [7]:
user_profiles = pd.read_table("lastfm-dataset-360K/profiles.tsv", 
                         header=None, names=['user_id', 'gender', 'age', 'country', 'signup'])
print(user_profiles.head(5))
print("Number of rows: " + str(len(user_profiles)))
print("Number of unique users: " + str(len(set(user_profiles['user_id'].values))))

                                    user_id gender    age        country  \
0  00000c289a1829a808ac09c00daf10bc3c4e223b      f 22.000        Germany   
1  00001411dc427966b17297bf4d69e7e193135d89      f    nan         Canada   
2  00004d2ac9316e22dc007ab2243d6fcb239e707d    NaN    nan        Germany   
3  000063d3fe1cf2ba248b9e3c3f0334845a27a6bf      m 19.000         Mexico   
4  00007a47085b9aab8af55f52ec8846ac479ac4fe      m 28.000  United States   

         signup  
0   Feb 1, 2007  
1   Dec 4, 2007  
2   Sep 1, 2006  
3  Apr 28, 2008  
4  Jan 27, 2006  
Number of rows: 359347
Number of unique users: 359347


### Load user data
The user dataset contains 358.868 individual users' Last.fm-artist listening information. 

![plays](img/plays.png)

In [8]:
user_data = pd.read_table("lastfm-dataset-360K/plays.tsv", 
                         header=None, names=['user_id', 'artist_id', 'artist_name', 'plays'])
print(user_data.head(5))
print("\nNumber of rows: " + str(len(user_data)))
print("\nNumber of users: " + str(len(set(user_data['user_id'].values))))
user_data.drop('artist_id', axis=1, inplace=True)

# drop all rows where we do not have an artist name
if user_data['artist_name'].isnull().sum() > 0:
    user_data = user_data.dropna(axis = 0, subset = ['artist_name'])

                                    user_id  \
0  00000c289a1829a808ac09c00daf10bc3c4e223b   
1  00000c289a1829a808ac09c00daf10bc3c4e223b   
2  00000c289a1829a808ac09c00daf10bc3c4e223b   
3  00000c289a1829a808ac09c00daf10bc3c4e223b   
4  00000c289a1829a808ac09c00daf10bc3c4e223b   

                              artist_id           artist_name  plays  
0  3bd73256-3905-4f3a-97e2-8b341527f805       betty blowtorch   2137  
1  f2fb0ff0-5679-42ec-a55c-15109ce6e320             die Ärzte   1099  
2  b3ae82c2-e60b-4551-a76d-6620f1b456aa     melissa etheridge    897  
3  3d6bbeb7-f90e-4d10-b440-e153c0d10b53             elvenking    717  
4  bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8  juliette & the licks    706  

Number of rows: 17535655

Number of users: 358868


## Data Preparation

### Reduce number of users
* To reduce the size of the data, only German users will be considered
* Join user-data and user-profiles 

In [10]:
german_profiles = user_profiles[user_profiles['country'] == 'Germany']
user_data_ger_profiles = german_profiles.merge(user_data, on="user_id", how='left')
print("Number of German profiles: " + str(len(user_data_ger_profiles)))

Number of German profiles: 1555720


### Reduce number of artists
Lesser known artists will have lesser plays from fewer users, which makes the data more noisy. This might have an effect on the recommender, since there might be a high sensitivity to instances where one individual user **loves** one less known artist.

To reduce this influence, we will **filter only for the most popular users**. 

Another advantage is that the **file size will be reduced**, leading to better performance of the model.

First, we create a table containing the total plays for each artist.

In [11]:
artist_plays = user_data_ger_profiles.groupby('artist_name')['plays'].sum().reset_index()
artist_plays.columns = ['artist_name', 'artist_total_plays']
print(artist_plays.head(5))
print("Number of artists:" + str(len(artist_plays['artist_name'])))

     artist_name  artist_total_plays
0            !!!           14362.000
1  !action pact!              85.000
2          !cube              40.000
3       !deladap            1148.000
4       !distain             379.000
Number of artists:82816


With more than 83.000 artists, the probability that some artists have been played only a few times is high.

Let's find a threshold to define how many plays are needed to be a popular artist in the dataset by looking at the descriptives:

* The median artist only has round about 145 plays. 
* The the most popular artist has more than 2.9 million plays
* Only 1% of the artists has around 70.000 and more plays

To keep the dataset small, **I will choose a threshold of 90.000 total plays to define whether a artist is popular or not. This will reduce the number of artists to 645**.

In [12]:
print(artist_plays['artist_total_plays'].describe())
print("")
print(artist_plays['artist_total_plays'].quantile(np.arange(.98, 1, .002)))
threshold = 90000
popular_artist_plays = artist_plays[artist_plays['artist_total_plays'] > threshold]

print(len(artist_plays['artist_name']))
print(len(popular_artist_plays['artist_name']))

count     82816.000
mean       3649.069
std       32286.877
min           1.000
25%          36.000
50%         145.000
75%         618.000
max     2955844.000
Name: artist_total_plays, dtype: float64

0.980     29364.400
0.982     33647.000
0.984     39316.520
0.986     46431.380
0.988     56943.960
0.990     69368.700
0.992     87238.000
0.994    114542.490
0.996    165063.340
0.998    276339.520
1.000   2955844.000
Name: artist_total_plays, dtype: float64
82816
645


Let's bring the datasets together into one DataFrame.

In [13]:
user_with_artist_plays = user_data_ger_profiles.merge(popular_artist_plays, on='artist_name', how='inner')
user_with_artist_plays = user_with_artist_plays.sort_values('artist_total_plays', ascending=False)
print(user_with_artist_plays.head(5))

                                       user_id gender    age  country  \
0     00000c289a1829a808ac09c00daf10bc3c4e223b      f 22.000  Germany   
4507  aba29c45c5067cba15e191da456a130ed84bcb14      f 26.000  Germany   
4479  aadcd8781ea372f3164b726ff10011d4ac73b9cc      m 32.000  Germany   
4478  aad472ed3b7ca0df1e4efc7c9b2436f52e221519      m 18.000  Germany   
4477  aad0cddae6587e92c7069b22d202adb99d53624e      f 15.000  Germany   

            signup artist_name    plays  artist_total_plays  
0      Feb 1, 2007   die Ärzte 1099.000         2955844.000  
4507   Sep 6, 2007   die Ärzte 2245.000         2955844.000  
4479  Nov 27, 2006   die Ärzte  802.000         2955844.000  
4478  Nov 24, 2006   die Ärzte  498.000         2955844.000  
4477  Oct 25, 2007   die Ärzte  648.000         2955844.000  


### Correct format
The k-nearest neighbor algorithm will be used for the recommender. As a prerequesit, the data must be in a *mxn*-shaped matrix, where *n* is the number of artists and *m* is the number of users.

The format we need here is: ![collabfiltering](img/icollaborative_filtering.png)

In [45]:
user_with_artist_plays = user_with_artist_plays.drop_duplicates(['user_id', 'artist_name'])
# pivot to create a dataframe with artists as rows and users as columns. 
# Fill with the number of plays per user and artist. Fill empty values with 0.
wide_artist_data = user_with_artist_plays.pivot(index = 'artist_name', columns = 'user_id', values = 'artist_total_plays').fillna(0)
# Transform to sparse matrix for more efficiency.
wide_artist_data_sparse = csr_matrix(wide_artist_data.values)

## Train model

I will use a k-Nearest Neighbor for the recommender. 
The algorithm computes the distance between a specific artist and the remaining artists in the dataset. 

The smaller the distance between two instances, the more similar they are. For each artist, **the k best matches will be returned** (the ones with the smallest distance).

![knn](img/knearestneighbor.jpeg)

There exists a large nummber of metrics you can choose to calculate this distance: Euclidian Distance, Pearson, Cosine, etc. I will **use the cosine to calculate the distance between two items**.

Let's train a NearestNeighbor-model on the dataset. 

In [15]:
model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'auto')
model_knn.fit(wide_artist_data_sparse)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

## Make recommendations

We can use the model to make some recommendations.

Sklearn's kneighbors-model provides an operation called *kneighbors*, which we will use here:
* Input: [Query points X (column of our artist), number of artists k]
* Output: [Distances between every of the k results and our query, Indices of the k nearest points in the data matrix]

In [52]:
query_index = np.random.choice(wide_artist_data.shape[0])
distances, indices = model_knn.kneighbors(wide_artist_data.iloc[query_index, :].values.reshape(1, -1), n_neighbors = 6)

print(distances.flatten())
print("")
distance_len = len(distances.flatten())
for i in range(1, min(distance_len, distance_len + 1)):
    if i == 1:
        print("Artist: %s" % wide_artist_data.index[query_index])
    print("Recommendation %s: %s - %s " % (i, wide_artist_data.index[indices[0][i]], distances.flatten()[i]))

429
[[     0.      0.      0. ...      0. 157035.      0.]]
[0.         0.70549543 0.760319   0.7680288  0.80323174 0.82062726]

Artist: ohrbooten
Recommendation 1: mono & nikitaman - 0.7054954325066647 
Recommendation 2: culcha candela - 0.760319003182583 
Recommendation 3: seeed - 0.7680287997564453 
Recommendation 4: patrice - 0.8032317368403745 
Recommendation 5: gentleman - 0.8206272568131207 


Since users will not be interested in recommendations for some random artist, the index of the artist one's looking for has to be looked up:

In [41]:
artist_name = "die toten hosen"
artists = user_with_artist_plays['artist_name'].unique()
artist_index = wide_artist_data.ix[artist_name].values.reshape(1, -1)
distances, indices = model_knn.kneighbors(artist_index, n_neighbors = 6)

distance_len = len(distances.flatten())
for i in range(1, min(distance_len, distance_len + 1)):
    if i == 1:
        print("Artist: %s" % artist_name)
    print("Recommendation %s: %s - %s " % (i, wide_artist_data.index[indices[0][i]], distances.flatten()[i]))

Artist: die toten hosen
Recommendation 1: die Ärzte - 0.517372892477576 
Recommendation 2: farin urlaub - 0.6454411358570306 
Recommendation 3: rammstein - 0.6924119777968816 
Recommendation 4: böhse onkelz - 0.6971024940724783 
Recommendation 5: the offspring - 0.6972890602949392 


## Save model
Since I want to deploy the model online, I will serialize and save it locally.

In [64]:
pickle.dump(model_knn, open('nn_recommender.sav', 'wb'))

## Save available artist-names
For my online artist-recommender, I also need the artist-name list to query for artists.

In [43]:
artists = pd.DataFrame(artists)
artists.to_csv("artists.csv")

## Deployment
I deployed the recommender in a simple **django-application on [Heroku](https://www.heroku.com/)**. 
The recommender can be used via a REST-API ([this website](https://denmei.github.io/) uses the API). 

Give it a try with [this](https://www.codepunker.com/tools/http-requests) http-request-service by making a POST-request (should be GET, I know):

* **URL**: https://ml-server-dm.herokuapp.com/music_recommender/api/artist_recommendation
* **Parameters**: 
    * artist: the name of the artist you are interested in
    * number: the number of recommendations you want to retrieve
