# Prediction of music listening (Part II): Recommendation with Alternating Least Squares

In this notebook, we want to reorganize the data provided in order to obtain objects usable by the Machine Learning models with will work with. The second part will deal with a basic implementation for the prediction of music listening for a particular user. Thirdly, an implementation of the Random Forest algorithm is presented.

You should note that this notebook and all the following were implemented on Google Colab, which is quicker than Jupyter in our case. If you want to open a Google Colab session, [here is the link](https://colab.research.google.com/notebooks/welcome.ipynb#recent=true).


### Package installation


*   **Implicit** is a package used to provide fast Python implementations of several different popular recommendation algorithms for implicit feedback datasets. It enables us to implement Alternating Least Squares.
*   **Kaggle** is used to import data from Kaggle (our dataset is from a Kaggle challenge)
*   **Pandas v.0.21** is a package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

To import a Kaggle dataset, you have to generate a token. In order to do so,you need to have a Kaggle account for accessing Kaggle API. You can find the token on your Kaggle account page. Once you download the token to your local machine, you can copy the token in the notebook. [Here is the link to access the Kaggle page of the challenge.](https://www.kaggle.com/c/kkbox-music-recommendation-challenge/overview)




In [0]:
!pip install --user implicit
!pip install --user -q kaggle
!pip install --user pandas==0.21



After that, we want to load the dataset, create a directory and copy the data from Kaggle. In the provided dataset, we will use train.csv (and unzip it). We don't want to use the test set proposed because it doesn't contain the targets, so using it would mean being unable to evaluate our results.

In [0]:
from google.colab import files
files.upload()

MessageError: ignored

In [0]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
#!chmod 600 /root/.kaggle/kaggle.json # the owner have the authorization to read and write in this directory
!kaggle competitions download -c kkbox-music-recommendation-challenge

!7za x train.csv.7z # unzip the file

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python2.7/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python2.7/dist-packages/kaggle/api/kaggle_api_extended.py", line 146, in authenticate
    self.config_file, self.config_dir))
IOError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         
ERROR: No more files
train.csv.7z



System ERROR:
Unknown error -2147024872


## Rearanging data
In order to be able to evaluate our results, we must compare datasets which are "similar". In the test dataset, we only want to have songs that appear in the training dataset. If not, we couldn't predict the target.

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [0]:
df = pd.read_csv('train.csv') # read the data
df_train, df_test = train_test_split(df, test_size = 0.2) #split into training and test dataset

In [0]:
#Clear out tuples with either song or user that didn't figure in the training dataset.

df_test = df_test[df_test['song_id'].isin(df_train['song_id'])]
df_test = df_test[df_test['msno'].isin(df_train['msno'])]

print('df_test shape before cleaning: ', df_test.shape[0])
print('df_test shape after cleaning: ', df_test.shape[0])

NameError: ignored

In [0]:
# How many null values do we have in each column ?

percent_missing = df_test.isnull().sum() * 100 / len(df_test)
missing_value_df = pd.DataFrame({'column_name': df_test.columns,
                                 'percent_missing': percent_missing})
print ('Number of missing values per column of the test set')
missing_value_df

NameError: ignored

Another problem we are facing is the usability of the data provided. Initialy, UserID are strings, as SongID. We rename the colums and transform it into numerical attributes, that are easier to manipulate.


In a second time, we need to have the same indexes in both the test dataset and the training dataset, to be able to compare the two.

In [0]:
df_train['SongID'] = df_train.groupby(['song_id']).ngroup()
df_train['UserID'] = df_train.groupby(['msno']).ngroup()

print('We have ', df_train['UserID'].nunique(), 'unique users.')
print('We have ', df_train['SongID'].nunique(), 'unique songs.')

NameError: ignored

In [0]:
# The indexes must be the same in training and testing datasets

df_train_ids = df_train.groupby('song_id').first().reset_index()
df_train_ids = df_train_ids[['msno', 'song_id', 'SongID', 'UserID']]

df_test = df_test[['target', 'song_id', 'msno']]

df_test_ids = pd.merge(df_test, df_train_ids, left_on = 'msno', right_on = 'msno').drop(['msno'], axis=1)
df_test_ids = pd.merge(df_test, df_train_ids, left_on = 'song_id', right_on = 'song_id').drop(['song_id'], axis=1)

df_train = df_train[['UserID', 'SongID', 'target']]

df_test = df_test_ids[['UserID', 'SongID', 'target']]

print('df test shape = ',df_test.shape)

## Alternating Least Squares & latent factors
In order to use Alternating Least Squares more efficiently, we will transform the training dataset into a CSR matrix. 


CSR stands for Compressed Sparse Row matrix. Sparse matrices can be used in efficient arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power.

In this section, we will use latent-factor models. They try to explain observed interactions between large numbers of users and songs through a relatively small number of observations.

First, we formulate the learning problem as a matrix completion problem. Then, we will use a type of matrix factorization model to "fill in" the blanks. We are given implicit ratings that users have given certain items (if they listened a song again or not) and our goal is to predict their ratings for the rest of the items. Formally, if there are $n$ users and $m$ items, we are given an $n \times m$ matrix $R$ in which the generic entry $(u, i)$ represents the rating for item $i$ by user $u$. Matrix $R$ has many missing entries indicating unobserved ratings, and our task is to estimate these unobserved ratings.

A popular approach to the matrix completion problem is matrix factorization, where we want to "summarize" users and items with their latent factors. For that, we approximate the initial matrix $R$ by the product of two smaller matrices $X$ and $Y$.

The challenge is to calculate $X$ and $Y$. We do this iteratively: knowing $Y$, we can calculate the best value of $X$, and vice versa. It means from the initial values of $X$ and $Y$ in the beginning, we calculate the best $X$ according to $Y$, and then calculate the best $Y$ according to the new $X$. This process is repeated until the distance from $XY$ to $R$ is small.

The values composing $X$ and $Y$ are called latent factors.

To use Alternating Least Squares, we first need to intialize the model, then train it on the sparse matrix we just created.

In [0]:
from scipy.sparse import csr_matrix
import implicit
import multiprocessing

We are using the library **multiprocessing**.  This package offers both local and remote concurrency, solving Global Interpreter problems by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.

In [0]:
user_item_data = csr_matrix((df_train['target'], (df_train['UserID'], df_train['SongID'])), \
                           shape = (df_train['UserID'].nunique(), df_train['SongID'].nunique()))

user_item_data.shape

NameError: ignored

In [0]:
# Initialize 
model = implicit.als.AlternatingLeastSquares(factors = 50)

# Train
model.fit(user_item_data)



HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




[(19534, 7.77816e-12),
 (25397, 4.594952e-12),
 (27808, 4.449786e-12),
 (8254, 3.2202025e-12),
 (1510, 3.1958776e-12)]

Now we want to do user listenings predictions. This greedy technique is long. Moreover, the top 10,000 recommended songs is a list which is too short to give significant results. Don't forget that we are comparing a list of 10,000 songs with a dataset of 320,000 songs !

In [0]:
## This function return a boolean to know if the current song is recommended for the current user
## It takes in input a song, a user, and the number of recommendations to genrate
def predict(user_id, song_id, N = 10000):
    r = model.recommend(user_id, user_item_data, N = N)
    recommendations = [i[0] for i in r] # recommended songID without the score associated

    return (song_id in recommendations)

In [0]:
L = []
i = 0

for UserID, SongID, target in df_test.values:
  i += 1
  L.append(predict(UserID, SongID))
  if (i % 10000) == 0:
    print(i)
  if (i == 100000):
    break

In [0]:
item_latent = pd.DataFrame(model.item_factors)
user_latent = pd.DataFrame(model.user_factors)
user_latent.columns = ['user_latent' + str(x) for i, x in enumerate(user_latent.columns, 1)]
item_latent.columns = ['item_latent' + str(x) for i, x in enumerate(item_latent.columns, 1)]
item_latent['ID'] = item_latent.index
user_latent['ID'] = user_latent.index

user_latent.head()

In [0]:
df_mini = df_test.sample(frac = 1)
print('df mini shape = ', df_mini.shape)
print('df mini size = ', df_mini.count())

df_latent = pd.merge(df_mini, item_latent, left_on = 'UserID', right_on = 'ID').drop(['ID'], axis = 1)
df_latent = pd.merge(df_latent, user_latent, left_on = 'SongID', right_on = 'ID').drop(['ID'], axis = 1)
print('df latent shape after second merge: ', df_latent.shape)

In [0]:
#multiprocessing
try:
    cpus = multiprocessing.cpu_count()
except NotImplementedError:
    cpus = 2   # default
print('cpus = ', cpus)

pool = multiprocessing.Pool(processes = cpus) # start multiple worker processes

print(pool.map(predict, df_test[['UserID','SongID']].values))

cpus= 4


TypeError: ignored

## Random forest
Random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

The fundamental concept behind random forest is the wisdom of crowds: A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

The low correlation between models is the key. Uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions. The trees protect each other from their individual errors. While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction.

In order to implement our random forest, we need to retrieve the latent factors from the previous section.

In [0]:
from sklearn import ensemble
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
import warnings
from sklearn.metrics import accuracy_score

In [0]:
dmtr = df_latent.drop(['UserID', 'SongID'], axis = 1)
model = ensemble.RandomForestClassifier(n_estimators = 250, max_depth = 25)
model.fit(dmtr[dmtr.columns[dmtr.columns != 'target']], dmtr.target)

In [0]:
df_plot = pd.DataFrame({'features': dmtr.columns[dmtr.columns != 'target'],
                        'importances': model.feature_importances_})
df_plot = df_plot.sort_values('importances', ascending = False).head(25)

plt.figure(figsize = [11,5])
sns.barplot(x = df_plot.importances, y = df_plot.features)
plt.title('Importances of Features Plot')
plt.show()

In [0]:
df_latent = pd.merge(df_test, item_latent, left_on = 'UserID', right_on = 'ID')
df_latent = df_latent.drop(['ID'], axis = 1)

df_latent = pd.merge(df_latent, user_latent, left_on = 'SongID', right_on = 'ID').drop(['ID'], axis = 1)
print('df latent shape after second merge: ', df_latent.shape)
print('df test shape: ', df_test.shape)

In [0]:
# trains on 10% of the dataset
df = df_latent.drop(['UserID', 'SongID'], axis = 1).sample(frac = 0.1)
ypred = model.predict(df[df.columns[df.columns != 'target']])

accuracy_score(df['target'], ypred)

In [0]:
#trains on the whole dataset
ypred = model.predict(dmtr[dmtr.columns[dmtr.columns != 'target']])

accuracy_score(dmtr['target'], ypred)

0.676412905259038