# Prediction of music listening (Part II): Recommendation with Alternating Least Squares

In this notebook, we aim to make predictions based on the Weighted Alternating Least Squares model. The outputs of the WALS can be used as a latent representation for both users and items. 

We feed these latent representation to various ML models in notebook III.


You should note that this notebook and all the following were implemented on Google Colab, which is quicker than Jupyter in our case. If you want to open a Google Colab session, [here is the link](https://colab.research.google.com/notebooks/welcome.ipynb#recent=true).


### Package installation


*   **Implicit** is a package used to provide fast Python implementations of several different popular recommendation algorithms for implicit feedback datasets. It enables us to implement Alternating Least Squares.
*   **Kaggle** is used to import data from Kaggle (our dataset is from a Kaggle challenge)
*   **Pandas v.0.21** is a package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

To import a Kaggle dataset, you have to generate a token. In order to do so,you need to have a Kaggle account for accessing Kaggle API. You can find the token on your Kaggle account page. Once you download the token to your local machine, you can copy the token in the notebook. [Here is the link to access the Kaggle page of the challenge.](https://www.kaggle.com/c/kkbox-music-recommendation-challenge/overview)




In [1]:
!pip install --user implicit
!pip install --user -q kaggle
!pip install --user pandas==0.21 ## This version is necessary for implicit package to work



After that, we want to load the dataset, create a directory and copy the data from Kaggle. In the provided dataset, we will use train.csv (and unzip it).

The test set is mainly for competition submissions and thus doesn't contain targets. 

Therefore we can't use the test set for local model evaluation.

We split the train set in order to do our local evaluation.



In [6]:
## You need to download a kaggle token from your personal computer in order to download dataset from kaggle
## HELP ==>
#https://adityashrm21.github.io/Setting-Up-Kaggle/

from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"mkallel","key":"5a891df11b5d66a4cf5a0ca518cff9d4"}'}

In [7]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!kaggle competitions download -c kkbox-music-recommendation-challenge
!7za x train.csv.7z # unzip the file

Downloading train.csv.7z to /content
 99% 100M/101M [00:00<00:00, 116MB/s]  
100% 101M/101M [00:00<00:00, 149MB/s]
Downloading test.csv.7z to /content
 74% 31.0M/41.9M [00:00<00:00, 46.0MB/s]
100% 41.9M/41.9M [00:00<00:00, 94.5MB/s]
Downloading song_extra_info.csv.7z to /content
 88% 87.0M/98.8M [00:00<00:00, 77.3MB/s]
100% 98.8M/98.8M [00:01<00:00, 102MB/s] 
Downloading members.csv.7z to /content
  0% 0.00/1.29M [00:00<?, ?B/s]
100% 1.29M/1.29M [00:00<00:00, 86.4MB/s]
Downloading sample_submission.csv.7z to /content
  0% 0.00/453k [00:00<?, ?B/s]
100% 453k/453k [00:00<00:00, 134MB/s]
Downloading songs.csv.7z to /content
 85% 86.0M/101M [00:00<00:00, 98.9MB/s]
100% 101M/101M [00:00<00:00, 159MB/s]  

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Xeon(R) CPU @ 2.30GHz (306F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 106420688 by

## Rearanging data
In order to be able to evaluate our results, we must compare datasets which are "similar". In the test dataset, we only want to have songs that appear in the training dataset. If not, we couldn't predict the target.

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [0]:
df = pd.read_csv('train.csv') # read the data
df_train, df_test = train_test_split(df, test_size = 0.2) #split into training and test dataset

In [10]:
#Clear out tuples with either song or user that didn't figure in the training dataset.

df_test = df_test[df_test['song_id'].isin(df_train['song_id'])]
df_test = df_test[df_test['msno'].isin(df_train['msno'])]

print('df_test shape before cleaning: ', df_test.shape[0])
print('df_test shape after cleaning: ', df_test.shape[0])

df_test shape before cleaning:  1437138
df_test shape after cleaning:  1437138


In [11]:
# How many null values do we have in each column ?

percent_missing = df_test.isnull().sum() * 100 / len(df_test)
missing_value_df = pd.DataFrame({'column_name': df_test.columns,
                                 'percent_missing': percent_missing})
print ('Number of missing values per column of the test set')
missing_value_df

Number of missing values per column of the test set


Unnamed: 0,column_name,percent_missing
msno,msno,0.0
song_id,song_id,0.0
source_system_tab,source_system_tab,0.329753
source_screen_name,source_screen_name,5.578518
source_type,source_type,0.281393
target,target,0.0


Another problem we are facing is the usability of the data provided. Initialy, UserID are strings, as SongID. We rename the colums and transform it into numerical attributes, that are easier to manipulate.


In a second time, we need to have the same indexes in both the test dataset and the training dataset, to be able to compare the two.

In [12]:
df_train['SongID'] = df_train.groupby(['song_id']).ngroup()
df_train['UserID'] = df_train.groupby(['msno']).ngroup()

print('We have ', df_train['UserID'].nunique(), 'unique users.')
print('We have ', df_train['SongID'].nunique(), 'unique songs.')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


We have  30535 unique users.
We have  324326 unique songs.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


**Note:** One problem we stumbled upon when using the IMPLICIT package is that User_ids and song ids must be continuous in values. This is why we have to get set IDs for training first then merge the Test set with training on those ids.


In [13]:
# The indexes must be the same in training and testing datasets

df_train_ids = df_train.groupby('song_id').first().reset_index()
df_train_ids = df_train_ids[['msno', 'song_id', 'SongID', 'UserID']]

df_test = df_test[['target', 'song_id', 'msno']]

df_test_ids = pd.merge(df_test, df_train_ids, left_on = 'msno', right_on = 'msno').drop(['msno'], axis=1)
df_test_ids = pd.merge(df_test, df_train_ids, left_on = 'song_id', right_on = 'song_id').drop(['song_id'], axis=1)

df_train = df_train[['UserID', 'SongID', 'target']]

df_test = df_test_ids[['UserID', 'SongID', 'target']]

print('df test shape = ',df_test.shape)

df test shape =  (1437138, 3)


## Alternating Least Squares & latent factors
In order to use Alternating Least Squares more efficiently, we will transform the training dataset into a CSR matrix. 


CSR stands for Compressed Sparse Row matrix. Sparse matrices can be used in efficient arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power.

In this section, we will use latent-factor models. They try to explain observed interactions between large numbers of users and songs through a relatively small number of observations.

First, we formulate the learning problem as a matrix completion problem. Then, we will use a type of matrix factorization model to "fill in" the blanks. We are given implicit ratings that users have given certain items (if they listened a song again or not) and our goal is to predict their ratings for the rest of the items. Formally, if there are $n$ users and $m$ items, we are given an $n \times m$ matrix $R$ in which the generic entry $(u, i)$ represents the rating for item $i$ by user $u$. Matrix $R$ has many missing entries indicating unobserved ratings, and our task is to estimate these unobserved ratings.

A popular approach to the matrix completion problem is matrix factorization, where we want to "summarize" users and items with their latent factors. For that, we approximate the initial matrix $R$ by the product of two smaller matrices $X$ and $Y$.

The challenge is to calculate $X$ and $Y$. We do this iteratively: knowing $Y$, we can calculate the best value of $X$, and vice versa. It means from the initial values of $X$ and $Y$ in the beginning, we calculate the best $X$ according to $Y$, and then calculate the best $Y$ according to the new $X$. This process is repeated until the distance from $XY$ to $R$ is small.

The values composing $X$ and $Y$ are called latent factors.

To use Alternating Least Squares, we first need to intialize the model, then train it on the sparse matrix we just created.

In [0]:
from scipy.sparse import csr_matrix
import implicit
import multiprocessing

We are using the library **multiprocessing**.  This package offers both local and remote concurrency, solving Global Interpreter problems by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.

In [15]:
user_item_data = csr_matrix((df_train['target'], (df_train['UserID'], df_train['SongID'])), \
                           shape = (df_train['UserID'].nunique(), df_train['SongID'].nunique()))

user_item_data.shape

(30535, 324326)

In [16]:
# Initialize 
model = implicit.als.AlternatingLeastSquares(factors = 64)

# Train
model.fit(user_item_data)



HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




In [43]:
#model.recommend_all(user_item_data,10)  This function would have saved us a lot of trouble but it doesnt work

HBox(children=(IntProgress(value=0, max=30535), HTML(value='')))




ValueError: ignored

Now we want to do user re-listenings predictions. 

To do so, we check if the song figures in the N top recommended songs for the user, and if so, we suppose it's target is 1.

Unfortunately this approach was inconclusive as our model will always predict false.


In [58]:
## This function return a boolean to know if the current song is recommended for the current user

## It takes in input a song, a user, and the number of recommendations to genrate


def predict(user_id, song_id,filter_already_liked_items=False, N = 1000000):
    r = model.recommend(user_id, user_item_data, N = N)
    recommendations = [i[0] for i in r] #  get recommended songID without the score associated
    return (song_id in recommendations)


predict (5951,260925) ## This is a sample from the test dataset (it should be True)


False

In [59]:

args=[[UserID,SongID] for UserID, SongID, target in df_test.sample(frac=0.01).head(1000).values]

#multiprocessing
try:
    cpus = multiprocessing.cpu_count()
except NotImplementedError:
    cpus = 2   # default
print('cpus = ', cpus)

pool = multiprocessing.Pool(processes = cpus) # start multiple worker processes

predictions=pool.starmap(predict, args)

cpus =  4


Process ForkPoolWorker-133:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Process ForkPoolWorker-131:


KeyboardInterrupt: ignored

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Process ForkPoolWorker-134:
Process ForkPoolWorker-132:
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "<ipython-input-58-ebc8516fc945>", line 6, in predict
    return (song_id in recommendations)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    

In this part we get the latent item / user representations and merge them with the dataset. The resulting dataframe is used in next notebooks to make predictions.

In [60]:
item_latent = pd.DataFrame(model.item_factors)
user_latent = pd.DataFrame(model.user_factors)
user_latent.columns = ['user_latent' + str(x) for i, x in enumerate(user_latent.columns, 1)]
item_latent.columns = ['item_latent' + str(x) for i, x in enumerate(item_latent.columns, 1)]
item_latent['ID'] = item_latent.index
user_latent['ID'] = user_latent.index

user_latent.head()

Unnamed: 0,user_latent0,user_latent1,user_latent2,user_latent3,user_latent4,user_latent5,user_latent6,user_latent7,user_latent8,user_latent9,...,user_latent55,user_latent56,user_latent57,user_latent58,user_latent59,user_latent60,user_latent61,user_latent62,user_latent63,ID
0,2.185818e-10,2.887835e-10,-9.349999e-11,6.612975e-11,-1.909906e-11,-2.930099e-11,-1.784031e-10,-2.03267e-11,1.232229e-10,9.218788e-11,...,-1.159656e-10,-3.737963e-11,-3.521769e-11,-7.502467e-11,5.143072e-11,-4.690552e-11,-1.300899e-10,-5.750772e-11,-8.912292e-11,0
1,-3.229599e-11,1.396763e-11,9.815937e-12,-2.68745e-11,3.155161e-12,1.408519e-10,-8.172361e-11,-5.152044e-11,4.476085e-11,-1.883276e-11,...,-4.69027e-11,7.468878e-11,-9.194449e-11,7.484484e-11,1.660755e-11,1.987651e-11,9.48546e-11,7.927081e-11,2.609627e-11,1
2,2.444868e-12,1.376773e-11,2.075325e-12,4.263875e-12,1.453393e-14,-4.928964e-12,-9.055751e-12,6.855614e-12,1.385661e-11,9.399668e-12,...,1.252419e-11,-1.524443e-11,5.986541e-12,-1.849516e-12,9.45617e-12,-1.752537e-12,2.579415e-13,1.405179e-12,-3.012579e-12,2
3,2.508143e-11,6.572374e-11,-3.357838e-11,-1.901207e-12,-2.974904e-11,-4.068785e-11,-1.91942e-11,1.093846e-11,5.48589e-11,-3.896883e-13,...,2.029777e-11,7.801227e-12,4.453333e-12,-3.373099e-11,-7.131739e-13,3.363178e-11,-2.711576e-11,-3.353382e-11,-1.03325e-11,3
4,3.268416e-11,2.196176e-12,-4.195175e-11,-6.28452e-12,-1.50704e-11,3.013468e-12,-2.11063e-11,-1.233696e-11,1.020784e-11,-3.790675e-11,...,-3.416188e-12,-6.51084e-12,-7.069612e-12,-1.433112e-11,-7.786692e-13,-4.432369e-12,-2.380146e-11,-8.839674e-12,4.606966e-13,4


In [61]:
df_mini = df_test.sample(frac = 1)
print('df mini shape = ', df_mini.shape)
print('df mini size = ', df_mini.count())

df_latent = pd.merge(df_mini, item_latent, left_on = 'UserID', right_on = 'ID').drop(['ID'], axis = 1)
df_latent = pd.merge(df_latent, user_latent, left_on = 'SongID', right_on = 'ID').drop(['ID'], axis = 1)
print('df latent shape after second merge: ', df_latent.shape)

df mini shape =  (1437138, 3)
df mini size =  UserID    1437138
SongID    1437138
target    1437138
dtype: int64
df latent shape after second merge:  (1437138, 131)
