# Prediction of music listening (Part III): Models

In this notebook, we are applying different models of Machine Learning to the pre-processed data. The first step is to retrieve latent factors from the previous notebook. (If you want, we can provide you the files)

In the first section, we are using TensorFlow to run the model. Then, we use a boosting algorithm called XGBoost.

## Retrieving data

In [0]:
# Links this notebook with a Google Drive account on which data will be stored
from google.colab import drive
drive.mount('drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at drive


In [0]:
# Import data from the linked Google Drive account
!cp 'drive/My Drive/train_latent.p' .
!cp 'drive/My Drive/test_latent.p' .

The **pickle** module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file) is converted back into an object.

In [0]:
import pickle

train_latent = pickle.load(open('train_latent.p', 'rb')).drop(['SongID', 'UserID'], axis = 1)
test_latent = pickle.load(open('test_latent.p', 'rb')).drop(['SongID', 'UserID'], axis = 1)

Xtrain = train_latent[train_latent.columns[train_latent.columns != 'target']]
Ytrain = train_latent.target.values
del(train_latent)
print('Shape of Xtrain: ', Xtrain.shape)

Xtest = test_latent[test_latent.columns[test_latent.columns != 'target']]
Ytest = test_latent.target.values
del(test_latent)

## TensorFlow
**TensorFlow** is an end-to-end open-source platform to run machine learning models. It has been developped by Google.

**Dense** implements the operation: $output = activation(input \cdot kernel) + bias)$ where $activation$ is the element-wise activation function, $kernel$ is a weights matrix created by the layer, and $bias$ is a bias vector created by the layer. Here, we chose in most of cases $activation=tf.nn.relu$, which corresponds to $max(features, 0)$. We also have $activation=tf.nn.softmax$, whith $softmax$ defined like this: $\frac{exp(inputs)}{\sum exp(inputs)}$

**Dropout** is used to drop data out.

**to_categorical** is used to converts a class vector (integers) into binary class matrix.

In [0]:
import tensorflow as tf

with tf.device('/gpu:0'): #run code on GPU
  #define the shape of the inputs
  model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(128, )),
        #tf.keras.layers.Dropout(0.20),
        tf.keras.layers.Dense(350, activation = tf.nn.relu),
        #tf.keras.layers.Dropout(0.18),
        tf.keras.layers.Dense(200, activation = tf.nn.relu),
        #tf.keras.layers.Dropout(0.18),
        tf.keras.layers.Dense(130, activation = tf.nn.relu),
        #tf.keras.layers.Dropout(0.12),
  ])

  #compile model
  model.compile(optimizer = 'adam', 
          loss = 'binary_crossentropy',
          metrics = ['accuracy'])

  # use the model:
  #Y_test = to_categorical(Y_test)
  #Y_train = to_categorical(Y_train)

  history = model.fit(Xtrain, Ytrain, epochs = 70, batch_size = 10000, validation_data = [Xtest,Ytest])
  loss,accuracy = model.evaluate(Xtest, Ytest)

Train on 5901934 samples, validate on 1437031 samples
Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70

KeyboardInterrupt: ignored

## XGBoost
Considering the results we obtained in the previous section, we want to use a boosting algorithm, such as **XGBoost**. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting.

The algorithm uses DMatrix for its calculations. This format is optimized for XGBoost.

Before running XGBoost, we must set three types of parameters: general parameters, booster parameters and learning task parameters.

*  General parameters relate to which booster we are using, commonly tree or linear model

*   Booster parameters depend on which booster you have chosen

*   Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.


In [0]:
import xgboost as xgb
from xgboost import XGBClassifier
import numpy as np

dtrain = xgb.DMatrix(Xtrain, Ytrain)
dtest = xgb.DMatrix(Xtest, Ytest)

In [0]:
params = {"objective": "binary:hinge",
          "colsample_bytree": 0.3,
          "learning_rate": 0.1,
          "max_depth": 8,
          "alpha": 1,
          "tree_method": "gpu_hist",
          "subsample": 0.5,
          "verbosity": 2}

num_round = 400
progress = dict()
evals = [(dtrain, "train"), (dtest, "test")]

xgb.train(params, dtrain, num_boost_round = num_round, evals = evals, evals_result = progress)

[0]	train-error:0.496486	test-error:0.493062
[1]	train-error:0.496486	test-error:0.493062
[2]	train-error:0.496486	test-error:0.493062
[3]	train-error:0.496486	test-error:0.493062
[4]	train-error:0.496486	test-error:0.493062
[5]	train-error:0.468685	test-error:0.486445
[6]	train-error:0.467436	test-error:0.486121
[7]	train-error:0.466835	test-error:0.485996
[8]	train-error:0.466544	test-error:0.485934
[9]	train-error:0.466447	test-error:0.485912
[10]	train-error:0.465387	test-error:0.485909
[11]	train-error:0.46431	test-error:0.485903
[12]	train-error:0.462653	test-error:0.486672
[13]	train-error:0.461096	test-error:0.486968
[14]	train-error:0.459195	test-error:0.486963
[15]	train-error:0.457136	test-error:0.486995
[16]	train-error:0.454461	test-error:0.486482
[17]	train-error:0.452506	test-error:0.486429
[18]	train-error:0.450572	test-error:0.485891
[19]	train-error:0.448511	test-error:0.485491
[20]	train-error:0.446672	test-error:0.484796
[21]	train-error:0.44423	test-error:0.484031


KeyboardInterrupt: ignored

We can conclude that this method is not efficient because the test error is very high and doesn't decrease with a growing number of iterations.

Let's apply XGBoost to a Random Forest method.

In [0]:
params = {
  "colsample_bynode": 0.8,
  "learning_rate": 1,
  "max_depth": 5,
  "num_parallel_tree": 100,
  "objective": "binary:logistic",
  "subsample": 0.8,
  "tree_method": "gpu_hist",
  "verbosity": 2
}
num_round = 400
progress = dict()
evals = [(dtrain,"train"),(dtest,"test")]

xgb.train(params, dtrain, num_boost_round = num_round, evals = evals, evals_result = progress)

[0]	train-error:0.383622	test-error:0.454578
[1]	train-error:0.366298	test-error:0.454279
[2]	train-error:0.353478	test-error:0.4538
[3]	train-error:0.344892	test-error:0.453893
[4]	train-error:0.337752	test-error:0.454109
[5]	train-error:0.33106	test-error:0.453443
[6]	train-error:0.326056	test-error:0.453495
[7]	train-error:0.321501	test-error:0.453083
[8]	train-error:0.318183	test-error:0.453429
[9]	train-error:0.314604	test-error:0.453642
[10]	train-error:0.311081	test-error:0.453516
[11]	train-error:0.308149	test-error:0.454333
[12]	train-error:0.306125	test-error:0.453643
[13]	train-error:0.303332	test-error:0.454323
[14]	train-error:0.300684	test-error:0.454157
[15]	train-error:0.298673	test-error:0.4543
[16]	train-error:0.296385	test-error:0.454443


KeyboardInterrupt: ignored

In [0]:
params = {
  'booster':'gblinear'
}
num_round = 400
progress = dict()
evals = [(dtrain, "train"), (dtest, "test")]

xgb.train(params, dtrain, num_boost_round = num_round, evals = evals, evals_result = progress)