# Context

## Quick reminder on last course

XX

## Goal of this course

XX


# Concepts

## Estimate the parameters

As we are in the supervised framework, each model will try here to relate the target output y - what we want to forecast, to the features X - what we can observe, using more or less complex formulas. These formulas will contains parameters that we need to estimate.

For all algorithms we will actually minimize __the loss function__ - which quantifies how far our predictions on X will be with our model and its set of parameters to the correct output y. The estimation process can be tricky as we are considering complex relations between our features and output. Fortunately, Python is doing all the maths for us! If you want to learn more on this: __XX (SGD etc.)__


## Training and test sets
 
When training machine learning models - i.e. estimating the parameters, we want to avoid training the model on all of the possible data that we have available. This is to avoid creating a model that is to specifically atuned to our training data and will later not generalise - this is often called __overfitting__. 

So instead we will spilt our data into __training and test sets__. The state of the art is to do __cross-validation__. Basically we will split the dataset in _n_ (usually 5) chunks and:
- train our model on the features dataset consolidated with _(n-1)_ chunks (the training set);
- compute the error, i.e. evaluate its predictions against the target y - that we _do_ observe, on the remaining chunk (the test set);
- and reiterate this process for all the possible combinations of _(n-1)_ vs 1 chunks of data (_n_ combinations).

At the end we then have computed _n_ errors (see the model evaluation part for more details on the metrics used). We will take the average of the errors for model selection purposes.

For many complex problems and datasets the 'bleeding' of knowledge from the evaluation set into the training set can be a real problem. In that case our model will perform much worse in production than what we would have assumed. And so it is really important to make sure that we don't have information in the training set that we would not have been able to have at that time.

What kind of modelling?

## Model Selection
 
The first decision you need to make in the model selection process in the supervised framework is whether you plan to use a __classifier__ or a __regressor model__. Classifiers make discrete predictions about a datapoint into a finite number of classes while regressors make linear predictions.  
 
Different models work in different ways and are more or less suitable for different problems. Fortunately, however,  understanding these specific differences is not essential to solve your data problems. The python module `scikit learn` contains all of the models that you are likely to need and the format of the data it requires is standardised across models. This makes it very easy to try your data using a myriad of different models and choose the one that performs best on your data.
 
In our current project we use both classifiers and regressors to predict engagement. In the classifier we simply try to predict whether someone has watched any content in the two-week period while in the regressor we attempt to predict the number of minutes watched by the viewer within the two-week period.

We won't tacke both prediction tasks here. This course focused on classification and the next one will focus on regression.

## Model Evaluation

To evaluate our models there are different metrics we can use. For classifiers we can use __the accuracy__ (percentage correct) or __the ROC curve__. The accuracy is the simplest metric but it gives us little insight into the behaviour of the model. ROC curves (and __the area under the curve__ statistic) give us a greater understanding of the separability of the data. For more details see https://en.wikipedia.org/wiki/Receiver_operating_characteristic.

We will evaluate our model in different stages on the model selection process. First, when training our model we will compute a test or __out-of-sample error__ (actually _n_ ones using cross-validation). It's called out-of-sample here because the predictions are made on data the model has never seen before. 

Usually models have also __hyperparameters__ - parameters that we don't need to estimate but to tune during the training process. A good way of doing that is to do a grid search and to evaluate the model with the different combinations of hyperparameters. We will then retain the one with the lowest error. 

Once we have selected the best set of hyperparameters for a given model, or the best kind of model for the data we have gor, we can retrain it on the entire dataset and compute its __in-of-sample__ error, i.e. evaluate its predictions with the observed targets, keeping in mind that we have used this data to build the model this time.   

Finally, when computing the performance of a model based on a given metric it's important to bare in mind what the performance of a simple model would be. Usually we consider the completely random one as a baseline. Any model you build must be evaluated in terms of improvement over this performance. 

# Classification

The first thing to do is to get our data back.

In [1]:
import pandas as pd
import numpy as np
import itertools

In [3]:
# We put both target arrays (regression and classification) in the same txt file
# As both target arrays have the same size we just need to split it it two
# and get the correct part for the prediction task
target = np.split(np.loadtxt('target.txt'), 2)[1].flatten()
features = pd.read_csv('features.csv')

# User id as index
features = features.set_index('user_id')

In [4]:
features.head()

Unnamed: 0_level_0,tw_lag7_watched,tw_lag6_watched,tw_lag5_watched,tw_lag4_watched,tw_lag3_watched,tw_lag2_watched,tw_lag1_watched,average_completion,total_sessions,total_watched,...,most_weekday_weekday_1,most_weekday_weekday_2,most_weekday_weekday_3,most_weekday_weekday_4,most_weekday_weekday_5,most_weekday_weekday_6,most_timeday_Afternoon,most_timeday_Evening,most_timeday_Morning,most_timeday_Night
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0001c6,16.6792,0.0,0.0,0.0,0.0,0.15255,0.0,0.371496,2,16.83175,...,1,0,0,0,0,0,0,1,0,0
000c1a,0.162867,0.147467,107.0984,145.686233,2.286283,100.487767,132.432083,0.233136,28,488.3011,...,0,0,1,0,0,0,0,0,1,0
001c53,1.8663,0.0,0.0,0.0,1.309867,0.0,0.0,0.489419,3,3.176167,...,0,1,0,0,0,0,0,0,1,0
001d44,0.0,0.0,0.0,14.5477,0.0,0.0,0.248017,0.058203,2,14.795717,...,0,0,0,0,0,1,0,0,1,0
002b2e,291.477033,0.0,0.0,0.0,0.0,0.0,0.0,0.228233,17,291.477033,...,0,1,0,0,0,0,0,1,0,0


In [5]:
target[:10]

array([ 1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.])

## Baseline score

Usually dealing with 0/1 classification problem we talk about __scoring__. And the probability to belong to the class 1 (usually our class of interest) is the score.

As mentionned before we should have a baseline to compare the performance of our models with. We usually choose as a baseline score the one obtained by a random allocation of our users. Allocating randomly 100 users to the class 1, the accuracy will depend on the effective concentration of class 1 in the entire population. If we observe 30% of class 1 in the population, then we should have 30 correct predictions out of our 100 users labeled class 1 with this random allocation.

In [6]:
# Let's check our baseline score
sum(target)/len(target)

0.40551224332930713

So for any classification model to add value we would like it to perform with an accuracy of more than 40% (otherwise guessing based on the proportions would be a better model).

## Tree based classification model

Scikit documentation: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.score

In [7]:
# Let's build a simple tree based classification model
from sklearn import tree

# Accuracy as our error evaluation
from sklearn.metrics import accuracy_score

# We will use cross validation, so import helper functions for this
from sklearn.model_selection import cross_val_score, cross_val_predict

### Training + HYPERPAM TUNING

In [8]:
# setup the model and develop a simple grid search against some key parameters
param_max_depth=[2,3,4,6,8,10]
param_min_leaf=[75,90,100,110,125,150]

In [9]:
# Let's keep track of our best parameters
best_score=0
best_param=[0,0]

# We will use the itertools library to try all the possible combinations of paramaters
# We could also have used the gridsearchCV capability in scikit learn
for c in itertools.product(param_max_depth,param_min_leaf):
    treeclass=tree.DecisionTreeClassifier(max_depth=c[0],min_samples_leaf=c[1])
    scores=cross_val_score(treeclass,
                           features,
                           target,
                           scoring='accuracy')
    if np.mean(scores)>best_score:
        best_score=np.mean(scores)
        best_param=c

# print the overall best results
print('Best Settings: Max Depth:',best_param[0], '- Min Sample Leaf:',best_param[1])
print('Score:', best_score)

Best Settings: Max Depth: 4 - Min Sample Leaf: 100
Score: 0.794992688868


### Best tree based model

In [10]:
# Let's keep our best model (hyperparameters tuned)
treeclass=tree.DecisionTreeClassifier(max_depth=best_param[0],
                                      min_samples_leaf=best_param[1])
mod1=treeclass.fit(features,target)

In [11]:
mod1.score(features,target)

0.79795761502141205

In [12]:
feature_imp1=pd.DataFrame(
    {'feature': list(features.columns),
     'importance': list(mod1.feature_importances_)
    })
feature_imp1.sort_values(by='importance', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,feature,importance
0,tw_lag1_watched,0.720014
1,number_watched,0.11559
2,tw_lag3_watched,0.090706
3,tw_lag2_watched,0.054184
4,tw_lag5_watched,0.012255
5,num_genre,0.00667
6,average_completion,0.000581
7,most_weekday_weekday_4,0.0
8,most_genre_Weather,0.0
9,most_weekday_weekday_0,0.0


## Random Forrest

Scikit documentation: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [16]:
# Let's build a random forrest
from sklearn.ensemble import RandomForestClassifier

### Training + HYPERPAM TUN

In [17]:
# setup the model and develop a simple grid search against some key parameters
param_max_depth=[2,3,4,6,8,10]
param_min_leaf=[75,90,100,110,125,150]

In [18]:
# Let's keep track of our best parameters
best_score=0
best_param=[0,0]

# We will use the itertools library to try all the possible combinations of paramaters
# We could also have used the gridsearchCV capability in scikit learn
for c in itertools.product(param_max_depth,param_min_leaf):
    forrestclass=RandomForestClassifier(n_estimators=200,
                                        max_depth=c[0],min_samples_leaf=c[1])
    scores=cross_val_score(forrestclass,
                           features,
                           target,
                           scoring='accuracy')
    if np.mean(scores)>best_score:
        best_score=np.mean(scores)
        best_param=c

# print the overall best results
print('Best Settings: Max Depth:',best_param[0], '- Min Sample Leaf:',best_param[1])
print('Score:', best_score)

Best Settings: Max Depth: 10 - Min Sample Leaf: 75
Score: 0.795102410105


### Best RF model

In [19]:
# Let's keep our best model (hyperparameters tuned)
forrestclass=RandomForestClassifier(n_estimators=200,
                                    max_depth=best_param[0],
                                    min_samples_leaf=best_param[1])
mod2=forrestclass.fit(features,target)

In [20]:
mod2.score(features,target)

0.80213022949379598

In [21]:
feature_imp2=pd.DataFrame(
    {'feature': list(features.columns),
     'importance': list(mod2.feature_importances_)
    })
feature_imp2.sort_values(by='importance', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,feature,importance
0,tw_lag1_watched,0.233214
1,tw_lag2_watched,0.183893
2,total_sessions,0.102758
3,number_watched,0.09036
4,tw_lag3_watched,0.084441
5,num_weekday,0.073745
6,num_genre,0.057644
7,total_watched,0.049236
8,tw_lag5_watched,0.03068
9,tw_lag4_watched,0.029978


## Logistic Regression

Scikit documentation: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [22]:
# Let's try something that is not based on decision tree
from sklearn import linear_model

### Training + HYPERPAM tun

In [23]:
# setup the model and develop a simple grid search against some key parameters
param_C=[0.001,0.01,0.1,1.0,10,100,1000]

In [24]:
# Let's keep track of our best parameters
best_score=0
best_param=0

# we will setup a manual grid search, but you can also use the gridsearchCV capability in sklearn
for i in param_C:
    logclass=linear_model.LogisticRegression(C=i)
    scores=cross_val_score(logclass,
                           features,
                           target,
                           scoring='accuracy')
    if np.mean(scores)>best_score:
        best_score=np.mean(scores)
        best_param=i

# print the overall best results
print('Best Settings: C:',best_param)
print('Score:', best_score)

Best Settings: C: 0.01
Score: 0.773690489886


### Best logistic model

In [25]:
# Let's keep our best model (hyperparameters tuned)
logclass=linear_model.LogisticRegression(C=best_param)
mod3=logclass.fit(features,target)

In [26]:
mod3.score(features,target)

0.77709454265949274

In [27]:
coef_mod3=pd.DataFrame(
    {'feature': list(features.columns),
     'coef': list(mod3.coef_.flatten())
    })
coef_mod3.sort_values(by='coef',ascending=False).reset_index(drop=True).head()

Unnamed: 0,coef,feature
0,0.254606,num_weekday
1,0.113189,num_genre
2,0.007645,number_watched
3,0.005754,tw_lag1_watched
4,0.001347,tw_lag2_watched


In [28]:
coef_mod3.sort_values(by='coef',ascending=False).reset_index(drop=True).tail()

Unnamed: 0,coef,feature
31,-0.282503,most_genre_Factual
32,-0.283514,most_genre_Drama
33,-0.289396,most_timeday_Afternoon
34,-0.30526,most_timeday_Night
35,-0.394988,most_timeday_Evening


P VALUES ??
AND which threshold ...

## Support Vector Machine (SVM)

Scikit documentation: http://scikit-learn.org/stable/modules/svm.html

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

In [29]:
# Let's try to get some non linear patterns
from sklearn import svm

### Training + HYPERPAM TUN

In [30]:
# setup the model and develop a simple grid search against some key parameters
param_C=[0.001,0.01,0.1,1.0,10,100,1000]

In [30]:
# Let's keep track of our best parameters
best_score=0
best_param=0

# we will setup a manual grid search, but you can also use the gridsearchCV capability in sklearn
for i in param_C:
    svcclass=svm.SVC(C=i)
    scores=cross_val_score(svcclass,
                           features,
                           target,
                           scoring='accuracy')
    if np.mean(scores)>best_score:
        best_score=np.mean(scores)
        best_param=i

# print the overall best results
print('Best Settings: C:',best_param)
print('Score:', best_score)

Best Settings: C: 1.0
Score: 0.692433069322


### Best SVM model

In [31]:
# Let's keep our best model (hyperparameters tuned)
svcclass=svm.SVC(C=best_param)
mod4=svcclass.fit(features,target)

In [32]:
mod4.score(features,target)

0.94729329087515102