# Context

## Quick reminder on last course

XX

## Goal of this course

XX


# Concepts

## Training

As explained in the previous course, we will build our model on a training set. In this dataset, as we are in the supervised framework, we observe both the features and the output variable (X,y). Each model will try here to relate the target output - what we want to forecast, to the features - what we can observe, using more or less complex formulas. For all algorithms we will actually minimize __the loss function__ - which quantifies how far our predictions on X will be with our model and its set of parameters to the correct output y. The estimation process can be tricky as we are considering complex relations between our features and output. Fortunately, Python is doing all the maths for us! If you want to learn more on this: __XX (SGD etc.)__

In order to avoid overfitting - again - during the training part, the state of the art is to do __cross-validation__. Basically, instead of training our model on all the data, we will split the dataset in _n_ (usually 5) chunks. Train our model on four of it, test on the last one. 

cross-val (+for hyperparam)...

What kind of model?

## Model Selection
 
The first decision you need to make in the model selection process in the supervised framework is whether you plan to use a __classifier__ or a __regressor model__. Classifiers make discrete predictions about a datapoint into a finite number of classes while regressors make linear predictions.  
 
Different models work in different ways and are more or less suitable for different problems. Fortunately, however,  understanding these specific differences is not essential to solve your data problems. The python module `scikit learn` contains all of the models that you are likely to need and the format of the data it requires is standardised across models. This makes it very easy to try your data using a myriad of different models and choose the one that performs best on your data.
 
In our current project we use both classifiers and regressors to predict engagement. In the classifier we simply try to predict whether someone has watched any content in the two-week period while in the regressor we attempt to predict the number of minutes watched by the viewer within the two-week period.

We won't tacke both prediction tasks here. This course focused on classification and the next one will focus on regression.

## Model Evaluation

To evaluate our models there are various different approaches we can use. As previously mentioned, the final evaluation should ideally be using data that the model has never seen before (the training set). The training set is pushed through the trained model to make predictions and those predictions are compared with the actual targets.

For classifiers we can use __the accuracy__ (percentage correct) or __the ROC curve__. The accuracy is the simplest metric but it gives us little insight into the behaviour of the model. ROC curves (and __the area under the curve__ statistic) give us a greater understanding of the separability of the data. For more details see https://en.wikipedia.org/wiki/Receiver_operating_characteristic.

When computing the performance of a model based on a given metric it's important to bare in mind what the performance of a simple model would be. Usually we consider the completely random one as a baseline. Any model you build must be evaluated in terms of improvement over this performance. 

# Classification

The first thing to do is to get our data back.

In [1]:
import pandas as pd
import numpy as np
import itertools

In [2]:
# We put both training target arrays (regression and classification) in the same txt file
# As both target arrays have the same size we just need to split it it two
# and get the right part for the prediction task
target_training = np.split(np.loadtxt('target_training'), 2)[1].flatten()
features_training = pd.read_csv('features_training.csv')

# Same for test data
target_test = np.split(np.loadtxt('target_test'), 2)[1].flatten()
features_test = pd.read_csv('features_test.csv')

In [3]:
# User id as index
features_training = features_training.set_index('user_id')
features_test = features_test.set_index('user_id')

## Baseline score

Usually dealing with 0/1 classification problem we talk about __scoring__. And the probability to belong to the class 1 (usually our class of interest) is the score.

As mentionned before we should have a baseline to compare the performance of our models with. We usually choose as a baseline score the one obtained by a random allocation of our users. Allocating randomly 100 users to the class 1, the accuracy will depend on the effective concentration of class 1 in the entire population. If we observe 30% of class 1 in the population, then we should have 30 correct predictions out of our 100 users labeled class 1 with this random allocation.

In [4]:
# Let's check our baseline score
sum(target_training)/len(target_training)

0.37152624614027346

So for any classification model to add value we would like it to perform with an accuracy of more than 37% (otherwise guessing based on the proportions would be a better model).

## Tree based classification model

Scikit documentation: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.score

In [5]:
# Let's build a simple tree based classification model
from sklearn import tree

# Accuracy as our error evaluation
from sklearn.metrics import accuracy_score

# We will use cross validation, so import helper functions for this
from sklearn.model_selection import cross_val_score, cross_val_predict

### Training + HYPERPAM TUNING

In [6]:
# setup the model and develop a simple grid search against some key parameters
param_max_depth=[2,3,4,6,8,10]
param_min_leaf=[75,90,100,110,125,150]

In [7]:
# Let's keep track of our best parameters
best_score=0
best_param=[0,0]

# We will use the itertools library to try all the possible combinations of paramaters
# We could also have used the gridsearchCV capability in scikit learn
for c in itertools.product(param_max_depth,param_min_leaf):
    treeclass=tree.DecisionTreeClassifier(max_depth=c[0],min_samples_leaf=c[1])
    scores=cross_val_score(treeclass,
                           features_training,
                           target_training,
                           scoring='accuracy')
    if np.mean(scores)>best_score:
        best_score=np.mean(scores)
        best_param=c

# print the overall best results
print('Best Settings: Max Depth:',best_param[0], '- Min Sample Leaf:',best_param[1])
print('Score:', best_score)

Best Settings: Max Depth: 4 - Min Sample Leaf: 90
Score: 0.810101888901


### Testing

In [8]:
# Let's keep our best model (hyperparameters tuned)
treeclass=tree.DecisionTreeClassifier(max_depth=best_param[0],
                                      min_samples_leaf=best_param[1])
mod1=treeclass.fit(features_training,target_training)

In [9]:
mod1.score(features_training,target_training)

0.81451257168063518

In [38]:
feature_imp1=pd.DataFrame(
    {'feature': list(features_training.columns),
     'importance': list(mod1.feature_importances_)
    })
feature_imp1.sort_values(by='importance', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,feature,importance
0,tw_lag1_watched,0.726536
1,number_watched,0.163234
2,tw_lag2_watched,0.079183
3,tw_lag3_watched,0.021644
4,num_weekday,0.005855
5,average_completion,0.003547
6,tw_lag6_watched,0.0
7,most_weekday_weekday_4,0.0
8,most_genre_Weather,0.0
9,most_weekday_weekday_0,0.0


In [11]:
pred1=mod1.predict(features_test)

In [12]:
accuracy_score(target_test, pred1)

0.75647941060386792

## Random Forrest

Scikit documentation: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [13]:
# Let's build a random forrest
from sklearn.ensemble import RandomForestClassifier

### Training + HYPERPAM TUN

In [14]:
# setup the model and develop a simple grid search against some key parameters
param_max_depth=[2,3,4,6,8,10]
param_min_leaf=[75,90,100,110,125,150]

In [15]:
# Let's keep track of our best parameters
best_score=0
best_param=[0,0]

# We will use the itertools library to try all the possible combinations of paramaters
# We could also have used the gridsearchCV capability in scikit learn
for c in itertools.product(param_max_depth,param_min_leaf):
    forrestclass=RandomForestClassifier(n_estimators=200,
                                        max_depth=c[0],min_samples_leaf=c[1])
    scores=cross_val_score(forrestclass,
                           features_training,
                           target_training,
                           scoring='accuracy')
    if np.mean(scores)>best_score:
        best_score=np.mean(scores)
        best_param=c

# print the overall best results
print('Best Settings: Max Depth:',best_param[0], '- Min Sample Leaf:',best_param[1])
print('Score:', best_score)

Best Settings: Max Depth: 8 - Min Sample Leaf: 75
Score: 0.807455474643


### Testing

In [16]:
# Let's keep our best model (hyperparameters tuned)
forrestclass=RandomForestClassifier(n_estimators=200,
                                    max_depth=best_param[0],
                                    min_samples_leaf=best_param[1])
mod2=forrestclass.fit(features_training,target_training)

In [17]:
mod2.score(features_training,target_training)

0.81451257168063518

In [39]:
feature_imp2=pd.DataFrame(
    {'feature': list(features_training.columns),
     'importance': list(mod2.feature_importances_)
    })
feature_imp2.sort_values(by='importance', ascending=False).reset_index(drop=True).head(10)

Unnamed: 0,feature,importance
0,tw_lag1_watched,0.226734
1,number_watched,0.132636
2,tw_lag2_watched,0.122547
3,total_sessions,0.108742
4,num_weekday,0.09068
5,total_watched,0.084173
6,tw_lag3_watched,0.082247
7,tw_lag4_watched,0.040228
8,num_genre,0.039364
9,num_timeday,0.028694


In [19]:
pred2=mod2.predict(features_test)

In [20]:
accuracy_score(target_test, pred2)

0.7556900407841074

## Logistic Regression

Scikit documentation: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [21]:
# Let's try something that is not based on decision tree
from sklearn import linear_model

### Training + HYPERPAM tun

In [22]:
# setup the model and develop a simple grid search against some key parameters
param_C=[0.001,0.01,0.1,1.0,10,100,1000]

In [23]:
# Let's keep track of our best parameters
best_score=0
best_param=0

# we will setup a manual grid search, but you can also use the gridsearchCV capability in sklearn
for i in param_C:
    logclass=linear_model.LogisticRegression(C=i)
    scores=cross_val_score(logclass,
                           features_training,
                           target_training,
                           scoring='accuracy')
    if np.mean(scores)>best_score:
        best_score=np.mean(scores)
        best_param=i

# print the overall best results
print('Best Settings: C:',best_param)
print('Score:', best_score)

Best Settings: C: 10
Score: 0.793009093676


### Testing

In [24]:
# Let's keep our best model (hyperparameters tuned)
logclass=linear_model.LogisticRegression(C=best_param)
mod3=logclass.fit(features_training,target_training)

In [25]:
mod3.score(features_training,target_training)

0.79378032642258489

In [40]:
coef_mod3=pd.DataFrame(
    {'feature': list(features_training.columns),
     'coef': list(mod3.coef_.flatten())
    })
coef_mod3.sort_values(by='coef',ascending=False).reset_index(drop=True).head()

Unnamed: 0,coef,feature
0,0.312772,num_weekday
1,0.147067,most_genre_Sport
2,0.135839,num_genre
3,0.12833,num_timeday
4,0.009653,number_watched


In [41]:
coef_mod3.sort_values(by='coef',ascending=False).reset_index(drop=True).tail()

Unnamed: 0,coef,feature
30,-0.335106,most_timeday_Morning
31,-0.343705,most_genre_Children's
32,-0.411556,most_timeday_Afternoon
33,-0.4388,most_timeday_Evening
34,-0.676191,most_timeday_Night


P VALUES ??
AND which threshold ...

In [27]:
pred3=mod3.predict(features_test)

In [28]:
accuracy_score(target_test, pred3)

0.73819234311274828

## Support Vector Machine (SVM)

Scikit documentation: http://scikit-learn.org/stable/modules/svm.html

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

In [29]:
# Let's try to get some non linear patterns
from sklearn import svm

### Training + HYPERPAM TUN

In [30]:
# setup the model and develop a simple grid search against some key parameters
param_C=[0.001,0.01,0.1,1.0,10,100,1000]

In [31]:
# Let's keep track of our best parameters
best_score=0
best_param=0

# we will setup a manual grid search, but you can also use the gridsearchCV capability in sklearn
for i in param_C:
    svcclass=svm.SVC(C=i)
    scores=cross_val_score(svcclass,
                           features_training,
                           target_training,
                           scoring='accuracy')
    if np.mean(scores)>best_score:
        best_score=np.mean(scores)
        best_param=i

# print the overall best results
print('Best Settings: C:',best_param)
print('Score:', best_score)

Best Settings: C: 1.0
Score: 0.716144130385


### Testing

In [32]:
# Let's keep our best model (hyperparameters tuned)
svcclass=svm.SVC(C=best_param)
mod4=svcclass.fit(features_training,target_training)

In [33]:
mod4.score(features_training,target_training)

0.95555800617556241

In [34]:
pred4=mod4.predict(features_test)

In [35]:
accuracy_score(target_test, pred4)

0.69425075647941059