# Context

## Quick reminder on last course

XX

## Goal of this course

XX

# Concepts

We need to prepare a dataset to be ingested into at least one of many statistical models. But we first need to understand in which part of Machine Learning our problematic stands.

## Supervised vs unsupervised learning

The first step here is to acknowledge  whether your data problem is suitable for “supervised” or “unsupervised” learning models. 
 
__Unsupervised learning__ models receive input but no output (or target) variables and, essentially, are a way of discovering latent structure in a set of data (“clustering” is an example of unsupervised learning). Unsupervised models are very useful when working with unlabelled datasets. These models can then be combined with supervised models.
 
__Supervised learning__ models, essentially,  learn a mathematical function between an input (explanatory variables) and an output (“target”). These models are used in situations where you know what you want to predict and have explicit input-output pairs for your model to be trained upon.

In our current project we want to forecast the minutes watched on iPlayer (output or target variable) based on past behavior (input or explanatory variables). We have input-output pairs and we are then in the supervised learning framework.

## Training and test sets
 
When training machine learning models, we want to avoid training the model on all of the possible data that we have available. This is to avoid creating a model that is to specifically atuned to our raining data and will later not generalise - this is often called __overfitting__. 

So instead we will spilt our data into __a training and a test sets__. We will then train our model on the training set and evaluate its predictions against the target - that we do observe - in the test set.

For many complex problems and datasets the 'bleeding' of knowledge from the evaluation set into the training set can be a real problem. In that case our model will perform much worse in production than what we would have assumed. And so it is really important to make sure that we don't have information in the training set that we would not have been able to have at that time.


# Data wrangling

## Scope

When computing the distribution of our observations among the `twoweek` variable (course 1 - exploratory data analysis part) we saw that we had roughly the same amount of data, except for week 0. We decide to remove this week for our modeling.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Download our output dataset from course 1
data = pd.read_csv('iplayer_data_c1.csv')
data.head()

Unnamed: 0,user_id,program_id,series_id,genre,programme_duration,streaming_id,start_date_time,time_viewed,weekday,time_of_day,programme_duration_mins,twoweek,min_watched,enriched_genre,hour,enriched_genre_hour,enriched_duration_mins,percentage_watched
0,cd2006,f6d3d8,a282ca,Factual,00:00:21,1486911129420_1,2017-02-12 14:51:24.544,20920.0,weekday_6,Afternoon,0.35,3,0.348667,Factual,14,Factual,0.35,0.99619
1,cd2006,b8fbf2,e0480e,Comedy,00:01:51,1484864257965_1,2017-01-19 22:17:04.648,111285.0,weekday_3,Evening,1.85,1,1.85475,Comedy,22,Drama,1.85,1.0
2,cd2006,e2f113,933a1b,Factual,00:00:30,1487099603980_1,2017-02-14 19:12:36.667,29945.0,weekday_1,Evening,0.5,3,0.499083,Factual,19,Factual,0.5,0.998167
3,cd2006,0e0916,b68e79,Entertainment,00:01:22,1484773546557_1,2017-01-18 21:05:11.466,82620.0,weekday_2,Evening,1.366667,1,1.377,Entertainment,21,Drama,1.366667,1.0
4,cd2006,ca03b9,5d0813,Sport,00:01:37,1486911176609_1,2017-02-12 14:52:08.965,97444.0,weekday_6,Afternoon,1.616667,3,1.624067,Sport,14,Factual,1.616667,1.0


In [3]:
# Based on the plots in course 1, we will drop week 0
data=data[data['twoweek']>0]

In [4]:
data.twoweek.value_counts().sort_index()

1    67222
2    62112
3    60431
4    51941
5    55668
6    49941
7    51286
8    53410
Name: twoweek, dtype: int64

## Training and test sets

We want to forecast the minutes watched in the next two weeks based on past data. So more precisely at the end of the training we would like to have a model that can forecast for each user the minutes watches in `twoweek` 9, based on its behaviour from `twoweek` 1 to 8.

As explained before, we will split our dataset in a training and a test set. It's important to assess the accuracy of the trained model with unseen data to see if it doesn't stick too much to the data and can generalised easily. Here we decide to keep the last 2 weeks observation (`twoweek` 8) as our test set. The remaining data (`twoweek` 1 to 7) will be used to train the different models.

For each model we will build in the next course we will:
- train the model on the training set (`twoweek` 1 to 7)
- use the trained model to forecast the next 2 weeks (`twoweek` 8)
- compare the forecasts with the real data observed that we kept in the test set

Both the training and testing erros will be used to choose the best model. Once we decide on the best model we can then retrain it on all the data and forecast for `twoweek` 9.

In this section we will build our training and test sets. For each of them we need to define the target variable and the feature ones. In the training set we have data from `twoweek` 1 to 7. Our target variable will be the minutes watched in `twoweek` 7, and we will use the data from `twoweek` 1 to 6 to build our features. We will then explain the users' engagement on `twoweek` 7 based on its behaviour on the 12 past weeks. In the test set, our target variables is the minutes watched in `twoweek` 8. We need to have the same features than the ones used in the model to compute the forecasts but one week after, so we will need the data from `twoweek` 2 to 7 to build them.

In [5]:
# Creating the two subsets of data
data_training=data[data['twoweek']<8]
data_test=data[data['twoweek']>=2]

In [6]:
data_training.twoweek.value_counts().sort_index()

1    67222
2    62112
3    60431
4    51941
5    55668
6    49941
7    51286
Name: twoweek, dtype: int64

In [7]:
data_test.twoweek.value_counts().sort_index()

2    62112
3    60431
4    51941
5    55668
6    49941
7    51286
8    53410
Name: twoweek, dtype: int64

## Explanatory variables
### User granularity and feature engineering
We want to forecast what _individual_ users will do. We then need to pivot our datasets from an events view to a __user view__. We can use this new granularity to define other features that we think will be important.

As mentioned in the first course, in most situations the process of feature engineering is an iterative one until you get the feature set that neither “underfits” or “overfits” the data. A feature set that does not contain sufficient information regarding the output variable will often result in the model underfitting (this can usually be identified by a high training error). The solution here is often to add more features. If the feature set contains features that are sensitive to  spurious and random elements of the dataset (and not the underlying population it should be an approximation of), overfitting occurs. Overfitting is characterised by low training error and high test error. Overfitting can be tackled by reducing the complexity of your model (often removing features) or use regularisation techniques (https://www.quora.com/What-is-regularization-in-machine-learning). A larger and more diverse training set also helps to reduce overfitting. 

There are various feature selection tools that can be used together with cross-validation to optimise your feature set (e.g. stepwise regression - https://en.wikipedia.org/wiki/Stepwise_regression).  
 
In our project we choose a selection of features that describe how the type viewing habits of a particular user (e.g. “average completion”, “most watched genre”, ”time watched“).

In [8]:
# Create a function that pivots the data based on customer
# and gives us all the features we need
def pivot_data(dataframe):
    #How many minutes did each person watch in each 2 week period
    data=pd.pivot_table(dataframe,values='min_watched', 
                        index=['user_id'],columns=['twoweek'], aggfunc=sum)
    # Fill the weeks they didn't watch in with 0s
    data.fillna(0,inplace=True)
    # How much of average did each viewer watch?
    data['average_completion']=dataframe.groupby('user_id')['percentage_watched'].mean()
    # How many sessions did the person have with us
    data['total_sessions']=dataframe.groupby('user_id')['streaming_id'].nunique()
    # How much did the viewer watch in total this year so far
    data['total_watched']=dataframe.groupby('user_id')['min_watched'].sum()
    # How many times has the viewer watched something
    data['number_watched']=dataframe.groupby('user_id')['streaming_id'].count()
    # Genre most watched by the viewer
    data['most_genre']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                      columns=['enriched_genre'], aggfunc=sum).idxmax(axis=1)
    # Number of genres watched
    data['num_genre']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                     columns=['enriched_genre'], aggfunc=sum).count(axis=1)
    # Favourite day of the week to watch
    data['most_weekday']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                        columns=['weekday'], aggfunc=sum).idxmax(axis=1)
    # Number of weekdays watched
    data['num_weekday']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                       columns=['weekday'], aggfunc=sum).count(axis=1)
    # Favorite time of day to watch
    data['most_timeday']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                        columns=['time_of_day'], aggfunc=sum).idxmax(axis=1)
    # Number of times of day
    data['num_timeday']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                       columns=['time_of_day'], aggfunc=sum).count(axis=1)
    return data

In [9]:
# Apply this function both on the training and test datasets to get our features
# Keep in mind that the last two_week observations in each subset is for our target variables
features_training=pivot_data(data_training[data_training['twoweek']<7])
features_training.reset_index().head()

twoweek,user_id,1,2,3,4,5,6,average_completion,total_sessions,total_watched,number_watched,most_genre,num_genre,most_weekday,num_weekday,most_timeday,num_timeday
0,0001c6,16.6792,0.0,0.0,0.0,0.0,0.15255,0.371496,2,16.83175,3,News,1,weekday_1,2,Evening,2
1,000c1a,0.162867,0.147467,107.0984,145.686233,2.286283,100.487767,0.228043,22,355.869017,28,Factual,4,weekday_3,5,Morning,3
2,001c53,1.8663,0.0,0.0,0.0,1.309867,0.0,0.489419,3,3.176167,3,News,2,weekday_2,2,Morning,2
3,001d44,0.0,0.0,0.0,14.5477,0.0,0.0,0.085238,1,14.5477,2,Sport,2,weekday_6,1,Morning,1
4,002b2e,291.477033,0.0,0.0,0.0,0.0,0.0,0.228233,17,291.477033,21,Factual,5,weekday_2,5,Evening,3


So for each user we have:
- the minutes watched on a 2 weeks basis for the past 12 weeks `1`, `2`,..., `6`

And aggregated on this 12 weeks timeframe:
- the total minutes watched `total_watched`
- the average completion when watching a piece of content `average_completion`
- the number of sessions `total_sessions`
- the number of time a user watched something `number_watched`
- the main genre watched - in terms of minutes and not number of pieces of content - `most_genre`
- the number of different genre watched `num_genre`
- the favourite day of the week to watch - again in minutes watched - `most_weekday`
- the number of differents days of the week a user watched something - `num_weekday`
- the favourite time of the day to watch - again in minutes watched - `most_timeday`
- the number of differents times of the day a user watched something - `num_timeday`

This set of variables constitute our input variables to train the models. Note that we could imagine lots of other features.

In [10]:
# Same for the test set
features_test=pivot_data(data_test[data_test['twoweek']<8])
features_test.reset_index().head()

twoweek,user_id,2,3,4,5,6,7,average_completion,total_sessions,total_watched,number_watched,most_genre,num_genre,most_weekday,num_weekday,most_timeday,num_timeday
0,0001c6,0.0,0.0,0.0,0.0,0.15255,0.0,0.002543,1,0.15255,1,News,1,weekday_2,1,Afternoon,1
1,000c1a,0.147467,107.0984,145.686233,2.286283,100.487767,132.432083,0.239364,27,488.138233,37,Factual,5,weekday_3,6,Morning,3
2,001c53,0.0,0.0,0.0,1.309867,0.0,0.0,0.507045,1,1.309867,1,Factual,1,weekday_1,1,Morning,1
3,001d44,0.0,0.0,14.5477,0.0,0.0,0.248017,0.058203,2,14.795717,3,Sport,2,weekday_6,1,Morning,2
4,002b43,0.0,0.0,0.0,0.0,14.5897,0.0,0.377768,2,14.5897,3,Factual,2,weekday_2,2,Evening,2


### Dummification
Most models take only in input quantitative variables. We then need to __dummify__ the categorical fields, i.e. we will split the variable in _n_ - the number of different values, dummy (0/1) ones.  

In [11]:
# Turn our categorical variables into bins so that we can run models on this
features_training=pd.get_dummies(features_training).reset_index()
features_training.head()

Unnamed: 0,user_id,1,2,3,4,5,6,average_completion,total_sessions,total_watched,...,most_weekday_weekday_1,most_weekday_weekday_2,most_weekday_weekday_3,most_weekday_weekday_4,most_weekday_weekday_5,most_weekday_weekday_6,most_timeday_Afternoon,most_timeday_Evening,most_timeday_Morning,most_timeday_Night
0,0001c6,16.6792,0.0,0.0,0.0,0.0,0.15255,0.371496,2,16.83175,...,1,0,0,0,0,0,0,1,0,0
1,000c1a,0.162867,0.147467,107.0984,145.686233,2.286283,100.487767,0.228043,22,355.869017,...,0,0,1,0,0,0,0,0,1,0
2,001c53,1.8663,0.0,0.0,0.0,1.309867,0.0,0.489419,3,3.176167,...,0,1,0,0,0,0,0,0,1,0
3,001d44,0.0,0.0,0.0,14.5477,0.0,0.0,0.085238,1,14.5477,...,0,0,0,0,0,1,0,0,1,0
4,002b2e,291.477033,0.0,0.0,0.0,0.0,0.0,0.228233,17,291.477033,...,0,1,0,0,0,0,0,1,0,0


In [12]:
# Same for the test set
features_test=pd.get_dummies(features_test).reset_index()
features_test.head()

Unnamed: 0,user_id,2,3,4,5,6,7,average_completion,total_sessions,total_watched,...,most_weekday_weekday_1,most_weekday_weekday_2,most_weekday_weekday_3,most_weekday_weekday_4,most_weekday_weekday_5,most_weekday_weekday_6,most_timeday_Afternoon,most_timeday_Evening,most_timeday_Morning,most_timeday_Night
0,0001c6,0.0,0.0,0.0,0.0,0.15255,0.0,0.002543,1,0.15255,...,0,1,0,0,0,0,1,0,0,0
1,000c1a,0.147467,107.0984,145.686233,2.286283,100.487767,132.432083,0.239364,27,488.138233,...,0,0,1,0,0,0,0,0,1,0
2,001c53,0.0,0.0,0.0,1.309867,0.0,0.0,0.507045,1,1.309867,...,1,0,0,0,0,0,0,0,1,0
3,001d44,0.0,0.0,14.5477,0.0,0.0,0.248017,0.058203,2,14.795717,...,0,0,0,0,0,1,0,0,1,0
4,002b43,0.0,0.0,0.0,0.0,14.5897,0.0,0.377768,2,14.5897,...,0,1,0,0,0,0,0,1,0,0


Make sure to have __the same columns__ in both feature sets. For example, if a categorical variable `X` has a certain modality `a` that we observe in the training set but not on the testing set, then the training features dataset will contain this column `X_a` after dummification, but not the test feature one. If the model is based on `X_a` then we won't be able to do the projection on the test feature dataset afterwards. Just add a column of 0s on the test dataset to avoid this kind of issues.

We also need to change the name of the 2 weeks minutes watched variables to make them generic like `tw_lag6_watched`, `tw_lag5_watched`, ..., `tw_lag1_watched`. 

In [13]:
features_training = features_training.rename(columns={1:'tw_lag6_watched',
                                                      2:'tw_lag5_watched',
                                                      3:'tw_lag4_watched',
                                                      4:'tw_lag3_watched',
                                                      5:'tw_lag2_watched',
                                                      6:'tw_lag1_watched'})
features_training.head()

Unnamed: 0,user_id,tw_lag6_watched,tw_lag5_watched,tw_lag4_watched,tw_lag3_watched,tw_lag2_watched,tw_lag1_watched,average_completion,total_sessions,total_watched,...,most_weekday_weekday_1,most_weekday_weekday_2,most_weekday_weekday_3,most_weekday_weekday_4,most_weekday_weekday_5,most_weekday_weekday_6,most_timeday_Afternoon,most_timeday_Evening,most_timeday_Morning,most_timeday_Night
0,0001c6,16.6792,0.0,0.0,0.0,0.0,0.15255,0.371496,2,16.83175,...,1,0,0,0,0,0,0,1,0,0
1,000c1a,0.162867,0.147467,107.0984,145.686233,2.286283,100.487767,0.228043,22,355.869017,...,0,0,1,0,0,0,0,0,1,0
2,001c53,1.8663,0.0,0.0,0.0,1.309867,0.0,0.489419,3,3.176167,...,0,1,0,0,0,0,0,0,1,0
3,001d44,0.0,0.0,0.0,14.5477,0.0,0.0,0.085238,1,14.5477,...,0,0,0,0,0,1,0,0,1,0
4,002b2e,291.477033,0.0,0.0,0.0,0.0,0.0,0.228233,17,291.477033,...,0,1,0,0,0,0,0,1,0,0


In [14]:
features_test = features_test.rename(columns={2:'tw_lag6_watched',
                                              3:'tw_lag5_watched',
                                              4:'tw_lag4_watched',
                                              5:'tw_lag3_watched',
                                              6:'tw_lag2_watched',
                                              7:'tw_lag1_watched'})
features_test.head()

Unnamed: 0,user_id,tw_lag6_watched,tw_lag5_watched,tw_lag4_watched,tw_lag3_watched,tw_lag2_watched,tw_lag1_watched,average_completion,total_sessions,total_watched,...,most_weekday_weekday_1,most_weekday_weekday_2,most_weekday_weekday_3,most_weekday_weekday_4,most_weekday_weekday_5,most_weekday_weekday_6,most_timeday_Afternoon,most_timeday_Evening,most_timeday_Morning,most_timeday_Night
0,0001c6,0.0,0.0,0.0,0.0,0.15255,0.0,0.002543,1,0.15255,...,0,1,0,0,0,0,1,0,0,0
1,000c1a,0.147467,107.0984,145.686233,2.286283,100.487767,132.432083,0.239364,27,488.138233,...,0,0,1,0,0,0,0,0,1,0
2,001c53,0.0,0.0,0.0,1.309867,0.0,0.0,0.507045,1,1.309867,...,1,0,0,0,0,0,0,0,1,0
3,001d44,0.0,0.0,14.5477,0.0,0.0,0.248017,0.058203,2,14.795717,...,0,0,0,0,0,1,0,0,1,0
4,002b43,0.0,0.0,0.0,0.0,14.5897,0.0,0.377768,2,14.5897,...,0,1,0,0,0,0,0,1,0,0


## Target variables

### Regression and classification
We will actually build two different kind of models:
- one in which our target is the minutes watched in the next two weeks - for this first kind of modeling we are in the __regression__ framework
- another one in which our target is a dummy variable (0/1) and we will forecast whether a user will watch iPlayer or not (total minutes watches > 0) within the next two weeks - here we are in the __classification__ framework

We will come back more in details to these concepts later on the training (next course).

#### regression

In [15]:
# Reminder: for the training set our target variable is the minutes watched on twoweek 7
target_training_reg=pd.pivot_table(data_training,
                                   values='min_watched', 
                                   index=['user_id'],
                                   columns=['twoweek'], 
                                   aggfunc=sum)[7].reset_index()
target_training_reg=target_training_reg.fillna(0)
target_training_reg.head()

Unnamed: 0,user_id,7
0,0001c6,0.0
1,000c1a,132.432083
2,001c53,0.0
3,001d44,0.248017
4,002b2e,0.0


In [16]:
# Same for the test set but on twoweek 8
target_test_reg=pd.pivot_table(data_test,
                               values='min_watched', 
                               index=['user_id'],
                               columns=['twoweek'], 
                               aggfunc=sum)[8].reset_index()
target_test_reg=target_test_reg.fillna(0)
target_test_reg.head()

Unnamed: 0,user_id,8
0,0001c6,0.144833
1,000c1a,318.047633
2,001c53,1.98035
3,001d44,10.059067
4,002b43,4.792617


#### classification

In [17]:
# We build the dummy variable based on the minutes watched on this twoweek
target_training_class=target_training_reg.copy()
target_training_class[7]=np.where(target_training_class[7]>0,1,0)
target_training_class.head()

Unnamed: 0,user_id,7
0,0001c6,0
1,000c1a,1
2,001c53,0
3,001d44,1
4,002b2e,0


In [18]:
# Same for the test set
target_test_class=target_test_reg.copy()
target_test_class[8]=np.where(target_test_class[8]>0,1,0)
target_test_class.head()

Unnamed: 0,user_id,8
0,0001c6,1
1,000c1a,1
2,001c53,1
3,001d44,1
4,002b43,1


### Make sure to have couples of features-target
The last thing we need to do is to make sure that for both the training and test sets we have couples of features X and target variable Y for each user. If features or target values are missing for a given user we will get rid of him.

We also need to make sure that these observations are in the right order.

In [19]:
# Find the unique users in both the features and the target
users_target_training=target_training_reg['user_id'].unique()
users_features_training=features_training['user_id'].unique()

# Find those users that are in the target but not in the feature
target_not_feature_training=[]
for user in users_target_training:
    if user not in users_features_training:
        target_not_feature_training.append(user)

# Find those users that are in the feature but not in the target
feature_not_target_training=[]
for user in users_features_training:
    if user not in users_target_training:
        feature_not_target_training.append(user)

# Print the size of the two sets
print('In target but not feature:',len(target_not_feature_training),
      '- In feature but not target:' ,len(feature_not_target_training))

In target but not feature: 39 - In feature but not target: 0


Remark: it's actually normal to have no one missing in the second case because we build the target variables based on the entire population of the training data. It's more a sanity check here.

We thus need to remove some users who don't have any past behaviour before twoweek 7. We actually could set all of their explanatory variables to 0 but there are probably new users and the models won't perform well for such profile. Speaking of which, the seniority of the user could be a great feature to consider but need some "business" rules to avoid what we call "left-censoring" issues. 

In [20]:
# We will set the index to the user_id as this will make it easier to drop rows
# Then we drop the rows and then turn the remaining column into an array
target_training_reg=target_training_reg.set_index(['user_id'])
target_training_reg.drop(target_not_feature_training,inplace=True)
target_training_reg.reset_index(inplace=True)
target_training_reg=target_training_reg[7].values

# Same for the classification
target_training_class=target_training_class.set_index(['user_id'])
target_training_class.drop(target_not_feature_training,inplace=True)
target_training_class.reset_index(inplace=True)
target_training_class=target_training_class[7].values
        
# Check to make sure the outcome makes sense
print(target_training_reg[:10])
print(target_training_class[:10])

[   0.          132.43208333    0.            0.24801667    0.            0.
    0.            0.            0.            0.        ]
[0 1 0 1 0 0 0 0 0 0]


In [21]:
# Let's check the size of our datasets
print('Number of samples in the training feature set:',len(features_training))
print('Number of samples in the training target set (classification):',
      len(target_training_class))
print('Number of samples in the training target set (regression):',
      len(target_training_reg))

Number of samples in the training feature set: 9068
Number of samples in the training target set (classification): 9068
Number of samples in the training target set (regression): 9068


In [22]:
# Same for the test set
# Find the unique users in both the features and the target
users_target_test=target_test_reg['user_id'].unique()
users_features_test=features_test['user_id'].unique()

# Find those users that are in the target but not in the feature
target_not_feature_test=[]
for user in users_target_test:
    if user not in users_features_test:
        target_not_feature_test.append(user)

# Find those users that are in the feature but not in the target
feature_not_target_test=[]
for user in users_features_test:
    if user not in users_target_test:
        feature_not_target_test.append(user)

# Print the size of the two sets
print('In target but not feature:',len(target_not_feature_test),
      '- In feature but not target:' ,len(feature_not_target_test))
print('\n')

# We will set the index to the user_id as this will make it easier to drop rows
# Then we drop the rows and then turn the remaining column into an array
target_test_reg=target_test_reg.set_index(['user_id'])
target_test_reg.drop(target_not_feature_test,inplace=True)
target_test_reg.reset_index(inplace=True)
target_test_reg=target_test_reg[8].values

# Same for the classification
target_test_class=target_test_class.set_index(['user_id'])
target_test_class.drop(target_not_feature_test,inplace=True)
target_test_class.reset_index(inplace=True)
target_test_class=target_test_class[8].values
        
# Check to make sure the outcome makes sense
print(target_test_reg[:10])
print(target_test_class[:10])
print('\n')

# Let's check the size of our datasets
print('Number of samples in the training feature set:',len(features_test))
print('Number of samples in the training target set (classification):',
      len(target_test_class))
print('Number of samples in the training target set (regression):',
      len(target_test_reg))

In target but not feature: 157 - In feature but not target: 0


[  1.44833333e-01   3.18047633e+02   1.98035000e+00   1.00590667e+01
   4.79261667e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   2.24259583e+02   0.00000000e+00]
[1 1 1 1 1 0 0 0 1 0]


Number of samples in the training feature set: 7601
Number of samples in the training target set (classification): 7601
Number of samples in the training target set (regression): 7601


### Missing values

The last thing we need to do is to remove any possible missing values. And to get rid of the `user_id` field as it's not a feature for the modeling part. 

Note that we need to keep it somewhere to easily find back our users when doing the forecastings and put our insights into actions. That's why we are using it as our database index.

In [23]:
# We will fill remaining missing values with 0s as we don't know any better
features_training=features_training.set_index(['user_id'])
features_training.fillna(0,inplace=True)
features_training.head()

Unnamed: 0_level_0,tw_lag6_watched,tw_lag5_watched,tw_lag4_watched,tw_lag3_watched,tw_lag2_watched,tw_lag1_watched,average_completion,total_sessions,total_watched,number_watched,...,most_weekday_weekday_1,most_weekday_weekday_2,most_weekday_weekday_3,most_weekday_weekday_4,most_weekday_weekday_5,most_weekday_weekday_6,most_timeday_Afternoon,most_timeday_Evening,most_timeday_Morning,most_timeday_Night
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0001c6,16.6792,0.0,0.0,0.0,0.0,0.15255,0.371496,2,16.83175,3,...,1,0,0,0,0,0,0,1,0,0
000c1a,0.162867,0.147467,107.0984,145.686233,2.286283,100.487767,0.228043,22,355.869017,28,...,0,0,1,0,0,0,0,0,1,0
001c53,1.8663,0.0,0.0,0.0,1.309867,0.0,0.489419,3,3.176167,3,...,0,1,0,0,0,0,0,0,1,0
001d44,0.0,0.0,0.0,14.5477,0.0,0.0,0.085238,1,14.5477,2,...,0,0,0,0,0,1,0,0,1,0
002b2e,291.477033,0.0,0.0,0.0,0.0,0.0,0.228233,17,291.477033,21,...,0,1,0,0,0,0,0,1,0,0


In [24]:
# We will fill remaining missing values with 0s as we don't know any better
features_test=features_test.set_index(['user_id'])
features_test.fillna(0,inplace=True)
features_test.head()

Unnamed: 0_level_0,tw_lag6_watched,tw_lag5_watched,tw_lag4_watched,tw_lag3_watched,tw_lag2_watched,tw_lag1_watched,average_completion,total_sessions,total_watched,number_watched,...,most_weekday_weekday_1,most_weekday_weekday_2,most_weekday_weekday_3,most_weekday_weekday_4,most_weekday_weekday_5,most_weekday_weekday_6,most_timeday_Afternoon,most_timeday_Evening,most_timeday_Morning,most_timeday_Night
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0001c6,0.0,0.0,0.0,0.0,0.15255,0.0,0.002543,1,0.15255,1,...,0,1,0,0,0,0,1,0,0,0
000c1a,0.147467,107.0984,145.686233,2.286283,100.487767,132.432083,0.239364,27,488.138233,37,...,0,0,1,0,0,0,0,0,1,0
001c53,0.0,0.0,0.0,1.309867,0.0,0.0,0.507045,1,1.309867,1,...,1,0,0,0,0,0,0,0,1,0
001d44,0.0,0.0,14.5477,0.0,0.0,0.248017,0.058203,2,14.795717,3,...,0,0,0,0,0,1,0,0,1,0
002b43,0.0,0.0,0.0,0.0,14.5897,0.0,0.377768,2,14.5897,3,...,0,1,0,0,0,0,0,1,0,0


__TO DO:__
- COMMENTS
- xxx

__Save new datasets - or create a class to call for the other notebook ?__