# Context

## Quick reminder on last course

XX

## Goal of this course

XX

# Concepts

We need to prepare a dataset to be ingested into at least one of many statistical models. But we first need to understand in which part of Machine Learning our problematic stands.

## Supervised vs unsupervised learning

The first step here is to acknowledge  whether your data problem is suitable for “supervised” or “unsupervised” learning models. 
 
__Unsupervised learning__ models receive input but no output (or target) variables and, essentially, are a way of discovering latent structure in a set of data (“clustering” is an example of unsupervised learning). Unsupervised models are very useful when working with unlabelled datasets. These models can then be combined with supervised models.
 
__Supervised learning__ models, essentially,  learn a mathematical function between an input (explanatory variables) and an output (“target”). These models are used in situations where you know what you want to predict and have explicit input-output pairs for your model to be trained upon.

In our current project we want to forecast the minutes watched on iPlayer (output or target variable) based on past behavior (input or explanatory variables). We have input-output pairs and we are then in the supervised learning framework.

## Training and test sets
 
When training machine learning models, we want to avoid training the model on all of the possible data that we have available. This is to avoid creating a model that is to specifically atuned to our raining data and will later not generalise - this is often called __overfitting__. 

So instead we will spilt our data into __a training and a validation set__. We will then train our model on the training set and evaluate its predictions against the target - that we do observe - in the validation set.

For many complex problems and datasets the 'bleeding' of knowledge from the evaluation set into the training set can be a real problem. In that case our model will perform much worse in production than what we would have assumed. And so it is really important to make sure that we don't have information in the training set that we would not have been able to have at that time.


# Data wrangling

## Scope

When computing the distribution of our observations among the `two_week` variable (course 1 - exploratory data analysis part) we saw that it was quite similar, except for week 0. We decide to remove these observations for our modeling.

In [2]:
import pandas as pd

In [9]:
# Download our output dataset from course 1
data = pd.read_csv('iplayer_data_c1.csv')
data.head()

Unnamed: 0,user_id,program_id,series_id,genre,programme_duration,streaming_id,start_date_time,time_viewed,weekday,time_of_day,programme_duration_mins,twoweek,min_watched,enriched_genre,hour,enriched_genre_hour,enriched_duration_mins,percentage_watched
0,cd2006,f6d3d8,a282ca,Factual,00:00:21,1486911129420_1,2017-02-12 14:51:24.544,20920.0,weekday_6,Afternoon,0.35,3,0.348667,Factual,14,Factual,0.35,0.99619
1,cd2006,b8fbf2,e0480e,Comedy,00:01:51,1484864257965_1,2017-01-19 22:17:04.648,111285.0,weekday_3,Evening,1.85,1,1.85475,Comedy,22,Drama,1.85,1.0
2,cd2006,e2f113,933a1b,Factual,00:00:30,1487099603980_1,2017-02-14 19:12:36.667,29945.0,weekday_1,Evening,0.5,3,0.499083,Factual,19,Factual,0.5,0.998167
3,cd2006,0e0916,b68e79,Entertainment,00:01:22,1484773546557_1,2017-01-18 21:05:11.466,82620.0,weekday_2,Evening,1.366667,1,1.377,Entertainment,21,Drama,1.366667,1.0
4,cd2006,ca03b9,5d0813,Sport,00:01:37,1486911176609_1,2017-02-12 14:52:08.965,97444.0,weekday_6,Afternoon,1.616667,3,1.624067,Sport,14,Factual,1.616667,1.0


In [10]:
# Based on the plots in course 1, we will drop week 0
data=data[data['twoweek']>0]

In [11]:
data.twoweek.value_counts().sort_index()

1    67222
2    62112
3    60431
4    51941
5    55668
6    49941
7    51286
8    53410
Name: twoweek, dtype: int64

## Training and validation sets

We want to forecast the minutes watched in the next two weeks. We will then keep the last 2 weeks observation of our target for our test set and the remaining data will be used to train our model.

In [12]:
# Ensuring that we only train on the data we should have
data_training=data[data['twoweek']<8]
data_val=data[data['twoweek']>=8]

In [13]:
data_training.twoweek.value_counts().sort_index()

1    67222
2    62112
3    60431
4    51941
5    55668
6    49941
7    51286
Name: twoweek, dtype: int64

In [14]:
data_val.twoweek.value_counts().sort_index()

8    53410
Name: twoweek, dtype: int64

## User granularity and feature engineering
We want to forecast what individual users will do. We then need to pivot our dataset from an events view to a user view. We can use this to do some more feature engineering and define a number of features that we think will be important.

As mentioned in the first course, in most situations the process of feature engineering is an iterative one until you get the feature set that neither “underfits” or “overfits” the data. A feature set that does not contain sufficient information regarding the output variable will often result in the model underfitting (this can usually be identified by a high training error). The solution here is often to add more features. If the feature set contains features that are sensitive to  spurious and random elements of the dataset (and not the underlying population it should be an approximation of), overfitting occurs. Overfitting is characterised by low training error and high test error. Overfitting can be tackled by reducing the complexity of your model (often removing features) or use regularisation techniques (https://www.quora.com/What-is-regularization-in-machine-learning). A larger and more diverse training set also helps to reduce overfitting. 

There are various feature selection tools that can be used together with cross-validation to optimise your feature set (e.g. stepwise regression - https://en.wikipedia.org/wiki/Stepwise_regression).  
 
In our project we choose a selection of features that describe how the type viewing habits of a particular user (e.g. “average completion”, “most watched genre”, ”time watched “).

In [17]:
# Create a function that pivots the data based on customer and gives us all the data we need
def pivot_data(dataframe):
    #How many minutes did each person watch in each 2 week period
    data=pd.pivot_table(dataframe,values='min_watched', 
                        index=['user_id'],columns=['twoweek'], aggfunc=sum)
    # Fill the weeks they didn't watch in with 0s
    data.fillna(0,inplace=True)
    # How much of average did each viewer watch?
    data['average_completion']=dataframe.groupby('user_id')['percentage_watched'].mean()
    # How many sessions did the person have with us
    data['total_sessions']=dataframe.groupby('user_id')['streaming_id'].nunique()
    # How much did the viewer watch in total this year so far
    data['total_watched']=dataframe.groupby('user_id')['min_watched'].sum()
    # How many times has the viewer watched something
    data['number_watched']=dataframe.groupby('user_id')['streaming_id'].count()
    # Genre most watched by the viewer
    data['most_genre']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                      columns=['enriched_genre'], aggfunc=sum).idxmax(axis=1)
    # Number of genres watched
    data['num_genre']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                     columns=['enriched_genre'], aggfunc=sum).count(axis=1)
    # Favourite day of the week to watch
    data['most_weekday']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                        columns=['weekday'], aggfunc=sum).idxmax(axis=1)
    # Number of weekdays watched
    data['num_weekday']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                       columns=['weekday'], aggfunc=sum).count(axis=1)
    # Favorite time of day to watch
    data['most_timeday']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                        columns=['time_of_day'], aggfunc=sum).idxmax(axis=1)
    # Number of times of day
    data['num_timeday']=pd.pivot_table(dataframe,values='min_watched', index=['user_id'],
                                       columns=['time_of_day'], aggfunc=sum).count(axis=1)
    return data

In [18]:
data_viewer=pivot_data(data_training)
data_viewer.reset_index().head()

twoweek,user_id,1,2,3,4,5,6,7,average_completion,total_sessions,total_watched,number_watched,most_genre,num_genre,most_weekday,num_weekday,most_timeday,num_timeday
0,0001c6,16.6792,0.0,0.0,0.0,0.0,0.15255,0.0,0.371496,2,16.83175,3,News,1,weekday_1,2,Evening,2
1,000c1a,0.162867,0.147467,107.0984,145.686233,2.286283,100.487767,132.432083,0.233136,28,488.3011,38,Factual,5,weekday_3,6,Morning,3
2,001c53,1.8663,0.0,0.0,0.0,1.309867,0.0,0.0,0.489419,3,3.176167,3,News,2,weekday_2,2,Morning,2
3,001d44,0.0,0.0,0.0,14.5477,0.0,0.0,0.248017,0.058203,2,14.795717,3,Sport,2,weekday_6,1,Morning,2
4,002b2e,291.477033,0.0,0.0,0.0,0.0,0.0,0.0,0.228233,17,291.477033,21,Factual,5,weekday_2,5,Evening,3


So for each user we have for _a 14 weeks timeframe_:
- the minutes watched on a 2 weeks basis `1`, `2`, ..., `7`
- the total minutes watched `total_watched`
- the average completion when watching a piece of content `average_completion`
- the number of sessions `total_sessions`
- the number of time a user watched something `number_watched`
- the main genre watched - in terms of minutes and not number of pieces of content - `most_genre`
- the number of different genre watched `num_genre`
- the favourite day of the week to watch - again in minutes watched - `most_weekday`
- the number of differents days of the week a user watched something - `num_weekday`
- the favourite time of the day to watch - again in minutes watched - `most_timeday`
- the number of differents times of the day a user watched something - `num_timeday`

This set of variables constitute our input variables. Note that we can imagine lots of other features but for this training we will only consider these ones.

## Dummification
Most models take only in input quantitative variables. We then need to __dummify__ the categorical fields, i.e. we will split the variable in _n_ - the number of different values, dummy (0/1) ones.  

In [19]:
# Turn our categorical variables into bins so that we can run models on this
data_viewer=pd.get_dummies(data_viewer).reset_index()
data_viewer.head()

Unnamed: 0,user_id,1,2,3,4,5,6,7,average_completion,total_sessions,...,most_weekday_weekday_1,most_weekday_weekday_2,most_weekday_weekday_3,most_weekday_weekday_4,most_weekday_weekday_5,most_weekday_weekday_6,most_timeday_Afternoon,most_timeday_Evening,most_timeday_Morning,most_timeday_Night
0,0001c6,16.6792,0.0,0.0,0.0,0.0,0.15255,0.0,0.371496,2,...,1,0,0,0,0,0,0,1,0,0
1,000c1a,0.162867,0.147467,107.0984,145.686233,2.286283,100.487767,132.432083,0.233136,28,...,0,0,1,0,0,0,0,0,1,0
2,001c53,1.8663,0.0,0.0,0.0,1.309867,0.0,0.0,0.489419,3,...,0,1,0,0,0,0,0,0,1,0
3,001d44,0.0,0.0,0.0,14.5477,0.0,0.0,0.248017,0.058203,2,...,0,0,0,0,0,1,0,0,1,0
4,002b2e,291.477033,0.0,0.0,0.0,0.0,0.0,0.0,0.228233,17,...,0,1,0,0,0,0,0,1,0,0


Our training dataset with its set of features is ready for use. 

## Target variable 