In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
import seaborn as sns
sns.set_style("whitegrid")
# Bigger font
sns.set_context("poster")
# Figure size
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 10, 3
np.random.seed(123)

## Prediction

I didn't use to know how to do prediction on time series besides what is explained in this other course https://www.coursera.org/learn/practical-time-series-analysis/home/week/1

Time series aren't treated directly, but their timestamp variable has to be converted to categorical and numerical features first.

Example: Predict number of apples a shop will sell each day next week.

In [2]:
date = pd.date_range('1/1/2018', periods=30, freq='2D')
df = pd.DataFrame({'date': pd.to_datetime(date, format='%d-%m-%Y'), 'apples': [np.round(np.random.uniform(i-2, i+2)) for i in [i for i in range(5,1000,4)][:30]]})
df.head(10)

Unnamed: 0,apples,date
0,6.0,2018-01-01
1,8.0,2018-01-03
2,12.0,2018-01-05
3,17.0,2018-01-07
4,22.0,2018-01-09
5,25.0,2018-01-11
6,31.0,2018-01-13
7,34.0,2018-01-15
8,37.0,2018-01-17
9,41.0,2018-01-19


It can be transformed so that it can be used to train a linear model.

- Add the day number and the week number.
- **The first days will be for training** and **the last days (since week 6) for testing**.

So your might split your dataset in train, test, and if you want, validation.

![](images/train_test.png)

Then, we can predict the number of apples on the 6th week.

The linear dependecy can be easily seen by doing the following.

In [3]:
df_1 = pd.DataFrame({'day': df.index.values + 1, 'week': df['date'].dt.week, 'apples': df.apples})
df_1.head(10)

Unnamed: 0,apples,day,week
0,6.0,1,1
1,8.0,2,1
2,12.0,3,1
3,17.0,4,1
4,22.0,5,2
5,25.0,6,2
6,31.0,7,2
7,34.0,8,3
8,37.0,9,3
9,41.0,10,3


### Time data for a non-linear model like GBDT

We can add the **mean number of apples on each week**.

In [4]:
df_tmp = pd.concat([df, pd.Series(df['date'].dt.week, name='week')], axis=1)
df_2 = pd.DataFrame(df_tmp.groupby('week')['apples'].mean())
df_2

Unnamed: 0_level_0,apples
week,Unnamed: 1_level_1
1,10.75
2,26.0
3,39.0
4,52.666667
5,66.75
6,81.333333
7,95.75
8,108.0
9,119.0


In [5]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
                                  max_depth=1, random_state=0, loss='ls')
X = df_2.index.values.reshape(-1, 1)[:5]
y = df_2.apples.values[:5]
model.fit(X, y)

model.predict(df_2.index.values.reshape(-1, 1))

array([10.8022367 , 26.01170861, 39.00334467, 52.65566279, 66.69371389,
       66.69371389, 66.69371389, 66.69371389, 66.69371389])

And it predicts the same value after what is the training set.

This proves what is said in the video https://www.coursera.org/learn/competitive-data-science/lecture/1Nh5Q/overview, so in general, **Feature Generation depends on the model type!**

## FEATURE GENERATION

There are essentially 3 types of features one can get from a timestamp variable.

- You can generate **categorical** (e.g. day of the week) or **numerical** (e.g. time passed since a date) or features.
- Because of that, **those generated features need to be treated accordingly with other pre-processing methods**.

### 1. Periodicity

- Useful to capture patterns

One can create other features like
- Day number in the week (week day)
- Day number in a month
- Month
- Season
- Year
- Second
- Minute
- Hour
- is_holiday? (binary feature)
- **Non-common periods** that may influence the data
    - For example: When predicting the efficiency of a medication when patients receive pills once every 3 days. Then, every 3 days is a particular time period.

### 2. Time since

Time since a particular moment or event that can be of 2 types.

#### a. Row-independent moment

For example: Time passed since 00:00:00 UTC, 1 January 1970

#### b. Row-dependent moment

For example:
- Number of days left **until the next holiday/weekend/...**.
- Time passed **after the last holiday/weekend/...**.

Example:
- Rossman Store Sales, Kaggle --> A feature can be "time passed since the last sales campaign".

![](images/timesince.png)

### 3. Difference between dates

If there are several datetime columns.

- Substract 2 datetimes
- Substract the generated features

Example:
- Churn prediction --> Estimation of the likelihood that a customer will  churn.
- **date_diff = last_purchase_date - last_call_date** (date of the call to the custimer service)

![](images/churnpred.png)