# Working with Time-Series Data

1. Cross-validation on time-series with `TimeSeriesSplit()`
2. Time shift (lag) prediction technique with pandas `shift()`


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import mglearn

## 1. Loading Citi-bike dataset

In [None]:
citibike = mglearn.datasets.load_citibike()


In [None]:
print("Citi Bike data:\n{}".format(citibike.head()))

## 2. Cross-validation with time-series

For k-fold cross-validation, a time series array is divided into k+1 blocks. 
- In the first iteration, block0 is training, block1 is validation. 
- In the second iteration, block0+block1 are training, block2 is validation, etc.



See https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split

for more information.

In [None]:
X_hour_week = pd.DataFrame(np.hstack([citibike.index.dayofweek.values.reshape(-1, 1),
                                     citibike.index.hour.values.reshape(-1, 1)]))

y = citibike.values
y.shape

In [None]:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=3)

for train_index, test_index in tscv.split(X_hour_week):
    print("train size:", train_index.shape, "test size:", test_index.shape)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

rf_regressor = RandomForestRegressor(n_estimators=100, random_state=0)

scores = cross_val_score(rf_regressor, X_hour_week, y, cv=tscv)

print(scores)
print(f"mean= {scores.mean():.3f}")

In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge()

scores = cross_val_score(ridge, X_hour_week, y, cv=tscv)
print(scores)
print(f"mean= {scores.mean():.3f}")

## 3. Shift (lag) method
In absence of (or in addition to) date-time index, features can be generated by shifting the target vector *forward*: A value at time `t-1` in the original vector gets aligned with the original value at time `t`.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html

In [None]:
df = pd.DataFrame()
df['original'] = list(range(10))
df['lag_1'] = df['original'].shift(1)
print(df)

### On our dataset
In the Citi-bike dataset, 1 time step is 3h; 8 time steps = 1 day.

We can create a feature that would use the previous day's value to predict today's value by using a shift of 8.

In [None]:
shift = 8
col_name = f'lag_{shift}'
X_single_shift = pd.DataFrame(y, columns=['original'], index=citibike.index)
X_single_shift[col_name] = X_single_shift['original'].shift(shift)

X_single_shift.head(n=10)

Shifting results in inserting `nan` values at the beginning. It is customary to drop these first rows, reducing the dataset.

## 4. Try out shift features on citi-bike


### Import utility functions
Utility functions are now in a module so that all notebooks can use them

In [None]:
from timeseries_utils import eval_on_features

### Analyse single shift feature vector

In [None]:
shift = 8
col_name = f'lag_{shift}'
X_single_shift = pd.DataFrame(y, columns=['original'], index=citibike.index)
X_single_shift[col_name] = X_single_shift['original'].shift(shift)

X_single_shift = X_single_shift.dropna()
X_single_shift.head()

**Random Forest Regressor**

In [None]:
eval_on_features(X_single_shift.drop(columns=['original']), y[shift:], rf_regressor)

In [None]:
scores = cross_val_score(rf_regressor, 
                         X_single_shift.drop(columns=['original']), 
                         y[shift:], 
                         cv=tscv)
print(scores)
print(f"mean= {scores.mean():.3f}")

**Ridge regression**

In [None]:
eval_on_features(X_single_shift.drop(columns=['original']), y[shift:], ridge)

In [None]:
scores = cross_val_score(ridge, X_single_shift.drop(columns=['original']), y[shift:], cv=tscv)
print(scores)
print(f"mean= {scores.mean():.3f}")

## 5. Multiple lag features

We can use this shifting technique to engineer multiple feature columns, each with a different lag.

In [None]:
y_df = pd.DataFrame(y, index=citibike.index)

X_shift = pd.DataFrame(y, columns=['original'], index=citibike.index)

for shift in range(8,57,8):
    
    col_name = f'lag_{shift}'
    X_shift[col_name] = y_df.shift(shift)

X_shift = X_shift.dropna()

In [None]:
X_shift

Because the maximum lag is 56, our dataset is reduced from 248 to 192 rows after dropping the rows with `nan` values.

In [None]:
eval_on_features(X_shift.drop(columns=['original']), X_shift['original'], rf_regressor)

In [None]:
scores = cross_val_score(rf_regressor, 
                         X_shift.drop(columns=['original']),
                         X_shift['original'],
                         cv=tscv)
print(scores)
print(f"mean= {scores.mean():.3f}")

In [None]:
n_features = X_shift.drop(columns=['original']).shape[1]
plt.barh(np.arange(n_features), rf_regressor.feature_importances_, align='center')
plt.yticks(np.arange(n_features),X_shift.drop(columns=['original']).columns)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.ylim(-1, n_features);

In [None]:
eval_on_features(X_shift.drop(columns=['original']), X_shift['original'], ridge)

In [None]:
scores = cross_val_score(ridge, 
                         X_shift.drop(columns=['original']),
                         X_shift['original'],
                         cv=tscv)
print(scores)
print(f"mean= {scores.mean():.3f}")

In [None]:
plt.figure(figsize=(6, 2))
plt.plot(ridge.coef_,'o')
plt.xticks(np.arange(len(ridge.coef_)), X_shift.drop(columns=['original']).columns, rotation=90)
plt.xlabel("Feature name")
plt.ylabel("Feature magnitude")
plt.grid();

## 6. Grid search
### Random forest

In [None]:
from sklearn.model_selection import GridSearchCV

tscv = TimeSeriesSplit(n_splits=3)

param_grid = {'max_depth': [3, 5, 7, 9],
             'min_samples_leaf': [1, 3, 5],
             'max_features': [0.3, 0.5, 0.7]}

grid = GridSearchCV(rf_regressor, param_grid, cv=tscv, n_jobs=-1)
grid.fit(X_shift.drop(columns=['original']),
                         X_shift['original'])
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

In [None]:
# what is the variation in the cv scores?
# indicates how sensitive model is to data sizes
grid.cv_results_['mean_test_score'].std()

In [None]:
eval_on_features(X_shift.drop(columns=['original']), 
                 X_shift['original'], 
                 grid.best_estimator_)

### Ridge regression

In [None]:
from sklearn.model_selection import GridSearchCV

tscv = TimeSeriesSplit(n_splits=3)

param_grid = {'alpha': [1.0, 10.0, 100.0, 500.0]}

grid = GridSearchCV(ridge, param_grid, cv=tscv, n_jobs=-1)
grid.fit(X_shift.drop(columns=['original']),
                         X_shift['original'])
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

In [None]:
grid.cv_results_['mean_test_score'].std()

In [None]:
eval_on_features(X_shift.drop(columns=['original']), 
                 X_shift['original'], 
                 grid.best_estimator_)

## Summary

### General
`TimeSeriesSplit` provides means to split time series for cross-validation and enables grid searching.

If we have a time series without a date-time index, we can derive useful features shifting existing features or the target variable using pandas `shift()`.

The downside is that `nan` values are introduced and rows need to be dropped (or filled in). Dropping rows reduces the dataset.

Furthermore, to make predictions, now a certain number of past samples are needed.

### Citibike 
- We choose multiples of 8 as lags, with a lag of 8 = 1 day
- Strongest feature is lag 56 = 7 days, the week prior

model | parameters | train rms | val rms | train r2 | valid r2 | cv valid r2 |
:-: | :-: | :-: | :-: | :-: | :-: | :-: | 
Random forest | {'max_depth': 5, 'max_features': 0.7, 'min_samples_leaf': 1} | 3.70 | 6.29 | 0.93 | 0.77 | 0.75|
Ridge regression | {'alpha': 500.0} | 6.65 | 6.42 | 0.76 | 0.76|0.72 |


Date derived features in the Time-Series Data notebook provided better results with valid r2 > 0.8 and valid rms < 6.
