<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Linear Regression for Time Series Data

_Authors: Kevin Markham (Washington, D.C.), Ed Podojil (New York City)_

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()
plt.style.use('fivethirtyeight')

In [None]:
%matplotlib inline

In [None]:
bikes = pd.read_csv('../assets/data/bikeshare_modified.csv',
                    index_col='datetime',
                    parse_dates=True
                   )

## Hang on a Second...

There's something fishy about our model. We are trying to "predict" the number of riders at a given time from the values that other variables have *at that same time*. That's...not prediction.

Traditional time series modeling techniques that have been developed in statistics and econometrics would predict the value of `num_total_users` at a given time from values of that same variable at previous times. In this notebook, we will instead adapt linear regression to predict `num_total_users` from other variables.

This general approach of applying standard supervised learning models to time-series data often gives better results than traditional time series modeling in machine learning competitions.

## Using a Standard Supervised Learning Model with Time-Series Data

### Time-Shifting the Feature Variables

If we want to be able to predict ridership two hours in advance, for instance, then we need to do it using only information that is available two hours in advance.

One way to do that is to take the variables that we cannot know in advance and to shift them back in time. For instance, at 6pm we cannot use the temperature at 8pm to predict ridership at 8pm, but we can use the temperature at 6pm, which is after all strongly correlated with the temperature at 6pm.

In [None]:
# Shift variables that cannot be known in advance forward in time


In [None]:
bikes.head()

In [None]:
# Delete the rows for which we do not have shifted weather information


### Taking the Test Set from the End

We should also take the test set from the *end* of the time period covered by the data, because in practice when we deploy the model it will run on data that comes from later in time than the data it was trained on.

Let's see how our model does with these modifications.

In [None]:
# 1. Split your feature columns from your target column


In [None]:
# 2. Refine your feature columns.


The cell above uses the handy "list comprehension" to build up a list inside a for-loop. It is equivalent to the following more verbose code:

```python
feature_cols = []
for var in time_sensitive_features:
    feature_cols.append(var)
```

In [None]:
# 3. Split your rows into a training set and a test set.


In [None]:
# Confirm that the split worked as expected


In [None]:
# 1. Import the LinearRegression class.


In [None]:
# 2. Make an instance of the LinearRegression class.


In [None]:
# 3. Train the model instance on the training set.


In [None]:
# 4. Score the model on the test set.


Rats -- our model is a lot worse when we use it for actual prediction.

### Incorporating Time-Based Features

Ridership depends heavily on time of day and day of week:

In [None]:
# Plot ridership vs. time for the first week of data


It also depends on the month of the year and on time since the start of the program:

In [None]:
# Plot total monthly ridership vs. time
bikes.loc[:, 'num_total_users'].resample('M').sum().plot(figsize=(20, 8), linewidth=.5);

Why not create features for these variables?

In [None]:
# Create a feature for hour of day


We already have an "is_workingday" variable that should capture the "day of week" effect to some extent.

In [None]:
# Create a feature for hours from start of dataset


In [None]:
# Create a feature for month of year


In [None]:
# Dummy-code hour of day and month of year


In [None]:
bikes.head()

In [None]:
# 1. Split your feature columns from your target column
# We still have `y` from before.


In [None]:
# Create list of feature variable names


In [None]:
# 2. Refine your feature columns.


The cell above uses the handy "list comprehension" to build up a list inside a for-loop. It is equivalent to the following more verbose code:

```python
feature_cols = []
for var in time_sensitive_features:
    feature_cols.append(var)
```

In [None]:
# 3. Split your rows into a training set and a test set.


In [None]:
# 2. Make an instance of the LinearRegression class.


In [None]:
# 3. Train the model instance on the training set.


In [None]:
# 4. Score the model on the test set.


$\blacksquare$

## Summary

- We can use standard supervised learning techniques to predict the values of one variable in a multivariate time series from the values of other variables in that time series, but not before those values are available. One way to address this problem is to shift the feature variables whose values cannot be known in advance forward in time relative to the target variable.
- With time series data we should also make sure that all of the training data comes from earlier in time than all of the test data.
- Using various aspects of time itself as features (e.g. hour of day, day of week, month of year) can be very powerful for data sets that show strong temporal patterns (called "seasonality").