# Module Tutorial: Create and Encode Time Series Features

## Overview

This module tutorial covers three core categories of time series features: date time features, lag features, and window statistics. These features will all be computed using the pandas library.

### Learning objectives

By the end of this tutorial, you will be able to:
- Access date and time information from a datetime object and encode it.
- Create lag features.
- Compare and contrast rolling and expanding windows.

### Estimated Time

This lab will take approximately **30 minutes** to complete.

### Prerequisites

- Familiarity with pandas, numpy, and lambdas.
- Knowledge of linear and logistic regression.

### Lab Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Introduction

In this tutorial, we use the Metro Interstate Traffic Volume dataset. It contains hourly measurements between 2012 and 2018 of various weather features and holidays to assess the impact on traffic volume on the westbound I-94 between Minneapolis and St. Paul, Minnesota.

Run the following cell to load the data for this tutorial. We will truncate the dataset and only take the last 6 months of data recorded.

In [None]:
df = pd.read_csv('data/Metro_Interstate_Traffic_Volume.csv')
df = df[df['date_time'] >= '2018-04-10']
df

## Lesson 1: Encode Time-based Features

Let's begin with encoding time-based features. In the dataset, the date and time of each measurement is recorded in the `date_time` column. 

In [None]:
display(df['date_time'])

# Try accessing one of the values
df['date_time'].values[0]

Notice the data type of the column and the cell output. It seems like the data type is currently `str`. In order to access the date and time information in these values, we need to convert it to the `datetime64` data type. 

Pandas provides a convenient API, `to_datetime`, to convert these values into the datetime64 data type. Luckily, the datetime objects are already in [ISO 8601 format](https://www.iso.org/iso-8601-date-and-time-format.html), so we don't have to specify a particular format. Run the following cell to convert the data type.

In [None]:
print(f"Data type before conversion: {type(df['date_time'].values[0])}")

# Convert the data to datetime64 and replace previous values.
df['date_time'] = pd.to_datetime(df['date_time'])

print(f"Data type after conversion: {type(df['date_time'].values[0])}")

Now that we have the right data type, we have access to [many properties](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components) of the datetime object. For example, it might be helpful to indicate to a model which day of the week the measurement was recorded.

We can access this information using `weekday()` on the elements in the column. This returns an integer of the day of the week, from 0 (Monday) to 6 (Sunday).

In [None]:
df['weekday'] = df['date_time'].apply(lambda x: x.weekday())
df['weekday']

In [None]:
plt.figure(figsize=(12, 7));
plt.plot(df['date_time'], df['weekday']);

We can then create a new column `is_weekend` by checking if the day is Saturday (5) or Sunday (6).

In [None]:
df['is_weekend'] = df['weekday'].apply(lambda x: 1 if x >= 5 else 0)

plt.figure(figsize=(10, 7));
plt.plot(df['date_time'], df['is_weekend']);

What if we decide to use a scale-dependent model? This means that the model we use is sensitive to data encoded as an ordinal variable. For example, if we use a linear regression model, the model will assume Sunday (6) is more important than Monday (0). 

We can resolve this with **one-hot encoding**! But how do we retrieve the day of the week? We can use the `strftime` or "string from time" function from Python's `datetime` module. By using a specific code, we can retrieve a string representation of the date or time.

To get the name of the weekday, we can use `strftime(%A)` on the datetime object, then use `pd.get_dummies` to encode the column.

In [None]:
# Apply the weekday name to the `datetime` column
df['weekday_name'] = df['date_time'].apply(lambda x: x.strftime('%A'))
display(df['weekday_name'])

# One-hot encode the `weekday_name` column
pd.get_dummies(df, columns=['weekday_name'], dtype=int)

Feel free to use the following cell to see what kinds of information appears from the datetime object using `strftime`. Use [this cheatsheet](https://strftime.org/) to get an idea of what the different codes are.

In [None]:
df['date_time'].apply(lambda x: x.strftime(...))

### Key Points

- Convert objects to datetime with `pd.to_datetime`. You may need to specify its format if it is not ISO 8601 compliant.
- Some time-based features, like hour or day of week, should not be used in scale-dependent models like a linear regression model. One way to circumvent this is by one-hot encoding.
- Use `strftime()` to parse and format dates.

## Lesson 2: Lag Features

Lag features are useful when previous measurements may influence the current measurement. For example, temperature at 6PM is often influenced by the temperature earlier in the day at 3PM. If you have stable weather conditions, yesterday's high might influence today's high temperature. Another example is the demand for a particular item today might be influenced by the demand recorded 7 days ago.

Pandas provides a convenient API, `shift` ([API documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html)), to create these lag features.

In [None]:
lag_hours = 1

df['temp_lag'] = df['temp'].shift(periods=lag_hours)
df[['date_time', 'temp', 'temp_lag']]

<div class="alert alert-info">
    Note the frequency of your data. If we wanted to lag the temperature data by one day, we would need to specify <code>periods=24</code> since we are working with hourly measurements.
</div>


### Key Points
- Use lag feautres when past values impact future values.
- Be mindful of the frequency of your time series data. 

## Lesson 3: Windows

### Rolling Windows
One way to smooth out noise and highlight trends in time series data is by using rolling windows. A rolling window computes a value over a range of time, then "slides" over by a specified interval, and repeats the computation on the next window until the end of the series. The following animation illustrates how a rolling mean is computed.

<img src="assets/rollingwindow.gif" alt="Explanation of rolling window" width="800"/>

For this example, we will work with a subset of the data. Run the following cell to view the raw data.

In [None]:
df3 = df[df['date_time'] <= '2018-06-01']
plt.figure(figsize=(10, 7));

# raw measurements
plt.plot(df3['date_time'], df3['temp']);
plt.xticks(rotation=45);

We can compute a rolling window statistic using the pandas [`rolling` API](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html). It works similarly to groupby—you call an additional function to apply an operation on the windows.

You define a the size of the window using the `window` parameter. You may also specify the minimum number of observations required to compute a value with the `min_periods` parameter.

Choosing the right window size may take some experimentation. The next module covers feature selection techniques, which may help in selecting the most helpful window sizes for your model.

In the following cell, we compute the rolling average (or moving average) on the temperature data using a window size of 24. In other words, we are taking the average temperature over a day. Notice how the spikes in the plot are smoothed out.

In [None]:
plt.figure(figsize=(10, 7));

plt.plot(df3['date_time'], df3['temp'].rolling(window=24, min_periods=1).mean(), color=(1,.49,.13), label='window=24')
plt.plot(df3['date_time'], df3['temp'], label='raw', alpha=0.5);
plt.legend();
plt.xticks(rotation=45);

Feel free to experiment with different size windows and operations on different columns in the following cell! Refer to the [list of the window operations](https://pandas.pydata.org/docs/reference/window.html#api-functions-rolling) in the pandas documentation.

In [None]:
plt.figure(figsize=(10, 7));

# Try out different parameters and operations!
rolling_stat = df3['temp'].rolling(window=48, min_periods=1).max()

plt.plot(df3['date_time'], rolling_stat, label='rolling', color=(1, 0, 0));
plt.plot(df3['date_time'], df3['temp'], label='raw', alpha=0.5);
plt.legend();
plt.xticks(rotation=45);

### Expanding Windows

In comparison, an expanding window includes all previous data. While a rolling window has a fixed window size, an expanding window has a fixed starting point. Continuing the example of taking the average, an expanding window will compute what the average value is *so far*. Expanding windows are useful when you need to consider all historical data up to the current observation. 

Pandas also conveniently provides an `expanding` API, with a [set of operations](https://pandas.pydata.org/docs/reference/window.html#expanding-window-functions) you can use just like the rolling API. Run the following cell and inspect the expanding average across the dataset.

In [None]:
plt.figure(figsize=(10, 7));

# plotting raw data
plt.plot(df['date_time'], df['temp'], label='raw', alpha=0.6)

# expanding window
y =  df['temp'].expanding().mean()
plt.plot(df['date_time'], y, label='expanding mean')

# plot full sample mean
plt.hlines(y=df['temp'].mean(), xmin=df['date_time'].min(), xmax=df['date_time'].max(), colors=['g'], linestyles=['--'], label='full sample mean');
plt.legend();

Notice how the values at the beginning are unstable. As more data points are included in the computation, the curve begins to smooth out and converge to the full sample average.

### Key Points

- Rolling windows applies a function over a fixed-size subset of data. This helps smooth out noise.
    - Window sizes can be configured and may require some experimentation to find the right window size for your model.
- Expanding windows has a fixed starting point and applies a function over all historical data. Use this when you need to keep all prior information.

-------

## Resources

- Python `strftime` cheat sheet: https://strftime.org/
- Rolling window functions: https://pandas.pydata.org/docs/reference/window.html#api-functions-rolling 
- Expanding window functions: https://pandas.pydata.org/docs/reference/window.html#expanding-window-functions

## References

[1] J. Hogue. "Metro Interstate Traffic Volume," UCI Machine Learning Repository, 2019. [Online]. Available: https://doi.org/10.24432/C5X60B.

 
<br>