[Time series] Forecasting
=========

Forecasting involves predicting the behavior of a variable that typically has some **stochastic** element.

A key distinction relates to whether we are forecasting based on **extrapolating a single variable**, or using a **multivariate** approach to improve the forecasting model.

A key concept is **periodicity** or **seasonality** -- periodic repeating patterns in the data that must be accounted for. Commonly quoted is the **STL decomposition** (seasonality, trend, residual):

$y(t) = S(t) + T(t) + R(t)$


Terminology:
-----------
 - **Trend**: the (non-seasonal) pattern in the data
 - **Seasonality**: the seasonal (periodic) variation in the data (also weekly, monthly trends)
 - **Cyclic**: Patterns that are not of a fixed frequency. An example is the ups and downs of the "business cycle".

Forecasting Tools:
-------
 - Decomposition models:
 - Smoothing models:
 - **(S)ARIMA models (Auto-Regressive Integrated Moving Average)**: 
 - **TBATS**: Trigonometric, Box-Cox transform, ARMA errors, Trend and Seasonal Components. Able to account for multiple seasonalities (e.g. weekly and yearly simultaneously)
 - **NNETAR (Neural Network AutoRegression)**
 - **LSTMs**: typically not used for a single trend forecast, but instead works well with a lot of input data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 15})
import urllib
import os

Plotting Time Series
=====
 - See `plotting.ipynb`

Datetime
======


In [None]:
from datetime import datetime, timedelta

In [None]:
string_date = '2016-05-01 12:00'

# strptime: "string-parse time"
datetime_obj = datetime.strptime(string_date,'%Y-%m-%d %H:%M')

# strftime: "string-format time"
string_date_1 = datetime.strftime(datetime_obj,'%Y-%m-%d %H:%M')

assert string_date == string_date_1, 'Something went wrong.'
string_date_1

In [None]:
tm = datetime.now()

# Translate "now" into "the beginning of today"
start_of_day = tm - timedelta(hours=tm.hour,
                              minutes=tm.minute,
                              seconds=tm.second,
                              microseconds=tm.microsecond)

# This will give you a DatetimeIndex type of panda Series
# with 15 minute time intervals
datelist = pd.date_range(start_of_day, periods=5, freq='15T')

# This will initiate a pandas dataframe with the dates as the index.
df = datelist.to_frame(name='dateTime')
df['dateTime_str'] = df.index.strftime('%Y-%m-%d %H:%M:%S')

# This will rather clumsily add both as columns, with normal integer indices
df2 = pd.DataFrame()
df2['dateTime_str'] = datelist.strftime('%Y-%m-%d %H:%M:%S')
df2['dateTime_dt'] = pd.to_datetime(df2['dateTime_str'])

df2

Dealing with Time zones
--------

We use pytz to specify a timezone.

In [None]:
import pytz
for tz in pytz.all_timezones:
    if 'Europe' not in tz :
        continue
    print(tz)

String data with no timezone info
----
Say you have a timeseries with no timezone info. If you know the origin of the data, you can specify what timezone that data is from by calling `tz_localize` to convert it into a Datetime that is now continuous in time.

In [None]:
TIMEZONE='Europe/Paris'
tz_object = pytz.timezone(TIMEZONE)

datelist = ['30.10.2016 01:45',
            '30.10.2016 02:00',
            '30.10.2016 02:15',
            '30.10.2016 02:30',
            '30.10.2016 02:45',
            '30.10.2016 02:00',
            '30.10.2016 02:15',
            '30.10.2016 02:30',
            '30.10.2016 02:45',
            '30.10.2016 03:00',
            '30.10.2016 03:15',
           ]

datelist_dt = pd.to_datetime(datelist)
datelist_dt.tz_localize(TIMEZONE,ambiguous='infer')

In [None]:
datelist_dt[0]

Rolling Average
------

In [None]:
filename = 'https://raw.githubusercontent.com/antoinecarme/TimeSeriesData/master/fpp2/elecequip.csv'
urllib.request.urlretrieve(filename, os.path.split(filename)[-1])

df = pd.read_csv('elecequip.csv')

fig,ax = plt.subplots(figsize=(16, 6))
ax.plot(df['Index'],df['x'])
ax.plot(df['Index'],df['x'].rolling(window=12,center=True,min_periods=1).mean(),label='rolling average',color='red')
ax.plot(df['Index'],df['x'].diff().fillna(0))

ax.set_xlabel('year')
ax.set_ylabel('new orders index')

In [None]:
# Plot data monthly
df['Month'] = (df['Index'] % 1 )*12
df_byMonth = df.groupby('Month').mean()

fig,ax = plt.subplots(figsize=(8, 6))
ax.plot(df_byMonth.index,df_byMonth['x'])
ax.set_xlabel('Month')
ax.set_ylabel('<new orders index>')
df_byMonth

Seasonal Decompose ("statsmodels" package)
=======

This function essentially applies a rolling average of sorts to the data in order to extract the larger trend, and then subtracts this trend from the data. The seasonality is then the mean in each month of this subtracted data. Finally, the residual is the remainder with the monthly periodic trend subtracted.

In [None]:
def runStatsModels() :
    from statsmodels.tsa.seasonal import seasonal_decompose

    result = seasonal_decompose(df['x'], model='additive', period=12)

    fig, (ax1,ax2,ax3,ax4) = plt.subplots(4,1, figsize=(10,20))
    ax1.plot(result.observed); ax1.set_xlabel('observed')
    ax2.plot(result.trend); ax2.set_xlabel('trend')
    ax2.plot(df['x'].rolling(window=12,center=True,min_periods=1).mean())
    ax3.plot(result.resid); ax3.set_xlabel('resid')
    ax4.plot(result.seasonal[:24]); ax4.set_xlabel('seasonal')
    ax4.plot(df_byMonth['x']-95)

#runStatsModels()

Autocorrelation
=========

See also the Hida Challenge description.

In [None]:
fig,ax = plt.subplots(figsize=(16, 6))
pd.plotting.autocorrelation_plot(df['x'],ax=ax,label='autocorrelation of data')

Bibliography
-------

 - Hyndman & Athanasopoulos, **Forecasting: Principles and Practice**, https://otexts.com/fpp2/
 - Davide Burba, https://towardsdatascience.com/an-overview-of-time-series-forecasting-models-a2fa7a358fcb