# __Time series forecasting__

## References

[Hyndman and Athanasopoulos, Forecasting: Principles and Practice](https://otexts.com/fpp2/)

[Stationarity in time series analysis](https://towardsdatascience.com/stationarity-in-time-series-analysis-90c94f27322)

[Trend, Seasonality, Moving Average, Auto Regressive Model : My Journey to Time Series Data with Interactive Code](https://towardsdatascience.com/trend-seasonality-moving-average-auto-regressive-model-my-journey-to-time-series-data-with-edc4c0c8284b)

[4 Common Machine Learning Data Transforms for Time Series Forecasting](https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/)

## __What is a time series?__

Time series data is observed sequentially over time, with each observation associated to a timestamp.  Most commonly these timestamps are evenly spaced.

There is connection (and hopefully correlation!) between samples, and that the order matters.

A good time series forecast will capture genuine patterns & relationships in data, without repeating past events that don't appear again.
- the aim is to estimate how the sequence of observations will continue into the future

A good forecast will be able to identify components of the time series that change, and will often assume them to be changing in their forecasts.

## __Applications__

Given the temporal dimension of our universe, there are numerous applications of time series forecasting. 

A few examples are given below:

- finance - stock prices
- weather & climate - rainfall
- accounting - annual sales, monthly costs
- business - customer churn
- energy - demand, generation, price

The applicability of time series forecasting depends on how predictable the time series is

## __Is my time series predictable?__

The predictability of future values in that time series depends on
- how well we understand the variables that contribute
- availability of data
- if the forecast itself affects future values (think of a forecast of economic indicators such as GDP)

A key question in time series is how predictable the time series is, and if the patterns we see are anything more than random noise!

## Characteristics and components of a time series

The figure below shows the quarterly Australian beer production from 1992 to the second quarter of 2010. The blue lines show forecasting for two years, the dark shaded regions an 80 %, and the light shaded regions an 95 % confidence interval.

Take a look at the time series plot below.  What words would you use to describe it?

<img src="../images/australian_beer_production.png" width=700> <br/>
__Australian quarterly beer production with 2 years of forecasts.__ ([Hyndman and Athanasopoulos, Forecasting: Principles and Practice](https://otexts.com/fpp2/))

In time series we use the following to characterize a time series:

- stationarity - how the statistics of a process change over time
- trend - a non-repeating change (inflation, population growth)
- seasonality - a repeating change (weather seasons)
- noise - non systematic, unpredictable

Question to class - is a time series with a trend stationary or non-stationary?

## Exogeneous variables

Imagine we are forecasting sales at a store
- it is likely that knowing the sales on the previous day is useful
- it is also likely that knowing other variables such as which store and weather are useful

The last point is an example of using exogenous predictors.  

## Specifying models

For the example of hourly electricity demand $ED$, an exogeneous model would look like the following:

$ED = f(\text{current temperature}, strength\ of\ economy, population, time\ of\ day, day\ of\ week, error)$

A model using only previous values of the target looks like:

$ED_{t+1} = f(ED_t, ED_{t-1}, ED_{t-2}, ED_{t-3}, ... ,error)$

It is also possible to combine these into a mixed model:

$ED_{t+1} = f(ED_t, current\ temperature, time\ of\ day, day\ of\ week, error)$

## What data do I have at test time?

When using any features (exogeneous or not) it is crucial that this data is always available when you make the prediction!

For example, you may not be able to use the previous days sales if the data is only available a week later.

## __Some simple forecasting methods__

The simplest approaches to time series are especially important to understand as a **simple approach should always be used as a baseline**.

Quite often a simple approach will outperform a more complex approach.

Here, we will look at some of the simplest forecasting methods:
- __Drift:__ draw a line between the first and the last value and extrapolate.
- __Mean:__ uses an unconditional, sample mean as a forecast.
- __Naive:__ takes the target value of the last timestamp.
- __Seasonal naive:__ takes the target value of the same time point of the __LAST__ season and uses it as a forecast value.

In [1]:
from IPython.display import HTML, display
display(HTML("<table><tr><td><img src='../images/drift.png' width='600'></td><td><img src='../images/seasonal_naive.png'width='600'></td></tr></table>"))

[Hyndman and Athanasopoulos, Forecasting: Principles and Practice](https://otexts.com/fpp2/)

## __Basic time series functionality methods in Python__

Below we will look at the objects commonly used in Python for dealing with dates & times.

In [2]:
import pandas as pd
import numpy as np

### __Dates and Times__

### ISO 8601

An international standard for datetimes - [Wikipedia](https://en.wikipedia.org/wiki/ISO_8601).  

A noteable feature of the standard is the use of `T` to separate the date & time

The below are all the same time, in UTC (Z = Zulu time = UTC = GMT).

```
2019-11-28T09:18:53+00:00

2019-11-28T09:18:53Z

20191128T091853Z
```

We can also represent the same moment in time in another time zone (say Central European Time, which is one hour ahead of UTC):

```
2019-11-28T10:18:53+01:00
```

*Standard times* are not affected by daylight savings - be thankful if your data is stamped in or you can work in a standard time.  

Working in local times is the worst case scenario for a programmer

#### __Timestamps__

In [3]:
# create a date range
rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D')
rng

DatetimeIndex(['2016-07-01', '2016-07-02', '2016-07-03', '2016-07-04', '2016-07-05', '2016-07-06', '2016-07-07', '2016-07-08', '2016-07-09', '2016-07-10'], dtype='datetime64[ns]', freq='D')

In [4]:
pd.Timestamp('2016-07-10')

Timestamp('2016-07-10 00:00:00')

In [5]:
# you can add more details
pd.Timestamp('2016-07-10 10')

Timestamp('2016-07-10 10:00:00')

In [6]:
# and more
pd.Timestamp('2016-07-10 10:15')

Timestamp('2016-07-10 10:15:00')

In [7]:
# creation of timestamp object variable
t = pd.Timestamp('2016-07-10 10:15')

#### __Time spans__

A period of time, with an associated frequency

In [8]:
pd.Period('2016-01')

Period('2016-01', 'M')

In [9]:
pd.Period('2016-01-01')

Period('2016-01-01', 'D')

In [10]:
pd.Period('2016-01-01 10')

Period('2016-01-01 10:00', 'H')

In [11]:
pd.Period('2016-01-01 10:10')

Period('2016-01-01 10:10', 'T')

In [12]:
pd.Period('2016-01-01 10:10:10')

Period('2016-01-01 10:10:10', 'S')

#### __Time offsets__

In [13]:
pd.Timedelta('1 day')

Timedelta('1 days 00:00:00')

In [14]:
pd.Period('2016-01-01 10:10') + pd.Timedelta('1 day')

Period('2016-01-02 10:10', 'T')

In [15]:
pd.Timestamp('2016-01-01 10:10') + pd.Timedelta('1 day')

Timestamp('2016-01-02 10:10:00')

In [16]:
pd.Timestamp('2016-01-01 10:10') + pd.Timedelta('15 ns')

Timestamp('2016-01-01 10:10:00.000000015')

#### __Frequency settings__

In [17]:
# only business days:
pd.period_range('2016-01-01 10:10', freq = 'B', periods = 10)

PeriodIndex(['2016-01-01', '2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08', '2016-01-11', '2016-01-12', '2016-01-13', '2016-01-14'], dtype='period[B]', freq='B')

In [18]:
# possible to combine frequencies. What if you want to advance by 25 hours each day. What are the 2 ways to do it?
p1 = pd.period_range('2016-01-01 10:10', freq = '25H', periods = 10)

In [19]:
p2 = pd.period_range('2016-01-01 10:10', freq = '1D1H', periods = 10)

In [20]:
p1

PeriodIndex(['2016-01-01 10:00', '2016-01-02 11:00', '2016-01-03 12:00', '2016-01-04 13:00', '2016-01-05 14:00', '2016-01-06 15:00', '2016-01-07 16:00', '2016-01-08 17:00', '2016-01-09 18:00', '2016-01-10 19:00'], dtype='period[25H]', freq='25H')

In [21]:
p2

PeriodIndex(['2016-01-01 10:00', '2016-01-02 11:00', '2016-01-03 12:00', '2016-01-04 13:00', '2016-01-05 14:00', '2016-01-06 15:00', '2016-01-07 16:00', '2016-01-08 17:00', '2016-01-09 18:00', '2016-01-10 19:00'], dtype='period[25H]', freq='25H')

In [22]:
# timestamped data can be convereted to period indices with to_period and vice versa with to_timestamp
ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H'))
ts

2016-07-10 08:00:00    0
2016-07-10 09:00:00    1
2016-07-10 10:00:00    2
2016-07-10 11:00:00    3
2016-07-10 12:00:00    4
2016-07-10 13:00:00    5
2016-07-10 14:00:00    6
2016-07-10 15:00:00    7
2016-07-10 16:00:00    8
2016-07-10 17:00:00    9
Freq: H, dtype: int64

In [23]:
ts_period = ts.to_period()
ts_period

2016-07-10 08:00    0
2016-07-10 09:00    1
2016-07-10 10:00    2
2016-07-10 11:00    3
2016-07-10 12:00    4
2016-07-10 13:00    5
2016-07-10 14:00    6
2016-07-10 15:00    7
2016-07-10 16:00    8
2016-07-10 17:00    9
Freq: H, dtype: int64

In [24]:
ts_period['2016-07-10 08:30':'2016-07-10 11:45'] # we have the concept of overlap with time periods

2016-07-10 08:00    0
2016-07-10 09:00    1
2016-07-10 10:00    2
2016-07-10 11:00    3
Freq: H, dtype: int64

In [25]:
ts['2016-07-10 08:30':'2016-07-10 11:45'] # we have the concept of include with timestamps

2016-07-10 09:00:00    1
2016-07-10 10:00:00    2
2016-07-10 11:00:00    3
Freq: H, dtype: int64

### __Time zone handling__

A frustrating challenge for all programmers:

![](../images/tz.jpg)


In [26]:
rng = pd.date_range('3/6/2012 00:00', periods=15, freq='D')
rng.tz

In [27]:
rng_tz = pd.date_range('3/6/2012 00:00', periods=15, freq='D', tz='Europe/London')
rng_tz.tz

<DstTzInfo 'Europe/London' LMT-1 day, 23:59:00 STD>

In [28]:
from pytz import common_timezones, all_timezones
print(len(common_timezones))
print(len(all_timezones))
print(set(all_timezones) - set(common_timezones))

440
592
{'Mexico/BajaNorte', 'Africa/Asmera', 'W-SU', 'Asia/Dacca', 'US/Michigan', 'GB', 'Asia/Chungking', 'America/Virgin', 'Etc/Zulu', 'Pacific/Yap', 'Etc/GMT-6', 'America/Atka', 'UCT', 'Europe/Belfast', 'Etc/GMT+5', 'America/Mendoza', 'America/Jujuy', 'Asia/Ashkhabad', 'Antarctica/South_Pole', 'Japan', 'Etc/GMT-4', 'Pacific/Truk', 'Australia/LHI', 'Etc/GMT+6', 'Etc/GMT+12', 'Pacific/Samoa', 'America/Fort_Wayne', 'Jamaica', 'Australia/ACT', 'Etc/GMT+3', 'GMT+0', 'Australia/NSW', 'MET', 'Europe/Tiraspol', 'MST7MDT', 'Israel', 'Etc/GMT+9', 'Australia/Victoria', 'GMT-0', 'Etc/GMT-13', 'Australia/Tasmania', 'Etc/GMT-2', 'NZ', 'Hongkong', 'Etc/UCT', 'Canada/Saskatchewan', 'Egypt', 'America/Montreal', 'America/Rosario', 'Pacific/Johnston', 'Atlantic/Jan_Mayen', 'America/Porto_Acre', 'Asia/Calcutta', 'Chile/EasterIsland', 'Asia/Istanbul', 'ROK', 'Etc/GMT', 'GMT0', 'Cuba', 'Kwajalein', 'Iceland', 'Australia/South', 'Etc/GMT-7', 'Etc/GMT+11', 'WET', 'America/Ensenada', 'EST', 'EET', 'CST6CDT'

In [29]:
#localisation of naive timestamp
t_naive = pd.Timestamp('2016-07-10 08:50')
t_naive

Timestamp('2016-07-10 08:50:00')

In [30]:
t = t_naive.tz_localize(tz = 'US/Central')
t

Timestamp('2016-07-10 08:50:00-0500', tz='US/Central')

In [31]:
t.tz_convert('Asia/Tokyo')

Timestamp('2016-07-10 22:50:00+0900', tz='Asia/Tokyo')

In [32]:
#handling of daylight saving
rng = pd.date_range('2016-03-10', periods=10, tz='US/Central')
ts = pd.Series(range(10), index=rng)
ts

2016-03-10 00:00:00-06:00    0
2016-03-11 00:00:00-06:00    1
2016-03-12 00:00:00-06:00    2
2016-03-13 00:00:00-06:00    3
2016-03-14 00:00:00-05:00    4
2016-03-15 00:00:00-05:00    5
2016-03-16 00:00:00-05:00    6
2016-03-17 00:00:00-05:00    7
2016-03-18 00:00:00-05:00    8
2016-03-19 00:00:00-05:00    9
Freq: D, dtype: int64

In [33]:
rng = pd.date_range('2016-03-10', periods=10, tz='utc')
ts = pd.Series(range(10), index=rng)
ts

2016-03-10 00:00:00+00:00    0
2016-03-11 00:00:00+00:00    1
2016-03-12 00:00:00+00:00    2
2016-03-13 00:00:00+00:00    3
2016-03-14 00:00:00+00:00    4
2016-03-15 00:00:00+00:00    5
2016-03-16 00:00:00+00:00    6
2016-03-17 00:00:00+00:00    7
2016-03-18 00:00:00+00:00    8
2016-03-19 00:00:00+00:00    9
Freq: D, dtype: int64

In [34]:
ts.tz_convert('US/Central')

2016-03-09 18:00:00-06:00    0
2016-03-10 18:00:00-06:00    1
2016-03-11 18:00:00-06:00    2
2016-03-12 18:00:00-06:00    3
2016-03-13 19:00:00-05:00    4
2016-03-14 19:00:00-05:00    5
2016-03-15 19:00:00-05:00    6
2016-03-16 19:00:00-05:00    7
2016-03-17 19:00:00-05:00    8
2016-03-18 19:00:00-05:00    9
Freq: D, dtype: int64

In [35]:
#what happens if one hour does not exist due to datetime saving time?
pd.Timestamp('2016-03-13 02:00:00', tz = 'US/Central')

NonExistentTimeError: 2016-03-13 02:00:00

### __Datetimes in DataFrames__

In [None]:
#Parsing datetime columns
df = pd.DataFrame({'year': [2015, 2016],'month': [2, 3],'day': [4, 5],'hour': [2, 3]})
df

In [None]:
pd.to_datetime(df)

In [None]:
pd.to_datetime(df[['year', 'month', 'day']])

In [None]:
#truncate convenience function
ts = pd.Series(range(10), index=pd.date_range('7/31/2015', freq='M', periods=10))
ts.truncate(before='2015-10-31', after='2015-12-31')

In [None]:
#truncate by indexing
ts[[0, 2, 6]].index

In [None]:
ts.ix[0:10:2].index

### __Resampling__

In [None]:
rng = pd.date_range('1/1/2011', periods=72, freq='H')
ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [None]:
ts

In [None]:
converted = ts.asfreq('45Min', method='pad')

In [None]:
converted1 = ts.asfreq('45Min', method='backfill')

In [None]:
converted

In [None]:
converted1

In [None]:
converted = ts.asfreq('90Min', method = 'bfill')

In [None]:
ts.resample('D').sum()

### __Moving window functions__

In [None]:
%matplotlib inline 
import matplotlib.pylab
import numpy as np
import pandas as pd

In [None]:
df = pd.DataFrame(np.random.randn(600, 3), index = pd.date_range('7/1/2016', freq = 'S', periods = 600), columns = ['A', 'B', 'C'])

In [None]:
# pd.rolling_mean(df, window = 2)[1:10] # in future versions you want to resample separately
r = df.rolling(window = 10)
# r.agg, r.apply, r.count, r.exclusions, r.max, r.median, r.name, r.quantile, r.kurt, r.cov, r.corr, r.aggregate, r.std, r.skew, r.sum, r.var
df.plot(style = 'k--')
r.mean().plot(style = 'k')

In [None]:
#Plotting rolling averages per columns
df = pd.DataFrame(np.random.randn(1000, 4), index = pd.date_range('6/6/16', periods = 1000), columns = ['A', 'B', 'C', 'D'])

In [None]:
df.head()

In [None]:
df = df.cumsum()
df.rolling(window = 50).sum().plot(subplots=True)

In [None]:
#apply a custom fucntion to your data using the .apply() method
df.rolling(window = 10).apply(lambda x: np.fabs(x - x.mean()).mean())

In [None]:
#yields the value of the statistic with all the data available up to that point in time
df.expanding(min_periods = 1).mean()[1:5]

## Data transformations

Commonly in data science we apply transformations to data (i.e. standardization or normalization).  All of these transformations require an inverse.

In time series transformations are often performed to remove components such as trend or seasonality

Below we look at two common time series transformations.

#### __Log transformation__

A subset of the more general power transformations (square root, cube root, log, etc).  The power transforms attempt to make data more Gaussian.  In time series the effect can often be removing a change in variance over time.

The log transform is common in finance, as it transforms exponential accumulation of returns to linear

A log transform will make a process more stationary, by transforming multiplicative trends into linear trends

In [None]:
import matplotlib.pyplot as plt

In [None]:
#the airline passengers dataset is a famous one to demonstrate time series forecasts
#we will see it at other occasions throughout this course
airline_passengers = pd.read_csv('../data/airline_passengers.csv')

In [None]:
plt.style.use('ggplot')
airline_passengers.plot(color='blue');

- the trend of this dataset is an exponential increase
- the periodic amplitude also increases over time

In [None]:
airline_passengers_log = airline_passengers['Thousands of Passengers'].apply(lambda x: np.log10(x))

In [None]:
plt.style.use('ggplot')
airline_passengers_log.plot(color='blue');

After the log transform
- the trend became almost linear
- the periodic amplitude remains constant

#### Difference transform

Also known as differencing.  There are various types of differencing:
- first order - subtracting by the previous value (t-1).  This will remove trend
- seasonal differencing - subtracting by the previous value the season before, such as yesterday, last week or last year

First order differencing can be repeated to remove second order trends.

Below we apply a first order difference to our passengers dataset:

In [None]:
diff = airline_passengers.loc[:, 'Thousands of Passengers'].diff(1)

In [None]:
diff.plot(color='blue')

The differencing transform has removed our linear growth trend

## Exercise

- load the `co2_mm_mlo.csv` dataset
- create a time stamp column
- is the time series stationary?
- does the time series have a trend?
- does the time series implement seasonality?

Based on your visual examination, apply either / or the transforms we discussed above (log / differencing)

Finally, implement a naive & seasonal naive forecast for this dataset.

In [None]:
data = pd.read_csv('../data/co2_mm_mlo.csv')

In [None]:
data.loc[:, 'average'].plot()