# Getting started

1. Create a new GitHub repository, call it something like `time_series_basic` or `time_series_forecasting_basic` etc
2. Clone the repository to a local folder on your workstation
3. Change into the folder
4. Activate the `greyatom` environment: `source activate greyatom`
5. Start jupyter notebook: `jupyter notebook`
6. Create two notebooks
    - `scratchpad.ipynb`
    - `daily_temperature_prediction.ipynb`
7. Add, commit and push the changes to the remote repo

> git add .

> git commit -m 'adds notebooks'

> git push

# What is Time Series Data

* A time series is a series of data points indexed (or listed or graphed) in time order
* Most commonly, a time series is a sequence taken at successive equally spaced points in time
* Time series forecasting is the use of a model to predict future values based on previously observed values
* Time Series Forecasting vs Supervised Machine Learning

Important feature of most time series is that observations close together in time tend to be correlated.

## Some Nomenclature

* Current time is $t$
* Times in the past are negative (relative to current time) - $t_{-1}$, $t_{-2}$
* $\ldots t_{-1}$, $t_{-2}$, $t$, $t_{1}$, $t_{2} \ldots$

### Univariate vs Multivariate

* **Univariate Time Series**: These are datasets where only a single variable is observed at each time, such as temperature each hour. The example in the previous section is a univariate time series dataset.
* **Multivariate Time Series**: These are datasets where two or more variables are observed at each time.

Multivariate data is often more difficult to work with. It is harder to model and often many of the classical methods do not perform well.

### Descriptive vs Predictive Analysis

* In **time series analysis**, a time series is modeled to determine its components in terms of
    - seasonal patterns
    - trends
    - relation to external factors
    - basline
    - noise
    - etc
* In contrast, **time series forecasting** uses the information in a time series (perhaps with additional information) to forecast future values of that series

### Examples of Time Series Forecasting

* Forecasting whether an EEG trace in seconds indicates a patient is having a seizure or not.
* Forecasting the closing price of a stock each day.
* Forecasting the birth rate at all hospitals in a city each year.
* Forecasting product sales in units sold each day for a store.
* Forecasting the number of passengers through a train station each day.

## Typical Pre-processing for Time Series Data

* **Missing**: Perhaps there are gaps or missing data that need to be interpolated or imputed.
* **Outliers**: Perhaps there are corrupt or extreme outlier values that need to be identified and handled.
* **Frequency**: Perhaps data is provided at a frequency that is too high to model or is unevenly spaced through time requiring resampling for use in some models.
  

## Components of Time Series

Time series analysis provides a body of techniques to better understand a dataset. Perhaps the most useful of these is the decomposition of a time series into 4 constituent parts:

* **Level**: The baseline value for the series if it were a straight line.
* **Trend**: The optional and often linear increasing or decreasing behavior of the series over time.
* **Seasonality**: The optional repeating patterns or cycles of behavior over time.
* **Noise**: The optional variability in the observations that cannot be explained by the model.

---

# Time Series as a Supervised Machine Learning problem

* The Task: given a sequence of numbers for a time series dataset, we have to reformulate the data to look like a supervised machine learning problem
* How: by using previous time steps as input variables, and using the next time step as the output variable.

## Univariate example

| time | measure |
|------|---------|
| 1 |  100 |
| 2 |  110 |
| 3 |  108 |
| 4 |  115 |
| 5 |  120 |

becomes

| X | y |
|------|---------|
| ? |  100 |
| 100 |  110 |
| 110 |  108 |
| 108 |  115 |
| 115 |  120 |
| 120 |  ? |

* We can see that the previous time step is the input (X) and the next time step is the output (y) in our supervised learning problem.
* We can see that the order between the observations is preserved, and must continue to be preserved when using this dataset to train a supervised model.
* We can see that we have no previous value that we can use to predict the first value in the sequence. We will delete this row as we cannot use it.
* We can also see that we do not have a known next value to predict for the last value in the sequence. We may want to delete this value while training our supervised model also.

The use of prior time steps to predict the next time step is called the **sliding window** or **window** method. In statistics and time series analysis, this is called a **lag** or **lag method**. The number of previous time steps is called the window width or size of the lag. 

## Multivariate example

In the following example, we want to predict `measure2`:

| time | measure1 | measure2 |
|------|----------|----------|
| 1 | 0.2 | 88 |
| 2 | 0.5 | 89 |
| 3 | 0.7 | 87 |
| 4 | 0.4 | 88 |
| 5 | 1.0 | 90 |

becomes

| $X_1$ | $X_2$ | $X_3$ | y |
|-------|-------|-------|---|
| ?   | ?  | 0.2 | 88 |
| 0.2 | 88 | 0.5 | 89 |
| 0.5 | 89 | 0.7 | 87 |
| 0.7 | 87 | 0.4 | 88 |
| 0.4 | 88 | 1.0 | 90 |
| 1.0 | 90 | ?   | ?  |

* What is the value of lag?
* What is the value of window width?
* How was $X_1$ constructed?
* How was $X_2$ constructed?
* How was $X_3$ constructed?

## One-step vs Multi-step

* One-Step Forecast: This is where the next time step (t+1) is predicted.
* Multi-Step Forecast: This is where two or more future time steps are to be predicted.

| time | measure |
|------|---------|
| 1 |  100 |
| 2 |  110 |
| 3 |  108 |
| 4 |  115 |
| 5 |  120 |
| 6 |   ?  |
| 7 |   ?  |

becomes

| X    | $y_1$   | $y_2$   |
|------|---------|---------|
|  ?  |  100 |  110 |
| 100 |  110 |  108 |
| 110 |  108 |  115 |
| 108 |  115 |  120 |
| 115 |  120 |   ?  |
| 120 |   ?  |   ?  |

# Exploring Time Series Data

## The Dataset

* We will use the Daily Female Births Dataset as an example. This dataset describes the number of daily female births in California in 1959.
    - You can find it inside the `data` subfolder
    - You can download it from https://datamarket.com/data/set/235k/daily-total-female-births-in-california-1959#!ds=235k&display=line

![](./images/ts01.png)

## Load Data


In [5]:
from pandas import Series

series = Series.from_csv('./data/female_births.csv', header=0)
print(type(series))
series.head()

<class 'pandas.core.series.Series'>


Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
Name: Daily total female births in California, 1959, dtype: int64

* You can see that each row has an associated date.
    - This is in fact not a column, but instead a time index for value.
    - As an index, there can be multiple values for one time, and values may be spaced evenly or unevenly across times.
* The main function for loading CSV data in Pandas is `read_csv()`. We can use this to load the time series as a Series object, instead of a DataFrame, as follows:

In [16]:
from pprint import pprint
from pandas import read_csv
series = read_csv('./data/female_births.csv', 
                  header=0, parse_dates=[0], index_col=0, squeeze=True)
print(type(series))
print(series.head())

<class 'pandas.core.series.Series'>
Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
Name: Daily total female births in California, 1959, dtype: int64


* **header=0**: We must specify the header information at row 0.
* **parse dates=[0]**: We give the function a hint that data in the first column contains dates that need to be parsed. This argument takes a list, so we provide it a list of one element, which is the index of the first column.
* **index_col=0**: We give a hint that the first column contains the index information for the time series.
* **squeeze=True**: We give a hint that we only have one data column and that we are interested in a Series and not a DataFrame.
* In this example, the date format has been inferred, and this works in most cases.
    - In those few cases where it does not, specify your own date parsing function and use the `date_parser` argument.
    
## Explore Data

In [2]:
from pprint import pprint
from pandas import read_csv
series = read_csv('../data/female_births.csv', 
                  header=0, parse_dates=[0], index_col=0, squeeze=True)

series.head(10)

Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
1959-01-06    29
1959-01-07    45
1959-01-08    43
1959-01-09    38
1959-01-10    27
Name: Daily total female births in California, 1959, dtype: int64

In [103]:
series.tail()

Date
1959-12-27    37
1959-12-28    52
1959-12-29    48
1959-12-30    55
1959-12-31    50
Name: Daily total female births in California, 1959, dtype: int64

In [104]:
series.shape

(365,)

In [105]:
series.size

365

### Descriptive Statistics

In [106]:
series.describe()

count    365.000000
mean      41.980822
std        7.348257
min       23.000000
25%       37.000000
50%       42.000000
75%       46.000000
max       73.000000
Name: Daily total female births in California, 1959, dtype: float64

In [25]:
series.skew()

18.977587954880505

# Basic Feature Engineering

Time Series data must be re-framed as a supervised learning dataset before we can start using machine learning algorithms. We will look at three classes of features that we can create from our time series dataset:

* **Date Time Features**: these are components of the time step itself for each observation
* **Lag Features**: these are values at prior time steps
* **Window Features**: these are a summary of values over a fixed window of prior time steps

## The Dataset

* This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia. The units are in degrees Celsius and there are 3650 observations.
    - You can find it inside the `data` subfolder
    - You can download it from https://datamarket.com/data/set/2324/daily-minimum-temperatures-in-melbourne-australia-1981-1990#!ds=2324&display=line

![](./images/ts03.png)

### Load Data

**Swtich to your time-series-forecasting notebook**

In [34]:
series = read_csv('../data/daily_temp.csv', 
                  header=0, parse_dates=[0], index_col=0, squeeze=True)
series.head()

Date
1981-01-01    20.7
1981-01-02    17.9
1981-01-03    18.8
1981-01-04    14.6
1981-01-05    15.8
Name: Daily minimum temperatures in Melbourne, Australia, 1981-1990, dtype: object

In [35]:
series = read_csv('../data/daily_temp.csv', 
                  header=0, parse_dates=[0], index_col=0)
series.head()

Unnamed: 0_level_0,"Daily minimum temperatures in Melbourne, Australia, 1981-1990"
Date,Unnamed: 1_level_1
1981-01-01,20.7
1981-01-02,17.9
1981-01-03,18.8
1981-01-04,14.6
1981-01-05,15.8


## Adding Date Time Features

Add day, month and year

In [108]:
from pandas import Series
from pandas import DataFrame
series = Series.from_csv('../data/daily_temp.csv', header=0)

dataframe = DataFrame()

dataframe['year'] = [series.index[i].year-1981 for i in range(len(series))] 
dataframe['month'] = [series.index[i].month for i in range(len(series))] 
dataframe['day'] = [series.index[i].day for i in range(len(series))] 
dataframe['temperature'] = [series[i] for i in range(len(series))] 

print(dataframe.head(5))

   year  month  day temperature
0     0      1    1        20.7
1     0      1    2        17.9
2     0      1    3        18.8
3     0      1    4        14.6
4     0      1    5        15.8


Some more examples (which of the following are going to be useful?):

* Minutes elapsed for the day
* Hour of day
* Business hours or not
* Weekend or not
* Season of the year
* Business quarter of the year
* Daylight savings or not
* Public holiday or not
* Leap year or not

## Adding Lag features

* The Pandas library provides the `shift()` function to help create these shifted or lag features from a time series dataset.
* Shifting the dataset by 1 creates the t column, adding a NaN (unknown) value for the first row
* The time series dataset without a shift represents the t+1

In [37]:
from pandas import Series
from pandas import DataFrame
from pandas import concat
series = Series.from_csv('../data/daily_temp.csv', header=0) 

temps = DataFrame(series.values)
dataframe = concat([temps.shift(1), temps], axis=1)
dataframe.columns = ['t', 't+1']
print(dataframe.head(5))

      t   t+1
0   NaN  20.7
1  20.7  17.9
2  17.9  18.8
3  18.8  14.6
4  14.6  15.8


We can expand the window width and include more lagged features. Let us include the last 3 observed values to predict the value at the next time step.

In [38]:
from pandas import Series
from pandas import DataFrame
from pandas import concat
series = Series.from_csv('../data/daily_temp.csv', header=0) 

temps = DataFrame(series.values)
dataframe = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)
dataframe.columns = ['t-2', 't-1', 't', 't+1']
print(dataframe.head(5))

    t-2   t-1     t   t+1
0   NaN   NaN   NaN  20.7
1   NaN   NaN  20.7  17.9
2   NaN  20.7  17.9  18.8
3  20.7  17.9  18.8  14.6
4  17.9  18.8  14.6  15.8


You can include lag values from past week, month or year as well.

# Task

1. For the entire datatset, add the following features
    - Day of the month
    - Month of the year
    - Year - 1981
    - Day of the year
        * write a custom function which computes day of the year from day of the month and month of year
        * apply the function in list comprehension
    - Add $lag_{1}$, $lag_{2}$, $lag_{3}$, $lag_{4}$, $lag_{5}$ features
2. Split the dataset into two parts
    - $1^{st}$ 9 years (training set)
    - the last (tenth) year (test set)
3. Write a function to fit a model to your training set (return model as an output)
4. Write a function to predict the model's performance on the test set