# Getting started

1. Create a new GitHub repository, call it something like `time_series_basic` or `time_series_forecasting_basic` etc
2. Clone the repository to a local folder on your workstation
3. Change into the folder
4. Activate the `greyatom` environment: `source activate greyatom`
5. Start jupyter notebook: `jupyter notebook`
6. Create two notebooks
    - `scratchpad.ipynb`
    - `daily_temperature_prediction.ipynb`
7. Add, commit and push the changes to the remote repo

> git add .

> git commit -m 'adds notebooks'

> git push

In [5]:
from pandas import Series

series = Series.from_csv('./data/female_births.csv', header=0)
print(type(series))
series.head()

<class 'pandas.core.series.Series'>


Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
Name: Daily total female births in California, 1959, dtype: int64

* You can see that each row has an associated date.
    - This is in fact not a column, but instead a time index for value.
    - As an index, there can be multiple values for one time, and values may be spaced evenly or unevenly across times.
* The main function for loading CSV data in Pandas is `read_csv()`. We can use this to load the time series as a Series object, instead of a DataFrame, as follows:

In [16]:
from pprint import pprint
from pandas import read_csv
series = read_csv('./data/female_births.csv', 
                  header=0, parse_dates=[0], index_col=0, squeeze=True)
print(type(series))
print(series.head())

<class 'pandas.core.series.Series'>
Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
Name: Daily total female births in California, 1959, dtype: int64


* **header=0**: We must specify the header information at row 0.
* **parse dates=[0]**: We give the function a hint that data in the first column contains dates that need to be parsed. This argument takes a list, so we provide it a list of one element, which is the index of the first column.
* **index_col=0**: We give a hint that the first column contains the index information for the time series.
* **squeeze=True**: We give a hint that we only have one data column and that we are interested in a Series and not a DataFrame.
* In this example, the date format has been inferred, and this works in most cases.
    - In those few cases where it does not, specify your own date parsing function and use the `date_parser` argument.
    
## Explore Data

In [102]:
from pprint import pprint
from pandas import read_csv
series = read_csv('./data/female_births.csv', 
                  header=0, parse_dates=[0], index_col=0, squeeze=True)

series.head(10)

Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
1959-01-06    29
1959-01-07    45
1959-01-08    43
1959-01-09    38
1959-01-10    27
Name: Daily total female births in California, 1959, dtype: int64

In [103]:
series.tail()

Date
1959-12-27    37
1959-12-28    52
1959-12-29    48
1959-12-30    55
1959-12-31    50
Name: Daily total female births in California, 1959, dtype: int64

In [104]:
series.shape

(365,)

In [105]:
series.size

365

### Descriptive Statistics

In [106]:
series.describe()

count    365.000000
mean      41.980822
std        7.348257
min       23.000000
25%       37.000000
50%       42.000000
75%       46.000000
max       73.000000
Name: Daily total female births in California, 1959, dtype: float64

In [25]:
series.skew()

18.977587954880505

### Load Data

**Swtich to your time-series-forecasting notebook**

In [34]:
series = read_csv('./data/daily_temp.csv', 
                  header=0, parse_dates=[0], index_col=0, squeeze=True)
series.head()

Date
1981-01-01    20.7
1981-01-02    17.9
1981-01-03    18.8
1981-01-04    14.6
1981-01-05    15.8
Name: Daily minimum temperatures in Melbourne, Australia, 1981-1990, dtype: object

In [35]:
series = read_csv('./data/daily_temp.csv', 
                  header=0, parse_dates=[0], index_col=0)
series.head()

Unnamed: 0_level_0,"Daily minimum temperatures in Melbourne, Australia, 1981-1990"
Date,Unnamed: 1_level_1
1981-01-01,20.7
1981-01-02,17.9
1981-01-03,18.8
1981-01-04,14.6
1981-01-05,15.8


## Adding Date Time Features

Add day, month and year

In [108]:
from pandas import Series
from pandas import DataFrame
series = Series.from_csv('./data/daily_temp.csv', header=0)

dataframe = DataFrame()

dataframe['year'] = [series.index[i].year-1981 for i in range(len(series))] 
dataframe['month'] = [series.index[i].month for i in range(len(series))] 
dataframe['day'] = [series.index[i].day for i in range(len(series))] 
dataframe['temperature'] = [series[i] for i in range(len(series))] 

print(dataframe.head(5))

   year  month  day temperature
0     0      1    1        20.7
1     0      1    2        17.9
2     0      1    3        18.8
3     0      1    4        14.6
4     0      1    5        15.8


Some more examples (which of the following are going to be useful?):

* Minutes elapsed for the day
* Hour of day
* Business hours or not
* Weekend or not
* Season of the year
* Business quarter of the year
* Daylight savings or not
* Public holiday or not
* Leap year or not

## Adding Lag features

* The Pandas library provides the `shift()` function to help create these shifted or lag features from a time series dataset.
* Shifting the dataset by 1 creates the t column, adding a NaN (unknown) value for the first row
* The time series dataset without a shift represents the t+1

In [37]:
from pandas import Series
from pandas import DataFrame
from pandas import concat
series = Series.from_csv('./data/daily_temp.csv', header=0) 

temps = DataFrame(series.values)
dataframe = concat([temps.shift(1), temps], axis=1)
dataframe.columns = ['t', 't+1']
print(dataframe.head(5))

      t   t+1
0   NaN  20.7
1  20.7  17.9
2  17.9  18.8
3  18.8  14.6
4  14.6  15.8


We can expand the window width and include more lagged features. Let us include the last 3 observed values to predict the value at the next time step.

In [38]:
from pandas import Series
from pandas import DataFrame
from pandas import concat
series = Series.from_csv('./data/daily_temp.csv', header=0) 

temps = DataFrame(series.values)
dataframe = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)
dataframe.columns = ['t-2', 't-1', 't', 't+1']
print(dataframe.head(5))

    t-2   t-1     t   t+1
0   NaN   NaN   NaN  20.7
1   NaN   NaN  20.7  17.9
2   NaN  20.7  17.9  18.8
3  20.7  17.9  18.8  14.6
4  17.9  18.8  14.6  15.8


You can include lag values from past week, month or year as well.