# Introduction to Time Series

Broadly speaking, time series data are points of data gathered over time. The time order is meaningful and typically there is only one observation per unit of time. The time will uniquely identify each record. Often, the time is evenly spaced between each data point. 

Examples of time series data include stock market closing prices, levels of CO2 in the atmosphere, unemployment rates, etc... pandas has good functionality with regards to manipulating dates, aggregating over different time periods, sampling different periods of time, and more. Let's begin by reading in 20 years of stock market data, putting the 'date' column in the index.

In [None]:
import pandas as pd
stocks = pd.read_csv('../data/stocks/stocks10.csv', parse_dates=['date'], 
                     index_col='date')
stocks.head(3)

## Set the datetime column as the index

If you do have time series data where the values of one datetime column uniquely identify each row, then it's best to use this column as the index. pandas provides extra functionality to DataFrames that have a datetime index.

### DateTimeIndex

Setting a datetime column as the index technically creates a DateTimeIndex. You can directly call specific datetime methods on it like you can with the `dt` accessor. Let's extract it and examine the first five values.

In [None]:
idx = stocks.index
idx[:5]

Let's verify the type of index we have.

In [None]:
type(idx)

Now, let's get the year, month and weekday name directly from this index object. The first five values for each attribute are returned.

In [None]:
idx.year[:5]

In [None]:
idx.month[:5]

In [None]:
idx.day_name()[:5]

## Easy subset selection with a DateTimeIndex

One big advantage of a DateTimeIndex is the ability to select subsets of data without using boolean indexing. We can use strings to represent specific datetimes and pass those strings to the `loc` indexer. Here, we select the data for January 5th, 2017.

In [None]:
stocks.loc['2017-1-5']

### Partial string matching to select entire months or years

You can select entire years or months (or other spans of time) by using a string with less precision. In the following example, we select the entire month of February, 2017.

In [None]:
stocks.loc['2017-2'].head(3)

Below, we select the entire year 2016.

In [None]:
stocks.loc['2016'].head(3)

### Slicing with partial string matching

Use slice notation to select a specific date range. Below, we select from March 28, 2017 through April 3, 2017. Note that the stop value is inclusive.

In [None]:
stocks['2017-3-28':'2017-4-3']

## Sampling specific times

Let's say you are interested in selecting the closing prices for the last day of every year in the dataset. pandas provides the `asfreq` method to do so. You must pass it an **offset alias** as a string. An offset alias determines the frequency of the time series data you would like to sample. The table below shows the most common offset aliases. To reference all of the [offset aliases, visit in the official documentation][1].

| Alias    | Description     |  Alias  |  Description  |
|:---------|:----------------|:--------|:--------------|
| `Y`        | year end        | `D`       | day           |
| `YS`       | year start      | `H`        | hourly       |
| `Q`        | quarter end     | `T` or `min`   | minutes      |
| `QS`       | quarter start   | `S`        | seconds      |
| `M`       | month end     | `L` or `ms`    | milliseconds |
| `MS`       | month start       | `U` or `us`    | microseconds |
| `W`        | weekly          | `N`        | nanoseconds  |

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

In our example, we need the offset alias `'Y'` for the year end frequency. We pass this as a string to the `asfreq` method to return the very last day of the each year. Note that `asfreq` only works for DataFrames with a DateTimeIndex.

In [None]:
stocks.asfreq('Y').head(8)

### Business offset aliases

This isn't quite what we want because the stock market is open only during the week and December 31st falls on a weekend some years. The `asfreq` method returns one row for each frequency regardless if there is data for that date. All values for frequencies that do not appear in the DataFrame will be filled with missing values.

Most of the offset aliases above can be prepended by the character `'B'` to signify a business offset alias. Business offset aliases only consider the weekdays Monday through Friday. Let's change the offset alias to `'BY'` to signify business year end frequency. Using this, we correctly select the last trading day of each year.

In [None]:
stocks.asfreq('BY').head(8)

### Anchored offset aliases

Let's say we would like to select every Thursday. We'll need to use a slightly different string called an **anchored offset alias**. You can anchor years and quarters to months and weeks to days by placing a dash and the abbreviation of the anchor after the offset alias. For example, `BY-APR` signifies business year frequency ending in April. Below, we anchor weeks to Thursday. The default anchor for weeks is Sunday.

In [None]:
stocks.asfreq('W-THU').head()

## Upsampling - Increasing the number of rows

The above selections choose a specific subset of rows. This is called **downsampling** when we select a subset of the original data.  Instead, we may choose to **upsample** and increase the number of rows. This will lead to rows of all missing values. Both upsampling and downsampling ensure that the rows are evenly spaced units of time. Let's return a DataFrame with a single row for each day of the year. Currently, only the trading days are in the dataset.

In [None]:
stocks.asfreq('D').head(7)

### Use integers in the offset alias

You can provide more precise offsets by placing an integer in front of the offset alias. These represent a multiple the of offset alias. For example, '3M' stands for 3 months and '15s' for 15 seconds. To select every 6th Wednesday, we  do the following:

In [None]:
stocks.asfreq('6W-WED').head()

You can also upsample by smaller units than what is present in the index. For instance, '4H' will make a new row for every 4 hours.

In [None]:
stocks.asfreq('4H').head(8)

You can fill in the missing values with the previous or next known values using the `method` parameter which can be set to either 'ffill' or 'bfill'. Here we fill the missing values using the previously known value in the column.

In [None]:
stocks.asfreq('4H', method='ffill').head(8)

### No duplicates are allowed and dates must be ordered

Upsampling and downsampling work properly when there are no duplicate dates and when the data is ordered. Let's take the employee dataset which has a datetime column, but is definitely not time series data.

In [None]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp = emp.set_index('hire_date')
emp.head(3)

If we try and sample it by Year (which is meaningless in this dataset) we get an empty DataFrame.

In [None]:
emp.asfreq('Y')

Even if we try and make it more like a time series by sorting the index, the operation will only be successful if there are no duplicate dates. The error tells us that at least one hire date is not unique.

In [None]:
emp = emp.sort_index()
emp.head(3)

In [None]:
emp.asfreq('W')

Selection with partial string still works.

In [None]:
emp.loc['2012-1':'2012-2'].head()

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the weather time series dataset and place the date column in the index.</span>

### Exercise 2
<span  style="color:green; font-size:16px">What was the temperature on June 11, 2011?</span>

### Exercise 3
<span  style="color:green; font-size:16px">How many days did it rain during the last three months of 2011?</span>

### Exercise 4
<span  style="color:green; font-size:16px">Which year had more snow days, 2007 or 2012?</span>

### Exercise 5
<span  style="color:green; font-size:16px">Select every other thursday</span>

### Exercise 6
<span  style="color:green; font-size:16px">Select the first day of each month.</span>