# 2. Intro to Time Series

### Objectives
* Make a web request and retrieve JSON data from the IEX trading API
* Create a DatetimeIndex and use it for easier subset selection
* Learn how to create offset alias strings and pass them to the `asfreq` method to do upsampling/downsampling

## Introduction
Broadly speaking, time series data are simply points of data gathered over time. The time order is meaningful and typically there is only one observation per unit of time. The time will uniquely identify each record. Often, the time is evenly spaced between each data point. 

Examples of time series data include stock market closing prices, levels of CO2 in the atmosphere, unemployement rates, etc... Pandas has good functionality with regards to manipulating dates, aggregating over different time periods, sampling different periods of time, and more.

# Stock Market Data
There are many tools available to get data stock market data. We will use the [IEX developer platform][1] which has an excellent and easy-to-use API to retrieve market data for free (up to 100 requests per second).

### Using the IEX API
The IEX API is fairly straightforward to use and there are several examples that you can view to understand how it works. The base URL of the API is `https://api.iextrading.com/1.0` which can be [seen here in the docs][2]. If you scroll down from the last link, you will see how the API is used. Each **endpoint** is documented. Let's use the [chart endpoint][3].

We simply append **`/stock/{symbol}/chart/{range}`** to the base URL and put the stock symbol and range of data we want (without the curly braces) to retrieve historical stock price data. Go to the docs to view the available ranges.

Let's create our URL:

[1]: https://iextrading.com/developer/
[2]: https://iextrading.com/developer/docs/#endpoints
[3]: https://iextrading.com/developer/docs/#chart

In [None]:
import pandas as pd

In [None]:
url = 'https://api.iextrading.com/1.0/stock/AMZN/chart/5y'

### Reading JSON objects
Most APIs will respond with **JSON** data, a standardized format of data that is very similar to a Python dictionary with key-value pairs. This particular JSON data is returned as a list of dictionaries. We can usually read in an API response with the **`read_json`** pandas function by passing it the URL directly.

In [None]:
amzn = pd.read_json(url)
amzn.head()

### Verify data types
The **`read_json`** function helps choose the correct data types for us. It's a good idea to verify that Pandas chose the correct data types with the **`dtypes`** attribute. A common occurrence is for a column that looks like it contains numeric data to be actually kept as a string.

Looking below, the data types seem to all be correct, save for **`label`**, which appears to be just a duplicate of the date column. We are good to continue.

In [None]:
amzn.dtypes

### Drop some columns
Let's drop the **`label`**, **`unadjustedVolumne`**, and **`vwap`** columns to get a smaller DataFrame.

In [None]:
amzn = amzn.drop(columns=['label', 'unadjustedVolume', 'vwap'])
amzn.head()

## Reviewing the `dt` accessor
The Series `dt` accessor gives us extra attributes and methods only available to datetime columns. Let's take a look at some of those again.

In [None]:
date = amzn['date']

In [None]:
date.dt.day_name().head()

In [None]:
date.dt.month.head()

In [None]:
date.dt.is_month_start.head()

# Set the Datetime column in the index
If you do have time series data where the values of one datetime column uniquely identify each row, then you can make for some easier data manipulation by setting this column in the index. Let's do this now.

In [None]:
amzn = amzn.set_index('date')
amzn.head()

## DateTimeIndex
Setting a datetime column as the index technically creates a DateTimeIndex. You can directly call specific datetime methods on it like you can with the `dt` accessor. Let's extract it and see a few examples.

In [None]:
idx = amzn.index
idx

In [None]:
type(idx)

In [None]:
idx.year[:5]

In [None]:
idx.month[:5]

In [None]:
idx.weekday_name[:5]

# Easy subset selection with a DateTimeIndex
One big advantage of a DateTimeIndex is the ability to select subsets of data without using boolean indexing. We can use strings to represent specific datetimes and pass those strings to the `loc` indexer. Let's see an example of selecting some rows.

In [None]:
# Select January 5th, 2017
amzn.loc['2017-1-5']

### Partial string matching to select entire months or years
You can select entire years or months (or other spans of time) by using a string with just less precision than the DateTimeIndex

In [None]:
# select all of January 2017
amzn.loc['2017-1']

In [None]:
# select all of the year 2017
amzn.loc['2017'].head()

## Slicing with partial string matching
Use slice notation to select a specific date range. Note that the stop value is inclusive.

In [None]:
amzn.loc['2017-1-5':'2017-1-17']

In [None]:
# select all of January and February of 2017
amzn.loc['2017-1':'2017-2']

# Sampling Specific Times
Let's say you are interested in selecting the closing prices for the last day of every year in the dataset. Pandas provides the `asfreq` method to do so. You must pass it an **offset alias** as a string. An offset alias determines the frequency of the time series data you would like to sample. You must reference the offset aliases in the Pandas documentation. It has been provided for you as an iframe (an html document embedded inside of another html document) in this notebook below.

In [None]:
from IPython.display import IFrame
IFrame('http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases', width=800, height=500)

### Offset aliases
For instance, in our case we might want to use the offset alias 'A' (or equivalently 'Y') as the table above tells us it is the year end frequency. We pass this as a string to the `asfreq` method to return the very last day of the each year.

Note, that `asfreq` only works for DataFrames with a DateTimeIndex.

In [None]:
amzn.asfreq('A')

This isn't quite what we want because the stock market is open only during the week and thus some years end on a weekend. Pandas will return rows filled with missing values for dates that we have no data for.

Let's use the offset alias 'BA' instead to signify business year end frequency. Now, we select the last trading day of each year.

In [None]:
amzn.asfreq('BY')

### Anchored offset aliases
Let's say we would like to select every Friday. We'll need to use a slightly different string called an **anchored offset alias**. The table for these are right below the offset aliases from above, so just scroll down a bit to see them. The documentation alerts us that by default, weeks are anchored to Sunday. We change it to Friday with the following.

In [None]:
amzn.asfreq('W-FRI').head()

## Upsampling - Increasing the number of rows
The above selections choose a specific subset of rows. This is called **downsampling** in time series when we select a subset of the original data. 

Instead, we may choose to **upsample** and increase the number of rows. This will lead to rows of all missing values. Both upsampling and downsampling ensure that the rows are evenly spaced units of time. 

Let's return a DataFrame with a single row for each day of the year. Currently, only the trading days are in the dataset.

In [None]:
amzn.asfreq('D').head(14)

### Use integers in the offset alias
You can upsample/downsample by appending an integer to the front of the offset alias. These represent the number of offset aliases. For example, '3M' stands for 3 months and '15s' for 15 seconds.

To select every 6th Wednesday, we could do the following:

In [None]:
amzn.asfreq('6W-WED').head()

You can also upsample by smaller units than what is present in the index. For instance, '4H' will make a new row for every 4 hours.

In [None]:
amzn.asfreq('4H').head(20)

You can fill in the missing values with the previous or next known values using the `method` parameter which can be set to either 'ffill' or 'bfill'. Here we fill the missing values using the previously known value in the column.

In [None]:
amzn.asfreq('4H', method='ffill').head(20)

## No duplicates are allowed and dates must be ordered
Upsampling/downsampling only works properly when there are no duplicate dates and when the data is ordered. Let's take the employee dataset which has a datetime column, but is definitely not time series data.

In [None]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp = emp.set_index('hire_date')
emp.head()

If we try and sample it by Year (which is meaningless in this dataset) we get an empty DataFrame.

In [None]:
emp.asfreq('W')

In [None]:
emp = emp.sort_index()
emp.head()

Even if we try and make it more like a time series by sorting the index, the operation will only be successful if there are no duplicate dates. The error tells us that at least one hire date is not unique.

In [None]:
emp.asfreq('W')

Selection with partial string still works.

In [None]:
emp.loc['2012-1':'2012-2']

# Exercises

## Problem 1
<span  style="color:green; font-size:16px">Read in the weather time series dataset and place the date column in the index.</span>

## Problem 2
<span  style="color:green; font-size:16px">What was the temperature on June 11, 2011?</span>

## Problem 3
<span  style="color:green; font-size:16px">How many days did it rain during the last three months of 2011?</span>

## Problem 4
<span  style="color:green; font-size:16px">Which year had more snow days, 2007 or 2012?</span>

## Problem 5
<span  style="color:green; font-size:16px">Select every other thursday</span>

## Problem 6
<span  style="color:green; font-size:16px">Select the first day of each month.</span>

# Extra

### Custom offsets

Pandas has special objects called **offsets** that can be used in place of offset alias strings. Below we create our own custom offset object.

The stock market is actually closed on some Fridays due to holidays so it wouldn't make sense to select every single Friday, but instead only the Fridays that were actual trading days. We have to dig a bit deeper into Pandas and create a custom business day that is aware of the US Federal Holiday Calendar.

In [None]:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay

custom_bday = CustomBusinessDay(calendar=USFederalHolidayCalendar(), weekmask='Fri')

In [None]:
type(custom_bday)

The original selection was the following:

In [None]:
orig = amzn.asfreq('W-Fri')
orig.shape

With our custom business day, we removed 6 Fridays.

In [None]:
new = amzn.asfreq(custom_bday)
new.shape

### Creating date ranges
It is possible to create your own equally spaced interval of time with the `date_range` function. It returns a DateTimeIndex which you can use to set as the index in your own DataFrame or Series.

In [None]:
# create 10 values begining with January 1, 2012 every 20 seconds.
idx = pd.date_range(start='1/1/2012', periods=10, freq='20S')
idx

In [None]:
# make 8 equally spaced periods between two dates
idx = pd.date_range(start='1/1/2012', end='10/1/2012', periods=8)
idx

In [None]:
# Choose the frequency between two dates - here 10 days and 15 seconds
idx = pd.date_range(start='1/1/2012', end='10/1/2012', freq='10D 15s')
idx