# 3. Grouping by Time

### Objectives
* Group by time with **`resample`**
* Use offset aliases to determine amount of time
* Use the **`rolling`** method to calculate moving window statistics

## Introduction
In previous notebooks, we learned how to downsample/upsample time series data. In this notebook, we will group spans of time together to get a result. For instance, we can find out the number of up or down days for a stock within each trading month, or calculate the number of flights per day for an airline.

# Grouping by time
Pandas gives you the ability to group by a period of time. A concrete example can help here with the Amazon closing stock data. Note, that the date is set as the index.

In [1]:
import pandas as pd

url = 'https://api.iextrading.com/1.0/stock/AMZN/chart/5y'
amzn = pd.read_json(url)
amzn = amzn.set_index('date')
amzn.head()

Unnamed: 0_level_0,change,changeOverTime,changePercent,close,high,label,low,open,unadjustedVolume,volume,vwap
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2013-12-23,0.72,0.0,0.179,402.92,405.0,"Dec 23, 13",399.2,403.69,2661823,2661823,402.2857
2013-12-24,-3.72,-0.009233,-0.923,399.2,403.7249,"Dec 24, 13",396.37,402.52,1380373,1380373,399.4538
2013-12-26,5.19,0.003648,1.3,404.39,404.52,"Dec 26, 13",396.81,401.79,1871590,1871590,401.8815
2013-12-27,-6.31,-0.012012,-1.56,398.08,405.63,"Dec 27, 13",396.25,404.65,1987280,1987280,399.9122
2013-12-30,-4.71,-0.023702,-1.183,393.37,399.92,"Dec 30, 13",392.45,399.41,2487812,2487812,394.7291


### Find the average closing price of Amazon for every month
If we are interested in finding the average closing price of Amazon for every month, then we need to group by month and aggregate the closing price with the mean function.

### Grouping column, aggregating column, and aggregating method
This procedure is very similar to how we grouped and aggregated columns in previous notebooks. The only difference is that, our **grouping column** will now be a datetime column with an additional specification for the amount of time.

### Use the `resample` method
Instead of the **`groupby`** method, we use a special method for grouping time together called **`resample`**. We must pass the **`resample`** method an offset alias string. The rest of the process is the exact same as the **`groupby`** method. We call the **`agg`** method and pass it a dictionary mapping the **aggregating columns** to the **aggregating functions**.

### `resample` syntax
The first parameter we pass to **`resample`** is the offset alias. Here, we choose to group by month.

In [2]:
amzn.resample('M').agg({'close': 'mean'}).head(10)

Unnamed: 0_level_0,close
date,Unnamed: 1_level_1
2013-12-31,399.458333
2014-01-31,394.863333
2014-02-28,354.336842
2014-03-31,362.630238
2014-04-30,321.64119
2014-05-31,302.953571
2014-06-30,324.573333
2014-07-31,339.919364
2014-08-31,327.331905
2014-09-30,330.311905


### Use any number of aggregation functions
Map the aggregating column to a list of aggregating functions.

In [3]:
amzn.resample('M').agg({'close': ['size', 'min', 'mean', 'max']}).head(10)

Unnamed: 0_level_0,close,close,close,close
Unnamed: 0_level_1,size,min,mean,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013-12-31,6,393.37,399.458333,404.39
2014-01-31,21,358.69,394.863333,407.05
2014-02-28,19,346.15,354.336842,362.1
2014-03-31,21,336.365,362.630238,378.77
2014-04-30,21,296.58,321.64119,342.99
2014-05-31,21,288.32,302.953571,313.78
2014-06-30,21,306.78,324.573333,335.2
2014-07-31,22,312.99,339.919364,360.84
2014-08-31,21,307.06,327.331905,343.18
2014-09-30,21,321.82,330.311905,346.38


## Offset Aliases iframe
The offset aliases are again embedded in the notebook as an iframe.

In [4]:
from IPython.display import IFrame
IFrame('http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases', width=800, height=500)

### Group by Quarter

In [5]:
amzn.resample('Q').agg({'close': ['size', 'min', 'mean', 'max']}).head()

Unnamed: 0_level_0,close,close,close,close
Unnamed: 0_level_1,size,min,mean,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013-12-31,6,393.37,399.458333,404.39
2014-03-31,61,336.365,371.143689,407.05
2014-06-30,63,288.32,316.389365,342.99
2014-09-30,64,307.06,332.636656,360.84
2014-12-31,64,287.06,311.590703,338.64


### Label as the entire Period
Notice how the end date of both the month and day are used as the returned index labels for the time periods. We can change the index labels so that they show just the time period we are aggregating over by setting the `kind` parameter to 'period'.

In [6]:
amzn_period = amzn.resample('Q', kind='period').agg({'close': ['size', 'min', 'mean', 'max']})
amzn_period

Unnamed: 0_level_0,close,close,close,close
Unnamed: 0_level_1,size,min,mean,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013Q4,6,393.37,399.458333,404.39
2014Q1,61,336.365,371.143689,407.05
2014Q2,63,288.32,316.389365,342.99
2014Q3,64,307.06,332.636656,360.84
2014Q4,64,287.06,311.590703,338.64
2015Q1,61,286.95,351.658361,387.83
2015Q2,63,370.255,418.003254,445.99
2015Q3,64,429.7,505.616094,548.39
2015Q4,64,520.72,630.406719,693.97
2016Q1,61,482.07,567.619672,636.99


## The PeriodIndex
We no longer have a DatetimeIndex. Pandas has a completely separate type of object for this called the **PeriodIndex**. The index label '2016Q1' refers to the entire period of the first quarter of 2016. Let's inspect the index to see the new type.

In [7]:
amzn_period.index

PeriodIndex(['2013Q4', '2014Q1', '2014Q2', '2014Q3', '2014Q4', '2015Q1',
             '2015Q2', '2015Q3', '2015Q4', '2016Q1', '2016Q2', '2016Q3',
             '2016Q4', '2017Q1', '2017Q2', '2017Q3', '2017Q4', '2018Q1',
             '2018Q2', '2018Q3', '2018Q4'],
            dtype='period[Q-DEC]', name='date', freq='Q-DEC')

## The Period data type
Pandas also has a completely separate data type called a **Period** to represent **columns** of data in a DataFrmae that are specific **periods of time**. This is directly analagous to the PeriodIndex, but for DataFrame columns. Examples of a Period are the entire month of June 2014, or the entire 15 minute period from June 12, 2014 5:15 to June 12, 2014 5:30.

### Convert a datetime column to a Period
We can use the `to_period` available with the `dt` accessor to convert datetimes to Period data types. You must pass it an offset alias to denote the length of the time period. Let's convert the `date` column in the weather dataset to a monthly Period column .

In [8]:
weather = pd.read_csv('../data/weather.csv', parse_dates=['date'])
weather.head()

Unnamed: 0,date,rain,snow,temperature
0,2007-01-01,Yes,No,68.0
1,2007-01-02,No,No,55.9
2,2007-01-03,No,No,62.1
3,2007-01-04,No,No,69.1
4,2007-01-05,Yes,No,72.0


Let's make the conversion from datetime to period and assign the result as a new column in the DataFrame.

In [9]:
date = weather['date']
weather['date_period'] = weather['date'].dt.to_period('M')
weather.head()

Unnamed: 0,date,rain,snow,temperature,date_period
0,2007-01-01,Yes,No,68.0,2007-01
1,2007-01-02,No,No,55.9,2007-01
2,2007-01-03,No,No,62.1,2007-01
3,2007-01-04,No,No,69.1,2007-01
4,2007-01-05,Yes,No,72.0,2007-01


### Why is the data type "object"?
Unfortunately, Pandas doesn't explicitly label the Period object as such when outputting the data types. But if we inspect each individual element, you will see that they are indeed Period objects.

In [10]:
weather.dtypes

date           datetime64[ns]
rain                   object
snow                   object
temperature           float64
date_period            object
dtype: object

Inspecting each individual element.

In [11]:
weather.loc[0, 'date_period']

Period('2007-01', 'M')

### The `dt` accessor works for Period columns
Even though it is technically labeled as object, Pandas still has attributes and methods specific to periods.

In [12]:
weather['date_period'].dt.month.head()

0    1
1    1
2    1
3    1
4    1
Name: date_period, dtype: int64

In [13]:
weather['date_period'].dt.month.head()

0    1
1    1
2    1
3    1
4    1
Name: date_period, dtype: int64

In [14]:
# Return the span of time
weather['date_period'].dt.freq

<MonthEnd>

# Anchored offsets
By default, when grouping by week, Pandas chooses to end the week on Sunday. Let's verify this by grouping by week and taking the resulting index label and determining its weekday name.

In [15]:
week_mean = amzn.resample('W').agg({'close': ['size', 'min', 'mean', 'max']})
week_mean.head()

Unnamed: 0_level_0,close,close,close,close
Unnamed: 0_level_1,size,min,mean,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013-12-29,4,398.08,401.1475,404.39
2014-01-05,4,393.37,396.6425,398.79
2014-01-12,5,393.63,398.45,401.92
2014-01-19,5,390.98,395.96,399.61
2014-01-26,4,387.6,399.765,407.05


In [16]:
week_mean.index[0].day_name()

'Sunday'

### Anchor by a different day
You can anchor the week to any day you choose by appending a dash and then the first the letters of the day of the week. Let's anchor the week to Wednesday.

In [17]:
amzn.resample('W-WED').agg({'close': ['size', 'min', 'mean', 'max']}).head()

Unnamed: 0_level_0,close,close,close,close
Unnamed: 0_level_1,size,min,mean,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013-12-25,2,399.2,401.06,402.92
2014-01-01,4,393.37,398.6575,404.39
2014-01-08,5,393.63,397.598,401.92
2014-01-15,5,390.98,396.612,401.01
2014-01-22,4,395.8,401.75,407.05


### Longer intervals of time with numbers appended to offset aliases
We can actually add more details to our offset aliases by using a number to specify an amount of that particular offset alias. For instance, **`5M`** will group in 5 month intervals.

In [18]:
amzn.resample('5M').agg({'close': ['size', 'min', 'mean', 'max']}).head()

Unnamed: 0_level_0,close,close,close,close
Unnamed: 0_level_1,size,min,mean,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013-12-31,6,393.37,399.458333,404.39
2014-05-31,103,288.32,347.148107,407.05
2014-10-31,108,287.06,325.9095,360.84
2015-03-31,102,286.95,336.269853,387.83
2015-08-31,106,370.255,450.50533,537.01


Group by every 22 weeks anchored to Thursday.

In [19]:
amzn.resample('22W-THU').agg({'close': ['size', 'min', 'mean', 'max']}).head()

Unnamed: 0_level_0,close,close,close,close
Unnamed: 0_level_1,size,min,mean,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013-12-26,3,399.2,402.17,404.39
2014-05-29,105,288.32,348.894714,407.05
2014-10-30,108,287.06,325.975148,360.84
2015-04-02,105,286.95,336.642762,387.83
2015-09-03,107,374.41,453.484766,537.01


# Calling `resample` on a datetime column
The `resample` method can still work without a Datetimeindex. If there is a column that is of the datetime data type, you can use the `on` parameter to specificy that column. Let's reset the index and then call `resample` on that DataFrame.

In [20]:
amzn_reset = amzn.reset_index()
amzn_reset.head()

Unnamed: 0,date,change,changeOverTime,changePercent,close,high,label,low,open,unadjustedVolume,volume,vwap
0,2013-12-23,0.72,0.0,0.179,402.92,405.0,"Dec 23, 13",399.2,403.69,2661823,2661823,402.2857
1,2013-12-24,-3.72,-0.009233,-0.923,399.2,403.7249,"Dec 24, 13",396.37,402.52,1380373,1380373,399.4538
2,2013-12-26,5.19,0.003648,1.3,404.39,404.52,"Dec 26, 13",396.81,401.79,1871590,1871590,401.8815
3,2013-12-27,-6.31,-0.012012,-1.56,398.08,405.63,"Dec 27, 13",396.25,404.65,1987280,1987280,399.9122
4,2013-12-30,-4.71,-0.023702,-1.183,393.37,399.92,"Dec 30, 13",392.45,399.41,2487812,2487812,394.7291


The only difference is that we specify the grouping column with the `on` parameter. The result is the exact same.

In [21]:
amzn_reset.resample('W-WED', on='date').agg({'close': ['size', 'min', 'mean', 'max']}).head()

Unnamed: 0_level_0,close,close,close,close
Unnamed: 0_level_1,size,min,mean,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013-12-25,2,399.2,401.06,402.92
2014-01-01,4,393.37,398.6575,404.39
2014-01-08,5,393.63,397.598,401.92
2014-01-15,5,390.98,396.612,401.01
2014-01-22,4,395.8,401.75,407.05


# Exercises

## Problem 1
<span  style="color:green; font-size:16px">Read in stock data for Apple (AAPL) for the last 5 years. Set the date as the index and keep just the closing price and the volume columns.</span>

In [34]:
cols = ['close', 'volume']
df = amzn[cols]
df.head()

Unnamed: 0_level_0,close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-12-23,402.92,2661823
2013-12-24,399.2,1380373
2013-12-26,404.39,1871590
2013-12-27,398.08,1987280
2013-12-30,393.37,2487812


## Problem 2
<span  style="color:green; font-size:16px">In which week did AAPL have the greatest number of its shares traded?</span>

In [39]:
vol = df.resample('W', kind='period').agg({'volume':'sum'})
vol.max()

volume    52444626
dtype: int64

In [40]:
vol.idxmax()

volume   2018-02-05/2018-02-11
dtype: object

In [44]:
#other way to do it
df.resample('W', kind='period').agg({'volume':'sum'}).idxmax()

volume   2018-02-05/2018-02-11
dtype: object

## Problem 3
<span  style="color:green; font-size:16px">With help from the `diff` method, find the quarter containing the most number of up days.</span>

In [50]:
up_days = df['close'].diff() > 0

In [51]:
up_days.head()

date
2013-12-23    False
2013-12-24    False
2013-12-26     True
2013-12-27    False
2013-12-30    False
Name: close, dtype: bool

In [64]:
up_days.resample('Q', kind='period').sum().head()

date
2013Q4     2.0
2014Q1    27.0
2014Q2    32.0
2014Q3    37.0
2014Q4    35.0
Freq: Q-DEC, Name: close, dtype: float64

In [65]:
up_days.resample('Q', kind='period').sum().idxmax()

Period('2016Q3', 'Q-DEC')

## Problem 4
<span  style="color:green; font-size:16px">Find the mean price per year along with the minimum and maximum volume. Have the label for each row be the first day of the year.</span>

In [80]:
df.resample('Y', kind='period').agg({'close':'mean','volume':['min','max']})

Unnamed: 0_level_0,close,volume,volume
Unnamed: 0_level_1,mean,min,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,399.458333,1380373,2661823
2014,332.550976,1518107,19805911
2015,478.138194,1092970,23856060
2016,699.523135,1458834,14677550
2017,968.167012,1585054,16565021
2018,1645.596667,2115639,14963783


## Problem 5
<span  style="color:green; font-size:16px">Execute the cell below exactly as it is to read in the employee dataset. Then use `to_datetime` to convert the hire date column into a datetime.</span>

In [84]:
# execute this as is
emp = pd.read_csv('../data/employee.csv')
emp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1653 entries, 0 to 1652
Data columns (total 6 columns):
title        1653 non-null object
dept         1653 non-null object
salary       1551 non-null float64
race         1633 non-null object
gender       1653 non-null object
hire_date    1653 non-null object
dtypes: float64(1), object(5)
memory usage: 77.6+ KB


In [85]:
emp['hire_date'] = pd.to_datetime(emp['hire_date'])
emp['hire_date'].head()

0   2015-02-03
1   1982-02-08
2   1984-11-26
3   2012-03-26
4   2013-11-04
Name: hire_date, dtype: datetime64[ns]

In [86]:
emp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1653 entries, 0 to 1652
Data columns (total 6 columns):
title        1653 non-null object
dept         1653 non-null object
salary       1551 non-null float64
race         1633 non-null object
gender       1653 non-null object
hire_date    1653 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 77.6+ KB


## Problem 6
<span  style="color:green; font-size:16px">Without putting `hire_date` into the index, find the mean salary based on `hire_date` over 5 year periods. Also return the number of salaries used in the mean calculation for each period.</span>

In [103]:
cols = ['salary', 'hire_date']
emp1 = emp[cols]
emp1.resample('5Y', on='hire_date').agg({'salary':['mean','count']})

Unnamed: 0_level_0,salary,salary
Unnamed: 0_level_1,mean,count
hire_date,Unnamed: 1_level_2,Unnamed: 2_level_2
1958-12-31,81239.0,1
1963-12-31,,0
1968-12-31,89590.0,1
1973-12-31,66614.0,1
1978-12-31,88503.166667,6
1983-12-31,69074.571429,63
1988-12-31,68358.8625,80
1993-12-31,63372.480198,202
1998-12-31,63408.519774,177
2003-12-31,59921.842857,210
