In [None]:
%matplotlib inline
import pandas as pd

df = pd.read_csv('data/Consumo_cerveja.csv', 
                 decimal=',', 
                 thousands='.', 
                 header=0, 
                 names=['date','median_temp','min_temp','max_temp','precip','weekend','consumption'], 
                 parse_dates=['date'], 
                 nrows=365)

Let's explore one of the more powerful features of pandas - the Split-apply-combine paradigm!

![split_apply_combine](images/split_apply_combine_example.jpg)

When exploring data, we always want to split our data into groups to discover differences between groups - groupby operations are a great way to do this in a concise and elegant way!

# Groupby

The split step is often called a groupby and should be familiar to anyone used to working in SQL - all we do in this step is define our groups. 

![Fun Fact](images/fun_fact.resized.jpeg)In Pandas, we usually define it by naming one or several columns to group by, but you can use anything as a grouper, as long as the length of the grouper is the same as the length of the data!

In [None]:
grouper = df.groupby('weekend')

In [None]:
grouper

`.groupby` returns a DataFrameGroupBy object, which "knows" how to split the data - we haven't done any actual calculations yet!

We can examine the grouper to see what it is going to do. In this example, there are two values or levels in the weekend column - 0 and 1. Our grouper has saved the indexes for each group, which we can see by inspecting the `.groups` property

In [None]:
grouper.groups

We can get one of the groups if we want, we simply call `.get_group` with one of the keys in the `.groups` dictionary

In [None]:
grouper.get_group(1)

The grouper is also a generator, so we can use it in a for loop

In [None]:
for group, data in grouper:
    print(group)
    print('-'*30)
    print(data)

What we normally use it for though is to apply some function to the groups. Pandas has a lot of methods available out of the box, and we can always specify our own.

For example, how many rows are there in each group?

In [None]:
grouper.size()

What is the mean and std deviation for each group?

In [None]:
grouper.mean()

In [None]:
grouper.std()

Note how pandas automatically ignores our non-numerical columns!

What if we are only interested in aggregating a single column?

In [None]:
grouper.consumption.sum()

Sometimes we want to use several aggregations - `.agg` lets us specify any number of aggregators, including any custom functions. 

Note that for convenience, pandas let's us specify a string for the most common functions!

In [None]:
def silly_function(x):
    return sum(x ** 2)

In [None]:
grouper.agg(['std', 'mean', silly_function])

`.agg` also lets us specify different aggregation functions per column

In [None]:
grouper.agg({'median_temp': ['std', 'mean'], 
             'consumption': 'sum'})

It even lets us do plotting directly on the groups

In [None]:
grouper.boxplot(rot=90, column=['median_temp', 'min_temp', 'max_temp']);

# DateTime & Resample
As mentioned when we were looking at datatypes, Pandas was built by a finance quant, and so datetimes are handled very well in the pandas library. Let's look at some of the more interesting possibilites when working with timeseries data! In order to get the most out of this functionality, we need to set a DateTime index

![Fun Fact](images/fun_fact.resized.jpeg) As of pandas 0.19.0, you no longer have to set a date as index - you can use the `on` parameter when resampling. However, it's often a good idea to have set an index, so we are still going to do it!


In [None]:
df = df.set_index('date')

In [None]:
df.head()

Having a date as index lets us do some special case indexing - Pandas will recognize dates and slice accordingly

In [None]:
df.loc['2015-02']

## Resampling

Resampling is a special case of grouping - it lets you aggregate by upsampling or downsampling your data very easily. The API is very similar to groupby, but instead of specifying a column, you specify a frequency

In [None]:
resampler = df.resample('M')

Now we can do aggregations, much like in groupby. 

For example, the mean per month:

In [None]:
resampler.mean()

Pandas has a wide range of offsets you can use, check the documentation for more [Offset Aliases](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases)

I commonly use
- "W": Weekly
- "M" or "MS": Month end or Month start
- "Y" or "YS": Year end or Year start

Pandas has up to Nanosecond resolution, so you should be covered for most usecases!

In [None]:
df.resample('W').mean()

You can also put a number in front to specify every X frequencies

In [None]:
df.resample('3M').mean()

In [None]:
df.resample('4W').mean()

Pandas also lets us upsample data

For example, let's say we have the monthly mean:

In [None]:
monthly_df = df.resample('M').mean()

We now want to upsample to a daily resolution - we can then specify we want to fill forward the missing values

In [None]:
monthly_df.resample('D').ffill()

# Rolling

Rolling has a similar API to groupby and resample, but works by aggregating over a rolling window. It's often used to smooth out jagged timeseries to see larger trends.

It works in the same way as we've seen before, but takes a window parameter instead - let's do a 7 day rolling mean

In [None]:
rolling = df.rolling(7)

In [None]:
rolling.mean()

Note that the first 6 rows will be NaN, as there isn't enough information to compute the rolling mean

We can also combine resampling and rolling to get a rolling 6 month mean

In [None]:
rolling_month = df.resample('M').mean().rolling(6)

In [None]:
rolling_month.mean()

Or combine groupby and resample to get mean monthly results for each group

In [None]:
df.groupby('weekend').resample('M').mean()