<center> 
# R406: Using Python for data analysis and modelling

<br> <br> 

## <center> Pandas — data aggregation, the split-apply-combine paradigm and pivot tables

<br>

<center> **Andrey Vassilev**

<br> 


 

# Outline

1. Aggregation
2. An overview of the split-apply-combine concept
3. Pivot tables

In [None]:
import numpy as np
import pandas as pd
from IPython.display import display

# Aggregation operations for `Series`

These practically mirror the respective operations for arrays:

In [None]:
rng = np.random.RandomState(5)
s = pd.Series(rng.rand(5))
s

In [None]:
s.sum()

In [None]:
s.mean()

Many other operations are available. Here are a few examples to give you ideas:

In [None]:
s.min()

In [None]:
s.prod()

In [None]:
s.cumsum()

In [None]:
s.cumprod()

# Aggregation operations for `DataFrames`

This is the same in spirit to the operations for `Series`. The novelty here is the option to perform an operation rowwise or columnwise.

In [None]:
wght = pd.DataFrame({'Bob':[90,91,89,88,86],
                     'Jane':[68,62,61,59,59], 
                     'Joe':[75,76,77,79,80]},
                 index=pd.date_range(start="20160601",
                                     periods=5,freq='M'))
wght

In [None]:
wght.mean()

In [None]:
wght.mean(axis=1)

In [None]:
wght.mean(axis="columns")

In [None]:
wage = pd.DataFrame({'Bill':[1000,1100,1050,1000,1200],'Jill':[2000,2000,2000,3000,2000], 'Jane':[500,550,550,600,500]},
                 index=pd.date_range(start="20160101",periods=5,freq='M'))
wage

In [None]:
wage.sum()

In [None]:
wage.sum(axis=1)

In [None]:
otherincome = pd.DataFrame({'Bill':[2000,2100,2050,2000,2200],'Jill':[3000,3000,3000,4000,3000], 'Jane':[1500,1550,1550,1600,1500]},
                 index=pd.date_range(start="20160101",periods=5,freq='M'))
display(otherincome)

In many respects a `DataFrame` behaves just like a NumPy array:

In [None]:
wage + otherincome

In [None]:
# same as above
wage.add(otherincome)

In [None]:
wage*3

# Split-apply-combine operations

- A common need that arises in data analysis is to divide a dataset into several subsets according to some criterion, process and analyse these subsets separately and put the results back together.
- This workflow is known as **split-apply-combine**.
- Pandas supports this approach via the `groupby` operation.
- The next slide contains a nice illustration of the main idea (courtesy of Jake VanderPlas's *Python Data Science Handbook*).

![Split-apply-combine illustrated](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/figures/03.08-split-apply-combine.png)

### The diamonds dataset

Source: R's `ggplot2` package

Description taken from http://docs.ggplot2.org/current/diamonds.html

**Variables:**  
- price. price in US dollars (\$326--\$18,823)
- carat. weight of the diamond (0.2--5.01)
- cut. quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- colour. diamond colour, from J (worst) to D (best)
- clarity. a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
- x. length in mm (0--10.74)
- y. width in mm (0--58.9)
- z. depth in mm (0--31.8)
- depth. total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- table. width of top of diamond relative to widest point (43--95)

In [None]:
dia = pd.read_csv("diamonds.csv")

In [None]:
dia.head()

The `unique()` method of a series allows us to get the distinct values.

In [None]:
dia['color'].unique()

In [None]:
dia.describe()

In [None]:
dia.groupby('cut').describe().head(15) # This is a GroupBy object

We can refer to a column of a `GroupBy` object and invoke a method on it:

In [None]:
dia.groupby('cut')['carat'].mean()

Or we can iterate over the group members. This produces tuples of group names and dataframes corresponding to the respective group.

In [None]:
for group in dia.groupby('cut'):
    print(group[0])
    print(type(group[1]))

# Or, almost equivalently
# for gr, fr in dia.groupbys('cut'):
#     print(gr,type(fr),sep=": ")

We can get a particular group with `get_group()`:

In [None]:
dia.groupby('cut').get_group('Good').head()

## Operations on groups

### Aggregation

In [None]:
dia.groupby('cut').agg([sum,min,max])
# Note the results for string variables

We can specify the operations to be column-specific.

In [None]:
dia.groupby('cut').agg({'carat':np.mean, 'price':max})

In [None]:
dia.groupby('cut').agg({'carat':np.mean, 'price':[min,max]})

### Filtering

We may want to keep only groups that satisfy certain conditions. Let's say we want to keep only those groups which have more than 30 diamonds with a price above \$18500. We can do it with the `filter()` method.

In [None]:
def SelectManyExpensive(x):
    return sum(x['price'] > 18500) > 30
dia.groupby('cut').filter(SelectManyExpensive).head()

In [None]:
dia.groupby('cut').filter(SelectManyExpensive).shape

In [None]:
dia.groupby('cut').filter(SelectManyExpensive)['cut'].unique()

### Transforming

The `transform()` method allows us to apply a transformation to each group. Here is a group-specific standardization transformation:

In [None]:
dia.loc[:,['cut','table','price']].groupby(
    'cut').transform(lambda x: (x - x.mean()) / x.std()).head()

### Apply operations

The `apply()` method is similar to the transform method  with the difference that the function passed to the method takes a dataframe to perform some calculation and returns a Pandas object or a scalar.

In [None]:
from scipy.stats import linregress
def ReturnSlope(x):
    return linregress(x['price'],x['carat']).slope
dia.groupby('cut').apply(ReturnSlope)

# Pivot tables

We already know the reshaping operation `pivot`. Pivot tables carry this idea further by providing data aggregation functionality. The basic syntax is `pivot_table(values, index, columns)` with aggregation performed using the `mean` function by default.

In [None]:
dia.pivot_table(values='price',index='cut',columns='color')

In [None]:
dia.pivot_table(values='price',index='cut',columns='color',aggfunc=sum)

We can pass several aggregating functions.

In [None]:
dia.pivot_table(values='price',index='cut',columns='color',aggfunc=[min,max])

We can also work with hierarchical indexes or columns:

In [None]:
dia.pivot_table(values='price',index=['cut','clarity'],columns='color').head(15)

In [None]:
dia.pivot_table(values='price',index='color',columns=['cut','clarity']).head(15)

We can add margins:

In [None]:
dia.pivot_table(values='price',index='color',columns='clarity',margins=True,aggfunc=sum).head(15)

And choose a fill value for NAs:

In [None]:
dia.pivot_table(values='price',index='color',columns=['cut','clarity'],fill_value=-1).head(15)