# Week 05: Aggregating data

## Objectives

By the end of this tutorial, you should be able to partition a data set and create groupings according to different variables. Specifically you will:

- understand the structure and properties of a `GroupBy` object
- partition a data set using the `groupby` method
- access summary information from a `GroupBy` mapping
- extract a subset of data
- specify hierarchical groupings
- compute a cross tabulation, using both frequencies and proportions
- specify multiple aggregation methods

## Getting started


As before, import `pandas` and read the `tips` data set from github. As before, add a column for the tip rate (`tip`/`total_bill`).

In [1]:
from pandas import *
url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'
tips = pandas.read_csv(url)

tips = tips.assign( rate = tips['tip'] / tips['total_bill'] * 100.0 )
print tips.head()

   total_bill   tip     sex smoker  day    time  size       rate
0       16.99  1.01  Female     No  Sun  Dinner     2   5.944673
1       10.34  1.66    Male     No  Sun  Dinner     3  16.054159
2       21.01  3.50    Male     No  Sun  Dinner     3  16.658734
3       23.68  3.31    Male     No  Sun  Dinner     2  13.978041
4       24.59  3.61  Female     No  Sun  Dinner     4  14.680765


## Partitioning a data set

To group rows into subsets, each defined by the values of a specific variable, use the `groupby()` method and pass the name of the desired variable.

As an example, suppose we wanted to investigate how the bill payer's sex affects tipping behavior. We would want to group observations by the variable `sex`, which has two values in its domain: `Male` and `Female`. The command `tips.groupby('sex')` does this, and produces a complex data structure called a `GroupBy` object. 

In [2]:
# group by sex and find the size of each group

by_sex = tips.groupby('sex')
print by_sex.size()

sex
Female     87
Male      157
dtype: int64


### *Exercise: group by meal time*

In [3]:
# E: find the number of observations for each day of the week
#    (create a GroupBy object called by_day)


## GroupBy object

A `GroupBy` object is **opaque**: you cannot read it directly using `print` or slicing operations. Instead, you should only access information from it using [its approved methods](http://pandas.pydata.org/pandas-docs/stable/api.html#groupby). For example, the `size()` method above produced a small DataFrame, named `by_sex`, listing the size of each group. In most cases, you will use an aggregation function, such as `sum` or `mean`, which pandas will apply for each group, to any variables that apply. The output is typically a DataFrame.

In [4]:
# compute average of variables for each group
# ('smoker' and other non-numerical variables omitted, because mean is undefined for them)

print by_sex.mean()

        total_bill       tip      size       rate
sex                                              
Female   18.056897  2.833448  2.459770  16.649074
Male     20.744076  3.089618  2.630573  15.765055


In [5]:
# for comparison, mean of total bill for entire data set
# (not equal to averages for Male and Female groups)

print tips['total_bill'].mean()

19.785942623


Since the output of `mean()` is a DataFrame, you can select specific columns, rows, and cells in the usual ways...

In [6]:
# average of tips from female and male bill payers
female_tip = by_sex.mean()['tip']['Female']
male_tip = by_sex.mean().loc['Male','tip']

print "Average tip from female bill payers: $%.2f" %female_tip
print "Average tip from male bill payers: $%.2f" %male_tip

Average tip from female bill payers: $2.83
Average tip from male bill payers: $3.09


... and remember you can apply filters to the DataFrame before invoking `groupby()`.

In [7]:
# mean of each variable for entire data set
f_lowtip = tips['rate'] < 12
print tips[f_lowtip].groupby('sex').mean()

        total_bill       tip      size      rate
sex                                             
Female   26.722000  2.315000  2.800000  8.583141
Male     26.979459  2.471622  2.837838  9.321600


### *Exercises: group by party size*

In [8]:
# E: create a GroupBy object named by_size, grouped by size of the party
#    and for each group compute the average tip rate, and average bill

by_size = tips.groupby('size').mean()[['rate','total_bill']]
print by_size

           rate  total_bill
size                       
1     21.729202    7.242500
2     16.571919   16.448013
3     15.215685   23.277632
4     14.594901   28.613514
5     14.149549   30.068000
6     15.622920   34.830000


In [9]:
# E: compute the number of bill payers who tip lower than 15%, aggregated by party size

f_lowtip = tips['rate'] < 15.0
print tips[f_lowtip].groupby('size').size()

size
1     1
2    64
3    15
4    23
5     3
6     2
dtype: int64


In [10]:
# E: compute the *percentage* of bill payers who tip lower than 15% for each group

print tips[f_lowtip].groupby('size').size()/tips.groupby('size').size() * 100

size
1    25.000000
2    41.025641
3    39.473684
4    62.162162
5    60.000000
6    50.000000
dtype: float64


## Extracting groups

Sometimes it's useful to extract a group so you can do some experimental processing with it, or export it for someone else to work on. Use `get_group()` to extract the group into a new DataFrame.

In [11]:
print by_sex.get_group('Female').head()

    total_bill   tip     sex smoker  day    time  size       rate
0        16.99  1.01  Female     No  Sun  Dinner     2   5.944673
4        24.59  3.61  Female     No  Sun  Dinner     4  14.680765
11       35.26  5.00  Female     No  Sun  Dinner     4  14.180374
14       14.83  3.02  Female     No  Sun  Dinner     2  20.364127
16       10.33  1.67  Female     No  Sun  Dinner     3  16.166505


### *Exercise: extract by party size*

In [12]:
# E: Create a dictionary of DataFrames, where the keys are values of party size
#    and the DataFrames are grouped by party size
#    Verify that your code worked by printing the first five rows of each group.

by_day = tips.groupby('day')
dct = {}
for name in tips['day'].unique():
    dct[name] = by_day.get_group(name)

for key in dct:
    print dct[key].head()
    print 

   total_bill   tip     sex smoker  day    time  size       rate
0       16.99  1.01  Female     No  Sun  Dinner     2   5.944673
1       10.34  1.66    Male     No  Sun  Dinner     3  16.054159
2       21.01  3.50    Male     No  Sun  Dinner     3  16.658734
3       23.68  3.31    Male     No  Sun  Dinner     2  13.978041
4       24.59  3.61  Female     No  Sun  Dinner     4  14.680765

    total_bill   tip     sex smoker  day    time  size       rate
90       28.97  3.00    Male    Yes  Fri  Dinner     2  10.355540
91       22.49  3.50    Male     No  Fri  Dinner     2  15.562472
92        5.75  1.00  Female    Yes  Fri  Dinner     2  17.391304
93       16.32  4.30  Female    Yes  Fri  Dinner     2  26.348039
94       22.75  3.25  Female     No  Fri  Dinner     2  14.285714

    total_bill   tip   sex smoker   day   time  size       rate
77       27.20  4.00  Male     No  Thur  Lunch     4  14.705882
78       22.76  3.00  Male     No  Thur  Lunch     2  13.181019
79       17.29  2.71

## Hierarchical grouping

You can pass more than one variable to `groupby()`, and it will create a hierarchical grouping. For example, suppose we want to see how male and female bill payers are distributed by meal time. We could pass in both `time` and `sex` as a list to `groupby()`, and then use `size()` to count the number of observations in each category.

In [13]:
# group by time of meal and sex, then count

print tips.groupby(['time','sex']).size()

# At Lunch time, bill payers are equally likely to be male or female, 
# but at Dinner time, bill payers are far more likely to be male.

time    sex   
Dinner  Female     52
        Male      124
Lunch   Female     35
        Male       33
dtype: int64


### *Exercise: how consistent are different groups?*

In [14]:
# E: Among the numerical variables, is there more or less variation
#    among groups of male and female smokers and nonsmokers?



## Cross tabulation

If you need to compare counts of observations between multiple groupings, a **cross tabulation** can help you organize these comparisons. In the hierarchical-grouping example above, for example, we can view the `time` groupings vertically and the `sex` groupings horizontally. Pandas' [`crosstab`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html#pandas.crosstab) takes at least two arguments: the first specifies vertical grouping, the second specifies horizontal grouping.

In [15]:
# create a cross tabulation with times as rows, and sex as columns

ct = pandas.crosstab(tips['time'],tips['sex'])
print ct

sex     Female  Male
time                
Dinner      52   124
Lunch       35    33


If the number of data points between groups is not uniform, use percentages of each column to make the comparisons more meaningful.

In [16]:
# for each sex, shows the distribution between dinner and lunch meals

print ct/ct.sum()*100.0

sex        Female       Male
time                        
Dinner  59.770115  78.980892
Lunch   40.229885  21.019108


The axes of a cross tabulation can be "flipped" diagonally, allowing you to compare between sexes.

In [17]:
# for each sex, shows the distribution between dinner and lunch meals

ct = pandas.crosstab(tips['sex'],tips['time'])
print ct/ct.sum()*100.0

# shows clearly that the male/female distribution changes between
# lunch (parity) and dinner (70% male payers)

time       Dinner      Lunch
sex                         
Female  29.545455  51.470588
Male    70.454545  48.529412


### *Exercise: When do smokers congregate?*

In [18]:
# E: On what days did the waiter tend to see the highest proportion of smokers?

ct = pandas.crosstab(tips['day'],tips['smoker'])
print ct/ct.sum()*100.0

smoker         No        Yes
day                         
Fri      2.649007  16.129032
Sat     29.801325  45.161290
Sun     37.748344  20.430108
Thur    29.801325  18.279570


## Specifying aggregation methods

Sometimes it might be useful to summarize different variables in different ways. For example, we typically want to *count* categorical data but *average* numerical data. The `aggregate()` method allows you to specify which functions apply to which variables; map each function to the appropriate column using a dictionary. The output is a `DataFrame` that you can do further calculations on.

In [19]:
# Group by shift and sex, and get count of sex and average of tip

import numpy as np
stats = { 
    'sex' : np.size, 
    'tip' : np.mean 
}
print tips.groupby(['time', 'sex']).aggregate(stats)

                    tip  sex
time   sex                  
Dinner Female  3.002115   52
       Male    3.144839  124
Lunch  Female  2.582857   35
       Male    2.882121   33


### *Exercise: where is the money?*

In [20]:
# E: Make a DataFrame showing the average tip rate, total tips collected
#    and number of tables waited, for different days and meal times.

by_size = tips.groupby(['time','day']).aggregate({
        'tip': np.sum,
        'rate' : np.mean,
        'size' : np.size
    })
print by_size

# Although Friday lunch has the best average tip rate, it has low volume
# and thus the waiter doesn't collect much from it overall

                  rate     tip  size
time   day                          
Dinner Fri   15.891611   35.28    12
       Sat   15.315172  260.40    87
       Sun   16.689729  247.39    76
       Thur  15.974441    3.00     1
Lunch  Fri   18.876489   16.68     7
       Thur  16.130074  168.83    61


### *Exercise: low tippers*

Earlier we tabulated the proportion of sub-15% tippers, grouped by party size. Replicate this analysis using a cross tabulation.

In [21]:
# E: compute the *percentage* of bill payers who tip lower than 15% for each group

tips = tips.assign(low = tips['rate'] < 15.0)
ct = pandas.crosstab(tips['size'], tips['low'])

ct = ct.assign(total = ct.sum(axis=1))
ct = ct.assign(low = ct[True]/ct['total'] * 100.0)
print ct

low   False  True  total        low
size                               
1         3     1      4  25.000000
2        92    64    156  41.025641
3        23    15     38  39.473684
4        14    23     37  62.162162
5         2     3      5  60.000000
6         2     2      4  50.000000


In [22]:
tips = tips.assign(low = tips['rate'] < 15.0)
ct = pandas.crosstab(tips['low'],tips['size'])

print ct/ct.sum()*100

size    1          2          3          4   5   6
low                                               
False  75  58.974359  60.526316  37.837838  40  50
True   25  41.025641  39.473684  62.162162  60  50
