# Chapter 10 Data Aggregation and Group Operations

- Split a data frame into pieces using one or more keys.
- Calculate group summary statistics such as count, mean, standard deviation, or a user-defined function.
- Apply within-group transformations such as normalization.
- Compute pivot tables and cross-tabulations.
- Perform statistical group analysis.

## I. GroupBy Mechanics

Many data processing follows a **split-apply-combine** process. For example, you may want to do the following operations to analyze a dataset about sales:
1. What is the total revenue every day?
2. What is the total sales of each product?
3. How much has each client perchased in total?

These operations all requires that you split the data into groups, and then apply certain calculations to each of the groups, and finally combine all results into a new table. In Pandas this is mostly done with `groupby()` function.

In [1]:
import numpy as np
import pandas as pd

In [5]:
# An example:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-1.830947,-0.591686
1,a,two,-0.546644,1.876575
2,b,one,0.444086,-0.465818
3,b,two,-1.1634,0.094892
4,a,one,-0.477514,-1.85953


In [6]:
# Split data1 values according to key1:
groups = df['data1'].groupby(df['key1'])
groups

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002425D328390>

In [7]:
# Apply mean() function to find the average value for each group
means = groups.mean()
means

key1
a   -0.951702
b   -0.359657
Name: data1, dtype: float64

In [8]:
# Convert it to a data frame
df_means = means.to_frame(name='data1_mean')
df_means

Unnamed: 0_level_0,data1_mean
key1,Unnamed: 1_level_1
a,-0.951702
b,-0.359657


In [9]:
# Put all operations in one statement
df_means = df['data1'].groupby(df['key1']).mean().to_frame(name='data1_mean')
df_means

Unnamed: 0_level_0,data1_mean
key1,Unnamed: 1_level_1
a,-0.951702
b,-0.359657


In [11]:
# Exercise: split data2 according to key2, and calculate the sum.
df_sums = df['data2'].groupby(df['key2']).sum().to_frame(name = "data2 sum")
df_sums

Unnamed: 0_level_0,data2 sum
key2,Unnamed: 1_level_1
one,-2.917035
two,1.971467


We can use more than one column as keys.

In [12]:
# Split the data according to both key1 and key2
groups = df['data1'].groupby([df['key1'], df['key2']])

In [13]:
# Calculate the mean
means = groups.mean()
means

key1  key2
a     one    -1.154231
      two    -0.546644
b     one     0.444086
      two    -1.163400
Name: data1, dtype: float64

We obtain a pandas Series with **hierarchical indexing**. It can be converted to a data frame using `unstack()`.

In [14]:
# Convert it to a data frame
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-1.154231,-0.546644
b,0.444086,-1.1634


In [16]:
# Put all operations in one statement
df["data1"].groupby([df["key1"], df["key2"]]).mean().unstack()


key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-1.154231,-0.546644
b,0.444086,-1.1634


In [17]:
# Split the entire data frame
df.groupby([df['key1'], df['key2']]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-1.154231,-1.225608
a,two,-0.546644,1.876575
b,one,0.444086,-0.465818
b,two,-1.1634,0.094892


In [25]:
# Frequently the grouping information is found in the same data frame as the data 
# you want to work on. In that case, simply put column names as the keys:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.299084,0.838113
a,two,-0.471687,0.130195
b,one,-0.803664,0.769901
b,two,1.163361,0.459188


In [18]:
# Find the number of instances in each subgroup
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

**Iterating Over Groups**

The GroupBy object support iteration, providing a sequence of 2-tuples containing the group name along with the data.

In [21]:
# Show the content of each group.
groups = df.groupby(['key1', 'key2'])
for name, group in groups:
    print("Name:", name)
    print(group)

Name: ('a', 'one')
  key1 key2     data1     data2
0    a  one -1.830947 -0.591686
4    a  one -0.477514 -1.859530
Name: ('a', 'two')
  key1 key2     data1     data2
1    a  two -0.546644  1.876575
Name: ('b', 'one')
  key1 key2     data1     data2
2    b  one  0.444086 -0.465818
Name: ('b', 'two')
  key1 key2   data1     data2
3    b  two -1.1634  0.094892


**Syntactic sugar**: selecting columns for groupby()

In [22]:
df.groupby('key1')['data1'].min()

key1
a   -1.830947
b   -1.163400
Name: data1, dtype: float64

In [23]:
df['data1'].groupby(df['key1']).min()

key1
a   -1.830947
b   -1.163400
Name: data1, dtype: float64

In [24]:
df.groupby('key1')[['data2']].min()

Unnamed: 0_level_0,data2
key1,Unnamed: 1_level_1
a,-1.85953
b,-0.465818


In [25]:
df[['data2']].groupby(df['key1']).min()

Unnamed: 0_level_0,data2
key1,Unnamed: 1_level_1
a,-1.85953
b,-0.465818


**Grouping with dictionary**

In [36]:
values = np.array([
    [100, 80, 95],
    [55, 60, 45],
    [70, 75, 90],
    [75, 70, 60],
    [60, 73, 75],
    [72, 63, 70]
])
data = pd.DataFrame(values,
                   columns=['Midterm', 'Project', 'Final'],
                   index=['Alics', 'Bob', 'Chris', 'Doug', 'Eva', "Frank"])
data

Unnamed: 0,Midterm,Project,Final
Alics,100,80,95
Bob,55,60,45
Chris,70,75,90
Doug,75,70,60
Eva,60,73,75
Frank,72,63,70


In [27]:
gender = {
    'Alics': 'F',
    'Bob': 'M',
    'Chris': 'M',
    'Doug': 'M',
    'Eva': 'F',
    'Frank': 'M'
}

In [28]:
# split the rows according to gender
data.groupby(gender).size()

F    2
M    4
dtype: int64

In [29]:
# same as grouping with lists
data.groupby(["F", "M", "M", "M", "F", "M"]).size() # not recommended

F    2
M    4
dtype: int64

In [44]:
data.groupby(gender).mean()

Unnamed: 0,Midterm,Project,Final
F,80.0,76.5,85.0
M,68.0,67.0,66.25


**Grouping with functions**

Any function passed as a group key will be called once per index value, with the returned values being used as the group names.

In [33]:
def get_initial(name):
    return name[0]

In [38]:
data.groupby(get_initial).mean()

Unnamed: 0,Midterm,Project,Final
A,100,80,95
B,55,60,45
C,70,75,90
D,75,70,60
E,60,73,75
F,72,63,70


In [37]:
data.groupby(lambda x: x[0]).mean()

Unnamed: 0,Midterm,Project,Final
A,100,80,95
B,55,60,45
C,70,75,90
D,75,70,60
E,60,73,75
F,72,63,70


**Example: Filling Missing Values with Group-Specific Values**

In [39]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.DataFrame(np.random.randn(8), index=states, columns=['Value'])
data.loc[['Vermont', 'Nevada', 'Idaho']] = np.nan
data['group_key'] = group_key
data

Unnamed: 0,Value,group_key
Ohio,1.803283,East
New York,0.218327,East
Vermont,,East
Florida,0.740863,East
Oregon,-1.16223,West
Nevada,,West
California,0.731731,West
Idaho,,West


In [40]:
# Fill the missing values with mean value
data.fillna(data.mean())


Unnamed: 0,Value,group_key
Ohio,1.803283,East
New York,0.218327,East
Vermont,0.466395,East
Florida,0.740863,East
Oregon,-1.16223,West
Nevada,0.466395,West
California,0.731731,West
Idaho,0.466395,West


In [47]:
# Find the average value of eastern states and western states
means = data.groupby("group_key").mean()

# Fill missing values with group specific average
#data.groupby("group_key").apply(lambda x: x.fillna(x.mean()))

def fill_group(group):
    return group.fillna(group.mean())
data.groupby("group_key").apply(fill_group)

Unnamed: 0_level_0,Unnamed: 1_level_0,Value,group_key
group_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,Ohio,1.803283,East
East,New York,0.218327,East
East,Vermont,0.920825,East
East,Florida,0.740863,East
West,Oregon,-1.16223,West
West,Nevada,-0.21525,West
West,California,0.731731,West
West,Idaho,-0.21525,West


In [52]:
# Fill missing values with the following rule:
# East: 0.5
# West: -0.5
values = {"East": 0.5,
          "West": -0.5}
#data.groupby("group_key").apply(lambda x: x.fillna(values[x.name]))

def fill_group2(group):
    value = values[group.name]
    return group.fillna(value)
data.groupby("group_key").apply(fill_group2)

Unnamed: 0,Value,group_key
Ohio,1.803283,East
New York,0.218327,East
Vermont,0.5,East
Florida,0.740863,East
Oregon,-1.16223,West
Nevada,-0.5,West
California,0.731731,West
Idaho,-0.5,West


**Example: Random Sampling and Permutation**

In [54]:
# Hearts, Spades, Clubs, Diamonds
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'Q', 'K']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

deck = pd.Series(card_val, index=cards)
deck

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
QH     10
KH     10
AS      1
2S      2
3S      3
4S      4
5S      5
6S      6
7S      7
8S      8
9S      9
10S    10
JS     10
QS     10
KS     10
AC      1
2C      2
3C      3
4C      4
5C      5
6C      6
7C      7
8C      8
9C      9
10C    10
JC     10
QC     10
KC     10
AD      1
2D      2
3D      3
4D      4
5D      5
6D      6
7D      7
8D      8
9D      9
10D    10
JD     10
QD     10
KD     10
dtype: int64

In [70]:
# Randomly sample 5 rows
deck.sample(5)


6S      6
2S      2
10S    10
10D    10
KD     10
dtype: int64

In [75]:
# Randomly sample 2 cards from each suit
groups = deck.groupby(lambda x: x[-1]) # takes the last character of the index 

#for name, group in groups:
#   print(name)
#  print(group)

groups.apply(lambda x: x.sample(2))

C  8C      8
   10C    10
D  KD     10
   9D      9
H  JH     10
   AH      1
S  7S      7
   9S      9
dtype: int64

**Example: Analyzing Cell Phone History**

In [76]:
# Load data
# https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
url = "https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2015/06/phone_data.csv"
# data = pd.read_csv(url, delimiter=",")
data = pd.read_csv(url, delimiter=",", index_col='index')
print(data.shape)
data.head(3)

(830, 6)


Unnamed: 0_level_0,date,duration,item,month,network,network_type
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,15/10/14 06:58,34.429,data,2014-11,data,data
1,15/10/14 06:58,13.0,call,2014-11,Vodafone,mobile
2,15/10/14 14:46,23.0,call,2014-11,Meteor,mobile


1. **date**: The date and time of the entry
2. **duration**: The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry.
3. **item**: A description of the event occurring – can be one of call, sms, or data.
4. **month**: The billing month that each entry belongs to – of form ‘YYYY-MM’.
5. **network**: The mobile network that was called/texted for each entry.
6. **network_type**: Whether the number being called was a mobile, international (‘world’), voicemail, landline, or other (‘special’) number.

In [31]:
# Convert date column from string to datetime objects
from dateutil.parser import parse
data['date'] = data['date'].apply(parse, dayfirst=True)
data.head(3)

Unnamed: 0_level_0,date,duration,item,month,network,network_type
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2014-10-15 06:58:00,34.429,data,2014-11,data,data
1,2014-10-15 06:58:00,13.0,call,2014-11,Vodafone,mobile
2,2014-10-15 14:46:00,23.0,call,2014-11,Meteor,mobile


In [87]:
# Check data types
data.dtypes


date            datetime64[ns]
duration               float64
item                    object
month                   object
network                 object
network_type            object
dtype: object

In [86]:
# Convert date column from string to datetime objects
from dateutil.parser import parse
data["date"] = data["date"].apply(parse, dayfirst = True)
data.head(3)

Unnamed: 0_level_0,date,duration,item,month,network,network_type
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2014-10-15 06:58:00,34.429,data,2014-11,data,data
1,2014-10-15 06:58:00,13.0,call,2014-11,Vodafone,mobile
2,2014-10-15 14:46:00,23.0,call,2014-11,Meteor,mobile


In [78]:
# Check missing values
data.isnull().sum()


date            0
duration        0
item            0
month           0
network         0
network_type    0
dtype: int64

**Apply GroupBy actions**

In [88]:
# Which months are covered in this data set?
data.groupby(['month']).groups.keys()

dict_keys(['2014-11', '2014-12', '2015-01', '2015-02', '2015-03'])

In [89]:
# Find the first entry for each month
data.groupby(['month']).first()

Unnamed: 0_level_0,date,duration,item,network,network_type
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-11,2014-10-15 06:58:00,34.429,data,data,data
2014-12,2014-11-13 06:58:00,34.429,data,data,data
2015-01,2014-12-13 06:58:00,34.429,data,data,data
2015-02,2015-01-13 06:58:00,34.429,data,data,data
2015-03,2015-02-12 20:15:00,69.0,call,landline,landline


In [90]:
# Get the number of instances in each month
data["month"].value_counts()


2014-11    230
2015-01    205
2014-12    157
2015-02    137
2015-03    101
Name: month, dtype: int64

In [91]:
# What is the sum of call durations for each month?
data.groupby("month").sum()

Unnamed: 0_level_0,duration
month,Unnamed: 1_level_1
2014-11,26639.441
2014-12,14641.87
2015-01,18223.299
2015-02,15522.299
2015-03,22750.441


**Group by more than one variable**

In [92]:
# How many calls, messages, and data entries are there in each month?
data.groupby(["month", "item"]).count()


Unnamed: 0_level_0,Unnamed: 1_level_0,date,duration,network,network_type
month,item,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-11,call,107,107,107,107
2014-11,data,29,29,29,29
2014-11,sms,94,94,94,94
2014-12,call,79,79,79,79
2014-12,data,30,30,30,30
2014-12,sms,48,48,48,48
2015-01,call,88,88,88,88
2015-01,data,31,31,31,31
2015-01,sms,86,86,86,86
2015-02,call,67,67,67,67


In [93]:
# How many instances are there per month, split by network_type?
data.groupby(["month", "network_type"]).count()


Unnamed: 0_level_0,Unnamed: 1_level_0,date,duration,item,network
month,network_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-11,data,29,29,29,29
2014-11,landline,5,5,5,5
2014-11,mobile,189,189,189,189
2014-11,special,1,1,1,1
2014-11,voicemail,6,6,6,6
2014-12,data,30,30,30,30
2014-12,landline,7,7,7,7
2014-12,mobile,108,108,108,108
2014-12,voicemail,8,8,8,8
2014-12,world,4,4,4,4


## II. Data Aggregation
Aggregation refer to any data transformation that produces numeric values from arrays. The preceding examples have used several of them, including `mean()`, `count()`, `first()`, `min()`, and `sum()`. However, user-defined functions can also be applied to create desired summary.

In [19]:
# Define function range() that returns(max - min)
def get_range(array):
    return array.max() - array.min()

In [20]:
# Apply agg() to find the range of each type of cell phone use.
data.groupby(['item'])['duration'].agg(get_range)

item
call    10527.0
data        0.0
sms         0.0
Name: duration, dtype: float64

In [22]:
# Apply multiple aggregation functions
data.groupby(['item'])['duration'].agg([get_range, np.max, np.min])

Unnamed: 0_level_0,get_range,amax,amin
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
call,10527.0,10528.0,1.0
data,0.0,34.429,34.429
sms,0.0,1.0,1.0


In [23]:
# Declare columns names
data.groupby(['item'])['duration'].agg([('range', get_range),
                                        ('maximum', np.max),
                                        ('minimum', np.min)])

Unnamed: 0_level_0,range,maximum,minimum
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
call,10527.0,10528.0,1.0
data,0.0,34.429,34.429
sms,0.0,1.0,1.0


In [28]:
# Apply a different function to each column
functions = {
    'duration': sum,
    'network_type': 'count',
    'date': 'first'
}
data.groupby(['month', 'item']).agg(functions)

Unnamed: 0_level_0,Unnamed: 1_level_0,duration,network_type,date
month,item,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-11,call,25547.0,107,15/10/14 06:58
2014-11,data,998.441,29,15/10/14 06:58
2014-11,sms,94.0,94,16/10/14 22:18
2014-12,call,13561.0,79,14/11/14 17:24
2014-12,data,1032.87,30,13/11/14 06:58
2014-12,sms,48.0,48,14/11/14 17:28
2015-01,call,17070.0,88,15/12/14 20:03
2015-01,data,1067.299,31,13/12/14 06:58
2015-01,sms,86.0,86,15/12/14 19:56
2015-02,call,14416.0,67,15/01/15 10:36


In [33]:
# Tuple named aggregations
data[data['item'] == 'call'].groupby('month').agg(
    # Get max of the duration column for each group
    max_duration=('duration', max),
    # Get min of the duration column for each group
    min_duration=('duration', min),
    # Get sum of the duration column for each group
    total_duration=('duration', sum),
    # Apply a lambda to date column
    num_days=("date", lambda x: (max(x) - min(x)).days)    
)

Unnamed: 0_level_0,max_duration,min_duration,total_duration,num_days
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-11,1940.0,1.0,25547.0,28
2014-12,2120.0,2.0,13561.0,30
2015-01,1859.0,2.0,17070.0,30
2015-02,1863.0,1.0,14416.0,25
2015-03,10528.0,2.0,21727.0,19


# Homework:
Use the cell phone usage data in this exercise.
1. Find out the network names that belongs to network_type "mobile".
2. How many messages were sent to each mobile network every month?
3. What is the total call duration to each mobile network every month?

In [34]:
data.groupby(['network_type', 'network']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,date,duration,item,month
network_type,network,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
data,data,150,150,150,150
landline,landline,42,42,42,42
mobile,Meteor,87,87,87,87
mobile,Tesco,84,84,84,84
mobile,Three,215,215,215,215
mobile,Vodafone,215,215,215,215
special,special,3,3,3,3
voicemail,voicemail,27,27,27,27
world,world,7,7,7,7


In [35]:
subdata = data[data['network_type'] == 'mobile']
subdata['network'].value_counts()

Three       215
Vodafone    215
Meteor       87
Tesco        84
Name: network, dtype: int64

In [36]:
subdata.groupby(['month', 'network']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,date,duration,item,network_type
month,network,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-11,Meteor,23,23,23,23
2014-11,Tesco,23,23,23,23
2014-11,Three,64,64,64,64
2014-11,Vodafone,79,79,79,79
2014-12,Meteor,24,24,24,24
2014-12,Tesco,13,13,13,13
2014-12,Three,43,43,43,43
2014-12,Vodafone,28,28,28,28
2015-01,Meteor,31,31,31,31
2015-01,Tesco,15,15,15,15


In [38]:
subdata[subdata['item'] == 'call'].groupby(['month', 'network'])['duration'].sum()

month    network 
2014-11  Meteor       1521.0
         Tesco        4045.0
         Three       12458.0
         Vodafone     4316.0
2014-12  Meteor       2010.0
         Tesco        1819.0
         Three        6316.0
         Vodafone     1302.0
2015-01  Meteor       2207.0
         Tesco        2904.0
         Three        6445.0
         Vodafone     3626.0
2015-02  Meteor       1188.0
         Tesco        4087.0
         Three        6279.0
         Vodafone     1864.0
2015-03  Meteor        274.0
         Tesco         973.0
         Three        4966.0
         Vodafone     3513.0
Name: duration, dtype: float64