# Chapter 10 Data Aggregation and Group Operations

- Split a data frame into pieces using one or more keys.
- Calculate group summary statistics such as count, mean, standard deviation, or a user-defined function.
- Apply within-group transformations such as normalization.
- Compute pivot tables and cross-tabulations.
- Perform statistical group analysis.

## I. GroupBy Mechanics

Many data processing follows a **split-apply-combine** process. For example, you may want to do the following operations to analyze a dataset about sales:
1. What is the total revenue every day?
2. What is the total sales of each product?
3. How much has each client perchased in total?

These operations all requires that you split the data into groups, and then apply certain calculations to each of the groups, and finally combine all results into a new table. In Pandas this is mostly done with `groupby()` function.

In [2]:
import numpy as np
import pandas as pd

In [3]:
# An example:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,0.395494,0.929851
1,a,two,-0.471687,0.130195
2,b,one,-0.803664,0.769901
3,b,two,1.163361,0.459188
4,a,one,-0.993663,0.746376


In [4]:
# Split data1 values according to key1:
groups = df['data1'].groupby(df['key1'])
groups

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f755009f780>

In [7]:
# Apply mean() function to find the average value for each group
means = groups.mean()
means

key1
a   -0.356619
b    0.179848
Name: data1, dtype: float64

In [11]:
# Convert it to a data frame
df_means = means.to_frame(name='data1_mean')
df_means

Unnamed: 0_level_0,data1_mean
key1,Unnamed: 1_level_1
a,-0.356619
b,0.179848


In [12]:
# Put all operations in one statement
df_means = df['data1'].groupby(df['key1']).mean().to_frame(name='data1_mean')
df_means

Unnamed: 0_level_0,data1_mean
key1,Unnamed: 1_level_1
a,-0.356619
b,0.179848


In [13]:
# Exercise: split data2 according to key2, and calculate the sum.



We can use more than one column as keys.

In [14]:
# Split the data according to both key1 and key2
groups = df['data1'].groupby([df['key1'], df['key2']])

In [16]:
# Calculate the mean
means = groups.mean()
means

key1  key2
a     one    -0.299084
      two    -0.471687
b     one    -0.803664
      two     1.163361
Name: data1, dtype: float64

We obtain a pandas Series with **hierarchical indexing**. It can be converted to a data frame using `unstack()`.

In [19]:
# Convert it to a data frame
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.299084,-0.471687
b,-0.803664,1.163361


In [20]:
# Put all operations in one statement



In [23]:
# Split the entire data frame
df.groupby([df['key1'], df['key2']]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.299084,0.838113
a,two,-0.471687,0.130195
b,one,-0.803664,0.769901
b,two,1.163361,0.459188


In [25]:
# Frequently the grouping information is found in the same data frame as the data 
# you want to work on. In that case, simply put column names as the keys:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.299084,0.838113
a,two,-0.471687,0.130195
b,one,-0.803664,0.769901
b,two,1.163361,0.459188


In [28]:
# Find the number of instances in each subgroup
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

**Iterating Over Groups**

The GroupBy object support iteration, providing a sequence of 2-tuples containing the group name along with the data.

In [29]:
# Show the content of each group.
groups = df.groupby(['key1', 'key2'])
for name, group in groups:
    print("Name:", name)
    print(group)

Name: ('a', 'one')
  key1 key2     data1     data2
0    a  one  0.395494  0.929851
4    a  one -0.993663  0.746376
Name: ('a', 'two')
  key1 key2     data1     data2
1    a  two -0.471687  0.130195
Name: ('b', 'one')
  key1 key2     data1     data2
2    b  one -0.803664  0.769901
Name: ('b', 'two')
  key1 key2     data1     data2
3    b  two  1.163361  0.459188


**Syntactic sugar**: selecting columns for groupby()

In [34]:
df.groupby('key1')['data1'].min()

key1
a   -0.993663
b   -0.803664
Name: data1, dtype: float64

In [36]:
df['data1'].groupby(df['key1']).min()

key1
a   -0.993663
b   -0.803664
Name: data1, dtype: float64

In [35]:
df.groupby('key1')[['data2']].min()

Unnamed: 0_level_0,data2
key1,Unnamed: 1_level_1
a,0.130195
b,0.459188


In [37]:
df[['data2']].groupby(df['key1']).min()

Unnamed: 0_level_0,data2
key1,Unnamed: 1_level_1
a,0.130195
b,0.459188


**Grouping with dictionary**

In [41]:
values = np.array([
    [100, 80, 95],
    [55, 60, 45],
    [70, 75, 90],
    [75, 70, 60],
    [60, 73, 75],
    [72, 63, 70]
])
data = pd.DataFrame(values,
                   columns=['Midterm', 'Project', 'Final'],
                   index=['Alics', 'Bob', 'Chris', 'Doug', 'Eva', "Frank"])
data

Unnamed: 0,Midterm,Project,Final
Alics,100,80,95
Bob,55,60,45
Chris,70,75,90
Doug,75,70,60
Eva,60,73,75
Frank,72,63,70


In [42]:
gender = {
    'Alics': 'F',
    'Bob': 'M',
    'Chris': 'M',
    'Doug': 'M',
    'Eva': 'F',
    'Frank': 'M'
}

In [43]:
# split the rows according to gender
data.groupby(gender).size()

F    2
M    4
dtype: int64

In [44]:
data.groupby(gender).mean()

Unnamed: 0,Midterm,Project,Final
F,80.0,76.5,85.0
M,68.0,67.0,66.25


**Grouping with functions**

Any function passed as a group key will be called once per index value, with the returned values being used as the group names.

In [47]:
data.groupby(lambda x: x[0]).mean()

Unnamed: 0,Midterm,Project,Final
A,100,80,95
B,55,60,45
C,70,75,90
D,75,70,60
E,60,73,75
F,72,63,70


**Example: Filling Missing Values with Group-Specific Values**

In [53]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.DataFrame(np.random.randn(8), index=states, columns=['Value'])
data.loc[['Vermont', 'Nevada', 'Idaho']] = np.nan
data['group_key'] = group_key
data

Unnamed: 0,Value,group_key
Ohio,0.746745,East
New York,-0.506052,East
Vermont,,East
Florida,0.489283,East
Oregon,0.998212,West
Nevada,,West
California,0.816119,West
Idaho,,West


In [54]:
# Fill the missing values with mean value



In [55]:
# Find the average value of eastern states and western states


# Fill missing values with group specific average



In [56]:
# Fill missing values with the following rule:
# East: 0.5
# West: -0.5



**Example: Random Sampling and Permutation**

In [58]:
# Hearts, Spades, Clubs, Diamonds
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

deck = pd.Series(card_val, index=cards)
deck

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
AS      1
2S      2
3S      3
4S      4
5S      5
6S      6
7S      7
8S      8
9S      9
10S    10
JS     10
KS     10
QS     10
AC      1
2C      2
3C      3
4C      4
5C      5
6C      6
7C      7
8C      8
9C      9
10C    10
JC     10
KC     10
QC     10
AD      1
2D      2
3D      3
4D      4
5D      5
6D      6
7D      7
8D      8
9D      9
10D    10
JD     10
KD     10
QD     10
dtype: int64

In [59]:
# Randomly sample 5 rows



In [60]:
# Randomly sample 2 cards for each suit

