# Week 13
# GroupBy Mechanics

Many data processing follows a **split-apply-combine** process. For example, you may want to do the following operations to analyze a dataset about sales:
1. What is the total revenue of each day?
2. What is the total sales of each product?
3. How much has each client perchased in total?

These operations all requires that you split the data into groups, and then apply certain calculations to each of the groups, and finally combine all results into a new table. In Pandas this is mostly done with `groupby()` function.

**Readings**
- Textbook, Chapter 10

In [1]:
import numpy as np
import pandas as pd

In [2]:
# An example:
df = pd.DataFrame({'Name' : ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
                   'Course' : ['Programming', 'Programming', 'Programming','Data Structure', 'Data Structure','Data Structure',],
                   'Semester': ['Spring 2019', 'Fall 2019', 'Fall 2019', 'Spring 2019', 'Fall 2019', 'Spring 2019'],
                   'Homework' : np.random.randint(60, 100, size=6),
                   'Exam' : np.random.randint(60, 100, size=6)})
df

Unnamed: 0,Name,Course,Semester,Homework,Exam
0,Alice,Programming,Spring 2019,84,63
1,Bob,Programming,Fall 2019,78,72
2,Charlie,Programming,Fall 2019,69,84
3,Alice,Data Structure,Spring 2019,69,89
4,Bob,Data Structure,Fall 2019,74,62
5,Charlie,Data Structure,Spring 2019,65,99


In [3]:
# Split exam scores according to name
groups = df['Exam'].groupby(df['Name'])

groups

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000197614B4E20>

In [4]:
# Apply mean() function to find the average value for each group
means = groups.mean()

means

Name
Alice      76.0
Bob        67.0
Charlie    91.5
Name: Exam, dtype: float64

We obtain a **data series**. It can be converted to a data frame by `to_frame()` method.

In [6]:
# A common practice is to convert the results to a data frame
df_means = means.to_frame(name='Average Exam Score')

df_means

Unnamed: 0_level_0,Average Exam Score
Name,Unnamed: 1_level_1
Alice,76.0
Bob,67.0
Charlie,91.5


In [7]:
# Put all operations in one statement
df_means = df['Exam'].groupby(df['Name']).mean().to_frame(name="Average Exam Score")

df_means

Unnamed: 0_level_0,Average Exam Score
Name,Unnamed: 1_level_1
Alice,76.0
Bob,67.0
Charlie,91.5


In [12]:
# Exercise: Find the average homework score and exam score for each course

# 1. Find the average homework scores for each course
groups = df['Homework'].groupby(df['Course'])
means = groups.mean()
df_means = means.to_frame(name="Average Homework Score")

df_means

# 2. Find the average exam scores for each course
df_means_exam = df['Exam'].groupby(df['Course']).mean().to_frame(name="Average Exam Score")

df_means_exam

# 3. merge df_means with df_means_exam
# pd.merge(df_means, df_means_exam, on='Course') # Merge on the Course attribute
pd.merge(df_means, df_means_exam, left_index=True, right_index=True) # Merge on the index

Unnamed: 0_level_0,Average Homework Score,Average Exam Score
Course,Unnamed: 1_level_1,Unnamed: 2_level_1
Data Structure,69.333333,83.333333
Programming,77.0,73.0


## Split Data with Multiple Columns

We can use more than one column as keys to split data into groups.

In [13]:
# Split the exam scores according to both course name and semester.
groups = df['Exam'].groupby([df['Course'], df['Semester']])

In [14]:
# Calculate the average score
means = groups.mean()

means

Course          Semester   
Data Structure  Fall 2019      62
                Spring 2019    94
Programming     Fall 2019      78
                Spring 2019    63
Name: Exam, dtype: int32

In [15]:
groups2 = df['Exam'].groupby([df['Semester'], df['Course']])
means2 = groups2.mean()
means2

Semester     Course        
Fall 2019    Data Structure    62
             Programming       78
Spring 2019  Data Structure    94
             Programming       63
Name: Exam, dtype: int32

In [16]:
# Convert the result to a data frame
df_means = means.to_frame(name='Average Exam Score')

df_means

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Exam Score
Course,Semester,Unnamed: 2_level_1
Data Structure,Fall 2019,62
Data Structure,Spring 2019,94
Programming,Fall 2019,78
Programming,Spring 2019,63


`means` is a data series with **hierarchical indexing**. It can be converted to a data frame using `unstack()`.

In [17]:
means.index

MultiIndex([('Data Structure',   'Fall 2019'),
            ('Data Structure', 'Spring 2019'),
            (   'Programming',   'Fall 2019'),
            (   'Programming', 'Spring 2019')],
           names=['Course', 'Semester'])

In [18]:
means.unstack() # unstack() converts a data series with multiple index into a data frame

Semester,Fall 2019,Spring 2019
Course,Unnamed: 1_level_1,Unnamed: 2_level_1
Data Structure,62,94
Programming,78,63


We can specify which index to be unstacked.

In [19]:
means.unstack(level=0)

Course,Data Structure,Programming
Semester,Unnamed: 1_level_1,Unnamed: 2_level_1
Fall 2019,62,78
Spring 2019,94,63


In [20]:
means.unstack(level=1)

Semester,Fall 2019,Spring 2019
Course,Unnamed: 1_level_1,Unnamed: 2_level_1
Data Structure,62,94
Programming,78,63


In [21]:
# Exercise:
# Using one statement, create the above data frame directly from df.

df['Exam'].groupby([df['Course'], df['Semester']]).mean().unstack()

Semester,Fall 2019,Spring 2019
Course,Unnamed: 1_level_1,Unnamed: 2_level_1
Data Structure,62,94
Programming,78,63


We can split the entire data set instead of one column.

In [22]:
df.groupby([df['Course'], df['Semester']]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Homework,Exam
Course,Semester,Unnamed: 2_level_1,Unnamed: 3_level_1
Data Structure,Fall 2019,74.0,62.0
Data Structure,Spring 2019,67.0,94.0
Programming,Fall 2019,73.5,78.0
Programming,Spring 2019,84.0,63.0


In [23]:
# Frequently the grouping information is found in the same data frame as the data 
# you want to work on. In that case, simply put column names as the keys:
df.groupby(['Course', 'Semester']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Homework,Exam
Course,Semester,Unnamed: 2_level_1,Unnamed: 3_level_1
Data Structure,Fall 2019,74.0,62.0
Data Structure,Spring 2019,67.0,94.0
Programming,Fall 2019,73.5,78.0
Programming,Spring 2019,84.0,63.0


In [25]:
# Exercise:
# Use `size()` method to find the number of students for each course 
# in each semester

df.groupby(['Course', 'Semester']).size().to_frame(name='Number of Students')

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of Students
Course,Semester,Unnamed: 2_level_1
Data Structure,Fall 2019,1
Data Structure,Spring 2019,2
Programming,Fall 2019,2
Programming,Spring 2019,1


## Iterating Over Groups

The GroupBy object support iteration, providing a sequence of 2-tuples containing the group name along with the data.

In [26]:
# Show the content of each group.
groups = df.groupby('Name')

for name, group in groups:
    print("Name:", name)
    print(group)

Name: Alice
    Name          Course     Semester  Homework  Exam
0  Alice     Programming  Spring 2019        84    63
3  Alice  Data Structure  Spring 2019        69    89
Name: Bob
  Name          Course   Semester  Homework  Exam
1  Bob     Programming  Fall 2019        78    72
4  Bob  Data Structure  Fall 2019        74    62
Name: Charlie
      Name          Course     Semester  Homework  Exam
2  Charlie     Programming    Fall 2019        69    84
5  Charlie  Data Structure  Spring 2019        65    99


**Syntactic sugar**: It is simpler to use the following statement for selecting columns for groupby()

In [27]:
df.groupby('Name')['Exam'].mean()

Name
Alice      76.0
Bob        67.0
Charlie    91.5
Name: Exam, dtype: float64

In [28]:
# The standard statement
df['Exam'].groupby(df['Name']).mean()

Name
Alice      76.0
Bob        67.0
Charlie    91.5
Name: Exam, dtype: float64

In [30]:
# The following statement does not work because 
# 'Name' is not a column in df['Exam']

# df['Exam'].groupby('Name').mean()

In [32]:
# Exercise:
# Use the simplified groupby expression to find the average homework scores of each semester.

df.groupby('Semester')['Homework'].mean().to_frame(name="Average Homework Score")

Unnamed: 0_level_0,Average Homework Score
Semester,Unnamed: 1_level_1
Fall 2019,73.666667
Spring 2019,72.666667


## **Grouping with dictionary**
We can use a seperate dictionary to decide the groups.

In [33]:
df

Unnamed: 0,Name,Course,Semester,Homework,Exam
0,Alice,Programming,Spring 2019,84,63
1,Bob,Programming,Fall 2019,78,72
2,Charlie,Programming,Fall 2019,69,84
3,Alice,Data Structure,Spring 2019,69,89
4,Bob,Data Structure,Fall 2019,74,62
5,Charlie,Data Structure,Spring 2019,65,99


In [36]:
genders = {
    "Alice": "Female",
    "Bob": "Male",
    "Charlie": "Male"
}

In [37]:
data = df.set_index("Name")

data

Unnamed: 0_level_0,Course,Semester,Homework,Exam
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alice,Programming,Spring 2019,84,63
Bob,Programming,Fall 2019,78,72
Charlie,Programming,Fall 2019,69,84
Alice,Data Structure,Spring 2019,69,89
Bob,Data Structure,Fall 2019,74,62
Charlie,Data Structure,Spring 2019,65,99


In [38]:
# Split the data according to gender
data.groupby(genders).size()

Female    2
Male      4
dtype: int64

In [39]:
for gender, group in data.groupby(genders):
    print("Gender:", gender)
    print(group)

Gender: Female
               Course     Semester  Homework  Exam
Name                                              
Alice     Programming  Spring 2019        84    63
Alice  Data Structure  Spring 2019        69    89
Gender: Male
                 Course     Semester  Homework  Exam
Name                                                
Bob         Programming    Fall 2019        78    72
Charlie     Programming    Fall 2019        69    84
Bob      Data Structure    Fall 2019        74    62
Charlie  Data Structure  Spring 2019        65    99


In [40]:
# One can also use a list to indicate the grouping information
# Not recommended because it is hard to understand
genders = ['F', 'M', 'M', 'F', 'M', 'M']

data.groupby(genders).size()

F    2
M    4
dtype: int64

In [41]:
# Exercise: Calculate the average scores for each gender
data.groupby(genders).mean()

Unnamed: 0,Homework,Exam
F,76.5,76.0
M,71.5,79.25


## Grouping with functions

Any function passed as a group key will be called once per index value, with the returned values being used as the group names.

In [42]:
def get_initial(name):
    return name[0]

In [43]:
get_initial("Liang")

'L'

In [44]:
data.groupby(get_initial).mean()

Unnamed: 0,Homework,Exam
A,76.5,76.0
B,76.0,67.0
C,67.0,91.5


In [45]:
# The function can be defined via lambda expression
data.groupby(lambda x: x[0]).mean()

Unnamed: 0,Homework,Exam
A,76.5,76.0
B,76.0,67.0
C,67.0,91.5


### Exercise:
Last week's homework asks us to create a data frame with number of airport for each country. Let's think about how it can be achieved using the groupby mechanism.

In [46]:
airports = pd.read_csv("https://ourairports.com/data/airports.csv", sep=",")
airports.head()

Unnamed: 0,id,ident,type,name,latitude_deg,longitude_deg,elevation_ft,continent,iso_country,iso_region,municipality,scheduled_service,gps_code,iata_code,local_code,home_link,wikipedia_link,keywords
0,6523,00A,heliport,Total Rf Heliport,40.070801,-74.933601,11.0,,US,US-PA,Bensalem,no,00A,,00A,,,
1,323361,00AA,small_airport,Aero B Ranch Airport,38.704022,-101.473911,3435.0,,US,US-KS,Leoti,no,00AA,,00AA,,,
2,6524,00AK,small_airport,Lowell Field,59.9492,-151.695999,450.0,,US,US-AK,Anchor Point,no,00AK,,00AK,,,
3,6525,00AL,small_airport,Epps Airpark,34.864799,-86.770302,820.0,,US,US-AL,Harvest,no,00AL,,00AL,,,
4,6526,00AR,closed,Newport Hospital & Clinic Heliport,35.6087,-91.254898,237.0,,US,US-AR,Newport,no,,,,,,00AR


In [47]:
countries = pd.read_csv("https://ourairports.com/data/countries.csv", sep=',')
countries.head()

Unnamed: 0,id,code,name,continent,wikipedia_link,keywords
0,302672,AD,Andorra,EU,https://en.wikipedia.org/wiki/Andorra,
1,302618,AE,United Arab Emirates,AS,https://en.wikipedia.org/wiki/United_Arab_Emir...,"UAE,مطارات في الإمارات العربية المتحدة"
2,302619,AF,Afghanistan,AS,https://en.wikipedia.org/wiki/Afghanistan,
3,302722,AG,Antigua and Barbuda,,https://en.wikipedia.org/wiki/Antigua_and_Barbuda,
4,302723,AI,Anguilla,,https://en.wikipedia.org/wiki/Anguilla,


In [54]:
# Create a data frame showing the number of airports for each country.

groups =  airports.groupby("iso_country") # which column should be used as keys?
results =  groups.size() # which method should be applied to the groups?
# results
df = results.to_frame(name="Number of Airports")      # how to convert the previously obtained results into a data frame?
df.head()

Unnamed: 0_level_0,Number of Airports
iso_country,Unnamed: 1_level_1
AD,2
AE,53
AF,65
AG,4
AI,2


In [60]:
# Add a column of country names
df = pd.merge(df, countries[['code', 'name']], left_index=True, right_on='code')

In [64]:
df = df.drop('code', axis=1)
df.head()

Unnamed: 0,Number of Airports,name
0,2,Andorra
1,53,United Arab Emirates
2,65,Afghanistan
3,4,Antigua and Barbuda
4,2,Anguilla
