# Week 13
# GroupBy Mechanics

Many data processing follows a **split-apply-combine** process. For example, you may want to do the following operations to analyze a dataset about sales:
1. What is the total revenue of each day?
2. What is the total sales of each product?
3. How much has each client perchased in total?

These operations all requires that you split the data into groups, and then apply certain calculations to each of the groups, and finally combine all results into a new table. In Pandas this is mostly done with `groupby()` function.

**Readings**
- Textbook, Chapter 8

In [None]:
import numpy as np
import pandas as pd

In [None]:
# An example:
df = pd.DataFrame({'Name' : ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
                   'Course' : ['Programming', 'Programming', 'Programming','Data Structure', 'Data Structure','Data Structure',],
                   'Semester': ['Spring 2019', 'Fall 2019', 'Fall 2019', 'Spring 2019', 'Fall 2019', 'Spring 2019'],
                   'Homework' : np.random.randint(60, 100, size=6),
                   'Exam' : np.random.randint(60, 100, size=6)})
df

In [None]:
# Split exam scores according to name
groups = df['Exam'].groupby(df['Name'])

groups

In [None]:
# Apply mean() function to find the average value for each group
means = groups.mean()

means

We obtain a **data series**. It can be converted to a data frame by `to_frame()` method.

In [None]:
# A common practice is to convert the results to a data frame
df_means = means.to_frame(name='Average Exam Score')

df_means

In [None]:
# Put all operations in one statement
df_means = ???

df_means

In [None]:
# Exercise: Find the average homework score and exam score for each course




## Split Data with Multiple Columns

We can use more than one column as keys to split data into groups.

In [None]:
# Split the exam score according to both course name and semester.
groups = df['Exam'].groupby([df['Course'], df['Semester']])

In [None]:
# Calculate the average score
means = groups.mean()

means

In [None]:
# Convert the result to a data frame
df_means = means.to_frame(name='Average Exam Score')

df_means

`means` is a data series with **hierarchical indexing**. It can be converted to a data frame using `unstack()`.

In [None]:
means.index

In [None]:
means.unstack()

We can specify which index to be unstacked.

In [None]:
means.unstack(level=0)

In [None]:
means.unstack(level=1)

In [None]:
# Exercise:
# Using one statement, create the above data frame directly from df.



We can split the entire data set instead of one column.

In [None]:
df.groupby([df['Course'], df['Semester']]).mean()

In [None]:
# Frequently the grouping information is found in the same data frame as the data 
# you want to work on. In that case, simply put column names as the keys:
df.groupby(['Course', 'Semester']).mean()

In [None]:
# Exercise:
# Use `size()` method to find the number of students for each course 
# in each semester



## Iterating Over Groups

The GroupBy object support iteration, providing a sequence of 2-tuples containing the group name along with the data.

In [None]:
# Show the content of each group.
groups = df.groupby('Name')
for name, group in groups:
    print("Name:", name)
    print(group)

**Syntactic sugar**: It is simpler to use the following statement for selecting columns for groupby()

In [None]:
df.groupby('Name')['Exam'].mean()

In [None]:
# The standard statement
df['Exam'].groupby(df['Name']).mean()

In [None]:
# The following statement does not work because 
# 'Name' is not a column in df['Exam']

# df['Exam'].groupby('Name').mean()

## **Grouping with dictionary**
We can use a seperate dictionary to decide the groups.

In [None]:
genders = {
    "Alice": "Female",
    "Bob": "Male",
    "Charlie": "Male"
}

In [None]:
data = df.set_index("Name")

data

In [None]:
# Split the data according to gender
data.groupby(genders).size()

In [None]:
# One can also use a list to indicate the grouping information
# Not recommended because it is hard to understand
genders = ['F', 'M', 'M', 'F', 'M', 'M']

data.groupby(genders).size()

In [None]:
# Exercise: Calculate the average scores for each gender
data.groupby(genders).mean()

## Grouping with functions

Any function passed as a group key will be called once per index value, with the returned values being used as the group names.

In [None]:
def get_initial(name):
    return name[0]

In [None]:
data.groupby(get_initial).mean()

In [None]:
# The function can be defined via lambda expression
data.groupby(lambda x: ????).mean()

### Exercise:
Last week's homework asks us to create a data frame with number of airport for each country. Let's think about how it can be achieved using the groupby mechanism.