# Week 10
# GroupBy Mechanics

Many data processing follows a **split-apply-combine** process. For example, you may want to do the following operations to analyze a dataset about sales:
1. What is the total revenue of each day?
2. What is the total sales of each product?
3. How much has each client purchased in total?

These operations all requires that you split the data into groups, and then apply certain calculations to each of the groups, and finally combine all results into a new table. In Pandas this is mostly done with `groupby()` function.

**Readings**
- Textbook, Chapter 10

In [2]:
import numpy as np
import pandas as pd

In [23]:
# An example:
df = pd.DataFrame({'Name' : ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
                   'Course' : ['Programming', 'Programming', 'Programming','Data Structure', 'Data Structure','Data Structure',],
                   'Semester': ['Spring 2019', 'Fall 2019', 'Fall 2019', 'Spring 2019', 'Fall 2019', 'Spring 2019'],
                   'Homework' : np.random.randint(60, 100, size=6),
                   'Exam' : np.random.randint(60, 100, size=6)})
df

Unnamed: 0,Name,Course,Semester,Homework,Exam
0,Alice,Programming,Spring 2019,69,92
1,Bob,Programming,Fall 2019,74,79
2,Charlie,Programming,Fall 2019,65,78
3,Alice,Data Structure,Spring 2019,75,62
4,Bob,Data Structure,Fall 2019,97,92
5,Charlie,Data Structure,Spring 2019,98,78


In [24]:
# Split exam scores according to name
groups = df['Exam'].groupby(df['Name'])

groups

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000024ABF36C460>

In [25]:
# Apply mean() function to find the average value for each group
means = groups.mean()

means

Name
Alice      77.0
Bob        85.5
Charlie    78.0
Name: Exam, dtype: float64

We obtain a **data series**. It can be converted to a data frame by `to_frame()` method.

In [26]:
# A common practice is to convert the results to a data frame
df_means = means.to_frame(name='Average Exam Score')

df_means

Unnamed: 0_level_0,Average Exam Score
Name,Unnamed: 1_level_1
Alice,77.0
Bob,85.5
Charlie,78.0


In [27]:
# Put all operations in one statement
df_means = df['Exam'].groupby(df['Name']).mean().to_frame(name="Average Exam Score")

df_means

Unnamed: 0_level_0,Average Exam Score
Name,Unnamed: 1_level_1
Alice,77.0
Bob,85.5
Charlie,78.0


In [28]:
# Exercise: Find the average homework score for each course
groups = (df['Homework']).groupby(df['Course'])
df_hw = groups.mean().to_frame(name="Average Homework Score")
df_hw

Unnamed: 0_level_0,Average Homework Score
Course,Unnamed: 1_level_1
Data Structure,90.0
Programming,69.333333


## Split Data with Multiple Columns

We can use more than one column as keys to split data into groups.

In [29]:
# Split the exam scores according to both course name and semester.
groups = df['Exam'].groupby([df['Course'], df['Semester']])

In [30]:
# Calculate the average score
means = groups.mean()

means

Course          Semester   
Data Structure  Fall 2019      92.0
                Spring 2019    70.0
Programming     Fall 2019      78.5
                Spring 2019    92.0
Name: Exam, dtype: float64

In [11]:
groups2 = df['Exam'].groupby([df['Semester'], df['Course']])
means2 = groups2.mean()
means2

Semester     Course        
Fall 2019    Data Structure    77.0
             Programming       80.5
Spring 2019  Data Structure    75.0
             Programming       63.0
Name: Exam, dtype: float64

In [31]:
# Convert the result to a data frame
df_means = means.to_frame(name='Average Exam Score')

df_means

Unnamed: 0_level_0,Unnamed: 1_level_0,Average Exam Score
Course,Semester,Unnamed: 2_level_1
Data Structure,Fall 2019,92.0
Data Structure,Spring 2019,70.0
Programming,Fall 2019,78.5
Programming,Spring 2019,92.0


`means` is a data series with **hierarchical indexing**. It can be converted to a data frame using `unstack()`.

In [13]:
means.index

MultiIndex([('Data Structure',   'Fall 2019'),
            ('Data Structure', 'Spring 2019'),
            (   'Programming',   'Fall 2019'),
            (   'Programming', 'Spring 2019')],
           names=['Course', 'Semester'])

In [14]:
means.unstack() # unstack() converts a data series with multiple index into a data frame

Semester,Fall 2019,Spring 2019
Course,Unnamed: 1_level_1,Unnamed: 2_level_1
Data Structure,77.0,75.0
Programming,80.5,63.0


We can specify which index to be unstacked.

In [33]:
means.unstack(level=0)

Course,Data Structure,Programming
Semester,Unnamed: 1_level_1,Unnamed: 2_level_1
Fall 2019,92.0,78.5
Spring 2019,70.0,92.0


In [16]:
means.unstack(level=1)

Semester,Fall 2019,Spring 2019
Course,Unnamed: 1_level_1,Unnamed: 2_level_1
Data Structure,77.0,75.0
Programming,80.5,63.0


In [17]:
# Exercise:
# Using one statement, create the above data frame directly from df.

df['Exam'].groupby([df['Course'], df['Semester']]).mean().unstack()

Semester,Fall 2019,Spring 2019
Course,Unnamed: 1_level_1,Unnamed: 2_level_1
Data Structure,77.0,75.0
Programming,80.5,63.0


We can split the entire data set instead of one column.

In [35]:
df.groupby([df['Course'], df['Semester']]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Homework,Exam
Course,Semester,Unnamed: 2_level_1,Unnamed: 3_level_1
Data Structure,Fall 2019,97.0,92.0
Data Structure,Spring 2019,86.5,70.0
Programming,Fall 2019,69.5,78.5
Programming,Spring 2019,69.0,92.0


In [39]:
# Frequently the grouping information is found in the same data frame as the data 
# you want to work on. In that case, simply put column names as the keys:
df.groupby(['Course', 'Semester'])[['Homework', 'Exam']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Homework,Exam
Course,Semester,Unnamed: 2_level_1,Unnamed: 3_level_1
Data Structure,Fall 2019,97.0,92.0
Data Structure,Spring 2019,86.5,70.0
Programming,Fall 2019,69.5,78.5
Programming,Spring 2019,69.0,92.0


In [40]:
# Exercise:
# Use `size()` method to find the number of students for each course 
# in each semester

# df.groupby(['Course', 'Semester'])['Exam'].count().to_frame(name='Number of Students')
df.groupby(['Course', 'Semester']).size().to_frame(name='Number of Students')

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of Students
Course,Semester,Unnamed: 2_level_1
Data Structure,Fall 2019,1
Data Structure,Spring 2019,2
Programming,Fall 2019,2
Programming,Spring 2019,1


Here is a list of functions applicable to the groupby results: [link](https://pandas.pydata.org/docs/reference/groupby.html)

## Example: Revisit Movie Ratings

In [41]:
ratings = pd.read_csv("mydata/ml-latest-small/ratings.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [44]:
# The number of ratings for each movie
num_ratings = ratings.groupby('movieId').size().to_frame(name="# of Ratings")
num_ratings.sort_values("# of Ratings", ascending=False).head(10)

Unnamed: 0_level_0,# of Ratings
movieId,Unnamed: 1_level_1
356,329
318,317
296,307
593,279
2571,278
260,251
480,238
110,237
589,224
527,220


In [43]:
# Calculate the average rating for each movie
groups = ratings.groupby('movieId')
avg_ratings = groups['rating'].mean().to_frame(name="Average Rating")
avg_ratings.sort_values('Average Rating', ascending=False).head(10)

Unnamed: 0_level_0,Average Rating
movieId,Unnamed: 1_level_1
88448,5.0
100556,5.0
143031,5.0
143511,5.0
143559,5.0
6201,5.0
102217,5.0
102084,5.0
6192,5.0
145994,5.0


In [48]:
threshold = 50
filter1 = (num_ratings['# of Ratings'] > threshold)
movies_with_many_ratings = num_ratings[filter1].index.values
movies_with_many_ratings

array([     1,      2,      3,      6,      7,     10,     11,     16,
           17,     19,     21,     25,     32,     34,     36,     39,
           47,     48,     50,     62,     70,     95,    104,    110,
          111,    141,    145,    150,    153,    158,    160,    161,
          163,    165,    168,    172,    173,    185,    208,    223,
          225,    231,    235,    253,    260,    266,    288,    292,
          293,    296,    300,    316,    317,    318,    329,    337,
          339,    344,    349,    350,    353,    356,    357,    364,
          367,    368,    370,    377,    380,    410,    420,    432,
          434,    435,    440,    442,    454,    457,    466,    474,
          480,    485,    500,    508,    509,    520,    527,    539,
          541,    551,    552,    553,    555,    586,    587,    588,
          589,    590,    592,    593,    594,    595,    596,    597,
          608,    648,    653,    673,    708,    733,    736,    750,
      

In [53]:
# Find the top-ten movies with highest ratings and at
# least 50 ratings
results = avg_ratings.loc[movies_with_many_ratings, :]
results.sort_values('Average Rating', ascending=False).head(10)

Unnamed: 0_level_0,Average Rating
movieId,Unnamed: 1_level_1
318,4.429022
858,4.289062
2959,4.272936
1276,4.27193
750,4.268041
904,4.261905
1221,4.25969
48516,4.252336
1213,4.25
912,4.24


## Iterating Over Groups

The GroupBy object support iteration, providing a sequence of 2-tuples containing the group name along with the data.

In [57]:
# Show the content of each group.
groups = df.groupby('Name')

for name, group in groups:
    print("Name:", name)
    print(group)
# group # Each group is a data frame

Name: Alice
    Name          Course     Semester  Homework  Exam
0  Alice     Programming  Spring 2019        69    92
3  Alice  Data Structure  Spring 2019        75    62
Name: Bob
  Name          Course   Semester  Homework  Exam
1  Bob     Programming  Fall 2019        74    79
4  Bob  Data Structure  Fall 2019        97    92
Name: Charlie
      Name          Course     Semester  Homework  Exam
2  Charlie     Programming    Fall 2019        65    78
5  Charlie  Data Structure  Spring 2019        98    78


Unnamed: 0,Name,Course,Semester,Homework,Exam
2,Charlie,Programming,Fall 2019,65,78
5,Charlie,Data Structure,Spring 2019,98,78


**Syntactic sugar**: It is simpler to use the following statement for selecting columns for groupby()

In [58]:
df.groupby('Name')['Exam'].mean()

Name
Alice      77.0
Bob        85.5
Charlie    78.0
Name: Exam, dtype: float64

In [59]:
# The standard statement
df['Exam'].groupby(df['Name']).mean()

Name
Alice      77.0
Bob        85.5
Charlie    78.0
Name: Exam, dtype: float64

In [None]:
# The following statement does not work because 
# 'Name' is not a column in df['Exam']

df['Exam'].groupby('Name').mean()

In [62]:
# Exercise:
# Use the simplified groupby expression to find the average homework scores of each semester.
df.groupby('Semester')['Homework'].mean()

Semester
Fall 2019      78.666667
Spring 2019    80.666667
Name: Homework, dtype: float64

## **Grouping with dictionary**
We can use a seperate dictionary to decide the groups.

In [63]:
df

Unnamed: 0,Name,Course,Semester,Homework,Exam
0,Alice,Programming,Spring 2019,69,92
1,Bob,Programming,Fall 2019,74,79
2,Charlie,Programming,Fall 2019,65,78
3,Alice,Data Structure,Spring 2019,75,62
4,Bob,Data Structure,Fall 2019,97,92
5,Charlie,Data Structure,Spring 2019,98,78


In [64]:
genders = {
    "Alice": "Female",
    "Bob": "Male",
    "Charlie": "Male"
}

In [65]:
data = df.set_index("Name")

data

Unnamed: 0_level_0,Course,Semester,Homework,Exam
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alice,Programming,Spring 2019,69,92
Bob,Programming,Fall 2019,74,79
Charlie,Programming,Fall 2019,65,78
Alice,Data Structure,Spring 2019,75,62
Bob,Data Structure,Fall 2019,97,92
Charlie,Data Structure,Spring 2019,98,78


In [66]:
# Split the data according to gender
data.groupby(genders).size()

Name
Female    2
Male      4
dtype: int64

In [67]:
for gender, group in data.groupby(genders):
    print("Gender:", gender)
    print(group)

Gender: Female
               Course     Semester  Homework  Exam
Name                                              
Alice     Programming  Spring 2019        69    92
Alice  Data Structure  Spring 2019        75    62
Gender: Male
                 Course     Semester  Homework  Exam
Name                                                
Bob         Programming    Fall 2019        74    79
Charlie     Programming    Fall 2019        65    78
Bob      Data Structure    Fall 2019        97    92
Charlie  Data Structure  Spring 2019        98    78


In [70]:
# One can also use a list to indicate the grouping information
# Not recommended because it is hard to understand
genders = ['F', 'M', 'M', 'F', 'M', 'M']

data.groupby(genders).size()

F    2
M    4
dtype: int64

In [71]:
# Exercise: Calculate the average scores for each gender
data.groupby(genders).mean()

Unnamed: 0,Homework,Exam
F,72.0,77.0
M,83.5,81.75


## Grouping with functions

Any function passed as a group key will be called once per index value, with the returned values being used as the group names.

In [72]:
def get_initial(name):
    return name[0]

In [73]:
get_initial("Liang")

'L'

In [74]:
data.groupby(get_initial).mean()

Unnamed: 0_level_0,Homework,Exam
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
A,72.0,77.0
B,85.5,85.5
C,81.5,78.0


In [75]:
# The function can be defined via lambda expression
data.groupby(lambda x: x[0]).mean()

Unnamed: 0_level_0,Homework,Exam
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
A,72.0,77.0
B,85.5,85.5
C,81.5,78.0
