<a href="https://colab.research.google.com/github/chonginbilly/Moringa_DS/blob/Moringa_python/pandasgroupby.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="green">*To start working on this notebook, or any other notebook that we will use in this course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

---

# Pandas Groupby

## Introduction

Pandas `.groupby()` method is a very powerful function with a lot of variations that makes it easy to split data over some criteria. The method is used for grouping data based on provided categories and allows us to apply a function to these categories. It allows us to aggregate data efficiently.

## Objectives

By the end of this lesson, you will be able to:

* Use `.groupby()` method to aggregate different groups in a dataframe

## Loading libraries

In [None]:
import pandas as pd
import numpy as np

import os

## Using `.groupby()`

Let's consider an example of the [titani dataset](https://drive.google.com/file/d/1Dwrqt3r80WbeukyQ9cuTr7RxC9q8ZBzF/view?usp=sharing)

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data_path = '/content/drive/MyDrive/Product/Naivas Big Data /Data'

os.chdir(data_path)

In [None]:
# import the data
df = pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In the Exploratory Data Analysis phase, a frequently performed task involves dividing the dataset into subgroups to facilitate comparisons and identify trends. For example, grouping passengers based on characteristics like gender or passenger class is a common practice. This can be accomplished using the built-in `.groupby()` method provided by pandas DataFrames. To group passengers by gender, the process involves utilizing this method in the following manner:

In [None]:
df.groupby("Sex")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7891ee057850>

Note that this alone will not display a result -- although you have split the dataset into groups, you don't have a meaningful way to display information until you chain an ***Aggregation Function*** onto the groupby. This allows you to compute summary statistics!

You can quickly use an aggregation function by chaining the call to the end of the `.groupby()` method.

In [None]:
df.groupby("Sex").sum()

  df.groupby("Sex").sum()


Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,135343,233,678,7286.0,218,204,13966.6628
male,262043,109,1379,13919.17,248,136,14727.2865


You can use aggregation functions to quickly help us compare subsets of our data. For example, the aggregate statistics displayed above allow you to quickly notice that there were more female survivors overall than male survivors.

## Aggregation Functions

There are many built-in aggregate methods provided for you in the pandas package, and you can even write and apply your own. Some of the most common aggregate methods you may want to use are:

* `.min()`: returns the minimum value for each column by group
* `.max()`: returns the maximum value for each column by group
* `.mean()`: returns the average value for each column by group
* `.median()`: returns the median value for each column by group
* `.count()`: returns the count of each column by group

In [None]:
# example 1
df.groupby("Pclass")["Fare"].mean()

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

In [None]:
# example 2
df.groupby("Sex")["Age"].max()

Sex
female    63.0
male      80.0
Name: Age, dtype: float64

In [None]:
# example 3
df.groupby("Sex")["Fare"].sum()

Sex
female    13966.6628
male      14727.2865
Name: Fare, dtype: float64

## Multiple groups

You can also split data into multiple different levels of groups by passing in an array containing the name of every column you want to group by -- for instance, by every combination of both `Sex` and `Pclass`.

In [None]:
df.groupby(['Sex', 'Pclass'])["Age"].mean()

Sex     Pclass
female  1         34.611765
        2         28.722973
        3         21.750000
male    1         41.281386
        2         30.740707
        3         26.507589
Name: Age, dtype: float64

In [None]:
df.groupby(['Sex', 'Pclass'])["Age"].agg(['mean', 'max', 'min'])


Unnamed: 0_level_0,Unnamed: 1_level_0,mean,max,min
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,1,34.611765,63.0,2.0
female,2,28.722973,57.0,2.0
female,3,21.75,63.0,0.75
male,1,41.281386,80.0,0.92
male,2,30.740707,70.0,0.67
male,3,26.507589,74.0,0.42


## Selecting information from grouped objects

The example below demonstrates the syntax for returning the `count` of the `Survived` class for every combination of `Sex` and `Pclass`:

In [None]:
df.groupby(['Sex', 'Pclass'])['Survived'].count()

Sex     Pclass
female  1          94
        2          76
        3         144
male    1         122
        2         108
        3         347
Name: Survived, dtype: int64

The above example slices by column, but you can also slice by index. Take a look:

In [None]:
grouped = df.groupby(['Sex', 'Pclass'])['Survived'].count()

print(grouped['female'])

Pclass
1     94
2     76
3    144
Name: Survived, dtype: int64


In [None]:
print(grouped['female'][1])

94


Note that you need to provide only the value female as the index, and are returned all the groups where the passenger is female, regardless of the `Pclass` value. The second example shows the results for female passengers with a 1st-class ticket

## Summary

We have explored the process of splitting a DataFrame into subgroups using the `.groupby()` method. We also applied built-in methods to a groupby object, enabling us to generate aggregate views of these groups.