---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.19 (Pandas-11)</h1>

## _Aggregating and Grouping Dataframes.ipynb_

## Learning agenda of this notebook
1. Common Aggregation Functions
2. Grouping Data and Applying Aggregate Functions
3. Use of `series.agg()` method

## 1. Common Aggregation Functions
- An aggregation function is one which takes multiple individual values and returns a summary. The most common aggregation functions are a simple average or summation of values
- The most common built in aggregation functions are basic math functions including 
   - **sum** 
   - **mean** 
   - **median** 
   - **min**, **max** 
   - **standard deviation** 
   - **variance** 
   - **product**


In [60]:
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
0,MS01,SAADIA,female,group B,MORNING,28,2562,72.0,72,74
1,MS02,JUMAIMA,female,group C,AFTERNOON,33,2800,69.0,90,88
2,MS03,ARIFA,female,,EVENING,21,3500,,95,93
3,MS04,SAADIA,male,group A,MOR,44,2000,47.0,57,44
4,MS05,DANISH,male,group C,AFTERNOON,54,2100,76.0,78,55


### a. Find Maximum/Minimum/Sum Values
- The `df.max()`, `df.min()`, `df.sum()`, and `df.cumsum()` methods are used to find the max/min/sum/summulative sum of values over the requested axis (default axis is 0), i.e., Column wise
- These methods works regardless of data types.

In [61]:
# The min() function return the minimum of the values over the requested axis (Default axis is 0)
df.min()

  df.min()


rollno             MS01
name           ABDULLAH
gender           female
session             AFT
age                  19
scholarship        2000
math               18.0
english              32
urdu                 28
dtype: object

**To make these work only on numeric columns**

In [62]:
# The min() function return the minimum of the values over the requested axis (Default axis is 0)
df.min(numeric_only=True)

age              19.0
scholarship    2000.0
math             18.0
english          32.0
urdu             28.0
dtype: float64

In [63]:
df.max(numeric_only=True)

age              54.0
scholarship    4000.0
math             97.0
english          95.0
urdu             93.0
dtype: float64

In [64]:
df.sum(numeric_only=True)

age              1705.0
scholarship    143777.0
math             2846.0
english          3391.0
urdu             3264.0
dtype: float64

#### Suppose we want to apply aggregate function on a specific column only

**To find maximum marks in a specific subject only**

In [65]:
df['math'].max()

97.0

In [72]:
df.math.max()

97.0

#### Similarly we can apply other aggregate functions on any column of this dataframe

### b. Descriptive Statistical Measures
- The `df.mean()` method return the mean of the values over the requested axis.
- The `df.median()` method return the median of the values over the requested axis.
- The `df.mode()` method return the mode(s) of each element along the selected axis.The mode of a set of values is the value that appears most often. It can be multiple values.
- The `df.count()` method return the count of non-NA values over the requested axis.

In [76]:
df.math.mean()

61.869565217391305

In [77]:
df.math.median()

64.5

In [78]:
df.math.mode()

0    69.0
dtype: float64

In [71]:
#Count the number of non-NA values
df.count()

rollno         50
name           50
gender         50
group          47
session        50
age            50
scholarship    50
math           46
english        50
urdu           50
dtype: int64

### c. How to compute the average marks in the subject of math for Group A students only

In [122]:
df[df.group == 'group A']

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
3,MS04,SAADIA,male,group A,MOR,44,2000,47.0,57,44
13,MS14,USAMA,male,group A,AFTERNOON,26,2654,78.0,72,70
14,MS15,NAVAIRA,female,group A,AFT,25,2137,50.0,53,58
25,MS26,IBRAR,male,group A,EVENING,39,3500,73.0,74,72
46,MS47,SABA,female,group A,EVE,36,2500,55.0,65,62


In [123]:
df[df.group == 'group A'].math

3     47.0
13    78.0
14    50.0
25    73.0
46    55.0
Name: math, dtype: float64

In [124]:
df[df.group == 'group A'].math.mean()

60.6

**What if I want to do the same for all the groups in the dataset**

**Use `df.groupby()` method**

In [125]:
df.groupby('group').math.mean()

group
group A    60.600000
group B    57.250000
group C    64.769231
group D    59.583333
group E    89.000000
Name: math, dtype: float64

## 2. Grouping Data and Applying Aggregation Functions

<img align="right" width="400" height="600"  src="images/group-by.png"  >

- The `df.groupby()` method is used to split the data into groups based on some criteria. Splitting the data into groups based on some criteria manualy will require masking, aggregation, and merging commands. The power of the GroupBy is that it abstracts away these steps: the user need not think about how the computation is done under the hood, but rather thinks about the operation as a whole.
- Returns : It does not return a `DataFrame` rather a `DataFrameGroupBy` object, which can be thought of as a special view of the `DataFrame`, which is poised to dig into the groups but does no actual computation until the aggregation is applied.
- To produce a result, we can apply an aggregate to this `DataFrameGroupBy` object, which will perform the appropriate apply/combine steps to produce the desired result


- It works on a split-apply-combine strategy which is shown below using a 3-step process:
   - **Split Step:** splits the data into groups by creating a GROUP BY object from the original DataFrame.
   - **Apply Step:** applies an aggregate function on the individual groups.
   - **Combine Step:** merges the results to form a new dataframe.

### a. Split the Data into Groups
- You can groupby using 'address' or 'session' or 'gender' or a combination

In [93]:
g1 = df.groupby('group')
g1

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe92a8cd910>

#### To display records of a specific group, use `get_group()` method on DataFrameGroupBy object

In [94]:
# Display DataFrame of a specific group from groupby object by providing the specific group value
df.groupby('group').get_group('group A')

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
3,MS04,SAADIA,male,group A,MOR,44,2000,47.0,57,44
13,MS14,USAMA,male,group A,AFTERNOON,26,2654,78.0,72,70
14,MS15,NAVAIRA,female,group A,AFT,25,2137,50.0,53,58
25,MS26,IBRAR,male,group A,EVENING,39,3500,73.0,74,72
46,MS47,SABA,female,group A,EVE,36,2500,55.0,65,62


In [95]:
# Display DataFrame of a specific group from groupby object by providing the specific group value
df.groupby('group').get_group('group E')

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
34,MS35,ABDULLAH,male,group E,MORNING,45,2500,97.0,87,82
35,MS36,OSAMA,male,group E,AFTERNOON,31,3500,81.0,81,79
44,MS45,ZAINAB,female,group E,MOR,28,3500,,56,54


#### Similarly you can use `groupby()` method on the `session` column of the dataframe as well

In [120]:
df.groupby('session').get_group('MOR')

Unnamed: 0,rollno,name,gender,group,session,age,scholarship,math,english,urdu
3,MS04,SAADIA,male,group A,MOR,44,2000,47.0,57,44
10,MS11,MUSTJAB,male,group C,MOR,46,3000,58.0,54,52
12,MS13,MAHOOR,female,,MOR,25,2345,65.0,81,73
16,MS17,NOFIL,male,group C,MOR,22,3500,88.0,89,86
22,MS23,RAUF,male,group D,MOR,31,2500,44.0,54,53
29,MS30,ROSHAN,female,group D,MOR,42,3500,62.0,70,75
37,MS38,NASEEM,female,group D,MOR,26,2500,50.0,64,59
44,MS45,ZAINAB,female,group E,MOR,28,3500,,56,54
48,MS49,FATIMA,female,group D,MOR,40,2500,57.0,74,76


#### To display indices of every group in the dataframe, use `groups` attribute of  DataFrameGroupBy object

In [96]:
# Displaying group data, i.e., group_name, row indexes corresponding to the group and their data type
df.groupby('group').groups

{'group A': [3, 13, 14, 25, 46], 'group B': [0, 5, 6, 7, 9, 17, 21, 26, 31, 39, 42, 43, 45], 'group C': [1, 4, 10, 15, 16, 18, 19, 23, 27, 28, 40, 41, 47, 49], 'group D': [8, 11, 20, 22, 24, 29, 30, 33, 36, 37, 38, 48], 'group E': [34, 35, 44]}

#### To find the size of each group, use `size()` method

In [97]:
#Display the size of each group
df.groupby('group').size()

group
group A     5
group B    13
group C    14
group D    12
group E     3
dtype: int64

### b. Apply Aggregate Function (Apply + Combine)
- To produce a result, we can apply an aggregate function which will perform the appropriate apply + combine steps to produce the desired result

**Let us first apply aggregate function on a specific column of `DataFrameGroupBy` object, which is a `SeriesGroupBy` object**

In [98]:
# compute mean of marks in math for each group
df.groupby('group').math.mean()

group
group A    60.600000
group B    57.250000
group C    64.769231
group D    59.583333
group E    89.000000
Name: math, dtype: float64

In [104]:
# compute max of specific column of each group
df.groupby('group').math.max()

group
group A    78.0
group B    88.0
group C    88.0
group D    75.0
group E    97.0
Name: math, dtype: float64

In [110]:
# compute median of specific column of each group
df.groupby('group').scholarship.median()

group
group A    2500.0
group B    2800.0
group C    3000.0
group D    2750.0
group E    3500.0
Name: scholarship, dtype: float64

**Till now, we have applied aggregation function on one column only, i.e., on a `SeriesGroupBy` object.**

**Let us now apply aggregate function on the entire `DataFrameGroupBy` object**

In [133]:
df.groupby('group').sum()

Unnamed: 0_level_0,age,scholarship,math,english,urdu
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
group A,170,12791,303.0,321,306
group B,421,35662,687.0,812,792
group C,510,41667,842.0,981,942
group D,425,34812,715.0,805,778
group E,104,9500,178.0,224,215


In [103]:
df.groupby('group').mean()

Unnamed: 0_level_0,age,scholarship,math,english,urdu
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
group A,34.0,2558.2,60.6,64.2,61.2
group B,32.384615,2743.230769,57.25,62.461538,60.923077
group C,36.428571,2976.214286,64.769231,70.071429,67.285714
group D,35.416667,2901.0,59.583333,67.083333,64.833333
group E,34.666667,3166.666667,89.0,74.666667,71.666667


In [101]:
df.groupby('group').max()

Unnamed: 0_level_0,rollno,name,gender,session,age,scholarship,math,english,urdu
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
group A,MS47,USAMA,male,MOR,44,3500,78.0,74,72
group B,MS46,USAMA,male,MORNING,47,3800,88.0,95,92
group C,MS50,ZAIN,male,MORNING,54,4000,88.0,90,88
group D,MS49,UNAIZA,male,MORNING,53,3500,75.0,90,88
group E,MS45,ZAINAB,male,MORNING,45,3500,97.0,87,82


The `max()` and `min()` aggregate methods by default applies on numerical as well as categorical columns.

## 3. Use of `series.agg()` method
- The `series.agg()` method allows us to specify one function or a list of functions to be applied on a Pandas series or a `SeriesGroupBy` object all at once.

In [132]:
df.groupby('group').math

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fe92cbb1070>

In [130]:
df.groupby('group').math.agg(['count', 'sum', 'mean'])

Unnamed: 0_level_0,count,sum,mean
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
group A,5,303.0,60.6
group B,12,687.0,57.25
group C,13,842.0,64.769231
group D,12,715.0,59.583333
group E,2,178.0,89.0


In [131]:
df.groupby('group').agg(['count', 'sum', 'mean'])

Unnamed: 0_level_0,age,age,age,scholarship,scholarship,scholarship,math,math,math,english,english,english,urdu,urdu,urdu
Unnamed: 0_level_1,count,sum,mean,count,sum,mean,count,sum,mean,count,sum,mean,count,sum,mean
group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
group A,5,170,34.0,5,12791,2558.2,5,303.0,60.6,5,321,64.2,5,306,61.2
group B,13,421,32.384615,13,35662,2743.230769,12,687.0,57.25,13,812,62.461538,13,792,60.923077
group C,14,510,36.428571,14,41667,2976.214286,13,842.0,64.769231,14,981,70.071429,14,942,67.285714
group D,12,425,35.416667,12,34812,2901.0,12,715.0,59.583333,12,805,67.083333,12,778,64.833333
group E,3,104,34.666667,3,9500,3166.666667,2,178.0,89.0,3,224,74.666667,3,215,71.666667
