#  <font color="green">Aggregation and Group By</font>

----------------

Data aggregation is a process in which data is gathered and represented in a summary form, for purposes including statistical analysis.

Some aggregation methods were encountered in the descriptive statistics lesson. However, in this lesson, they will be applied to a groupby object in order to aggregate by groups. This means that we can perform aggregation operations, such as sum, mean, count, etc., on specific groups within our dataset. Grouping the data allows us to analyze subsets of the data independently and gain insights into how different groups behave or differ from each other.

## Pandas Aggregation Syntax Options:

Pandas provides several options for performing aggregation operations on data. Some common syntax options include:

- Using the `groupby()` function: This function allows us to group the data based on one or more variables and then apply aggregation functions to each group.

- Using the `agg()` function: This function is used in combination with `groupby()` to specify the aggregation functions to apply to each group. It accepts a dictionary where the keys are column names and the values are the aggregation functions to apply.

- Using named aggregation with `agg()`: This syntax allows us to specify multiple aggregation functions for different columns using named parameters.

- Using method chaining: Pandas also supports method chaining, where multiple operations can be applied in sequence using dot notation. For example, `groupby().agg().reset_index()`.

## Pandas Grouping + Descriptive Statistics:

Grouping data in Pandas allows us to compute descriptive statistics on subsets of the data. After grouping the data using `groupby()`, we can apply descriptive statistics functions such as `mean()`, `median()`, `sum()`, `min()`, `max()`, `std()`, `var()`, etc., to calculate summary statistics for each group. This allows us to analyze the distribution and characteristics of the data within each group separately.

## Pandas Grouping + Counting:

Counting the occurrences of values within groups is a common operation in data analysis. After grouping the data using `groupby()`, we can use the `size()` function to count the number of records in each group. Alternatively, we can use the `count()` function to count non-null values within each group for specific columns. This allows us to understand the frequency or prevalence of certain categories or values within different groups of the dataset.


## Warm-up

- In an empty notebook open a dataset about penguins ‘built-in’ to the seaborn library
- Calculate the average bill length in the dataset
- Find out which gender of penguins occurs the most in the dataset

In [1]:
import pandas as pd
import seaborn as sns

In [26]:
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

#### 1. In an empty notebook open a dataset about pengiuns ‘built-in’ to the seabornlibrary using the following commands:

In [3]:
penguins = sns.load_dataset('penguins')

In [4]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


#### 2. Calculate the average bill length in the dataset

In [29]:
penguins['bill_length_mm'].mean()

np.float64(43.9219298245614)

#### 3.Find out which gender of pengiuns occurs the most in the dataset

In [6]:
penguins['sex'].mode()

0    Male
Name: sex, dtype: object

In [32]:
penguins['sex'].value_counts()

sex
Male      168
Female    165
Name: count, dtype: int64

In [33]:
# we can also calculate the corresponding percentages

penguins['sex'].value_counts(normalize=True)

sex
Male      0.504505
Female    0.495495
Name: proportion, dtype: float64

### 1. Pandas aggregation syntax options

#### Option 1: using a list

In [38]:
# let's calculate the mean, median and standard deviation of the penguins bill length

# all aggregations in the list will be applied to the column we specified

penguins['bill_length_mm'].agg(['mean', 'median', 'std'])

mean      43.921930
median    44.450000
std        5.459584
Name: bill_length_mm, dtype: float64

In [39]:
# we can also apply on a list of columns

penguins[['bill_length_mm', 'bill_depth_mm']].agg(['mean', 'median', 'std'])

Unnamed: 0,bill_length_mm,bill_depth_mm
mean,43.92193,17.15117
median,44.45,17.3
std,5.459584,1.974793


#### Option 2: using a dictionary

In [46]:
# with this option we can specify which aggregations to apply on which columns

agg_dict = {'island':['count'],
             'body_mass_g': ['min', 'max', 'mean']
           }

penguins.agg(agg_dict)

Unnamed: 0,island,body_mass_g
count,344.0,
min,,2700.0
max,,6300.0
mean,,4201.754386


#### Option 3: using a tuple

In [19]:
# here we pass a tuple or multiple tuples of a (column name, aggregation)
# we get to name the result row index


penguins.agg(
             sex_count = ('sex', 'count'),
             bill_length_max = ('bill_length_mm', 'max'),
             body_mass_mean = ('body_mass_g', 'mean')
            )

Unnamed: 0,sex,bill_length_mm,body_mass_g
sex_count,333.0,,
bill_length_max,,59.6,
body_mass_mean,,,4201.754386


### 2.  Pandas grouping + descriptive statistics

![groupby](groupby_steps.png)

#### Q1: What is the average weight of penguins for each of the species in the data set ??

In [20]:
# groupby() creates a groupby object
# we need to group the data by species first
# then do a mean on each group for the body_mass_g column
# we can save the object in a variable and examine it
df_group = penguins.groupby('species')
df_group

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001367E7DFB30>

In [48]:
# to see which observations belong to which group
df_group.groups

{'Adelie': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], 'Chinstrap': [152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219], 'Gentoo': [220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 2

In [22]:
# to see the observations of one group
df_group.get_group('Adelie')

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
...,...,...,...,...,...,...,...
147,Adelie,Dream,36.6,18.4,184.0,3475.0,Female
148,Adelie,Dream,36.0,17.8,195.0,3450.0,Female
149,Adelie,Dream,37.8,18.1,193.0,3750.0,Male
150,Adelie,Dream,36.0,17.1,187.0,3700.0,Female


In [23]:
# Let's go back to the exercise
penguins.groupby('species')['body_mass_g']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001367E7D8710>

In [53]:
penguins.groupby('species')['body_mass_g'].agg('mean')

species
Adelie       3700.662252
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64

In [54]:
# another option would be to use the dictionary syntax

agg_dict = {'body_mass_g':['mean', 'min', 'max']}

penguins.groupby('species').agg(agg_dict)

Unnamed: 0_level_0,body_mass_g,body_mass_g,body_mass_g
Unnamed: 0_level_1,mean,min,max
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adelie,3700.662252,2850.0,4775.0
Chinstrap,3733.088235,2700.0,4800.0
Gentoo,5076.01626,3950.0,6300.0


In [59]:
# we can also call the describe method to calculate all defualt descriptive stats!!

agg_dict = {'body_mass_g':['describe']}

penguins.groupby('species').agg(agg_dict)

Unnamed: 0_level_0,body_mass_g,body_mass_g,body_mass_g,body_mass_g,body_mass_g,body_mass_g,body_mass_g,body_mass_g
Unnamed: 0_level_1,describe,describe,describe,describe,describe,describe,describe,describe
Unnamed: 0_level_2,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Adelie,151.0,3700.662252,458.566126,2850.0,3350.0,3700.0,4000.0,4775.0
Chinstrap,68.0,3733.088235,384.335081,2700.0,3487.5,3700.0,3950.0,4800.0
Gentoo,123.0,5076.01626,504.116237,3950.0,4700.0,5000.0,5500.0,6300.0


In [62]:
penguins.groupby('species')['body_mass_g'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adelie,151.0,3700.662252,458.566126,2850.0,3350.0,3700.0,4000.0,4775.0
Chinstrap,68.0,3733.088235,384.335081,2700.0,3487.5,3700.0,3950.0,4800.0
Gentoo,123.0,5076.01626,504.116237,3950.0,4700.0,5000.0,5500.0,6300.0


### 3. Pandas grouping + counting

#### Q2: Where does each species of penguins live ? Are there species that live in more than one island ?

In [63]:
agg_dict = { 'island': ['count', 'nunique', pd.Series.mode]}

penguins.groupby('species').agg(agg_dict)

Unnamed: 0_level_0,island,island,island
Unnamed: 0_level_1,count,nunique,mode
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adelie,152,3,Dream
Chinstrap,68,1,Dream
Gentoo,124,1,Biscoe


In [71]:
penguins.groupby('species')['island'].agg(['count', 'nunique', pd.Series.mode])

Unnamed: 0_level_0,count,nunique,mode
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adelie,152,3,Dream
Chinstrap,68,1,Dream
Gentoo,124,1,Biscoe


#### Q3: What is the number of male and female penguins for each species in the dataset? Are there any missing values ?

In [77]:
# size here in aggregation gives us the number of rows including null values

# count gives us the number of non-null values

agg_dict = {'sex': ['count', 'size', 'nunique']}

penguins.groupby('species').agg(agg_dict)

Unnamed: 0_level_0,sex,sex,sex
Unnamed: 0_level_1,count,size,nunique
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adelie,146,152,2
Chinstrap,68,68,2
Gentoo,119,124,2


In [74]:
# size here in aggregation gives us the number of rows including null values

# count gives us the number of non-null values

agg_dict = {'sex': ['count', 'size', 'nunique']}

penguins.groupby(['species','sex']).agg(agg_dict)

Unnamed: 0_level_0,Unnamed: 1_level_0,sex,sex,sex
Unnamed: 0_level_1,Unnamed: 1_level_1,count,size,nunique
species,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Adelie,Female,73,73,1
Adelie,Male,73,73,1
Chinstrap,Female,34,34,1
Chinstrap,Male,34,34,1
Gentoo,Female,58,58,1
Gentoo,Male,61,61,1


In [75]:
# size here in aggregation gives us the number of rows including null values

# count gives us the number of non-null values

agg_dict = {'sex': ['count', 'size', 'nunique']}

penguins.groupby(['sex','species']).agg(agg_dict)

Unnamed: 0_level_0,Unnamed: 1_level_0,sex,sex,sex
Unnamed: 0_level_1,Unnamed: 1_level_1,count,size,nunique
sex,species,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Female,Adelie,73,73,1
Female,Chinstrap,34,34,1
Female,Gentoo,58,58,1
Male,Adelie,73,73,1
Male,Chinstrap,34,34,1
Male,Gentoo,61,61,1


In [None]:
# if we want the number of rows in each group including the number of null values

# notice that we can group by multiple columns!!!

penguins.groupby(['species', 'sex'], dropna=False).size()

### More examples

#### Q4: What is the weight of the heaviest and lightest penguin on each island ?

In [78]:
agg_dict = {'body_mass_g': ['min', 'max']}

penguins.groupby('island').agg(agg_dict)

Unnamed: 0_level_0,body_mass_g,body_mass_g
Unnamed: 0_level_1,min,max
island,Unnamed: 1_level_2,Unnamed: 2_level_2
Biscoe,2850.0,6300.0
Dream,2700.0,4800.0
Torgersen,2900.0,4700.0


### Extra stuff

In [79]:
penguins[['species','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].groupby('species').max()

Unnamed: 0_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adelie,46.0,21.5,210.0,4775.0
Chinstrap,58.0,20.8,212.0,4800.0
Gentoo,59.6,17.3,231.0,6300.0


In [80]:
# if we group it by two columns
penguins[['species','island','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].groupby(['species', 'island']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
species,island,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adelie,Biscoe,38.975,18.370455,188.795455,3709.659091
Adelie,Dream,38.501786,18.251786,189.732143,3688.392857
Adelie,Torgersen,38.95098,18.429412,191.196078,3706.372549
Chinstrap,Dream,48.833824,18.420588,195.823529,3733.088235
Gentoo,Biscoe,47.504878,14.982114,217.186992,5076.01626


In [82]:
penguins[['species','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].groupby('species').var()

Unnamed: 0_level_0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adelie,7.093725,1.480237,42.764503,210282.891832
Chinstrap,11.15063,1.289122,50.863916,147713.454785
Gentoo,9.497845,0.962792,42.054911,254133.180061


In [None]:
penguins[['species','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].groupby('species').corr()

#### Transform method

In [87]:
penguins_transformed = penguins.groupby('species')[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm','body_mass_g']].transform('sum')
penguins_transformed
# The sum function in this example calculates the sum of each group.

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,5857.5,2770.3,28683.0,558800.0
1,5857.5,2770.3,28683.0,558800.0
2,5857.5,2770.3,28683.0,558800.0
3,5857.5,2770.3,28683.0,558800.0
4,5857.5,2770.3,28683.0,558800.0
...,...,...,...,...
339,5843.1,1842.8,26714.0,624350.0
340,5843.1,1842.8,26714.0,624350.0
341,5843.1,1842.8,26714.0,624350.0
342,5843.1,1842.8,26714.0,624350.0


In [88]:
result_df = pd.concat([penguins, penguins_transformed.add_suffix("_mean")], axis=1)

result_df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,bill_length_mm_mean,bill_depth_mm_mean,flipper_length_mm_mean,body_mass_g_mean
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,5857.5,2770.3,28683.0,558800.0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,5857.5,2770.3,28683.0,558800.0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,5857.5,2770.3,28683.0,558800.0
3,Adelie,Torgersen,,,,,,5857.5,2770.3,28683.0,558800.0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,5857.5,2770.3,28683.0,558800.0
...,...,...,...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,5843.1,1842.8,26714.0,624350.0
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,5843.1,1842.8,26714.0,624350.0
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,5843.1,1842.8,26714.0,624350.0
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,5843.1,1842.8,26714.0,624350.0
