#  <font color="green">Aggregation and Group By</font>

----------------

Data aggregation is a process in which data is gathered and represented in a summary form, for purposes including statistical analysis.

Some aggregation methods were encountered in the descriptive statistics lesson. However, in this lesson, they will be applied to a groupby object in order to aggregate by groups. This means that we can perform aggregation operations, such as sum, mean, count, etc., on specific groups within our dataset. Grouping the data allows us to analyze subsets of the data independently and gain insights into how different groups behave or differ from each other.

## Pandas Aggregation Syntax Options:

Pandas provides several options for performing aggregation operations on data. Some common syntax options include:

- Using the `groupby()` function: This function allows us to group the data based on one or more variables and then apply aggregation functions to each group.

- Using the `agg()` function: This function is used in combination with `groupby()` to specify the aggregation functions to apply to each group. It accepts a dictionary where the keys are column names and the values are the aggregation functions to apply.

- Using named aggregation with `agg()`: This syntax allows us to specify multiple aggregation functions for different columns using named parameters.

- Using method chaining: Pandas also supports method chaining, where multiple operations can be applied in sequence using dot notation. For example, `groupby().agg().reset_index()`.

## Pandas Grouping + Descriptive Statistics:

Grouping data in Pandas allows us to compute descriptive statistics on subsets of the data. After grouping the data using `groupby()`, we can apply descriptive statistics functions such as `mean()`, `median()`, `sum()`, `min()`, `max()`, `std()`, `var()`, etc., to calculate summary statistics for each group. This allows us to analyze the distribution and characteristics of the data within each group separately.

## Pandas Grouping + Counting:

Counting the occurrences of values within groups is a common operation in data analysis. After grouping the data using `groupby()`, we can use the `size()` function to count the number of records in each group. Alternatively, we can use the `count()` function to count non-null values within each group for specific columns. This allows us to understand the frequency or prevalence of certain categories or values within different groups of the dataset.


## Warm-up

- In an empty notebook open a dataset about penguins ‘built-in’ to the seaborn library
- Calculate the average bill length in the dataset
- Find out which gender of penguins occurs the most in the dataset

In [None]:
import pandas as pd
import seaborn as sns

#### 1. In an empty notebook open a dataset about pengiuns ‘built-in’ to the seabornlibrary using the following commands:

In [None]:
penguins = sns.load_dataset('penguins')

In [None]:
penguins.head()

#### 2. Calculate the average bill length in the dataset

In [None]:
penguins['bill_length_mm'].mean()

#### 3.Find out which gender of pengiuns occurs the most in the dataset

In [None]:
penguins['sex'].mode()

In [None]:
penguins['sex'].value_counts()

In [None]:
# we can also calculate the corresponding percentages

penguins['sex'].value_counts(normalize=True)

### 1. Pandas aggregation syntax options

#### Option 1: using a list

In [None]:
# let's calculate the mean, median and standard deviation of the penguins bill length

# all aggregations in the list will be applied to the column we specified

penguins['bill_length_mm'].agg(['mean', 'median', 'std'])

In [None]:
# we can also apply on a list of columns

penguins[['bill_length_mm', 'bill_depth_mm']].agg(['mean', 'median', 'std'])

#### Option 2: using a dictionary

In [None]:
# with this option we can specify which aggregations to apply on which columns

agg_dict = {'island':['count'],
             'body_mass_g': ['min', 'max', 'mean']
           }

penguins.agg(agg_dict)

#### Option 3: using a tuple

In [None]:
# here we pass a tuple or multiple tuples of a (column name, aggregation)
# we get to name the result row index


penguins.agg(
             sex_count = ('sex', 'count'),
             bill_length_max = ('bill_length_mm', 'max'),
             body_mass_mean = ('body_mass_g', 'mean')
            )

### 2.  Pandas grouping + descriptive statistics

![groupby](groupby_steps.png)

#### Q1: What is the average weight of penguins for each of the species in the data set ??

In [None]:
# groupby() creates a groupby object
# we need to group the data by species first
# then do a mean on each group for the body_mass_g column
# we can save the object in a variable and examine it
df_group = penguins.groupby('species')
df_group

In [None]:
# to see which observations belong to which group
df_group.groups

In [None]:
# to see the observations of one group
df_group.get_group('Adelie')

In [None]:
# Let's go back to the exercise
penguins.groupby('species')['body_mass_g']

In [None]:
penguins.groupby('species')['body_mass_g'].agg('mean')

In [None]:
# another option would be to use the dictionary syntax

agg_dict = {'body_mass_g':['mean', 'min', 'max']}

penguins.groupby('species').agg(agg_dict)

In [None]:
# we can also call the describe method to calculate all defualt descriptive stats!!

agg_dict = {'body_mass_g':['describe']}

penguins.groupby('species').agg(agg_dict)

### 3. Pandas grouping + counting

#### Q2: Where does each species of penguins live ? Are there species that live in more than one island ?

In [None]:
agg_dict = { 'island': ['count', 'nunique', pd.Series.mode]}

penguins.groupby('species').agg(agg_dict)

#### Q3: What is the number of male and female penguins for each species in the dataset? Are there any missing values ?

In [None]:
# size here in aggregation gives us the number of rows including null values

# count gives us the number of non-null values

agg_dict = {'sex': ['count', 'size', 'nunique']}

penguins.groupby('species').agg(agg_dict)

In [None]:
# if we want the number of rows in each group including the number of null values

# notice that we can group by multiple columns!!!

penguins.groupby(['species', 'sex'], dropna=False).size()

### More examples

#### Q4: What is the weight of the heaviest and lightest penguin on each island ?

In [None]:
agg_dict = {'body_mass_g': ['min', 'max']}

penguins.groupby('island').agg(agg_dict)

### Extra stuff

In [None]:
penguins[['species','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].groupby('species').max()

In [None]:
# if we group it by two columns
penguins[['species','island','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].groupby(['species', 'island']).mean()

In [None]:
penguins[['species','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].groupby('species').var()

In [None]:
penguins[['species','bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']].groupby('species').corr()

#### Transform method

In [None]:
penguins_transformed = penguins.groupby('species')[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm','body_mass_g']].transform(sum)
penguins_transformed
# The sum function in this example calculates the sum of each group.