# `groupby()`

In this section we will go over the `groupby()` function and the split-apply-combine strategy. We will use the Palmer penguins data.

In [5]:
# import pandas library with standard abbreviations
import pandas as pd

# read in Palmer penguins data
penguins = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv")

# print number of rows
print(len(penguins)) 

# view first 5 rows of data frame
penguins.head() 


344


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


# Summary statistics

In pandas it is easy to get summary statistics for each column in a dataframe by using methods such as

- `sum()`: sum values in each column,
- `count()`: count non-NA values in each column,
- `min()` and `max()`: get the minimum and maximum value in each column,
- `mean()` and `median()`: get the mean and median value in each column,
- `std()` and `var()`: get the standard deviation and variance in each column.


example

In [7]:
# get the number of non-NA values in each column
penguins.count()

species              344
island               344
bill_length_mm       342
bill_depth_mm        342
flipper_length_mm    342
body_mass_g          342
sex                  333
year                 344
dtype: int64

In [10]:
# get the minimum value in each column
penguins.select_dtypes('number').min() 
# select_dtypes('number') selects columns that contain numbers 

bill_length_mm         32.1
bill_depth_mm          13.1
flipper_length_mm     172.0
body_mass_g          2700.0
year                 2007.0
dtype: float64

# Grouping

Our penguins data can be naturally split (grouped) into different groups: there are three different species, two sexes and three islands. 
- islands, sex, species, year

Often, we want to calculate a certain statistic for each group. For example, suppose we want to calculate the average flipper length per species. How would we do this “by hand”?

0. We start with our data and notice there are multiple species in the `species` column.

1. We *split* our original table to group all observations from the same species together.

2. We calculate the average flipper length for each of the groups we formed.

3. Then we *combine* the values for average flipper length per species into a single table.


This is known as the **Split-Apply-Combine strategy**. This strategy follows the three steps we explained above:

1. **Split**: Split the data into logical groups (e.g. species, sex, island, etc.)

2. **Apply**: Calculate some summary statistic on each group (e.g. average flipper length by species, number of individuals per island, body mass by sex, etc.)

3. **Combine**: Combine the statistic calculated on each group back together.

Split-apply-combine to calculate mean flipper length


- split data into groups with group)by
- apply a function/method (like .mean())
- combine into one table for your final table of summary statistics (displays each group and their associated summary stat)




In Python we can use the `groupby()` method to split (i.e. group) the data into different categories. 

The general syntax for `groupby()` is

```
df.groupby(columns_to_group_by)
```

where most often we will have `columns_to_group_by` = a single column name (string) or a list of column names, the unique values of the column (or columns) will be used as the groups of the data frame

**Example**

First, if we don’t use the `groupby()` method, we obtain the average for the whole flipper length column:

In [12]:
penguins.flipper_length_mm.mean()

200.91520467836258

To get the mean flipper length by species we first group our dataset by the species column’s data:

In [14]:
# average flipper lenth per species
penguins.groupby("species").flipper_length_mm.mean()

species
Adelie       189.953642
Chinstrap    195.823529
Gentoo       217.186992
Name: flipper_length_mm, dtype: float64

There’s a lot going on there, let’s break it down (remember the . can be read as “and then…”)

- start with the `penguins` data frame, and then…
- use `groupby()` to group the data frame by `species` values, and then…
- select the '`flipper_length_mm`' column, and then…
- calculate the `mean()`

We can store our new data frame as `avg_flipper` and then graph it as a bar plot: