# Video: Grouping Data for Exploratory Analysis

This video shows another style of exploratory analysis based on grouping data and summary tables instead of charting.

In [None]:
import pandas as pd

In [None]:
penguins_adelie = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx601-examples/refs/heads/main/data/palmer-penguins-adelie.csv", index_col="Sample Number")
penguins_gentoo = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx601-examples/refs/heads/main/data/palmer-penguins-gentoo.csv", index_col="Sample Number")
penguins_chinstrap = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx601-examples/refs/heads/main/data/palmer-penguins-chinstrap.csv", index_col="Sample Number")
penguins = pd.concat([penguins_adelie, penguins_gentoo, penguins_chinstrap])
penguins

Unnamed: 0_level_0,studyName,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
Sample Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,PAL0708,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
2,PAL0708,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
3,PAL0708,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
4,PAL0708,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,Adult not sampled.
5,PAL0708,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,PAL0910,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N98A2,Yes,2009-11-19,55.8,19.8,207.0,4000.0,MALE,9.70465,-24.53494,
65,PAL0910,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N99A1,No,2009-11-21,43.5,18.1,202.0,3400.0,FEMALE,9.37608,-24.40753,Nest never observed with full clutch.
66,PAL0910,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N99A2,No,2009-11-21,49.6,18.2,193.0,3775.0,MALE,9.46180,-24.70615,Nest never observed with full clutch.
67,PAL0910,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N100A1,Yes,2009-11-21,50.8,19.0,210.0,4100.0,MALE,9.98044,-24.68741,


Script:
* The last kind of exploratory analysis that I'd like to show you this week is simply based on grouping and aggregating data.
* It's not as flashy as charting, but it works better with string columns, and is easier to tweak and manipulate.
* It's not the best fit for the smaller penguin and abalone datasets that we have been looking at this week, but that will give us some rough edges to talk about.
* The most basic form is to group on one or more interesting columns, and look at statistics of interest.
* I'll start doing this for penguin species and body mass.

In [None]:
penguins.groupby("Species")["Body Mass (g)"].agg(["count", "mean"])

Unnamed: 0_level_0,count,mean
Species,Unnamed: 1_level_1,Unnamed: 2_level_1
Adelie Penguin (Pygoscelis adeliae),151,3700.662252
Chinstrap penguin (Pygoscelis antarctica),68,3733.088235
Gentoo penguin (Pygoscelis papua),123,5076.01626


Script:
* Generally when you do an analysis like this, you will want to see the number of samples, and the average.
* And as you dig in, you will alternately sort by the number of samples, or your target metric.
* Let's sort that data by species by the count.

In [None]:
penguins.groupby("Species")["Body Mass (g)"].agg(["count", "mean"]).sort_values("count", ascending=False)

Unnamed: 0_level_0,count,mean
Species,Unnamed: 1_level_1,Unnamed: 2_level_1
Adelie Penguin (Pygoscelis adeliae),151,3700.662252
Gentoo penguin (Pygoscelis papua),123,5076.01626
Chinstrap penguin (Pygoscelis antarctica),68,3733.088235


Script:
* So looking at this, you can see that the most common species are Adelie penguins and they weigh about 3700 grams.
* And the next most common species is the Gento penguins, and they average more than 5000 grams.
* So perhaps a lot of the variation in the average penguin body mass is from species difference.
* I won't pretend that's a deep insight, but a lot of business observations come from such simple reports.
* I broke down the data by X and saw a big spread in Y.
* In particular, A and B are big and very different in this view.
* While I was in industry, we had a lot of important insights come out of such simple reports.
* Usually we had a lot of investigation to really understand what was going on, but having a simple report showing a clear difference helped focus those investigations.
* The trick is to figure out the right report for that clarity.
* Let's look at a couple columns now.

In [None]:
penguins.groupby(["Species", "Sex"])["Body Mass (g)"].agg(["count", "mean"])

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
Species,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
Adelie Penguin (Pygoscelis adeliae),FEMALE,73,3368.835616
Adelie Penguin (Pygoscelis adeliae),MALE,73,4043.493151
Chinstrap penguin (Pygoscelis antarctica),FEMALE,34,3527.205882
Chinstrap penguin (Pygoscelis antarctica),MALE,34,3938.970588
Gentoo penguin (Pygoscelis papua),.,1,4875.0
Gentoo penguin (Pygoscelis papua),FEMALE,58,4679.741379
Gentoo penguin (Pygoscelis papua),MALE,61,5484.836066


Script:
* The mystery period is there again.
* But there is just one, so we can filter on the count to get bigger more important segments of the data.

In [None]:
penguins.groupby(["Species", "Sex"])["Body Mass (g)"].agg(["count", "mean"]).query("count > 10")

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
Species,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
Adelie Penguin (Pygoscelis adeliae),FEMALE,73,3368.835616
Adelie Penguin (Pygoscelis adeliae),MALE,73,4043.493151
Chinstrap penguin (Pygoscelis antarctica),FEMALE,34,3527.205882
Chinstrap penguin (Pygoscelis antarctica),MALE,34,3938.970588
Gentoo penguin (Pygoscelis papua),FEMALE,58,4679.741379
Gentoo penguin (Pygoscelis papua),MALE,61,5484.836066


Script:
* Sometimes, it is helpful to sort by your target, or the average target of a group to see what the extreme sets are.

In [None]:
penguins.groupby(["Species", "Sex"])["Body Mass (g)"].agg(["count", "mean"]).query("count > 10").sort_values("mean", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
Species,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
Gentoo penguin (Pygoscelis papua),MALE,61,5484.836066
Gentoo penguin (Pygoscelis papua),FEMALE,58,4679.741379
Adelie Penguin (Pygoscelis adeliae),MALE,73,4043.493151
Chinstrap penguin (Pygoscelis antarctica),MALE,34,3938.970588
Chinstrap penguin (Pygoscelis antarctica),FEMALE,34,3527.205882
Adelie Penguin (Pygoscelis adeliae),FEMALE,73,3368.835616


Script:
* In this case, the Gentoo penguins of both sexes sorted to the top.
* While the range of Chinstrap penguins seems to be contained within the Adelie penguin range.
* Sorting by target value is useful for finding top or high performers, or for finding bottom or worst performers.
* However, it is important to have some kind of significance filter, or rows with little data will sort to either end of the range erratically, especially if they came from just one sample like that period row.
* Related to that concern, these manual grouping queries tend to work with medium numbers of distinct values.
* Just a few values, like the three species here, does not leave much room for interesting actions.
* But thousands of groups summarizing one or two rows each is not very helpful either.
* You can check the number of values with the `nunique` method.

In [None]:
penguins.nunique()

studyName                3
Species                  3
Region                   1
Island                   3
Stage                    1
Individual ID          190
Clutch Completion        2
Date Egg                50
Culmen Length (mm)     164
Culmen Depth (mm)       80
Flipper Length (mm)     55
Body Mass (g)           94
Sex                      3
Delta 15 N (o/oo)      330
Delta 13 C (o/oo)      331
Comments                10
dtype: int64

Script:
* This dataset only has 344 rows, so it's hard to have a lot of values with a lot of matching rows, and we already looked at them.
* Generally, I'd be hoping for columns where multiple values had hundreds of rows, but this dataset does not have much of those besides species and sex which we already looked at.


## Why Simple?

* Because it is easy to explain?
* Because it is easier to persuade?
* Because it is more likely to be correct?

Script:
* Why does simple reporting like this work?
* Simple reports can be very persuasive when they show significant volume and significant differences.
* The catch is that you need find a dimension or column where those significance conditions hold.
* They don't always exist, but they are clear when you find them.