# Class 6: Statistical Distribution Exploration

## Theme: Uncovering Hidden Patterns in Hawk Populations

**Learning Goals:**
By the end of this class, you will be able to:
- Use pandas groupby operations to test hypotheses about distributions
- Create and interpret boxplots to visualize statistical summaries
- Apply faceting techniques to compare multiple groups
- Build violin plots to reveal distribution shapes
- Create scatter plots with regression lines to explore relationships


---

##  Setup and Data Loading

In [None]:
import pandas as pd
import altair as alt
import numpy as np

alt.renderers.enable("default")

# FILEPATH
# If on PL use this one
filepath = 'data/hawks.csv'

# If running locally on your machine use this one
filepath = 'https://raw.githubusercontent.com/kemiolamudzengi/dsci-320-datasets/main/hawks.csv'

# Load the hawks dataset
hawks = pd.read_csv(filepath)

cols_to_drop = [
    'Unnamed: 0', 'ReleaseTime', 'StandardTail', 'Tarsus', 'Sex',
    'KeelFat', 'Crop', 'BandNumber', 'CaptureTime', 'WingPitFat', 'Day'
]
hawks.shape

hawks = (
    hawks
    .drop(columns=cols_to_drop, errors='ignore')
    .dropna(subset=['Weight', 'Tail'])  # Drop rows where 'weight' or 'tail' is NA
    .rename(columns=lambda x: x.strip().lower()) # we are used to lower-case attribute names, so drop the title case
).copy()

print("The size of the cleaned dataset is ", hawks.shape)
print("Sample of the dataset\n")
print(hawks.sample(10))

### Hawks Dataset Data Card

| Column       | Description   | Attribute Type |                                                               
|--------------|---------------------------------------------------------|------|
| month | Month of capture | Temporal |
| year | Year of capture | Temporal |
| species      | hawk's species     | Nominal |
| age          | age group: A=Adult or I=Immature | Ordinal  |
| wing         | Length (in mm) of primary wing feather from tip to wrist it attaches to | Quantitative |
| weight       | Body weight (in gm)    | Quantitative |
| culmen       | Length (in mm) of the upper bill from the tip to where it bumps into the fleshy part of the bird | Quantitative |
| hallux       | Length (in mm) of the killing talon      | Quantitative |
| tail         | Measurement (in mm) related to the length of the tail    | Quantitative         |



## Part 1: Histogram Deep Dive - The Power of Binning
### *When Visualization Choices Change Your Story*

We've created histograms earlier, but have you considered how your binning choices affect the insights you discover?

### **The Binning Experiment**

In [None]:
# Same data, different bin sizes
bin_sizes = [5, 10, 20, 30, 50]

charts = []
for bins in bin_sizes:
    chart = alt.Chart(hawks).mark_bar().encode(
        x=alt.X('weight:Q', 
               bin=alt.BinParams(maxbins=bins), 
               title=f'Weight (g) - {bins} bins'),
        y=alt.Y('count()', title='Count'),
        color=alt.value('steelblue')
    ).properties(
        width=200,
        height=150,
        title=f"{bins} Bins"
    )
    charts.append(chart)

# Arrange in a grid
binning_comparison = alt.vconcat(
    alt.hconcat(charts[0], charts[1], charts[2]),
    alt.hconcat(charts[3], charts[4])
).properties(
    title="How Binning Affects Pattern Detection"
)

binning_comparison

### Observation
What do you notice about the different binning sizes?



<div style="border-left: 5px solid #007BFF; padding: 15px; background-color: #F0F8FF; border-radius: 8px;">

<h2>Weight Distribution by Species</h2>

<p><strong>Exploratory Question:</strong>  
<em>Do different hawk species require different binning strategies when examining weight distributions?</em></p>

<p><strong>VIZ TASK:</strong> Create small multiple histograms to compare the weight distributions across species.</p>

<h3>Chart Specification:</h3>
<ul>
<li>Use <code>mark_bar</code> to create histograms of hawk weights.</li>
<li>Encode <code>weight</code> on the <strong>x channel</strong> as <code>quantitative</code>, applying binning with <code>maxbins=15</code>. Title the axis <em>Weight (g)</em>.</li>
<li>Encode <code>count()</code> on the <strong>y channel</strong> as <code>quantitative</code>, with the axis title <em>Count</em>.</li>
<li>Encode <code>species</code> on the <strong>color channel</strong> as <code>nominal</code>, with the legend title <em>Species</em>.</li>
<li>Use the <strong>column channel</strong> for <code>species</code> so that each species appears as a separate small multiple histogram, with the title <em>Species</em>.</li>
<li>Set the overall chart title to <em>Weight Distribution by Species</em>.</li>
</ul>

<h3>Styling Specifications:</h3>
<ul>
<li>Set the <strong>width</strong> of each facet to 200 pixels and <strong>height</strong> to 150 pixels.</li>
<li>Apply a distinct color for each species category (Altair’s default categorical palette).</li>
</ul>

</div>


In [None]:
# Do different species need different binning strategies?
species_histograms = ...

species_histograms

### Observation
Change the number of bins (i.e, maxbins value), when you look at each of the species, which ones benefit from a specific bin size. 

** Binning Guidelines:**
- **Too few bins (< 10)**: May hide important patterns
- **Too many bins (> 50)**: May create noise, hard to interpret
- **Sweet spot**: Usually 15-30 bins for exploration
- **Consider your data size**: More data can handle more bins

** Critical Thinking:**
1. What patterns appear with 5 bins vs 30 bins?
2. Do you see different shapes in the wing length distribution with different binning?
3. How might this affect conclusions about hawk populations?

---


## The Statistical Detective Workflow

When we encounter strange patterns in data, we follow a systematic approach:

```
1. Understand your data (missing values, data quality)
2. Form hypotheses (what might explain the variation?)
3. Test statistically (groupby, describe)
4. Visualize to confirm (boxplots, violins)
5. Explore relationships (correlations, scatter plots)
```

---

## Part 2: Boxplots - Seeing the Statistics

### Anatomy of a Boxplot

Let's create the simplest possible boxplot and understand every part of it.



<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>Viz Task: Boxplot of Hawk Weights</b></h3>

<ul>
<li>Use the <code>boxplot</code> mark to show the distribution of hawk weights.</li>
<li>Mark options:
<ul>
<li><code>median={"color": "red"}</code> — highlights the median in red.</li>
<li><code>extent=1.5</code> — sets the whisker length to 1.5</li>
<li><code>ticks=True</code> — shows tick marks for outliers.</li>
</ul>
</li>
<li>Encode:
<ul>
<li><code>weight</code> on the <b>x channel</b> as quantitative.</li>
</ul>
</li>
</ul>
</div>



In [None]:
# Simplest boxplot - just one variable
...

#### Observations

1. **The line in the middle** = MEDIAN (50% of hawks are lighter, 50% are heavier)
2. **The box** = Middle 50% of data (Interquartile Range, IQR)
   - Left edge = 25th percentile (Q1)
   - Right edge = 75th percentile (Q3)
3. **The whiskers** = Extend to 1.5 × IQR from the box edges
4. **The dots** = OUTLIERS (unusual hawks that are much heavier or lighter)


### Comparing Groups - The Y Channel

Now let's compare species using the y-axis for categories.

In [None]:
# 3. Group-wise statistical analysis
# Hypothesis 1: Does species explain the distribution?
species_stats = hawks.groupby('species')['weight'].describe()
print("Weight statistics by species:")
print(species_stats)

<div style="border-left: 5px solid #007BFF; padding: 15px; background-color: #F0F8FF; border-radius: 8px;">

<h2>Hawk Weight Distribution by Species</h2>

<p><strong>Exploratory Question:</strong>  
<em>How do weight distributions differ across hawk species, and are there any outliers or median shifts between species?</em></p>

<p><strong>VIZ TASK:</strong> Create a boxplot to compare the distribution of hawk weights across different species.</p>

<h3>Chart Specification:</h3>
<ul>
<li>Use <code>mark_boxplot</code> to visualize the spread and median of weights for each species.</li>
<li>Encode <code>species</code> on the <strong>y channel</strong> as <code>nominal</code>, titled <em>Species</em>.</li>
<li>Encode <code>weight</code> on the <strong>x channel</strong> as <code>quantitative</code>, titled <em>Weight (g)</em>.</li>
<li>Display median lines in <strong>red</strong> with a stroke width of 3 pixels.</li>
<li>Show outliers as <strong>orange points</strong> with a size of 50 and black fill.</li>
<li>Use <code>extent=1.5</code> for whiskers (standard IQR rule).</li>
<li>Include ticks on each box to show distribution symmetry.</li>
</ul>

<h3>Styling Specifications:</h3>
<ul>
<li>Set <strong>tick size</strong> to 30 to make whiskers more prominent.</li>
<li>Set <strong>box width</strong> (<code>size</code>) to 40 for clear visibility.</li>
<li>Chart dimensions: <strong>width=500</strong>, <strong>height=300</strong>.</li>
<li>Chart title: <em>Hawk Weight Distribution by Species</em>.</li>
<li>Enable <code>.interactive()</code> for hover zooming or panning if desired.</li>
</ul>

</div>

---


In [None]:
# Compare species using boxplots
...

---

### When One Chart Isn't Enough - Faceting (5 min)

**The Challenge:** "What if we want to see species AND age together?"

**Option 1 - Try using the color channel:**

In [None]:
# Attempt 1: Use color for species
alt.Chart(hawks).mark_boxplot().encode(
    color=alt.Color('age:N', title='Age'), #i'm using N just to show the difference
    x=alt.X('weight:Q', title='Weight (g)'),
    y='species:N'
).properties(
    width=500,
    height=200,
    title='Weight by Age and Species'
)

** The Problem:** "This is getting messy and hard to read..."
- Too many boxes overlapping
- Hard to compare specific groups
- Visually overwhelming

** Option 2 - Faceting (The Solution):**

<div style="border-left: 5px solid #007BFF; padding: 15px; background-color: #F0F8FF; border-radius: 8px;">

<h2>Weight Distribution by Species, Faceted by Age</h2>

<p><strong>Exploratory Question:</strong>  
<em>How does the weight distribution of hawks vary with age across different species?</em></p>

<p><strong>VIZ TASK:</strong> Create faceted boxplots to compare how hawk weight distributions differ by age within each species.</p>

<h3>Chart Specification:</h3>
<ul>
<li>Use <code>mark_boxplot</code> to show the distribution of hawk weights for each age group.</li>
<li>Encode <code>age</code> on the <strong>y channel</strong> as <code>nominal</code>, titled <em>Age</em>.</li>
<li>Encode <code>weight</code> on the <strong>x channel</strong> as <code>quantitative</code>, titled <em>Weight (g)</em>.</li>
<li>Encode <code>age</code> on the <strong>color channel</strong> as <code>nominal</code> to distinguish groups, but remove the legend (<code>legend=None</code>).</li>
<li>Use <code>facet</code> with <code>row='species:N'</code> so that each species appears as a separate horizontal panel.</li>
<li>Display median lines in <strong>black</strong> with a stroke width of 3 pixels.</li>
</ul>

<h3>Styling Specifications:</h3>
<ul>
<li>Set chart <strong>width</strong> to 400 pixels and <strong>height</strong> to 100 pixels for each facet.</li>
<li>Apply distinct colors per age group (Altair’s default categorical palette).</li>
<li>Chart title: <em>Weight Distribution by Species, Faceted by Age</em>.</li>
<li>Each facet should clearly separate species for easy comparison across age categories.</li>
</ul>

</div>



In [None]:
# Faceted boxplots - much clearer!


** Discussion: When to Facet vs. When to Use Color?**

**Use FACETING when:**
- Too many categories would crowd a single plot
- You want clear, separated comparisons
- Emphasizing differences between groups
- Each facet tells its own story

**Use COLOR when:**
- You want to see overlap between groups
- Fewer categories (2-4)
- Prefer a single, unified view
- Comparing proportions or densities

---

## Violin Plots - Adding Distribution Shape

### The Limitation of Boxplots 

Boxplots are great, but they hide something important: **the shape of the distribution**.

Let's compare a histogram and a boxplot side by side:

In [None]:
# Histogram shows the shape
histogram = alt.Chart(hawks).mark_bar().encode(
    x=alt.X('tail:Q', bin=True, title='Tail Length (mm)'),
    y=alt.Y('count()', title='Count')
).properties(
    width=400,
    height=200,
    title='Histogram: Shows Distribution Shape'
)

# Boxplot shows statistics
boxplot = alt.Chart(hawks).mark_boxplot().encode(
    x=alt.X('tail:Q', scale=alt.Scale(zero=False, padding=5),title='Tail Length (mm)')
).properties(
    width=400,
    height=100,
    title='Boxplot: Shows Statistics Only'
)

# Stack them vertically
histogram & boxplot

** The Problem with Boxplots:**
- Can't see if distribution is symmetric or skewed
- Can't see if there are multiple peaks (bimodal)
- Can't see the actual density of data points

** The Solution: Violin Plots!**
- Combine statistical summary (like boxplot)
- WITH distribution shape (like histogram)
- Best of both worlds!

---

### 4.2: Building a Violin Plot 

**Step 1: Understand transform_density**

In [None]:
# Step 1: Create a density plot (one side of the violin)
alt.Chart(hawks).transform_density(
    'tail',                           # Variable to analyze
    as_=['tail', 'density']          # Output column names
).mark_area(
    orient='horizontal',              # Make it horizontal
    opacity=0.7
).encode(
    y=alt.Y('tail:Q', title='Tail Length (mm)'),
    x=alt.X('density:Q', title='Density')
).properties(
    width=200,
    height=300,
    title='Step 1: Density Curve (One Side)'
)

** Explanation:**
- `transform_density('Tail', ...)` calculates how many hawks have each tail length
- Creates a smooth curve instead of histogram bars
- `as_=['Tail', 'density']` names the output columns
- `orient='horizontal'` makes it go sideways

**Step 2: Make it symmetrical (the violin shape!)**

In [None]:
# Step 2: Add stack='center' to create the violin shape
alt.Chart(hawks).transform_density(
    'tail',
    as_=['tail', 'density']
).mark_area(
    orient='horizontal',
    opacity=0.7
).encode(
    y=alt.Y('tail:Q', title='Tail Length (mm)'),
    x=alt.X('density:Q', 
            stack='center',)           # This creates symmetry!
          #  axis=None)                # Remove axis for cleaner look
).properties(
    width=150,
    height=300,
    title='Step 2: Violin Plot (Symmetrical)'
)

**Key Point:**
- `stack='center'` mirrors the density on both sides
- That's what creates the "violin" shape!
- Width at any point = density of data at that value

---



### 4.3: Comparing Groups with Violins

Now let's compare species using violin plots:

In [None]:
# Compare species with violin plots
alt.Chart(hawks).transform_density(
    'Tail',
    as_=['Tail', 'density'],
    groupby=['Species']               # Separate density for each species!
).mark_area(
    orient='horizontal',
    opacity=0.7
).encode(
    y=alt.Y('Tail:Q', title='Tail Length (mm)'),
    x=alt.X('density:Q', stack='center', axis=None),
    color=alt.Color('Species:N', legend=alt.Legend(title='Species')),
    column=alt.Column('Species:N', title='Species')  # Each species in its own panel
).properties(
    width=120,
    height=250,
    title='Tail Length Distribution by Species'
)

** Discussion: What patterns do you see that a boxplot would hide?**
- Are any distributions skewed?
- Are there multiple peaks in any species?
- Where is most of the data concentrated?

**When to use Violin Plots:**
- When distribution SHAPE matters for your analysis
- When you want to see if data is normally distributed
- When comparing distributions across groups
- When you have enough data points (50+ per group ideally)

---

## Part 5: Exploring Relationships (10 min)

### 5.1: Do Measurements Correlate? (5 min)

**The Question:** "If a hawk has a long tail, does it also have a heavy weight?"

Let's investigate with a scatter plot and regression line:

<div style="border-left: 5px solid #007BFF; padding: 15px; background-color: #F0F8FF; border-radius: 8px;">

<h2>Relationship Between Tail Length and Weight</h2>

<p><strong>Exploratory Question:</strong>  
<em>Is there a relationship between a hawk’s tail length and its weight?</em></p>

<p><strong>VIZ TASK:</strong> Create a scatter plot showing the relationship between tail length and weight for each hawk, and add a regression line to capture the overall trend.</p>

<h3>Chart Specification:</h3>
<ul>
<li>Use <code>mark_circle</code> for individual data points, representing each hawk.</li>
<li>Encode <code>Tail</code> on the <strong>x channel</strong> as <code>quantitative</code> and title it <em>Tail Length (mm)</em>.</li>
<li>Encode <code>Weight</code> on the <strong>y channel</strong> as <code>quantitative</code> and title it <em>Weight (g)</em>.</li>
<li>Encode <code>Species</code> on the <strong>color channel</strong> as <code>nominal</code> with a legend titled <em>Species</em>.</li>
<li>Add a regression line using <code>transform_regression('tail', 'weight')</code> to model the trend across all data.</li>
<li>Use <code>mark_line</code> for the regression, styled in red to make it visually distinct.</li>
<li>Layer the scatter plot and regression line using <code>(scatter + regression)</code>.</li>
</ul>

<h3>Styling Specifications:</h3>
<ul>
<li>Set point <strong>size</strong> to 40 and <strong>opacity</strong> to 0.6 for subtle overlap.</li>
<li>Set regression line <strong>color</strong> to red and <strong>width</strong> to 3 pixels.</li>
<li>Set overall chart <strong>width</strong> to 500 pixels and <strong>height</strong> to 400 pixels.</li>
<li>Title: <em>Do Tail Length and Weight Correlate?</em></li>
</ul>

</div>



In [None]:
# Scatter plot
scatter = ...

# Trend line for all data
regression = ...

# Layer them together
(scatter + regression).properties(
    width=500,
    height=400,
    title='Do Tail Length and Weight Correlate?'
)

In [None]:
correlation_coef = hawks['tail'].corr(hawks['weight'])
print(f"Correlation coefficient: {correlation_coef:.3f}")


** Teaching transform_regression:**
- This fits a straight line through the data
- Steep slope = strong relationship
- Positive slope = both increase together
- But wait... is it the SAME relationship for all species?

Let's check by creating separate regression lines for each species:

In [None]:
# Scatter plot (same as before)
scatter = alt.Chart(hawks).mark_circle(
    size=40,
    opacity=0.6
).encode(
    x='tail:Q',
    y='weight:Q',
    color='species:N'
)

# Separate regression line for EACH species
regression_by_species = alt.Chart(hawks).mark_line(
    size=3
).transform_regression(
    'tail', 'weight',
    groupby=['species']               #  Separate line for each species!
).encode(
    x='tail:Q',
    y='weight:Q',
    color='species:N'
)

# Combine
(scatter + regression_by_species).properties(
    width=500,
    height=400,
    title='Tail vs. Weight: Different Relationships for Each Species!'
)

** Key Insight:**
- Different slopes for different species!
- The relationship varies by species
- This is why grouping/faceting matters
- Always ask: "Is this pattern the same across groups?"

---


In [None]:
# Calculate and display correlation for each species
for species in hawks['species'].unique():
    species_data = hawks[hawks['species'] == species]
    corr = species_data['weight'].corr(species_data['tail'])
    print(f"{species}: {corr:.3f}")

This is a classic example of **Simpson's Paradox**!

## The Explanation

The **overall correlation (0.875)** is driven largely by **differences between species**, not within species:

- **RT (Red-tailed hawks)** are likely the largest species (heaviest weight, longest tails)
- **SS (Sharp-shinned hawks)** are likely the smallest  
- **CH (Cooper's hawks)** are medium-sized

When you pool all the data together, you're capturing the strong relationship: "bigger species → heavier weight AND longer tails." This between-group variation creates the high 0.875 correlation.

But **within each species**, the relationship is weaker (0.26, 0.56, 0.76) because you're only looking at individual variation within birds of similar size. A heavy Red-tailed hawk doesn't necessarily have a proportionally longer tail than a lighter Red-tailed hawk.

## Visual Analogy

Think of three clusters of points:
- **Cluster 1 (RT)**: Top-right of the plot (large birds, weak internal correlation)
- **Cluster 2 (CH)**: Middle (medium birds, moderate internal correlation)  
- **Cluster 3 (SS)**: Bottom-left (small birds, stronger internal correlation)

The three clusters together form a tight diagonal line (0.8 8correlation), but within each cluster, the points are more scattered.

## Lesson

When analyzing relationships, **always check for grouping variables** (like species). The aggregate correlation can be misleading if there are distinct subgroups with different characteristics!

---

## Get Stepping

**To prepare for PCP6:**

1. **Redo this entire class notebook** - You learn by doing!
2. **Try different variables** - Use Wing instead of Tail, Culmen instead of Weight, etc.
3. **Experiment with styling** - Change colors, sizes, try different color schemes
4. **Review Tutorial 6** - It has additional examples and explanations
5. **Practice the workflow** - Data quality → Statistics → Visualization → Interpretation

**Remember:** The best way to prepare is to TYPE the code yourself, not just read it!
