# Class 5: Statistical vs Visual EDA with Hawks Dataset

**Prerequisites:** Have actively worked through Tutorials 1 - 5.

---

## Learning Objectives

By the end of this class, students will be able to:
- Conduct systematic dataset exploration using pandas fundamentals
- Distinguish when to use statistical vs visual approaches in EDA workflows
- Investigate distributional questions methodically using both approaches
- Integrate pandas and Altair effectively for comprehensive analysis
- Apply decision frameworks for choosing analytical tools based on goals
- Discuss the role of bin sizes on histograms fidelity to the underlying data

---

## Environment Setup

In [None]:
import pandas as pd
import altair as alt
import numpy as np

alt.renderers.enable("default")

---

## Part 1: Meet the Hawks Dataset


In [None]:
# FILEPATH
# If on PL use this one
filepath = 'data/hawks.csv'

# If running locally on your machine use this one
filepath = 'https://raw.githubusercontent.com/kemiolamudzengi/dsci-320-datasets/main/hawks.csv'

# Load the hawks dataset
hawks = pd.read_csv(filepath)


### Basic Exploration Questions
Before we can get started, we need to get an overall sense of our dataset.

When presented with a new dataset, how do you start your exploration? 
What questions should you ask?
1. How big is the dataset (i.e. rows and columns)
2. How many attributes are there?
3. What are the data types of each attribute?
4. What is the missing data situation?
5. Which attributes seem to be a best starting spot for exploration?


Let's start by systematically exploring our new dataset using **pandas fundamentals**:

In [None]:

# Basic dataset metadata
print(f"Dataset shape: {hawks.shape}\n")


# Data types and detailed overview
hawks.info()

In [None]:
# First few rows to understand the data structure
hawks.head()
# or you can use sample so it doesn't just give what is at the top
hawks.sample(10)

In [None]:
# Missing data landscape
missing_summary = hawks.isnull().sum()
print("Missing values per column:")
print(missing_summary[missing_summary > 0])

print(f"\nTotal missing values: {missing_summary.sum()}")
print(f"Percentage of complete rows: {(hawks.dropna().shape[0] / hawks.shape[0] * 100):.1f}%")

#### Missing Data Issue

Only 3.9% of our rows have all the attributes. 
YIKES, this is not a great start. 
Next week we will delve deeper into doing a missing data analysis. For right now we will take the avoidance approach and just exclude the most offending columns (i.e. Sex, Tarsus, WingPitFat, KeelFat, Crop, StandardTail). I'm also going to remove some other columns just so we can have a targeted exploration.  

In [None]:
cols_to_drop = [
    'Unnamed: 0', 'ReleaseTime', 'StandardTail', 'Tarsus', 'Sex',
    'KeelFat', 'Crop', 'BandNumber', 'CaptureTime', 'WingPitFat', 'Day'
]

hawks_clean = (
    hawks
    .drop(columns=cols_to_drop, errors='ignore')
    .rename(columns=lambda x: x.strip().lower()) # we are used to lower-case attribute names, so drop the title case
).copy()

print(f"Cleaned dataset shape: {hawks_clean.shape}")


In [None]:
# Let's do a missing data analysis again

# Missing data landscape
missing_summary = hawks_clean.isnull().sum()
print("Missing values per column:")
print(missing_summary[missing_summary > 0])

print(f"\nTotal missing values: {missing_summary.sum()}")
print(f"Percentage of complete rows: {(hawks_clean.dropna().shape[0] / hawks_clean.shape[0] * 100):.1f}%")


Okay so we went from 3% to 98%, WE ARE AWESOME!!!.
Now it is worth saying that in industry you can't just ignore all the attributes that have missing values, so we will revisit this later. 


Here are the attributes that we decided to focus on today

### Hawks Dataset Data Card

| Column       | Description   | Attribute Type |                                                               
|--------------|---------------------------------------------------------|------|
| month | Month of capture | Temporal |
| year | Year of capture | Temporal |
| species      | hawk's species     | Nominal |
| age          | age group: A=Adult or I=Immature | Ordinal  |
| wing         | Length (in mm) of primary wing feather from tip to wrist it attaches to | Quantitative |
| weight       | Body weight (in gm)    | Quantitative |
| culmen       | Length (in mm) of the upper bill from the tip to where it bumps into the fleshy part of the bird | Quantitative |
| hallux       | Length (in mm) of the killing talon      | Quantitative |
| tail         | Measurement (in mm) related to the length of the tail    | Quantitative         |

### Dataset Context & Discussion Points

**&#128269; What type of data collection is this?**
- Research study of captured and measured hawks
- Multiple numeric measurements (weight, wing span, etc.)
- Categorical classifications (species, sex, age)

**&#129300;  What variables might be related?**
- Size measurements probably correlate
- Species likely affects size significantly  
- Sex differences within species?

**‚ö†Ô∏è Data quality considerations:**
- Missing data patterns (detailed analysis in Class 6)
- Potential outliers in measurements
- Measurement precision and errors

---


## Part 2: The Distributional Question 

### Setting Up the Core EDA Question

> **In typical EDA workflows, we ask: "How does a quantitative variable distribute across our observed data?"**

This is one of the most fundamental questions in exploratory data analysis. Let's investigate the distribution of hawk weights.

### The Standard Approach: Histogram

<div style="border-left: 5px solid #007BFF; padding: 15px; background-color: #F0F8FF; border-radius: 8px;">

<h2>Hawk Weight Histogram</h2>

<p><strong>Exploratory Question:</strong>  
<em>What is the distribution of hawk weights in the dataset?</em></p>

<p><strong>VIZ TASK:</strong> Create a histogram showing the distribution of hawk weights.</p>

<h3>Chart Specification:</h3>
<ul>
<li>Use <code>mark_bar</code> with <code>binSpacing=0</code> to ensure no gaps between bins.</li>
<li>Encode <code>weight</code> on the <strong>x channel</strong> as <code>quantitative</code> with a binning specification (<code>maxbins=25</code>).</li>
<li>Encode <code>count()</code> on the <strong>y channel</strong> as <code>quantitative</code>, titled <em>Number of Hawks</em>.</li>
<li>Set the chart <strong>width</strong> to 400 pixels and <strong>height</strong> to 300 pixels.</li>
<li>Give the chart the title <em>Hawk Weight Distribution</em>.</li>
</ul>

</div>


In [None]:
# Basic distribution exploration using histogram
hawk_histogram = ...


# Show the chart
hawk_histogram

### Observation
What do you notice about the distribution?
...
How would you characterize this distribution?
...

## Part 3: The Statistical Approach 

### Fundamental Pandas EDA Techniques

Let's systematically investigate using **computational summaries**:

In [None]:
# 1. Overall distribution characteristics
print("Overall weight distribution:")
print(hawks_clean['weight'].describe())

In [None]:
# 2. Investigate potential grouping variables
print("Available categorical variables:")
categorical_cols = hawks_clean.select_dtypes(include=['object']).columns

for col in categorical_cols:
    print(f"\n{col}: {hawks_clean[col].nunique()} unique values")
    print(hawks_clean[col].value_counts())

In [None]:
# 3. Group-wise statistical analysis
# Hypothesis 1: Does species explain the distribution?
species_stats = hawks_clean.groupby('species')['weight'].describe()
print("Weight statistics by species:")
print(species_stats)

In [None]:
# 3. Group-wise statistical analysis
# Hypothesis 2: Does age explain the distribution?
age_stats = hawks_clean.groupby('age')['weight'].describe()
print("Weight statistics by age:")
print(age_stats)

### üîç Observation: Statistical Detective Findings

**From the statistical analysis, what do you notice?**





---



## Part 4: The Visual  Approach 

### Visual Investigation Using Faceting

Now let's investigate the **same question** using visual exploration

<br>

<div style="border-left: 5px solid #007BFF; padding: 15px; background-color: #F0F8FF; border-radius: 8px;">

<h2>Density of Hawk Weights by Species</h2>

<p><strong>Exploratory Question:</strong>  
<em>Does species explain the multimodal distribution of hawk weights?</em></p>

<p><strong>VIZ TASK:</strong> Create a density plot of hawk weights, grouped by species, to compare distributions.</p>

<h3>Chart Specification:</h3>
<ul>
<li>Transform the data using <code>transform_density</code> on <code>weight</code>, grouping by <code>species</code>.</li>
<li>Use <code>mark_area</code> for smooth density curves.</li>
<li>Encode <code>weight</code> on the <strong>x channel</strong> as <code>quantitative</code>, with the title <em>Weight (g)</em>.</li>
<li>Encode <code>density</code> on the <strong>y channel</strong> as <code>quantitative</code>, with the title <em>Density</em>.</li>
<li>Encode <code>species</code> on the <strong>color channel</strong> as <code>nominal</code>, with the legend title <em>Species</em>.</li>
<li>Set the chart <strong>width</strong> to 400 pixels and <strong>height</strong> to 200 pixels.</li>
<li>Give the chart the title <em>Density of Hawk Weights by Species</em>.</li>
</ul>

<h3>Styling Specifications:</h3>
<ul>
<li>Set <code>opacity</code> of the density areas to <code>0.7</code> for overlap visibility.</li>
</ul>

</div>


In [None]:
# Visual hypothesis testing: Does species explain the multimodal pattern?
density_by_species = ...


# Show the chart

density_by_species

### Observation
What do you notice from the layered density plot?

...


<div style="border-left: 5px solid #007BFF; padding: 15px; background-color: #F0F8FF; border-radius: 8px;">

<h2>Density of Hawk Weights by Age</h2>

<p><strong>Exploratory Question:</strong>  
<em>Does age explain the multimodal distribution of hawk weights?</em></p>

<p><strong>VIZ TASK:</strong> Create a density plot of hawk weights, grouped by age, to compare distributions across age categories.</p>

<h3>Chart Specification:</h3>
<ul>
<li>Transform the data using <code>transform_density</code> on <code>weight</code>, grouping by <code>age</code>.</li>
<li>Use <code>mark_area</code> to display smooth density curves.</li>
<li>Encode <code>weight</code> on the <strong>x channel</strong> as <code>quantitative</code>, with the axis title <em>Weight (g)</em>.</li>
<li>Encode <code>density</code> on the <strong>y channel</strong> as <code>quantitative</code>, with the axis title <em>Density</em>.</li>
<li>Encode <code>age</code> on the <strong>color channel</strong> as <code>ordinal</code>, with the legend title <em>Age</em>.</li>
<li>Set the chart <strong>width</strong> to 400 pixels and <strong>height</strong> to 200 pixels.</li>
<li>Give the chart the title <em>Density of Hawk Weights by Age</em>.</li>
</ul>

<h3>Styling Specifications:</h3>
<ul>
<li>Set <code>opacity</code> of the density areas to <code>0.7</code> for overlap visibility.</li>
</ul>

</div>


In [None]:
# Visual hypothesis testing: Does age explain the multimodal pattern?
density_by_age = ...

density_by_age

### Observation
What do you notice from the layered density plot?

...


### üé® Observation: Visual Discovery Process

**What do you immediately notice from this visualization?**

...

---


## Part 5: Integration - The Best of Both Worlds 

### Combining Statistical Precision with Visual Insight

The real power comes from using **both approaches together** for deeper understanding:

In [None]:
# Start with visual exploration ‚Üí Species clearly explains multimodal distribution
# Now use statistics to quantify the visual findings

species_summary = hawks_clean.groupby('species')['weight'].agg([
    'count', 'mean', 'std', 'min', 'max'
]).round(1)

print("Quantified species differences:")
print(species_summary)

In [None]:
# The visual analysis suggested we should investigate age differences too
# Let's get statistical confirmation:

detailed_stats = hawks_clean.groupby(['species', 'age'])['weight'].agg([
    'count', 'mean', 'std'
]).round(1)

print("Detailed breakdown by species AND age:")
print(detailed_stats)



## Part 6: Histogram Deep Dive - The Power of Binning
### *When Visualization Choices Change Your Story*

We've created histograms earlier, but have you considered how your binning choices affect the insights you discover?

### **The Binning Experiment**

In [None]:
# Same data, different bin sizes
bin_sizes = [5, 10, 20, 30, 50]

charts = []
for bins in bin_sizes:
    chart = alt.Chart(hawks_clean).mark_bar().encode(
        x=alt.X('weight:Q', 
               bin=alt.BinParams(maxbins=bins), 
               title=f'Weight (g) - {bins} bins'),
        y=alt.Y('count()', title='Count'),
        color=alt.value('steelblue')
    ).properties(
        width=200,
        height=150,
        title=f"{bins} Bins"
    )
    charts.append(chart)

# Arrange in a grid
binning_comparison = alt.vconcat(
    alt.hconcat(charts[0], charts[1], charts[2]),
    alt.hconcat(charts[3], charts[4])
).properties(
    title="How Binning Affects Pattern Detection"
)

binning_comparison

### Observation
What do you notice about the different binning sizes?


### **Binning by Groups**

<div style="border-left: 5px solid #007BFF; padding: 15px; background-color: #F0F8FF; border-radius: 8px;">

<h2>Weight Distribution by Species</h2>

<p><strong>Exploratory Question:</strong>  
<em>Do different hawk species require different binning strategies when examining weight distributions?</em></p>

<p><strong>VIZ TASK:</strong> Create small multiple histograms to compare the weight distributions across species.</p>

<h3>Chart Specification:</h3>
<ul>
<li>Use <code>mark_bar</code> to create histograms of hawk weights.</li>
<li>Encode <code>weight</code> on the <strong>x channel</strong> as <code>quantitative</code>, applying binning with <code>maxbins=15</code>. Title the axis <em>Weight (g)</em>.</li>
<li>Encode <code>count()</code> on the <strong>y channel</strong> as <code>quantitative</code>, with the axis title <em>Count</em>.</li>
<li>Encode <code>species</code> on the <strong>color channel</strong> as <code>nominal</code>, with the legend title <em>Species</em>.</li>
<li>Use the <strong>column channel</strong> for <code>species</code> so that each species appears as a separate small multiple histogram, with the title <em>Species</em>.</li>
<li>Set the overall chart title to <em>Weight Distribution by Species</em>.</li>
</ul>

<h3>Styling Specifications:</h3>
<ul>
<li>Set the <strong>width</strong> of each facet to 200 pixels and <strong>height</strong> to 150 pixels.</li>
<li>Apply a distinct color for each species category (Altair‚Äôs default categorical palette).</li>
</ul>

</div>


In [None]:
# Do different species need different binning strategies?
species_histograms = ...

# Show the plot
species_histograms

### Observation
Change the number of bins (i.e, maxbins value), when you look at each of the species, which ones benefit from a specific bin size. 





## Reflection 

### Key Takeaways Discussion

**ü§î Reflection Questions:**

1. **Complementary approaches:** How did statistics and visualization reveal different aspects of the same data?

2. **Discovery process:** Which approach led you to the "aha moment" about species differences?

3. **Efficiency trade-offs:** When might a 30-second `.describe()` be better than a 5-minute faceted chart?

4. **Integration benefits:** How did combining both approaches give deeper insights?


### Quick Practice Exercise

**Before we wrap up:**

> *"If you wanted to investigate whether wing span varies by species, would you start with pandas or Altair? Why?"*

**Think about:**
- Your analytical goal
- Time constraints  
- Expected complexity
- Audience for results

---

## Summary

Today we explored the **fundamental tension** in exploratory data analysis:

**üî¢ Statistical EDA:** Fast, precise, hypothesis-driven  
**üé® Visual EDA:** Intuitive, exploratory, communication-focused

**üéØ The key insight:** They're not competing approaches - they're complementary tools that together provide comprehensive understanding of your data.
