# Chapter 8: Data Visualization - Turning Numbers Into Stories

---

## The CRAWL ‚Üí WALK ‚Üí RUN Framework

This textbook uses a structured approach to learning Python while developing effective AI collaboration skills. Each chapter follows three distinct phases:

| Mode | Icon | AI Policy | Purpose |
|------|------|-----------|--------|
| **CRAWL** | üêõ | No AI assistance | Build foundational skills you can demonstrate independently |
| **WALK** | üö∂ | AI for understanding only | Use AI to explain concepts and errors, but write your own code |
| **RUN** | üöÄ | Full AI collaboration | Partner with AI on complex tasks while documenting your process |

**Why This Matters:** Your final exam will test CRAWL and WALK material with no AI assistance. If you skip the foundational work and rely entirely on AI, you won't pass. The progression ensures you build genuine competence before leveraging AI as a professional tool.

## üìä Case Study Continues: From Analysis to Communication

In Chapter 7, you learned to analyze the Lehigh student dataset with pandas. You calculated statistics, grouped data, and identified patterns. But here's a question: **if you showed your manager a table of mean GPAs by college, would they understand what it means?**

Probably. But would they *remember* it? Would they *act* on it?

Data visualization is the bridge between analysis and action. A well-designed chart can:

- Reveal patterns that tables hide
- Communicate findings in seconds instead of minutes
- Persuade decision-makers to act
- Expose data quality issues you missed

**The Reality Check:**

Bad visualizations are everywhere. Misleading axes, cluttered legends, 3D pie charts that distort proportions. Learning to create *good* visualizations means learning what makes them effective, not just how to generate them.

**The Tools:**

| Library | Purpose | Relationship |
|---------|---------|-------------|
| **Matplotlib** | Low-level plotting, full control | The foundation |
| **Seaborn** | Statistical visualizations, beautiful defaults | Built on matplotlib |
| **Pandas plotting** | Quick exploration from DataFrames | Uses matplotlib internally |

Think of it this way: matplotlib is like having all the individual LEGO bricks. Seaborn gives you pre-assembled kits. Pandas plotting is the quick-start guide. You need to understand matplotlib to customize anything serious, but seaborn will handle 80% of your daily needs with less code.

**Continuing with Our Dataset:**

We'll visualize the Lehigh student dataset (600 students, 7 variables) throughout this chapter. By the end, you'll be able to create publication-quality visualizations that tell a clear story about student performance.

## Learning Objectives

By the end of this chapter, you will:

- üêõ Create basic plots with `plt.plot()`, `plt.bar()`, `plt.scatter()`, and `plt.hist()`
- üêõ Add titles, axis labels, and legends to plots
- üêõ Understand the matplotlib figure/axes architecture
- üêõ Save figures to files with `plt.savefig()`
- üêõ Create histograms and box plots to explore distributions
- üö∂ Use seaborn for statistical visualizations with less code
- üö∂ Create multi-panel figures with subplots
- üö∂ Customize colors, styles, and aesthetics
- üö∂ Choose the right chart type for your data and question
- üöÄ Build a complete visualization dashboard for the student dataset
- üöÄ Critique and improve AI-generated visualizations

---

# üêõ CRAWL: Matplotlib Fundamentals

**Rules for this section:**
- Close all AI tools (ChatGPT, Claude, Copilot, etc.)
- Work through examples by typing them yourself
- Use only this notebook, Python documentation, or your instructor for help
- This material will appear on the final exam without AI assistance

---

## üìö DataCamp Resources for Chapter 8

**[Introduction to Data Visualization with Matplotlib](https://www.datacamp.com/courses/introduction-to-data-visualization-with-matplotlib)** - Complete these:

| Chapter | Topics Covered | Alignment |
|---------|---------------|------------|
| Chapter 1: Introduction to Matplotlib | Basic plotting, customization | Sections 8.1-8.3 |
| Chapter 2: Plotting Time-Series | Line plots, annotations | Section 8.4 |
| Chapter 3: Quantitative Comparisons | Bar charts, histograms | Sections 8.5-8.6 |
| Chapter 4: Sharing Visualizations | Saving, styles, best practices | Section 8.7 |

**[Introduction to Data Visualization with Seaborn](https://www.datacamp.com/courses/introduction-to-data-visualization-with-seaborn)** - Complete these:

| Chapter | Topics Covered | Alignment |
|---------|---------------|------------|
| Chapter 1: Introduction to Seaborn | Scatter plots, count plots | Section 8.8 |
| Chapter 2: Visualizing Two Quantitative Variables | Relational plots | Sections 8.8-8.9 |
| Chapter 3: Visualizing a Categorical and a Quantitative Variable | Box plots, bar plots | Section 8.9 |
| Chapter 4: Customizing Seaborn Plots | Styles, palettes, faceting | Section 8.10 |

**Estimated time:** 6-8 hours total

---

## 8.1 Getting Started with Matplotlib

Let's start by importing our libraries and loading our data.

In [None]:
# Standard imports for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Load the Lehigh student dataset
df = pd.read_csv('lehigh_students_clean.csv')

# Quick inspection
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Configure matplotlib to display plots inline (Jupyter specific)
# This line is often optional in modern Jupyter, but good practice
%matplotlib inline

# Set a reasonable default figure size
plt.rcParams['figure.figsize'] = [8, 5]
plt.rcParams['figure.dpi'] = 100

## 8.2 Your First Plot: The Histogram

A histogram shows the distribution of a single numerical variable. It's often the first plot you should make when exploring data.

In [None]:
# Basic histogram of GPA
plt.hist(df['GPA'])
plt.show()

That works, but it's not very informative. Let's add context:

In [None]:
# Better histogram with labels and customization
plt.hist(df['GPA'], bins=20, edgecolor='black', alpha=0.7)
plt.xlabel('GPA')
plt.ylabel('Number of Students')
plt.title('Distribution of Student GPAs at Lehigh University')
plt.show()

**What we added:**
- `bins=20`: More granularity (default is 10)
- `edgecolor='black'`: Outlines each bar so they're distinguishable
- `alpha=0.7`: Slight transparency (useful when overlapping)
- `xlabel()`, `ylabel()`, `title()`: Essential context

**What the plot tells us:**
- GPAs are roughly normally distributed
- The center is around 3.0
- Very few students below 2.0 or at 4.0

In [None]:
# Add a vertical line for the mean
plt.hist(df['GPA'], bins=20, edgecolor='black', alpha=0.7)
plt.axvline(df['GPA'].mean(), color='red', linestyle='--', linewidth=2, label=f"Mean: {df['GPA'].mean():.2f}")
plt.xlabel('GPA')
plt.ylabel('Number of Students')
plt.title('Distribution of Student GPAs at Lehigh University')
plt.legend()
plt.show()

### ‚úèÔ∏è Practice 8.1: Basic Histograms

In [None]:
# 1. Create a histogram of Credits_Earned
# Include: title, axis labels, 15 bins, black edges
# Your code:


In [None]:
# 2. Create a histogram of Credits_Attempted
# Add vertical lines for both the mean (red) and median (blue)
# Include a legend
# Your code:


## 8.3 Bar Charts: Comparing Categories

Bar charts compare values across categories. Use them when you have categorical data on one axis and numerical data on the other.

In [None]:
# Count students by college
college_counts = df['College'].value_counts()
print(college_counts)

In [None]:
# Basic bar chart
plt.bar(college_counts.index, college_counts.values)
plt.show()

The labels are overlapping. This is a common problem with categorical data. Let's fix it:

In [None]:
# Better bar chart with rotated labels
plt.figure(figsize=(10, 6))  # Wider figure
plt.bar(college_counts.index, college_counts.values, color='steelblue', edgecolor='black')
plt.xlabel('College')
plt.ylabel('Number of Students')
plt.title('Student Enrollment by College')
plt.xticks(rotation=45, ha='right')  # Rotate labels, align right
plt.tight_layout()  # Prevent labels from being cut off
plt.show()

In [None]:
# Horizontal bar chart (often better for long category names)
plt.figure(figsize=(10, 6))
plt.barh(college_counts.index, college_counts.values, color='steelblue', edgecolor='black')
plt.xlabel('Number of Students')
plt.ylabel('College')
plt.title('Student Enrollment by College')
plt.tight_layout()
plt.show()

In [None]:
# Mean GPA by college (requires calculation first)
gpa_by_college = df.groupby('College')['GPA'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(gpa_by_college.index, gpa_by_college.values, color='forestgreen', edgecolor='black')
plt.xlabel('Average GPA')
plt.ylabel('College')
plt.title('Average GPA by College')
plt.xlim(2.5, 3.5)  # Zoom in on the relevant range
plt.tight_layout()
plt.show()

**Important design choice:** The x-axis starts at 2.5, not 0. This makes differences visible but could be misleading. Always consider whether truncating axes is appropriate for your audience. For internal exploration, it's fine. For public communication, start at 0 or clearly indicate the break.

### ‚úèÔ∏è Practice 8.2: Bar Charts

In [None]:
# 1. Create a bar chart showing the number of students in each Class_Year
# Order should be: Freshman, Sophomore, Junior, Senior
# Your code:


In [None]:
# 2. Create a horizontal bar chart showing average Credits_Earned by College
# Sort from highest to lowest
# Your code:


## 8.4 Scatter Plots: Relationships Between Variables

Scatter plots show the relationship between two numerical variables. Each point represents one observation.

In [None]:
# Relationship between Credits_Attempted and Credits_Earned
plt.scatter(df['Credits_Attempted'], df['Credits_Earned'])
plt.xlabel('Credits Attempted')
plt.ylabel('Credits Earned')
plt.title('Credits Earned vs. Credits Attempted')
plt.show()

Strong positive relationship (as expected). Students who attempt more credits generally earn more credits. The spread below the diagonal line represents credits not completed.

In [None]:
# Add a reference line (y=x, meaning 100% completion)
plt.figure(figsize=(8, 8))  # Square figure
plt.scatter(df['Credits_Attempted'], df['Credits_Earned'], alpha=0.5)

# Add y=x line
max_credits = max(df['Credits_Attempted'].max(), df['Credits_Earned'].max())
plt.plot([0, max_credits], [0, max_credits], 'r--', linewidth=2, label='100% Completion')

plt.xlabel('Credits Attempted')
plt.ylabel('Credits Earned')
plt.title('Credits Earned vs. Credits Attempted')
plt.legend()
plt.show()

In [None]:
# Relationship between Credits_Earned and GPA
plt.scatter(df['Credits_Earned'], df['GPA'], alpha=0.5)
plt.xlabel('Credits Earned')
plt.ylabel('GPA')
plt.title('GPA vs. Credits Earned')
plt.show()

Interesting. No obvious relationship between credits earned and GPA. Students with more credits (likely seniors) don't systematically have higher or lower GPAs.

In [None]:
# Color points by Class_Year to see if patterns emerge
colors = {'Freshman': 'blue', 'Sophomore': 'green', 'Junior': 'orange', 'Senior': 'red'}

plt.figure(figsize=(10, 6))
for year in ['Freshman', 'Sophomore', 'Junior', 'Senior']:
    subset = df[df['Class_Year'] == year]
    plt.scatter(subset['Credits_Earned'], subset['GPA'], 
                alpha=0.5, label=year, c=colors[year])

plt.xlabel('Credits Earned')
plt.ylabel('GPA')
plt.title('GPA vs. Credits Earned by Class Year')
plt.legend()
plt.show()

Now we see the story: class year naturally clusters by credits earned (freshmen have fewer credits), but within each cluster, GPA varies widely.

### ‚úèÔ∏è Practice 8.3: Scatter Plots

In [None]:
# 1. Create a scatter plot of Credits_Attempted (x) vs GPA (y)
# Include axis labels and title
# Your code:


In [None]:
# 2. Create the same scatter plot but color points by College
# Hint: You'll need to loop through each college like the Class_Year example
# Your code:


## 8.5 Box Plots: Distributions by Category

Box plots (box-and-whisker plots) show the distribution of a numerical variable across categories. They display:
- The median (middle line)
- The interquartile range (IQR, the box)
- Whiskers extending to 1.5 * IQR
- Outliers as individual points

In [None]:
# Box plot of GPA
plt.boxplot(df['GPA'])
plt.ylabel('GPA')
plt.title('Distribution of GPA')
plt.show()

In [None]:
# Box plot of GPA by Class_Year
# We need to prepare data as a list of arrays, one per group
class_order = ['Freshman', 'Sophomore', 'Junior', 'Senior']
data_by_year = [df[df['Class_Year'] == year]['GPA'] for year in class_order]

plt.boxplot(data_by_year, labels=class_order)
plt.xlabel('Class Year')
plt.ylabel('GPA')
plt.title('GPA Distribution by Class Year')
plt.show()

This is where matplotlib becomes tedious. We had to manually prepare the data. Seaborn makes this much easier (coming in the WALK section).

In [None]:
# Box plot of GPA by College
colleges = df['College'].unique()
data_by_college = [df[df['College'] == college]['GPA'] for college in colleges]

plt.figure(figsize=(12, 6))
plt.boxplot(data_by_college, labels=colleges)
plt.xlabel('College')
plt.ylabel('GPA')
plt.title('GPA Distribution by College')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Reading the box plot:**
- The boxes show where the middle 50% of students fall
- The line in the box is the median
- Dots outside the whiskers are outliers
- Wider boxes suggest more variability

## 8.6 The Figure and Axes Model

Matplotlib has two interfaces:
1. **pyplot interface** (what we've been using): Quick and simple, but limited
2. **Object-oriented interface**: More control, essential for complex figures

The key concepts:
- **Figure**: The entire image/window
- **Axes**: A single plot within the figure (can have multiple)

In [None]:
# Create a figure with explicit axes
fig, ax = plt.subplots(figsize=(8, 5))

# Now use ax methods instead of plt functions
ax.hist(df['GPA'], bins=20, edgecolor='black', alpha=0.7)
ax.set_xlabel('GPA')
ax.set_ylabel('Number of Students')
ax.set_title('Distribution of Student GPAs')

plt.show()

Notice the difference:
- `plt.xlabel()` becomes `ax.set_xlabel()`
- `plt.title()` becomes `ax.set_title()`

Why bother? Because with the object-oriented approach, we can create multiple subplots:

In [None]:
# Create a 2x2 grid of plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Top-left: GPA histogram
axes[0, 0].hist(df['GPA'], bins=20, edgecolor='black', alpha=0.7, color='steelblue')
axes[0, 0].set_xlabel('GPA')
axes[0, 0].set_ylabel('Count')
axes[0, 0].set_title('GPA Distribution')

# Top-right: Credits_Earned histogram
axes[0, 1].hist(df['Credits_Earned'], bins=20, edgecolor='black', alpha=0.7, color='forestgreen')
axes[0, 1].set_xlabel('Credits Earned')
axes[0, 1].set_ylabel('Count')
axes[0, 1].set_title('Credits Earned Distribution')

# Bottom-left: Students by College
college_counts = df['College'].value_counts()
axes[1, 0].barh(college_counts.index, college_counts.values, color='coral', edgecolor='black')
axes[1, 0].set_xlabel('Number of Students')
axes[1, 0].set_title('Students by College')

# Bottom-right: Students by Class Year
year_order = ['Freshman', 'Sophomore', 'Junior', 'Senior']
year_counts = df['Class_Year'].value_counts().reindex(year_order)
axes[1, 1].bar(year_counts.index, year_counts.values, color='mediumpurple', edgecolor='black')
axes[1, 1].set_xlabel('Class Year')
axes[1, 1].set_ylabel('Number of Students')
axes[1, 1].set_title('Students by Class Year')

plt.tight_layout()
plt.show()

## 8.7 Saving Figures

Use `plt.savefig()` or `fig.savefig()` to save plots to files.

In [None]:
# Create a figure worth saving
fig, ax = plt.subplots(figsize=(10, 6))
gpa_by_college = df.groupby('College')['GPA'].mean().sort_values(ascending=True)
ax.barh(gpa_by_college.index, gpa_by_college.values, color='steelblue', edgecolor='black')
ax.set_xlabel('Average GPA', fontsize=12)
ax.set_title('Average GPA by College at Lehigh University', fontsize=14)
ax.set_xlim(2.8, 3.2)

# Save in different formats
fig.savefig('gpa_by_college.png', dpi=150, bbox_inches='tight')
fig.savefig('gpa_by_college.pdf', bbox_inches='tight')
fig.savefig('gpa_by_college.svg', bbox_inches='tight')

print("Saved as PNG, PDF, and SVG")
plt.show()

**Format recommendations:**
- **PNG**: Raster format, good for web/presentations, use `dpi=150` or higher
- **PDF**: Vector format, perfect for papers and reports (scales infinitely)
- **SVG**: Vector format, editable in design software

`bbox_inches='tight'` removes excess whitespace around the figure.

### ‚úèÔ∏è Practice 8.4: Figure and Axes

In [None]:
# 1. Create a 1x2 subplot (one row, two columns)
# Left: Histogram of Credits_Attempted
# Right: Histogram of Credits_Earned
# Use the object-oriented interface (fig, axes = plt.subplots(...))
# Your code:


In [None]:
# 2. Create a single plot showing the distribution of GPA
# Save it as 'my_gpa_histogram.png' with dpi=200
# Your code:


---

## CRAWL Checkpoint: Key Concepts So Far

Before moving on, make sure you can do these **without AI assistance**:

1. Create a histogram with `plt.hist()` and customize bins, colors, edges
2. Create bar charts (vertical and horizontal) with `plt.bar()` and `plt.barh()`
3. Create scatter plots with `plt.scatter()`
4. Add titles, axis labels, and legends
5. Use `plt.figure(figsize=(w, h))` to control figure size
6. Use `plt.tight_layout()` to prevent overlapping elements
7. Create multi-panel figures with `plt.subplots(rows, cols)`
8. Save figures with `plt.savefig()` or `fig.savefig()`

---

## CRAWL Exercises

Complete these exercises without AI assistance. They test the material covered so far.

### Problem 8.1: Code Prediction

Predict what each code snippet will produce **before** running it.

In [None]:
# What will this plot look like? How many bars?
plt.bar(['A', 'B', 'C'], [10, 20, 15])
plt.show()

# Your prediction before running:

In [None]:
# What's wrong with this code? What error will you get?
fig, axes = plt.subplots(1, 2)
axes[0].hist(df['GPA'])
axes[1, 0].hist(df['Credits_Earned'])  # Bug is here
plt.show()

# Your prediction:

### Problem 8.2: Debug These Plots

In [None]:
# Error 1: The labels are cut off. Fix it.
college_counts = df['College'].value_counts()
plt.bar(college_counts.index, college_counts.values)
plt.xlabel('College')
plt.ylabel('Count')
plt.title('Students by College')
plt.show()

In [None]:
# Error 2: The legend doesn't appear. Fix it.
plt.scatter(df['Credits_Earned'], df['GPA'], alpha=0.5)
plt.axhline(df['GPA'].mean(), color='red', linestyle='--')
plt.legend()
plt.show()

### Problem 8.3: From Description to Code

Write the code to create each visualization described.

In [None]:
# a) A histogram showing the distribution of Credits_Attempted
#    - Use 25 bins
#    - Add a vertical line at the median
#    - Include a title: "Distribution of Credits Attempted"
#    - Label both axes appropriately
# Your code:


In [None]:
# b) A horizontal bar chart showing the number of students in each Major
#    - Only show the top 10 majors by student count
#    - Sort from most students (top) to fewest (bottom)
#    - Include a title and axis labels
# Your code:


In [None]:
# c) A scatter plot showing Credits_Attempted (x) vs Credits_Earned (y)
#    - Add a diagonal line showing y=x (perfect completion)
#    - Make points semi-transparent (alpha=0.4)
#    - Include title, axis labels, and legend for the line
# Your code:


---

# üö∂ WALK: Seaborn and Advanced Customization

**Rules for this section:**
- You may use AI tools to **explain** concepts and errors
- You must **write all code yourself**
- Good prompts: "Explain how seaborn's hue parameter works" or "What's the difference between catplot and barplot?"
- Bad prompts: "Write code to create a box plot of GPA by college"

---

## 8.8 Introduction to Seaborn

Seaborn is built on matplotlib but provides:
- Better default aesthetics
- Direct integration with pandas DataFrames
- Statistical visualizations in one line
- Automatic handling of categorical data

In [None]:
# Compare: matplotlib box plot (tedious)
class_order = ['Freshman', 'Sophomore', 'Junior', 'Senior']
data_by_year = [df[df['Class_Year'] == year]['GPA'] for year in class_order]
plt.boxplot(data_by_year, labels=class_order)
plt.title('Matplotlib: Manual data prep required')
plt.show()

In [None]:
# Compare: seaborn box plot (simple)
sns.boxplot(data=df, x='Class_Year', y='GPA', order=['Freshman', 'Sophomore', 'Junior', 'Senior'])
plt.title('Seaborn: Direct DataFrame integration')
plt.show()

One line vs. three. And the seaborn version looks better by default.

In [None]:
# Set seaborn style for all subsequent plots
sns.set_style('whitegrid')  # Options: white, dark, whitegrid, darkgrid, ticks
sns.set_palette('Set2')  # Color palette

## 8.9 Seaborn Plot Types

Seaborn organizes plots by the type of data relationship they show:

| Function | Purpose | When to Use |
|----------|---------|-------------|
| `histplot()` | Distribution of one variable | Exploring a single numeric column |
| `boxplot()` | Distribution by category | Compare distributions across groups |
| `violinplot()` | Distribution by category (detailed) | See full distribution shape |
| `barplot()` | Mean (or other stat) by category | Compare averages across groups |
| `countplot()` | Count by category | Frequency of categorical values |
| `scatterplot()` | Relationship between two numerics | Correlation, clusters |
| `lineplot()` | Trend over continuous variable | Time series, sequences |
| `heatmap()` | Matrix of values | Correlation matrices, pivot tables |

In [None]:
# histplot: Better histogram
sns.histplot(data=df, x='GPA', bins=20, kde=True)  # kde adds a smooth density curve
plt.title('GPA Distribution with Density Curve')
plt.show()

In [None]:
# countplot: Bar chart for categorical data
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='College')
plt.xticks(rotation=45, ha='right')
plt.title('Number of Students by College')
plt.tight_layout()
plt.show()

In [None]:
# barplot: Shows mean with confidence interval
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='College', y='GPA')
plt.xticks(rotation=45, ha='right')
plt.title('Average GPA by College (with 95% CI)')
plt.tight_layout()
plt.show()

Notice the error bars. These show 95% confidence intervals, giving you a sense of statistical uncertainty.

In [None]:
# boxplot: Distribution by category
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='College', y='GPA')
plt.xticks(rotation=45, ha='right')
plt.title('GPA Distribution by College')
plt.tight_layout()
plt.show()

In [None]:
# violinplot: Like boxplot but shows full distribution
plt.figure(figsize=(12, 6))
sns.violinplot(data=df, x='College', y='GPA')
plt.xticks(rotation=45, ha='right')
plt.title('GPA Distribution by College (Violin Plot)')
plt.tight_layout()
plt.show()

In [None]:
# scatterplot with hue: Color by a third variable
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Credits_Earned', y='GPA', hue='Class_Year', 
                hue_order=['Freshman', 'Sophomore', 'Junior', 'Senior'], alpha=0.6)
plt.title('GPA vs Credits Earned by Class Year')
plt.show()

The `hue` parameter is powerful. It adds a dimension to your plot by coloring points based on a categorical variable.

In [None]:
# boxplot with hue: Compare across two categorical variables
plt.figure(figsize=(14, 6))
sns.boxplot(data=df, x='College', y='GPA', hue='Class_Year',
            hue_order=['Freshman', 'Sophomore', 'Junior', 'Senior'])
plt.xticks(rotation=45, ha='right')
plt.title('GPA by College and Class Year')
plt.legend(title='Class Year', bbox_to_anchor=(1.02, 1), loc='upper left')
plt.tight_layout()
plt.show()

### ‚úèÔ∏è Practice 8.5: Seaborn Basics

In [None]:
# 1. Create a seaborn histplot of Credits_Attempted with a density curve (kde=True)
# Your code:


In [None]:
# 2. Create a barplot showing average Credits_Earned by Class_Year
# Order the years logically
# Your code:


In [None]:
# 3. Create a boxplot of GPA by Class_Year
# Add hue='College' to see breakdown by both
# Your code:


## 8.10 Heatmaps and Correlation

Heatmaps are excellent for showing relationships in matrices, like correlation matrices.

In [None]:
# Calculate correlation matrix for numeric columns
numeric_cols = ['GPA', 'Credits_Attempted', 'Credits_Earned']
correlation_matrix = df[numeric_cols].corr()
print(correlation_matrix)

In [None]:
# Visualize as heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            vmin=-1, vmax=1, square=True, fmt='.2f')
plt.title('Correlation Matrix: Numeric Variables')
plt.show()

**Reading the heatmap:**
- Values range from -1 (perfect negative correlation) to +1 (perfect positive)
- Credits_Attempted and Credits_Earned are strongly correlated (0.98)
- GPA has weak correlation with credits (~0.05)

In [None]:
# Heatmap of mean GPA by College and Class Year (pivot table)
pivot = df.pivot_table(values='GPA', index='College', columns='Class_Year', aggfunc='mean')
pivot = pivot[['Freshman', 'Sophomore', 'Junior', 'Senior']]  # Reorder columns

plt.figure(figsize=(10, 8))
sns.heatmap(pivot, annot=True, cmap='YlGnBu', fmt='.2f')
plt.title('Average GPA by College and Class Year')
plt.show()

## 8.11 FacetGrid: Small Multiples

Small multiples show the same plot repeated for different subsets of the data. They're powerful for comparing patterns across groups.

In [None]:
# Histogram of GPA for each College
g = sns.FacetGrid(df, col='College', col_wrap=3, height=4)
g.map(plt.hist, 'GPA', bins=15, edgecolor='black', alpha=0.7)
g.set_axis_labels('GPA', 'Count')
g.set_titles('{col_name}')
plt.tight_layout()
plt.show()

In [None]:
# Scatter plot for each Class Year
g = sns.FacetGrid(df, col='Class_Year', col_order=['Freshman', 'Sophomore', 'Junior', 'Senior'], 
                  height=4, aspect=1)
g.map(plt.scatter, 'Credits_Attempted', 'Credits_Earned', alpha=0.5)
g.set_axis_labels('Credits Attempted', 'Credits Earned')
g.set_titles('{col_name}')
plt.tight_layout()
plt.show()

Now you can see the clear progression: Freshmen cluster at low credits, Seniors at high credits.

## 8.12 Choosing the Right Chart

The most important skill in visualization isn't knowing the syntax. It's knowing *what* to visualize.

| Question Type | Recommended Plot |
|--------------|------------------|
| What's the distribution of X? | Histogram, density plot |
| How does X compare across groups? | Box plot, bar plot |
| Is there a relationship between X and Y? | Scatter plot |
| What are the counts in each category? | Bar plot, count plot |
| How do two categorical variables interact? | Heatmap, grouped bar, stacked bar |
| How does X change over time/sequence? | Line plot |
| What are the correlations in my data? | Correlation heatmap |

**Common mistakes to avoid:**
- Pie charts for more than 3-4 categories (use bar charts instead)
- 3D charts almost always (they distort proportions)
- Too many colors or categories in one plot
- Missing axis labels or titles
- Truncated axes without clear indication

### ‚úèÔ∏è Practice 8.6: Choosing Charts

In [None]:
# For each question, create the most appropriate visualization:

# 1. "What is the distribution of completion rates (Credits_Earned / Credits_Attempted)?"
# First calculate completion rate, then visualize
# Your code:


In [None]:
# 2. "Which colleges have the highest and lowest average completion rates?"
# Your code:


In [None]:
# 3. "Is there a relationship between completion rate and GPA?"
# Your code:


In [None]:
# 4. "How does the GPA distribution differ between Freshmen and Seniors?"
# Your code:


---

# üöÄ RUN: Visualization Dashboard Project

**Rules for this section:**
- AI collaboration is encouraged
- Document how you use AI and what you learn
- Focus on understanding the final result, not just generating code
- Be critical: AI-generated visualizations often need refinement

---

## Project: Lehigh University Student Success Dashboard

Your task is to create a comprehensive visualization dashboard that tells the story of student performance at Lehigh University. This is the kind of deliverable you might present to university administrators.

**Requirements:**

The dashboard should answer these questions with appropriate visualizations:

1. **Overall Distribution:** What does the GPA distribution look like across all students?
2. **College Comparison:** How do GPAs vary by college? Is the variation statistically meaningful?
3. **Class Year Progression:** Do GPAs change from freshman to senior year?
4. **Credit Completion:** What is the relationship between credits attempted and earned? Are there concerning patterns?
5. **At-Risk Identification:** Which students might need academic support? (GPA < 2.0, low completion rate)
6. **Top Performers:** What characterizes high-achieving students?

**Deliverables:**
1. A multi-panel figure (at least 6 plots) that tells a coherent story
2. Individual publication-quality figures for at least 3 key insights
3. Written interpretation of what the visualizations reveal

### Part 1: Data Preparation

In [None]:
# Create derived columns that will be useful for visualization
# - Completion_Rate
# - GPA_Category (e.g., "Below 2.0", "2.0-2.5", "2.5-3.0", "3.0-3.5", "3.5-4.0")
# - At_Risk flag (GPA < 2.0 or Completion_Rate < 0.85)
# - Quality_Points (GPA * Credits_Earned)

# Your implementation:


### Part 2: Multi-Panel Dashboard

In [None]:
# Create a multi-panel figure (suggested: 3 rows x 2 columns = 6 plots)
# Each panel should address one of the questions from the requirements
# Use appropriate plot types for each question

# Your implementation:


### Part 3: Key Insight Visualizations

In [None]:
# Create a publication-quality visualization for your most important insight
# This should be polished enough to present to university administrators
# Include:
# - Clear title that states the insight (not just describes the data)
# - Properly labeled axes with units if applicable
# - Legend if needed
# - Annotation highlighting key findings
# Save as both PNG (high DPI) and PDF

# Your implementation:


In [None]:
# Create a second key insight visualization

# Your implementation:


In [None]:
# Create a third key insight visualization

# Your implementation:


### Part 4: At-Risk Student Analysis

In [None]:
# Visualize the at-risk student population
# Consider:
# - What percentage of students are at-risk?
# - Which colleges have the highest proportion of at-risk students?
# - Are at-risk students concentrated in certain class years?

# Your implementation:


### Part 5: Top Performer Profile

In [None]:
# Visualize characteristics of top performers (GPA >= 3.5)
# Consider:
# - What majors/colleges are most represented?
# - How do their credit completion rates compare?
# - Is there a relationship with class year?

# Your implementation:


### Part 6: Written Analysis

Write a brief report (3-5 paragraphs) summarizing your findings:

1. What are the most important patterns in student performance?
2. Which visualizations were most effective for communicating these patterns?
3. What recommendations would you make to university administrators?
4. What additional data would help deepen this analysis?

*Your analysis here:*



### Project Reflection

1. Which visualization was hardest to create? Why?
2. Did any visualizations reveal something you didn't expect from the data?
3. How did you use AI during this project? What worked well? What didn't?
4. If you had more time, what would you add or improve?

*Your reflection here:*



---

# Accountability Check

## üêõ CRAWL (Must do without AI)
- [ ] Create histograms with `plt.hist()` including bins, colors, edges
- [ ] Create bar charts (vertical and horizontal) with `plt.bar()` and `plt.barh()`
- [ ] Create scatter plots with `plt.scatter()`
- [ ] Add titles, axis labels, and legends to any plot
- [ ] Use `plt.figure(figsize=(...))` to control figure size
- [ ] Create multi-panel figures with `plt.subplots()`
- [ ] Save figures with `plt.savefig()` in PNG and PDF formats
- [ ] Add vertical/horizontal reference lines with `axvline()` and `axhline()`

## üö∂ WALK (AI to learn, write code yourself)
- [ ] Use seaborn for statistical plots: `boxplot()`, `violinplot()`, `barplot()`
- [ ] Use the `hue` parameter to add a categorical dimension
- [ ] Create heatmaps with `sns.heatmap()`
- [ ] Use FacetGrid for small multiples
- [ ] Choose appropriate chart types for different questions
- [ ] Customize colors, styles, and palettes

## üöÄ RUN (AI-assisted, must understand)
- [ ] Build multi-panel dashboards that tell a coherent story
- [ ] Create publication-quality figures with proper annotations
- [ ] Write interpretations that connect visualizations to insights
- [ ] Critique and improve AI-generated visualizations

**Review CRAWL material if you can't do it from memory. This will be on the final exam.**

---

## What's Next?

You've completed the core Python for Data Analysis curriculum:

- **Chapters 1-5:** Python fundamentals, data structures, file I/O
- **Chapter 6:** NumPy for numerical computing
- **Chapter 7:** Pandas for data manipulation
- **Chapter 8:** Visualization for communication

In **Chapter 9**, you'll bring everything together for the **Final Project**, where you'll:

- Work with a new dataset (or choose your own)
- Apply the complete data analysis workflow
- Create a professional report with visualizations
- Document your process and AI collaboration

The skills you've built in this course map directly to professional data analysis work. The CRAWL-WALK-RUN framework has prepared you to use AI as a productivity tool while maintaining the foundational understanding needed to verify, debug, and improve AI-generated solutions.

**Final Exam Reminder:** The final exam will test CRAWL and WALK material without AI assistance. Review the Accountability Checklists from all chapters.

---