# CE49X: Introduction to Computational Thinking and Data Science for Civil Engineers
## Week 4 — Theory: Visualization Thinking

**Instructor:** Dr. Eyuphan Koc  
**Department of Civil Engineering, Bogazici University**  
**Semester:** Spring 2026

*Companion notebook to Week 4 lecture — designed for self-study and in-class discussion*

---

## Table of Contents

1. [Why Visualization Matters](#1.-Why-Visualization-Matters)
2. [Visual Encoding Channels](#2.-Visual-Encoding-Channels)
3. [Choosing the Right Chart](#3.-Choosing-the-Right-Chart)
4. [Communicating Uncertainty](#4.-Communicating-Uncertainty)
5. [From Data to Story](#5.-From-Data-to-Story)
6. [Common Mistakes](#6.-Common-Mistakes)
7. [Reflection](#7.-Reflection)

---
## 1. Why Visualization Matters

Summary statistics are essential, but they can hide critical patterns. Before you trust a number, **look at the data**.

The most famous demonstration of this principle is **Anscombe's Quartet** (1973) — four datasets with nearly identical descriptive statistics but wildly different distributions.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Anscombe's Quartet
datasets = {
    'I':   {'x': [10,8,13,9,11,14,6,4,12,7,5], 'y': [8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68]},
    'II':  {'x': [10,8,13,9,11,14,6,4,12,7,5], 'y': [9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74]},
    'III': {'x': [10,8,13,9,11,14,6,4,12,7,5], 'y': [7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73]},
    'IV':  {'x': [8,8,8,8,8,8,8,19,8,8,8],     'y': [6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89]}
}

# Show that statistics are IDENTICAL
print("Dataset  | mean(x) | mean(y) |  std(x) |  std(y) | corr(x,y)")
print("-" * 65)
for name, d in datasets.items():
    x, y = np.array(d['x']), np.array(d['y'])
    print(f"  {name:5s} | {x.mean():7.2f} | {y.mean():7.2f} | {x.std():7.2f} | {y.std():7.2f} | {np.corrcoef(x, y)[0,1]:8.3f}")

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for ax, (name, d) in zip(axes.flat, datasets.items()):
    x, y = np.array(d['x']), np.array(d['y'])
    ax.scatter(x, y, s=60, edgecolors='k', alpha=0.8, color='steelblue')
    m, b = np.polyfit(x, y, 1)
    x_line = np.linspace(3, 20, 100)
    ax.plot(x_line, m * x_line + b, 'r--', linewidth=2)
    ax.set_title(f'Dataset {name}', fontsize=12)
    ax.set_xlim(3, 20)
    ax.set_ylim(2, 14)
    ax.grid(True, alpha=0.3)

plt.suptitle("Anscombe's Quartet: Identical Statistics, Completely Different Stories", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

> **Key Insight: Summary Statistics Can Lie. Visualization Reveals Truth.**
> All four datasets above have virtually identical means, standard deviations, and correlations.
> Yet the data tell completely different stories. **Never skip the plot.**

> **Example: Civil Engineering**
> Imagine reporting "average beam deflection = 2.1 mm" when the data actually shows a bimodal distribution — two different failure modes. The average hides the danger.

---
## 2. Visual Encoding Channels

Every visualization maps data to **visual properties** (position, color, size, shape, etc.). The choice of encoding determines what patterns your audience can see — and what they miss.

Let's explore this with a Civil Engineering dataset: beam load vs. deflection across different materials.

In [None]:
np.random.seed(42)
n = 30
materials = np.random.choice(['Steel', 'Concrete', 'Timber'], n)
load = np.random.uniform(50, 300, n)
deflection = load / np.where(materials == 'Steel', 100, np.where(materials == 'Concrete', 60, 30)) + np.random.normal(0, 0.3, n)

# Encoding 1: Position only (basic scatter)
fig, ax = plt.subplots(figsize=(8, 5))
ax.scatter(load, deflection, s=60, edgecolors='k', alpha=0.8, color='steelblue')
ax.set_xlabel('Load (kN)', fontsize=12)
ax.set_ylabel('Deflection (mm)', fontsize=12)
ax.set_title('Encoding 1: Position Only', fontsize=13)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**[DISCUSS]** What information can you read from this? What is hard to read?

You can see there is a general positive relationship, but can you tell which material each point represents? No.

In [None]:
# Encoding 2: Position + Color (colored by material)
color_map = {'Steel': 'steelblue', 'Concrete': 'gray', 'Timber': 'peru'}
colors = [color_map[m] for m in materials]

fig, ax = plt.subplots(figsize=(8, 5))
for mat in ['Steel', 'Concrete', 'Timber']:
    mask = materials == mat
    ax.scatter(load[mask], deflection[mask], s=60, edgecolors='k', alpha=0.8,
               color=color_map[mat], label=mat)
ax.set_xlabel('Load (kN)', fontsize=12)
ax.set_ylabel('Deflection (mm)', fontsize=12)
ax.set_title('Encoding 2: Position + Color', fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**[DISCUSS]** Now what can you see? The three materials form distinct clusters with different load-deflection relationships. Color reveals the **category**.

In [None]:
# Encoding 3: Position + Size (bubble chart — size encodes load magnitude)
fig, ax = plt.subplots(figsize=(8, 5))
sizes = (load / load.max()) * 300  # scale to reasonable bubble sizes
ax.scatter(load, deflection, s=sizes, edgecolors='k', alpha=0.5, color='steelblue')
ax.set_xlabel('Load (kN)', fontsize=12)
ax.set_ylabel('Deflection (mm)', fontsize=12)
ax.set_title('Encoding 3: Position + Size (size = load magnitude)', fontsize=13)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**[DISCUSS]** Size is redundant here because load is already on the x-axis. But in other contexts, size can encode a third variable effectively (e.g., bubble size = number of test specimens).

In [None]:
# Encoding 4: Position + Color + Shape (full encoding)
marker_map = {'Steel': 'o', 'Concrete': 's', 'Timber': '^'}

fig, ax = plt.subplots(figsize=(8, 5))
for mat in ['Steel', 'Concrete', 'Timber']:
    mask = materials == mat
    ax.scatter(load[mask], deflection[mask], s=80, edgecolors='k', alpha=0.8,
               color=color_map[mat], marker=marker_map[mat], label=mat)
ax.set_xlabel('Load (kN)', fontsize=12)
ax.set_ylabel('Deflection (mm)', fontsize=12)
ax.set_title('Encoding 4: Position + Color + Shape', fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

**[QUICK]** Shape is useful for colorblind accessibility and for printed (grayscale) versions. Double-encoding with both color and shape is often the safest choice.

> **Key Insight: The Encoding Hierarchy**
> Humans decode visual properties with different accuracy:
>
> **Position** > **Length** > **Angle** > **Area** > **Color intensity**
>
> This is why scatter plots (position) work better than pie charts (angle) for comparing values.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

materials_unique = ['Steel', 'Concrete', 'Timber']
mean_loads = [np.mean(load[materials == m]) for m in materials_unique]
mean_defl = [np.mean(deflection[materials == m]) for m in materials_unique]

# Bad: 3D-style bar chart (simulated with offset and gradient)
x_pos = np.arange(len(materials_unique))
bars = axes[0].bar(x_pos, mean_loads, color=['#4472C4', '#ED7D31', '#A5A5A5'], width=0.6, edgecolor='k')
axes[0].bar(x_pos + 0.1, [v * 0.95 for v in mean_loads], color=['#5B9BD5', '#F4B183', '#C0C0C0'], width=0.6, alpha=0.5)
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(materials_unique)
axes[0].set_ylabel('Mean Load (kN)')
axes[0].set_title('Pseudo-3D Bar Chart\n(harder to read)', fontsize=12)
axes[0].grid(True, alpha=0.3, axis='y')

# Good: clean grouped bar chart
x_pos2 = np.arange(len(materials_unique))
width = 0.35
axes[1].bar(x_pos2 - width/2, mean_loads, width, label='Mean Load (kN)', color='steelblue', edgecolor='k')
axes[1].bar(x_pos2 + width/2, [d * 100 for d in mean_defl], width, label='Mean Deflection (\u00d7100 mm)', color='indianred', edgecolor='k')
axes[1].set_xticks(x_pos2)
axes[1].set_xticklabels(materials_unique)
axes[1].set_title('Clean Grouped Bar Chart\n(easier to compare)', fontsize=12)
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

> **Key Insight: Just Because You CAN Make a 3D Rotating Pie Chart Doesn't Mean You SHOULD.**
>
> Pseudo-3D effects, shadows, and gradients add visual complexity without adding information. They make it harder to compare values accurately. Stick to clean, 2D representations.

---
## 3. Choosing the Right Chart

The chart type you choose should be driven by the **question** you are trying to answer, not by what looks impressive.

### Decision Framework

| Question You're Answering | Best Chart Type |
|---|---|
| Comparing quantities across categories | Bar chart |
| Showing trends over time | Line chart |
| Showing relationships between variables | Scatter plot |
| Showing distributions | Histogram / Box plot |
| Showing composition (parts of whole) | Stacked bar (not pie!) |
| Showing spatial patterns | Heatmap / Contour |

Let's test this with three Civil Engineering scenarios.

In [None]:
# Scenario 1: Concrete strength development over 28 days
np.random.seed(42)
days = [1, 3, 7, 14, 21, 28]
strength = [8, 15, 22, 28, 31, 33]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].bar(range(len(days)), strength, color='steelblue', edgecolor='k')
axes[0].set_xticks(range(len(days)))
axes[0].set_xticklabels(days)
axes[0].set_xlabel('Age (days)')
axes[0].set_ylabel('Strength (MPa)')
axes[0].set_title('Option A: Bar Chart', fontsize=12)
axes[0].grid(True, alpha=0.3, axis='y')

axes[1].plot(days, strength, 'o-', color='steelblue', linewidth=2, markersize=8)
axes[1].set_xlabel('Age (days)')
axes[1].set_ylabel('Strength (MPa)')
axes[1].set_title('Option B: Line Chart', fontsize=12)
axes[1].grid(True, alpha=0.3)

plt.suptitle('[DISCUSS] Concrete Strength Development \u2014 Which is Better?', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

**[DISCUSS]** The line chart (Option B) is better here because:
- Time is **continuous** (not categorical) — the line emphasizes this
- The **rate of strength gain** is visible from the slope
- You can see the curve flattening — approaching an asymptote

The bar chart treats each day as an independent category and hides the temporal progression.

In [None]:
# Scenario 2: Strength comparison across 5 mix designs
np.random.seed(42)
mix_names = ['Mix A', 'Mix B', 'Mix C', 'Mix D', 'Mix E']
mix_data = [np.random.normal(loc, scale, 20) for loc, scale in
            [(30, 2), (35, 5), (28, 1.5), (33, 3), (31, 4)]]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Option A: Plain scatter (jittered)
for i, (name, data) in enumerate(zip(mix_names, mix_data)):
    axes[0].scatter(np.full_like(data, i) + np.random.uniform(-0.1, 0.1, len(data)),
                    data, alpha=0.5, edgecolors='k', linewidth=0.5, color='steelblue')
axes[0].set_xticks(range(len(mix_names)))
axes[0].set_xticklabels(mix_names)
axes[0].set_ylabel('Compressive Strength (MPa)')
axes[0].set_title('Option A: Strip Plot', fontsize=12)
axes[0].grid(True, alpha=0.3, axis='y')

# Option B: Box plot
bp = axes[1].boxplot(mix_data, labels=mix_names, patch_artist=True,
                     boxprops=dict(facecolor='steelblue', alpha=0.7),
                     medianprops=dict(color='black', linewidth=2))
axes[1].set_ylabel('Compressive Strength (MPa)')
axes[1].set_title('Option B: Box Plot', fontsize=12)
axes[1].grid(True, alpha=0.3, axis='y')

plt.suptitle('[DISCUSS] Comparing Mix Designs \u2014 Which is Better?', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

**[DISCUSS]** The box plot (Option B) is more informative because:
- You can immediately see the **median**, **spread**, and **outliers**
- Mix B has a high median but also high variability — the box is wide
- Mix C is the most **consistent** — the box is narrow

The strip plot shows the raw data points but makes comparison harder.

In [None]:
# Scenario 3: Stress distribution across structural members
np.random.seed(42)
stress_matrix = np.random.uniform(10, 80, (6, 8))
member_labels = [f'Member {i+1}' for i in range(8)]
load_cases = [f'LC {i+1}' for i in range(6)]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Option A: Bar chart (cluttered)
x = np.arange(len(member_labels))
bar_width = 0.12
colors_bar = plt.cm.viridis(np.linspace(0.2, 0.8, len(load_cases)))
for i, (lc, color) in enumerate(zip(load_cases, colors_bar)):
    axes[0].bar(x + i * bar_width, stress_matrix[i], bar_width, label=lc, color=color, edgecolor='k', linewidth=0.5)
axes[0].set_xticks(x + bar_width * 2.5)
axes[0].set_xticklabels(member_labels, rotation=45, ha='right', fontsize=8)
axes[0].set_ylabel('Stress (MPa)')
axes[0].set_title('Option A: Grouped Bar Chart', fontsize=12)
axes[0].legend(fontsize=7)
axes[0].grid(True, alpha=0.3, axis='y')

# Option B: Heatmap
im = axes[1].imshow(stress_matrix, cmap='YlOrRd', aspect='auto')
axes[1].set_xticks(range(len(member_labels)))
axes[1].set_xticklabels(member_labels, rotation=45, ha='right', fontsize=9)
axes[1].set_yticks(range(len(load_cases)))
axes[1].set_yticklabels(load_cases)
axes[1].set_title('Option B: Heatmap', fontsize=12)
plt.colorbar(im, ax=axes[1], label='Stress (MPa)')

plt.suptitle('[DISCUSS] Stress Across Members & Load Cases \u2014 Which is Better?', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

**[DISCUSS]** The heatmap (Option B) wins here because:
- It shows the **full matrix** at a glance
- You can spot **hot spots** (high-stress regions) instantly
- The grouped bar chart is too cluttered with 6 groups per category

> **Key Insight: The Chart Type Should Match the Question**
> - Line chart for trends (shows continuity, growth rate visible)
> - Box plot for comparisons (shows distribution, not just mean)
> - Heatmap for spatial/matrix patterns (shows density and location)

---
## 4. Communicating Uncertainty

Engineering measurements always contain uncertainty. A prediction without a confidence range is incomplete. Your visualizations should make uncertainty **visible**, not hide it.

> **Definition: Uncertainty in Engineering**
> Uncertainty comes from measurement error, material variability, model approximations, and environmental factors. Building codes account for this through **safety factors** — your visualizations should too.

In [None]:
np.random.seed(42)
x = np.linspace(0, 10, 50)
y_true = 2 * x + 5
y_noisy = y_true + np.random.normal(0, 2, len(x))

from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x.reshape(-1, 1), y_noisy)
y_pred = model.predict(x.reshape(-1, 1))
residual_std = np.std(y_noisy - y_pred)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Without uncertainty
axes[0].scatter(x, y_noisy, alpha=0.5, edgecolors='k', linewidth=0.5, color='steelblue')
axes[0].plot(x, y_pred, 'r-', linewidth=2, label='Prediction')
axes[0].set_title('Without Uncertainty Bands', fontsize=12)
axes[0].set_xlabel('7-day Strength (MPa)')
axes[0].set_ylabel('28-day Strength (MPa)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# With uncertainty
axes[1].scatter(x, y_noisy, alpha=0.5, edgecolors='k', linewidth=0.5, color='steelblue')
axes[1].plot(x, y_pred, 'r-', linewidth=2, label='Prediction')
axes[1].fill_between(x, y_pred - 2*residual_std, y_pred + 2*residual_std, color='red', alpha=0.15, label='95% Prediction Interval')
axes[1].fill_between(x, y_pred - residual_std, y_pred + residual_std, color='red', alpha=0.25, label='68% Prediction Interval')
axes[1].set_title('With Uncertainty Bands', fontsize=12)
axes[1].set_xlabel('7-day Strength (MPa)')
axes[1].set_ylabel('28-day Strength (MPa)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle('[DISCUSS] Which Plot Would You Trust More?', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

**[TOGETHER]** The right plot communicates **honesty** — it shows the prediction is not exact. A client or reviewer who sees uncertainty bands will trust the analysis more, not less.

In [None]:
np.random.seed(42)
beams = ['Beam A', 'Beam B', 'Beam C', 'Beam D', 'Beam E']
mean_strength = [35, 42, 28, 39, 31]
std_strength = [2, 8, 1.5, 5, 3]

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(beams, mean_strength, yerr=std_strength, capsize=8, color='steelblue',
              edgecolor='k', alpha=0.8, error_kw={'linewidth': 2})
ax.axhline(y=30, color='red', linestyle='--', linewidth=2, label='Minimum required strength (30 MPa)')
ax.set_ylabel('Compressive Strength (MPa)', fontsize=12)
ax.set_title('Beam Strength Test Results with Error Bars', fontsize=13)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

> **[DISCUSS] Look at Beam B — it has the highest mean but also the widest error bars.**
> Would you trust Beam B more than Beam C (lower mean but very consistent)?
>
> Wide error bars = high variability = more uncertainty.

> **Key Insight: Engineering Data Always Has Uncertainty**
> Building codes require safety factors precisely because measurements are uncertain. Your visualizations should make uncertainty **visible**, not hide it.

In [None]:
np.random.seed(42)
n = 100
x = np.random.uniform(2, 12, n)
y = 2 * x + 5 + np.random.normal(0, 2, n)

from sklearn.linear_model import LinearRegression
model2 = LinearRegression().fit(x.reshape(-1,1), y)
x_plot = np.linspace(2, 12, 200)
y_pred2 = model2.predict(x_plot.reshape(-1,1))

# Confidence interval (uncertainty about the mean)
se_mean = np.std(y - model2.predict(x.reshape(-1,1))) / np.sqrt(n) * 2
# Prediction interval (uncertainty about individual predictions)
se_pred = np.std(y - model2.predict(x.reshape(-1,1))) * 2

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x, y, alpha=0.3, color='steelblue', edgecolors='k', linewidth=0.3)
ax.plot(x_plot, y_pred2, 'r-', linewidth=2, label='Regression line')
ax.fill_between(x_plot, y_pred2 - se_mean, y_pred2 + se_mean, color='red', alpha=0.3, label='Confidence interval (mean)')
ax.fill_between(x_plot, y_pred2 - se_pred, y_pred2 + se_pred, color='blue', alpha=0.1, label='Prediction interval (individual)')
ax.set_xlabel('Feature', fontsize=12)
ax.set_ylabel('Target', fontsize=12)
ax.set_title('Confidence Interval vs Prediction Interval', fontsize=13)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Confidence Interval vs Prediction Interval

These two bands answer different questions:

- **Confidence interval (narrow, red):** Where does the **average** regression line probably lie? This reflects uncertainty about the model parameters.
- **Prediction interval (wide, blue):** Where will a **new individual data point** probably fall? This includes both model uncertainty and natural data variability.

> **Example: Civil Engineering**
> If you're predicting the *average* 28-day strength for a concrete batch, use the confidence interval. If you're predicting the strength of a *single* test cylinder, use the prediction interval.

---
## 5. From Data to Story

A common mistake in data science is to produce a "data dump" — throwing all the data on a chart and expecting the reader to find the insight. Good visualization **guides** the reader to the conclusion.

In [None]:
# The "data dump" chart — too many lines, no labels, no title, confusing
np.random.seed(42)
fig, ax = plt.subplots(figsize=(10, 6))
for i in range(8):
    x = np.linspace(0, 10, 100)
    y = np.random.uniform(10, 50) + np.random.uniform(-5, 5) * x + np.random.normal(0, 3, 100)
    ax.plot(x, y, linewidth=1)
ax.set_title('Data', fontsize=10)
plt.show()

> **[DISCUSS] What is the chart above telling you?**
> Probably nothing. There's no title explaining the point, no legend, no axis labels, and too many overlapping lines.

> **Key Insight: A Visualization Should Answer a Specific Question**
> Don't just display data — **tell a story**. Every plot should have a clear message.

In [None]:
# Transform: clear, purposeful chart with highlighted story
np.random.seed(42)
fig, ax = plt.subplots(figsize=(10, 6))

x = np.linspace(0, 10, 100)
# Background series (de-emphasized)
for i in range(6):
    y = np.random.uniform(10, 50) + np.random.uniform(-5, 5) * x + np.random.normal(0, 3, 100)
    ax.plot(x, y, color='gray', alpha=0.2, linewidth=1)

# Highlighted series
y_best = 15 + 3 * x + np.random.normal(0, 2, 100)
y_worst = 45 - 2 * x + np.random.normal(0, 2, 100)
ax.plot(x, y_best, color='steelblue', linewidth=2.5, label='Mix Design A (improving)')
ax.plot(x, y_worst, color='indianred', linewidth=2.5, label='Mix Design G (deteriorating)')

ax.set_xlabel('Curing Time (days)', fontsize=12)
ax.set_ylabel('Compressive Strength (MPa)', fontsize=12)
ax.set_title('Mix Design A Shows Strongest Strength Gain Over Time', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

> **Key Insight: Tufte's Data-Ink Ratio**
> *"Every pixel should earn its place."* — Remove chart junk, maximize the data-to-ink ratio.
>
> The transformation above:
> 1. Added a title that **states the conclusion**
> 2. Highlighted the key finding (2 series), pushed the rest to background
> 3. Added axis labels with units
> 4. Removed unnecessary visual clutter

**[PRACTICE]** Take a chart from one of your own projects or homework assignments. Apply the following transformations:
1. Add a title that **states the conclusion**, not just the topic
2. Remove unnecessary grid lines, legends for single-series plots, or decorative elements
3. Highlight the most important data and de-emphasize the rest
4. Add axis labels with proper units

---
## 6. Common Mistakes

Even experienced engineers make visualization mistakes. Here are four of the most common — and how to fix them.

In [None]:
# Gallery of bad visualizations
np.random.seed(42)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Truncated y-axis
strengths = [29.1, 30.2, 31.0, 31.8]
labels = ['Mix A', 'Mix B', 'Mix C', 'Mix D']
axes[0, 0].bar(labels, strengths, color='steelblue', edgecolor='k')
axes[0, 0].set_ylim(28, 32.5)
axes[0, 0].set_title('\u26a0 Truncated Y-axis\n(Exaggerates differences)', fontsize=11, color='red')
axes[0, 0].set_ylabel('Strength (MPa)')

# 2. Rainbow colormap on sequential data
data_2d = np.random.rand(8, 8) * 100
im = axes[0, 1].imshow(data_2d, cmap='hsv')
plt.colorbar(im, ax=axes[0, 1])
axes[0, 1].set_title('\u26a0 Rainbow Colormap\n(No perceptual ordering)', fontsize=11, color='red')

# 3. Pie chart with too many slices
sizes = [15, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 7]
pie_labels = [f'Item {i+1}' for i in range(len(sizes))]
axes[1, 0].pie(sizes, labels=pie_labels, textprops={'fontsize': 7})
axes[1, 0].set_title('\u26a0 Pie Chart with 14 Slices\n(Impossible to compare)', fontsize=11, color='red')

# 4. Dual y-axes
x_dual = np.arange(10)
y1 = np.random.uniform(20, 40, 10)
y2 = np.random.uniform(100, 500, 10)
axes[1, 1].plot(x_dual, y1, 'b-o', label='Strength (MPa)')
ax2_twin = axes[1, 1].twinx()
ax2_twin.plot(x_dual, y2, 'r-s', label='Cost ($)')
axes[1, 1].set_title('\u26a0 Dual Y-axes\n(Implies false correlation)', fontsize=11, color='red')
axes[1, 1].set_ylabel('Strength (MPa)', color='blue')
ax2_twin.set_ylabel('Cost ($)', color='red')

plt.suptitle('Common Visualization Mistakes', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Corrected versions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Fixed: Y-axis from 0
axes[0, 0].bar(labels, strengths, color='steelblue', edgecolor='k')
axes[0, 0].set_ylim(0, 35)
axes[0, 0].set_title('\u2713 Y-axis Starting at Zero\n(Honest comparison)', fontsize=11, color='green')
axes[0, 0].set_ylabel('Strength (MPa)')

# 2. Fixed: Sequential colormap
im2 = axes[0, 1].imshow(data_2d, cmap='viridis')
plt.colorbar(im2, ax=axes[0, 1])
axes[0, 1].set_title('\u2713 Sequential Colormap (viridis)\n(Perceptually uniform)', fontsize=11, color='green')

# 3. Fixed: Horizontal bar chart
sorted_idx = np.argsort(sizes)
axes[1, 0].barh(range(len(sizes)), np.array(sizes)[sorted_idx], color='steelblue', edgecolor='k')
axes[1, 0].set_yticks(range(len(sizes)))
axes[1, 0].set_yticklabels(np.array(pie_labels)[sorted_idx])
axes[1, 0].set_xlabel('Value')
axes[1, 0].set_title('\u2713 Sorted Bar Chart\n(Easy to compare)', fontsize=11, color='green')

# 4. Fixed: Separate subplots
axes[1, 1].plot(x_dual, y1, 'b-o', linewidth=2, markersize=6)
axes[1, 1].set_xlabel('Sample')
axes[1, 1].set_ylabel('Strength (MPa)')
axes[1, 1].set_title('\u2713 Separate Subplot\n(No false correlation)', fontsize=11, color='green')
axes[1, 1].grid(True, alpha=0.3)

plt.suptitle('Corrected Versions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

> **Key Insight: Visualization Can Mislead — Intentionally or Accidentally**
> - **Truncated axes** exaggerate small differences
> - **Rainbow colormaps** have no perceptual ordering — use viridis, plasma, or inferno
> - **Pie charts** make comparison nearly impossible for more than 3-4 categories
> - **Dual y-axes** imply relationships that may not exist
>
> **Be honest with your plots.**

---
## 7. Reflection

### Before Making Any Plot, Ask Three Questions:

1. **What question am I answering?** — Every visualization should have a purpose
2. **Who is my audience?** — Fellow engineers? Clients? The public?
3. **What should they take away?** — State your conclusion in the title

### Connection to the Course

Week 4's lecture teaches you **HOW** to make plots with Matplotlib and Seaborn. This notebook teaches you **WHICH** plots to make and **WHY**.

> **Key Insight: The Best Visualization is the One That Answers the Question**
> Technical skill (knowing the API) is necessary but not sufficient. Design thinking (knowing what to show) is what turns data into insight.

---

### Questions?

**Dr. Eyuphan Koc**  
eyuphan.koc@bogazici.edu.tr