# Social Pressure, Voter Turnout, and Logistic Regression

**Persuasion at Scale** | Week 4, Lecture 8 | 2026-02-16

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chrishwiggins/social-pressure-voting/blob/main/notebooks/social-pressure-logistic-regression.ipynb)

---

This notebook walks through the data from two assigned readings:

1. **Gerber, Green, and Larimer (2008)** "Social Pressure and Voter Turnout" (APSR)
2. **Coppock, Hill, and Vavreck (2020/2024)** on campaign persuasion effects

We will:
- Load the **actual experimental data** from GGL (2008): 344,084 individuals, five treatment groups
- Reproduce the paper's main results
- Introduce **logistic regression** as the right tool for binary outcomes
- Build intuition visually before touching any math

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from scipy import stats

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['font.size'] = 13

np.random.seed(42)

---

## Part 1: The Experiment

In August 2006, Gerber, Green, and Larimer ran a field experiment with **344,084 registered voters** in Michigan. Households were randomly assigned to one of five groups:

| Group | What they received | N |
|-------|-------------------|---|
| **Control** | Nothing | 191,243 |
| **Civic Duty** | "DO YOUR CIVIC DUTY - VOTE!" | 38,218 |
| **Hawthorne** | "YOU ARE BEING STUDIED" (told researchers would check if they voted) | 38,204 |
| **Self** | Showed the recipient's own past voting record | 38,218 |
| **Neighbors** | Showed the recipient AND their neighbors' past voting records | 38,201 |

The key outcome: **did they vote in the August 2006 primary?** (binary: yes/no)

Let's load the data.

In [None]:
# Download the actual replication data from ISPS Yale
url = 'http://hdl.handle.net/10079/d3669799-4537-411e-b175-d9e837324c35'
df = pd.read_csv(url)
print(f'Loaded {len(df):,} individual records')
print(f'Columns: {list(df.columns)}')
df.head()

In [None]:
# Clean up
df['treatment'] = df['treatment'].str.strip()
df['voted_binary'] = (df['voted'] == 'Yes').astype(int)
df['age'] = 2006 - df['yob']  # approximate age at time of election

# Past voting: convert yes/no to 1/0 for covariates
for col in ['g2000', 'g2002', 'g2004', 'p2000', 'p2002', 'p2004']:
    df[col + '_bin'] = (df[col].str.strip().str.lower() == 'yes').astype(int)

print(f'Age range: {df["age"].min()} to {df["age"].max()}')
print(f'Overall turnout: {df["voted_binary"].mean()*100:.1f}%')

### What does the data look like?

Each row is one registered voter. The key columns:
- `treatment`: which experimental group they were assigned to
- `voted`: did they vote in the August 2006 primary? (our **outcome**)
- `g2000`, `g2002`, `g2004`: voted in general elections in 2000, 2002, 2004?
- `p2000`, `p2002`, `p2004`: voted in primary elections in 2000, 2002, 2004?
- `yob`: year of birth
- `hh_id`, `hh_size`: household ID and size

In [None]:
# Treatment group sizes
group_order = ['Control', 'Civic Duty', 'Hawthorne', 'Self', 'Neighbors']
colors = ['#95a5a6', '#3498db', '#2ecc71', '#e67e22', '#e74c3c']

fig, ax = plt.subplots(figsize=(10, 5))
counts = [len(df[df['treatment'] == t]) for t in group_order]
bars = ax.bar(group_order, counts, color=colors, edgecolor='white', linewidth=1.5)

for bar, count in zip(bars, counts):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2000,
            f'{count:,}', ha='center', fontsize=12, fontweight='bold')

ax.set_ylabel('Number of Individuals')
ax.set_title('Sample Sizes by Treatment Group')
ax.set_ylim(0, 220000)
plt.tight_layout()
plt.show()

Notice the control group is about 5x larger than each treatment group. This is a deliberate design choice: it's cheap to not send mail, so the researchers allocated more people to control.

---

## Part 2: The Main Result (Difference in Means)

The simplest analysis of an RCT: compare the average outcome across groups.

Because treatment was **randomly assigned**, any difference in turnout between groups is a valid estimate of the **causal effect** of the mailer.

This is the **intent-to-treat (ITT) effect**: the effect of being *assigned* the treatment (not necessarily reading the mailer).

In [None]:
# Reproduce Table 2 from the paper
control_rate = df[df['treatment'] == 'Control']['voted_binary'].mean()

print('=' * 60)
print('Replicating GGL (2008), Table 2')
print('=' * 60)
print(f'{"Group":15s} {"Turnout":>10s} {"ITT Effect":>12s} {"N":>10s}')
print('-' * 60)

for t in group_order:
    sub = df[df['treatment'] == t]
    rate = sub['voted_binary'].mean()
    effect = (rate - control_rate) * 100 if t != 'Control' else 0
    effect_str = f'{effect:+.1f} pp' if t != 'Control' else '(baseline)'
    print(f'{t:15s} {rate*100:9.1f}% {effect_str:>12s} {len(sub):>10,}')

print('-' * 60)
print(f'\nNeighbors effect: +8.1 percentage points over control')
print(f'That is a {8.1/29.7*100:.0f}% relative increase in turnout!')

In [None]:
# Visualize: turnout by treatment group
fig, ax = plt.subplots(figsize=(10, 6))

rates = [df[df['treatment'] == t]['voted_binary'].mean() * 100 for t in group_order]
bars = ax.bar(group_order, rates, color=colors, edgecolor='white', linewidth=1.5)

# Add rate labels
for bar, rate in zip(bars, rates):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
            f'{rate:.1f}%', ha='center', fontsize=13, fontweight='bold')

# Add horizontal line at control rate
ax.axhline(y=rates[0], color='gray', linestyle='--', alpha=0.7, label=f'Control baseline ({rates[0]:.1f}%)')

ax.set_ylabel('Turnout Rate (%)')
ax.set_title('Voter Turnout by Treatment Group\nGerber, Green, and Larimer (2008)', fontsize=14)
ax.set_ylim(25, 42)
ax.legend(loc='upper left')
plt.tight_layout()
plt.show()

The "Neighbors" treatment, which showed recipients their neighbors' past voting records and promised to send an updated version after the election, produced the largest effect: **+8.1 percentage points** over the control group.

The treatments are ordered by social pressure intensity:
- Civic Duty (weakest): generic civic appeal
- Hawthorne: you're being watched
- Self: we know YOUR voting history
- Neighbors (strongest): we know your neighbors' history too, and we'll tell them yours

### Checking randomization

One advantage of the real data: we can verify that randomization "worked." If treatment was truly random, pre-treatment covariates should be balanced across groups.

In [None]:
# Balance check: past voting history by treatment group
balance_vars = ['g2004_bin', 'g2002_bin', 'g2000_bin', 'p2004_bin', 'age']
balance_labels = ['Voted 2004 Gen', 'Voted 2002 Gen', 'Voted 2000 Gen',
                  'Voted 2004 Primary', 'Age']

print(f'{"Variable":20s}', end='')
for t in group_order:
    print(f'{t:>12s}', end='')
print()
print('-' * 80)

for var, label in zip(balance_vars, balance_labels):
    print(f'{label:20s}', end='')
    for t in group_order:
        val = df[df['treatment'] == t][var].mean()
        if var == 'age':
            print(f'{val:12.1f}', end='')
        else:
            print(f'{val*100:11.1f}%', end='')
    print()

print('\nVery similar across groups = randomization worked!')

In [None]:
# Visual balance check: age distribution by group
fig, axes = plt.subplots(1, 5, figsize=(18, 4), sharey=True)

for ax, t, c in zip(axes, group_order, colors):
    sub = df[df['treatment'] == t]
    ax.hist(sub['age'], bins=range(18, 100, 2), color=c, alpha=0.8, density=True)
    ax.set_title(t, fontsize=11)
    ax.set_xlabel('Age')
    ax.axvline(sub['age'].mean(), color='black', linestyle='--', linewidth=1)

axes[0].set_ylabel('Density')
fig.suptitle('Age Distributions Look Identical Across Groups (Randomization Worked)', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

---

## Part 3: Why Not Just Stop Here?

The difference-in-means analysis is perfectly valid for an RCT. So why bother with regression at all?

Three reasons:
1. **Precision**: Adding pre-treatment covariates can reduce noise and give tighter confidence intervals
2. **Subgroup effects**: We might want to know if the treatment works differently for different people
3. **The outcome is binary**: Turnout is either 0 or 1. Regular linear regression can predict values outside [0, 1]. Logistic regression respects the binary nature of the outcome.

Let's see why #3 matters.

### The problem with the linear probability model

A **linear probability model (LPM)** just runs ordinary regression on a binary outcome:

$$\text{voted}_i = \beta_0 + \beta_1 \cdot \text{treatment}_i + \varepsilon_i$$

This gives us an easy-to-interpret coefficient: $\beta_1$ is the change in probability of voting.

But it has a conceptual problem...

In [None]:
# Create a simple example to illustrate the problem
# Turnout vs age: older people vote more

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Take a random sample for plotting (full dataset is too dense)
sample = df.sample(3000, random_state=42)

# Left: scatter of actual data with linear fit
ax = axes[0]
jitter = np.random.uniform(-0.05, 0.05, len(sample))
ax.scatter(sample['age'], sample['voted_binary'] + jitter,
           alpha=0.1, s=10, color='steelblue')

# Linear fit
from numpy.polynomial.polynomial import polyfit
age_range = np.linspace(18, 95, 200)
b, m = polyfit(sample['age'], sample['voted_binary'], 1)
ax.plot(age_range, b + m * age_range, 'r-', linewidth=2.5, label='Linear fit')

ax.axhline(0, color='gray', linestyle=':', alpha=0.5)
ax.axhline(1, color='gray', linestyle=':', alpha=0.5)
ax.fill_between(age_range, -0.3, 0, alpha=0.1, color='red')
ax.fill_between(age_range, 1, 1.3, alpha=0.1, color='red')
ax.text(80, -0.15, 'Impossible!\n(probability < 0)', color='red', fontsize=10, ha='center')
ax.text(80, 1.1, 'Impossible!\n(probability > 1)', color='red', fontsize=10, ha='center')
ax.set_xlabel('Age')
ax.set_ylabel('Voted (0/1)')
ax.set_title('Linear Probability Model\n(can predict outside [0, 1])')
ax.set_ylim(-0.3, 1.3)
ax.legend()

# Right: the logistic (sigmoid) function
ax = axes[1]
z = np.linspace(-6, 6, 200)
sigmoid = 1 / (1 + np.exp(-z))

ax.plot(z, sigmoid, 'b-', linewidth=3)
ax.axhline(0, color='gray', linestyle=':', alpha=0.5)
ax.axhline(1, color='gray', linestyle=':', alpha=0.5)
ax.axhline(0.5, color='gray', linestyle='--', alpha=0.3)
ax.fill_between(z, 0, 1, alpha=0.05, color='green')
ax.text(0, 0.5, '  always between 0 and 1', color='green', fontsize=11, fontweight='bold')
ax.set_xlabel('Linear predictor (z)')
ax.set_ylabel('Predicted probability')
ax.set_title('Logistic (Sigmoid) Function\n$\sigma(z) = 1 / (1 + e^{-z})$')
ax.set_ylim(-0.1, 1.1)

plt.tight_layout()
plt.show()

The left plot shows the problem: a straight line through binary (0/1) data will eventually predict probabilities below 0 or above 1.

The right plot shows the solution: the **logistic function** (also called the sigmoid) squashes any input into the range (0, 1). It's an S-shaped curve that:
- Approaches 0 as the input goes to $-\infty$
- Approaches 1 as the input goes to $+\infty$
- Equals exactly 0.5 when the input is 0

---

## Part 4: What Logistic Regression Actually Does

Instead of modeling probability directly:

$$P(\text{voted}=1) = \beta_0 + \beta_1 x \quad \text{(linear, can go outside [0,1])}$$

Logistic regression models the **log-odds**:

$$\log\left(\frac{P(\text{voted}=1)}{1 - P(\text{voted}=1)}\right) = \beta_0 + \beta_1 x$$

Equivalently, the predicted probability is:

$$P(\text{voted}=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}$$

Let's build intuition for what "odds" and "log-odds" mean.

In [None]:
# Intuition: probability -> odds -> log-odds
probs = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
odds = probs / (1 - probs)
log_odds = np.log(odds)

fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

# Probability scale
ax = axes[0]
ax.barh(range(len(probs)), probs, color='steelblue', height=0.6)
ax.set_yticks(range(len(probs)))
ax.set_yticklabels([f'{p:.0%}' for p in probs])
ax.set_xlabel('Probability')
ax.set_title('Probability\n(bounded 0 to 1)')
ax.set_xlim(0, 1)

# Odds scale
ax = axes[1]
ax.barh(range(len(probs)), odds, color='#e67e22', height=0.6)
ax.set_yticks(range(len(probs)))
ax.set_yticklabels([f'{o:.2f}' for o in odds])
ax.set_xlabel('Odds')
ax.set_title('Odds = p/(1-p)\n(bounded 0 to $\infty$)')

# Log-odds scale
ax = axes[2]
bar_colors = ['#e74c3c' if lo < 0 else '#2ecc71' for lo in log_odds]
ax.barh(range(len(probs)), log_odds, color=bar_colors, height=0.6)
ax.set_yticks(range(len(probs)))
ax.set_yticklabels([f'{lo:+.2f}' for lo in log_odds])
ax.set_xlabel('Log-odds')
ax.set_title('Log-odds = log(p/(1-p))\n(unbounded: $-\infty$ to $+\infty$)')
ax.axvline(0, color='black', linewidth=1)

plt.tight_layout()
plt.show()

print('Key insight: log-odds is the scale where logistic regression is linear.')
print('Positive log-odds = more likely than not (p > 50%)')
print('Negative log-odds = less likely than not (p < 50%)')

**Why log-odds?** It's the transformation that maps probabilities (bounded between 0 and 1) to an unbounded scale ($-\infty$ to $+\infty$). On the log-odds scale, it makes sense to fit a linear model.

Think of it this way:
- A probability of 50% = odds of 1:1 = log-odds of 0
- A probability of 90% = odds of 9:1 = log-odds of +2.2
- A probability of 10% = odds of 1:9 = log-odds of -2.2

The symmetry is nice: 90% and 10% are equally far from 50%, just in opposite directions.

---

## Part 5: Logistic Regression on the GGL Data

Now let's fit logistic regression to the real experimental data.

In [None]:
import statsmodels.formula.api as smf

# First: the simple linear probability model (OLS) for comparison
lpm = smf.ols('voted_binary ~ C(treatment, Treatment(reference="Control"))', data=df).fit()

print('=' * 65)
print('LINEAR PROBABILITY MODEL (OLS)')
print('Coefficients are changes in probability of voting')
print('=' * 65)
for name, coef in lpm.params.items():
    if 'Intercept' in name:
        print(f'  Control baseline:  {coef:.4f} ({coef*100:.1f}%)')
    else:
        # Extract treatment name
        tname = name.split('T.")[1].rstrip('"').rstrip(']')
        print(f'  {tname:15s}:  {coef:+.4f} ({coef*100:+.1f} pp)')

In [None]:
# Now: logistic regression
logit = smf.logit('voted_binary ~ C(treatment, Treatment(reference="Control"))', data=df).fit(disp=0)

print('=' * 65)
print('LOGISTIC REGRESSION')
print('Coefficients are changes in LOG-ODDS of voting')
print('=' * 65)
for name, coef in logit.params.items():
    if 'Intercept' in name:
        p = np.exp(coef) / (1 + np.exp(coef))
        print(f'  Control (intercept): {coef:.4f} (log-odds)  ->  {p*100:.1f}% (probability)')
    else:
        tname = name.split('T.")[1].rstrip('"').rstrip(']')
        or_val = np.exp(coef)
        print(f'  {tname:15s}: {coef:+.4f} (log-odds)  odds ratio: {or_val:.3f}')

print()
print('An odds ratio > 1 means the treatment INCREASES the odds of voting.')
print('For example, OR=1.45 means 45% higher odds of voting than control.')

In [None]:
# Compare LPM and Logistic side by side
treatment_names = ['Civic Duty', 'Hawthorne', 'Self', 'Neighbors']

# Get marginal effects from logistic (average predicted probability differences)
logit_margins = logit.get_margeff()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: LPM coefficients (= difference in means)
ax = axes[0]
lpm_effects = [lpm.params[f'C(treatment, Treatment(reference="Control"))[T.{t}]'] * 100
               for t in treatment_names]
lpm_ci = [lpm.conf_int().loc[f'C(treatment, Treatment(reference="Control"))[T.{t}]'] * 100
          for t in treatment_names]
lpm_errors = [(e[1] - e[0])/2 for e in lpm_ci]

ax.barh(treatment_names, lpm_effects, xerr=lpm_errors,
        color=colors[1:], edgecolor='white', linewidth=1.5, capsize=5)
ax.set_xlabel('Effect on Turnout (percentage points)')
ax.set_title('Linear Probability Model\n(coefficient = pp change)')
ax.axvline(0, color='black', linewidth=0.5)

# Right: Logistic regression (odds ratios)
ax = axes[1]
odds_ratios = [np.exp(logit.params[f'C(treatment, Treatment(reference="Control"))[T.{t}]'])
               for t in treatment_names]
logit_ci_vals = logit.conf_int()
or_lower = [np.exp(logit_ci_vals.loc[f'C(treatment, Treatment(reference="Control"))[T.{t}]', 0])
            for t in treatment_names]
or_upper = [np.exp(logit_ci_vals.loc[f'C(treatment, Treatment(reference="Control"))[T.{t}]', 1])
            for t in treatment_names]
or_errors = [[o - l for o, l in zip(odds_ratios, or_lower)],
             [u - o for o, u in zip(odds_ratios, or_upper)]]

ax.barh(treatment_names, odds_ratios, xerr=or_errors,
        color=colors[1:], edgecolor='white', linewidth=1.5, capsize=5)
ax.set_xlabel('Odds Ratio (vs Control)')
ax.set_title('Logistic Regression\n(odds ratio: >1 means higher turnout)')
ax.axvline(1, color='black', linewidth=0.5)

plt.tight_layout()
plt.show()

print('Both models tell the same story: more social pressure -> more turnout.')
print('The ranking is identical; only the scale of the coefficients differs.')

### Why are the numbers so similar?

When the baseline probability is near 30% and the effects are modest (a few percentage points), the linear probability model and logistic regression give nearly identical answers.

The logistic model matters more when:
- Baseline rates are very high or very low (near 0% or 100%)
- You have continuous covariates (like age) where the linear model might predict outside [0, 1]
- You want predicted probabilities for new individuals

---

## Part 6: Adding Covariates

In an RCT, adding pre-treatment covariates doesn't change the *bias* of our estimate (randomization already handles that), but it can improve **precision**.

Past voting behavior is the strongest predictor of future voting. Let's add it.

In [None]:
# Logistic regression with covariates
logit_full = smf.logit(
    'voted_binary ~ C(treatment, Treatment(reference="Control"))'
    ' + g2000_bin + g2002_bin + g2004_bin'
    ' + p2000_bin + p2002_bin + p2004_bin'
    ' + age + hh_size',
    data=df
).fit(disp=0)

print('=' * 65)
print('LOGISTIC REGRESSION WITH COVARIATES')
print('=' * 65)
print(logit_full.summary().tables[1])

In [None]:
# Compare confidence intervals: with and without covariates
fig, ax = plt.subplots(figsize=(10, 5))

y_positions = np.arange(len(treatment_names))
offset = 0.15

for i, t in enumerate(treatment_names):
    key = f'C(treatment, Treatment(reference="Control"))[T.{t}]'

    # Without covariates
    coef1 = logit.params[key]
    ci1 = logit.conf_int().loc[key]
    ax.plot([ci1[0], ci1[1]], [i + offset, i + offset], 'o-',
            color='steelblue', linewidth=2.5, markersize=6)

    # With covariates
    coef2 = logit_full.params[key]
    ci2 = logit_full.conf_int().loc[key]
    ax.plot([ci2[0], ci2[1]], [i - offset, i - offset], 's-',
            color='#e74c3c', linewidth=2.5, markersize=6)

ax.set_yticks(y_positions)
ax.set_yticklabels(treatment_names)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_xlabel('Log-odds Coefficient (with 95% CI)')
ax.set_title('Treatment Effects: With vs Without Covariates')

blue_patch = mpatches.Patch(color='steelblue', label='Treatment only')
red_patch = mpatches.Patch(color='#e74c3c', label='+ voting history, age, hh_size')
ax.legend(handles=[blue_patch, red_patch], loc='lower right')

plt.tight_layout()
plt.show()

print('Red intervals are narrower: covariates absorb noise, giving more precise estimates.')
print('The point estimates barely change (because randomization already removed bias).')

---

## Part 7: Predicted Probabilities

One of the nicest features of logistic regression: we can compute predicted probabilities for specific types of voters.

In [None]:
# Predicted turnout by age and treatment group
ages = np.arange(20, 90)

fig, ax = plt.subplots(figsize=(12, 6))

for t, c in zip(group_order, colors):
    # Create prediction data: median voter characteristics, varying age
    pred_df = pd.DataFrame({
        'treatment': t,
        'g2000_bin': 1,  # typical voter: voted in past generals
        'g2002_bin': 1,
        'g2004_bin': 1,
        'p2000_bin': 0,  # but not in past primaries
        'p2002_bin': 0,
        'p2004_bin': 0,
        'age': ages,
        'hh_size': 2
    })
    pred_probs = logit_full.predict(pred_df)
    lw = 3 if t in ['Control', 'Neighbors'] else 1.5
    ls = '-' if t in ['Control', 'Neighbors'] else '--'
    ax.plot(ages, pred_probs * 100, color=c, linewidth=lw, linestyle=ls, label=t)

ax.set_xlabel('Age', fontsize=13)
ax.set_ylabel('Predicted Turnout Probability (%)', fontsize=13)
ax.set_title('Predicted Turnout by Age and Treatment\n(for a voter who voted in past generals but not primaries)',
             fontsize=13)
ax.legend(loc='lower right', fontsize=11)
ax.set_ylim(0, 70)
plt.tight_layout()
plt.show()

print('Note the S-shaped curves: logistic regression naturally produces these.')
print('The treatment gap (Control vs Neighbors) is roughly constant across ages.')

---

## Part 8: The Bigger Picture (Coppock, Hill, Vavreck)

GGL found that social pressure mailers can increase turnout by up to 8.1 percentage points. That's a huge effect by the standards of political science.

How does this compare to **persuasion** effects in campaigns?

Coppock, Hill, and Vavreck (2020) analyzed **59 real-time randomized experiments** on political advertising. Their finding: the average persuasion effect is **tiny**, around 0.5 percentage points.

In their 2024 paper (with Hewitt et al.), they expanded this to **146 experiments from 51 campaigns**, finding similar results.

Let's visualize this comparison.

In [None]:
# Comparison: GGL turnout effects vs typical campaign persuasion effects
fig, ax = plt.subplots(figsize=(12, 6))

# GGL effects (from our data)
ggl_effects = {
    'GGL: Civic Duty': 1.8,
    'GGL: Hawthorne': 2.6,
    'GGL: Self': 4.9,
    'GGL: Neighbors': 8.1
}

# Coppock et al. typical effects (from their papers)
persuasion_effects = {
    'Typical TV ad\n(Coppock+ 2020)': 0.5,
    'Typical digital ad\n(Hewitt+ 2024)': 0.4,
    'Best campaign ad\n(95th percentile)': 1.5,
}

all_effects = {**persuasion_effects, **ggl_effects}
labels = list(all_effects.keys())
values = list(all_effects.values())
bar_colors = ['#9b59b6'] * len(persuasion_effects) + colors[1:]

bars = ax.barh(labels, values, color=bar_colors, edgecolor='white', linewidth=1.5, height=0.6)

for bar, val in zip(bars, values):
    ax.text(bar.get_width() + 0.15, bar.get_y() + bar.get_height()/2,
            f'{val:.1f} pp', va='center', fontsize=11, fontweight='bold')

ax.set_xlabel('Effect Size (percentage points)', fontsize=13)
ax.set_title('Social Pressure (GGL) vs Campaign Persuasion (Coppock+)\nEffect sizes from randomized experiments',
             fontsize=13)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_xlim(0, 10)

# Add dividing annotation
ax.axhline(2.5, color='gray', linestyle=':', alpha=0.5)
ax.text(9.5, 4.5, 'Social pressure\n(GGL 2008)', ha='right', fontsize=10,
        fontstyle='italic', color='gray')
ax.text(9.5, 0.8, 'Campaign persuasion\n(Coppock+ 2020, 2024)', ha='right', fontsize=10,
        fontstyle='italic', color='gray')

plt.tight_layout()
plt.show()

print('Key takeaway from Coppock, Hill, and Vavreck:')
print('Campaign persuasion effects are real but VERY small (about 0.5 pp).')
print('GGL\'s social pressure effects are 2x to 16x larger.')
print('This suggests mobilization (getting people to show up) is easier than persuasion (changing minds).')

### Why the difference?

**Mobilization** (GGL): Getting people who *already agree with you* to actually show up and vote. Social pressure is a powerful lever because it exploits accountability to neighbors.

**Persuasion** (Coppock+): Changing people's minds about *which candidate to support*. Much harder. Most ads "preach to the choir" or are ignored.

This distinction, between mobilization and persuasion, is central to modern campaign strategy.

---

## Part 9: Logistic Regression Under the Hood

Let's build more intuition about what the logistic model is doing, using a continuous predictor.

In [None]:
# Logistic regression with age as a continuous predictor
logit_age = smf.logit('voted_binary ~ age', data=df).fit(disp=0)

print('Logistic regression: voted ~ age')
print(f'  Intercept: {logit_age.params["Intercept"]:.4f}')
print(f'  Age coef:  {logit_age.params["age"]:.4f}')
print(f'\nInterpretation: each additional year of age increases the log-odds')
print(f'of voting by {logit_age.params["age"]:.4f},')
print(f'or multiplies the odds by {np.exp(logit_age.params["age"]):.4f}')

In [None]:
# Visualize: empirical turnout rates by age vs logistic fit
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bin ages and compute empirical turnout rates
df['age_bin'] = pd.cut(df['age'], bins=range(18, 92, 2))
age_rates = df.groupby('age_bin')['voted_binary'].agg(['mean', 'count'])
age_rates['midpoint'] = [interval.mid for interval in age_rates.index]
age_rates = age_rates[age_rates['count'] > 100]  # drop tiny bins

# Left: probability scale
ax = axes[0]
ax.scatter(age_rates['midpoint'], age_rates['mean'] * 100,
           s=age_rates['count']/50, alpha=0.6, color='steelblue',
           label='Empirical rate (size = N)')

# Logistic fit
age_pred = np.linspace(20, 90, 200)
pred_df_age = pd.DataFrame({'age': age_pred})
pred_probs_age = logit_age.predict(pred_df_age)
ax.plot(age_pred, pred_probs_age * 100, 'r-', linewidth=2.5, label='Logistic fit')

ax.set_xlabel('Age')
ax.set_ylabel('Turnout Rate (%)')
ax.set_title('Probability Scale')
ax.legend()

# Right: log-odds scale
ax = axes[1]
empirical_log_odds = np.log(age_rates['mean'] / (1 - age_rates['mean']))
ax.scatter(age_rates['midpoint'], empirical_log_odds,
           s=age_rates['count']/50, alpha=0.6, color='steelblue',
           label='Empirical log-odds')

# On log-odds scale, the fit is a straight line!
log_odds_pred = logit_age.params['Intercept'] + logit_age.params['age'] * age_pred
ax.plot(age_pred, log_odds_pred, 'r-', linewidth=2.5, label='Logistic fit (linear!)')

ax.set_xlabel('Age')
ax.set_ylabel('Log-odds of Voting')
ax.set_title('Log-odds Scale')
ax.legend()

plt.tight_layout()
plt.show()

print('Left: On the probability scale, the logistic fit is an S-curve.')
print('Right: On the log-odds scale, it is a straight line.')
print('This is the key insight: logistic regression is just linear regression on log-odds.')

---

## Part 10: Practice Exercises

Try these on your own!

### Exercise 1: Past Primary Voting

People who voted in the 2004 primary (`p2004_bin`) are probably more likely to vote in the 2006 primary.

1. Compute the turnout rate for people who did vs didn't vote in the 2004 primary
2. Fit a logistic regression with `p2004_bin` as the only predictor
3. What is the odds ratio? Interpret it in plain language.

In [None]:
# YOUR CODE HERE

# Hint: 
# df.groupby('p2004_bin')['voted_binary'].mean()
# smf.logit('voted_binary ~ p2004_bin', data=df).fit(disp=0)


### Exercise 2: Interaction Effects

Does the Neighbors treatment work differently for people who voted in the 2004 general election vs those who didn't?

1. Create a subset of just the Control and Neighbors groups
2. Fit a logistic regression with an interaction: `voted_binary ~ neighbors * g2004_bin`
3. Is the interaction significant? What does it mean?

In [None]:
# YOUR CODE HERE

# Hint:
# subset = df[df['treatment'].isin(['Control', 'Neighbors'])].copy()
# subset['neighbors'] = (subset['treatment'] == 'Neighbors').astype(int)
# smf.logit('voted_binary ~ neighbors * g2004_bin', data=subset).fit(disp=0).summary()


### Exercise 3: Household Clustering

The GGL experiment randomized at the **household** level, not the individual level. People in the same household got the same treatment.

Why might this matter for our standard errors? (Think about whether two people in the same household are likely to have correlated voting behavior.)

For those who want to try: `statsmodels` supports clustered standard errors via the `cov_type='cluster'` option.

In [None]:
# YOUR CODE HERE (optional / advanced)

# Hint for clustered SEs with OLS:
# lpm.get_robustcov_results(cov_type='cluster', groups=df['hh_id']).summary()


---

## Solutions

In [None]:
# Exercise 1 Solution
print('Turnout by 2004 primary voting:')
print(df.groupby('p2004_bin')['voted_binary'].mean())
print()

logit_p04 = smf.logit('voted_binary ~ p2004_bin', data=df).fit(disp=0)
print(f'Log-odds coefficient: {logit_p04.params["p2004_bin"]:.4f}')
print(f'Odds ratio: {np.exp(logit_p04.params["p2004_bin"]):.2f}')
print()
print('Interpretation: People who voted in the 2004 primary had')
print(f'{np.exp(logit_p04.params["p2004_bin"]):.1f}x the odds of voting in the 2006 primary.')

In [None]:
# Exercise 2 Solution
subset = df[df['treatment'].isin(['Control', 'Neighbors'])].copy()
subset['neighbors'] = (subset['treatment'] == 'Neighbors').astype(int)

logit_interact = smf.logit('voted_binary ~ neighbors * g2004_bin', data=subset).fit(disp=0)
print(logit_interact.summary().tables[1])
print()
print('The interaction term tells us whether the Neighbors effect differs')
print('for 2004 general election voters vs non-voters.')
print('A negative interaction would mean the treatment is LESS effective for past voters')
print('(ceiling effect: they were already likely to vote).')

---

## Summary

**What we learned:**

1. **GGL (2008)** ran a massive field experiment showing social pressure increases voter turnout, with the Neighbors treatment producing an 8.1 pp effect

2. **Simple difference-in-means** is the right starting point for RCT analysis

3. **Logistic regression** is a tool for binary outcomes that:
   - Keeps predictions in [0, 1]
   - Models log-odds as a linear function
   - Reports odds ratios (how much the odds multiply)
   - Produces S-shaped predicted probability curves

4. **Adding covariates** to an RCT improves precision without changing the point estimate

5. **Coppock, Hill, and Vavreck (2020/2024)** show that campaign *persuasion* effects are much smaller (~0.5 pp) than GGL's *mobilization* effects

**Connection to the course:** The methods in this notebook (randomized experiments, regression, logistic regression) are the foundations of measuring whether persuasion works. The GGL experiment is a clean example of how randomization lets us draw causal conclusions.