# Westwood et al. (2022) Replication - Part 1: Data Loading

This notebook loads the survey data from Google Drive for the replication of:

> Westwood, S. J., Grimmer, J., Tyler, M., & Nall, C. (2022).
> "Current research overstates American support for political violence."
> *PNAS*, 119(12), e2116870119.

**Learning Objectives:**
1. Understand how to load data from Google Drive in Colab
2. Explore survey data structure
3. Understand the experimental design (vignettes + engagement checks)

**Data Sources:**
- Harvard Dataverse: doi:10.7910/DVN/ZEHO8E
- Original data collected via Qualtrics and YouGov surveys in 2021

## Step 1: Install and Import Packages

In [None]:
# Install gdown for Google Drive downloads (already in Colab)
!pip install -q gdown

import pandas as pd
import numpy as np
import gdown

## Step 2: Define Google Drive URLs

These URLs point to the publicly shared data files on Google Drive.
The format uses gdown-compatible URLs: `https://drive.google.com/uc?id=FILE_ID`

In [None]:
DATA_URLS = {
    # Studies 1 & 4: Qualtrics panel, January 2021
    # Study 1: Car-ramming vignette (violence against protesters)
    # Study 4: Sentencing task (proposed prison sentences)
    "study14": "https://drive.google.com/uc?id=1gKIY11FaM5RmhhXTKx3wVcwGkMoTyTUM",

    # Studies 2 & 5: Qualtrics panel, April 2021
    # Study 2: Shooting vignette (violence at political rally)
    # Study 5: Incentive study (in appendix)
    "study25": "https://drive.google.com/uc?id=1VfZM3hSDzIIIVp2AUGC-RwOy-Fk2t_Fm",

    # Study 3: YouGov nationally representative sample, November 2021
    # Same shooting vignette as Study 2, but with survey weights
    "study3": "https://drive.google.com/uc?id=1OYlDc-TgzqNa9iFRgcUa1XRLH-uQHomO",

    # Prior estimates: Kalmoe-Mason derived percentages from media coverage
    "priorestimates": "https://drive.google.com/uc?id=1__z-IhvnRPgRqkyfG7rZlss7cIcR_kyn",

    # News coverage data (2016-2021)
    "newsCoverage": "https://drive.google.com/uc?id=1N4PSN87o687kwIcRBz80MBMJEgOz3yun",
}

## Step 3: Helper Function to Load Data

In [None]:
def load_from_gdrive(name: str, url: str) -> pd.DataFrame:
    """
    Download a CSV file from Google Drive and load it as a DataFrame.

    Parameters:
    -----------
    name : str
        A descriptive name for the dataset (for error messages)
    url : str
        The gdown-compatible Google Drive URL

    Returns:
    --------
    pd.DataFrame
        The loaded dataset
    """
    print(f"Loading {name}...")
    output_file = f"/tmp/{name}.csv"

    try:
        gdown.download(url, output_file, quiet=True)
        df = pd.read_csv(output_file)
        print(f"  Loaded {len(df):,} rows, {len(df.columns)} columns")
        return df
    except Exception as e:
        print(f"  ERROR: Could not load {name}: {e}")
        return None

## Step 4: Load All Datasets

In [None]:
def load_all_data():
    """Load all datasets needed for the replication."""
    data = {}
    for name, url in DATA_URLS.items():
        data[name] = load_from_gdrive(name, url)
    return data

# Load the data
data = load_all_data()

## Step 5: Explore Data Structure

In [None]:
def explore_dataset(df: pd.DataFrame, name: str):
    """Print summary statistics for a dataset."""
    print(f"\n{'='*60}")
    print(f"DATASET: {name}")
    print(f"{'='*60}")
    print(f"\nShape: {df.shape[0]:,} rows x {df.shape[1]} columns")
    print(f"\nColumn names: {list(df.columns)[:10]}...")  # First 10
    print(f"\nFirst 3 rows:")
    display(df.head(3))

# Explore each dataset
for name, df in data.items():
    if df is not None:
        explore_dataset(df, name)

## Summary

We loaded 5 datasets:
- **study14**: Studies 1 & 4 (Qualtrics, n=3,361 raw)
- **study25**: Studies 2 & 5 (Qualtrics, n=4,585 raw)
- **study3**: Study 3 (YouGov, n=1,863)
- **priorestimates**: Kalmoe-Mason estimates from media
- **newsCoverage**: News coverage data

**Next:** Run notebook 02 for core analysis (engagement effect)