# LLMs for Synthetic Data I: Simulating Survey Respondents

**Learning objectives:**
- Understand silicon sampling and its application to survey research
- Implement demographic persona construction following Argyle et al. (2023)
- Generate synthetic survey responses and compare to real data
- Measure accuracy, invariance, and stereotyping
- Validate synthetic data against ground truth surveys
- Understand boundary conditions for when silicon sampling works (Bisbee et al. 2024)

**How to run this notebook:**
- **Google Colab** (recommended): Works for all parts
- **OpenAI API key needed**: For generating synthetic responses

---

## What is Silicon Sampling?

**Silicon sampling** is the use of large language models (LLMs) to simulate survey respondents by conditioning the model on demographic characteristics.

**The basic idea:**
1. Create a demographic "persona" (age, gender, education, etc.)
2. Prompt an LLM to respond as that persona would
3. Ask survey questions and collect responses
4. Aggregate across many personas to estimate population distributions

**Potential applications:**
- Rapid prototyping of survey instruments
- Exploring counterfactual scenarios
- Augmenting small samples
- Pre-testing research designs

**Key challenges:**
- **Algorithmic fidelity**: Do synthetic distributions match real ones?
- **Invariance**: Do all personas with same demographics give identical answers?
- **Stereotyping**: Are between-group differences exaggerated?

---

## Setup

In [None]:
# Install packages
!pip install -q openai pandas numpy scipy scikit-learn matplotlib seaborn requests

In [None]:
import os
import json
import re
import getpass
import time
from datetime import datetime
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import cosine
from scipy.stats import entropy, spearmanr
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from openai import OpenAI

# Set API key
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

client = OpenAI()

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("✓ Setup complete!")

**What this code does:**

Sets up the environment for silicon sampling experiments:

**Key libraries:**
- **`openai`**: API access to GPT models
- **`pandas`**: Data manipulation and analysis
- **`scipy`**: Statistical measures (cosine similarity, KL divergence, Spearman correlation)
- **`sklearn`**: Validation metrics (Cohen's kappa, confusion matrix)
- **`matplotlib/seaborn`**: Visualization

**Why these specific tools:**
- **Cosine similarity**: Measure algorithmic fidelity (how similar are distributions?)
- **KL divergence**: Another distance metric for distributions
- **Cohen's kappa**: Inter-rater reliability between synthetic and real
- **Spearman correlation**: Ordinal association between rankings

**Security reminder:** Uses `getpass` for API keys - never hardcode them.

---

## Part 1: Creating Demographic Personas (Argyle et al. 2023)

The foundation of silicon sampling is constructing realistic demographic personas. Based on Argyle et al. (2023), personas should include:

- Age
- Gender
- Race/ethnicity
- Education level
- Income bracket
- Geographic region
- Political party (for political questions)

**What this code does:**

Implements the **persona construction** approach from Argyle et al. (2023):

**The `create_persona` function:**
- Takes a dictionary of demographic attributes
- Formats them into a natural language persona description
- Uses second person ("You are...") to prime the model

**Key demographic variables:**
- **Age**: Specific number (not range) for precision
- **Race/ethnicity**: Following U.S. Census categories
- **Gender**: Binary in original study (limitations noted)
- **Education**: Categorical levels (high school, some college, college degree, graduate)
- **Income**: Specific dollar amount or range
- **Region**: Geographic area (affects policy preferences)
- **Party**: Political affiliation (optional, task-dependent)

**Why this format:**
- Clear, unambiguous demographic information
- Mimics how humans think about identity
- Tested extensively in Argyle et al. (2023)

**Limitations:**
- Simplified categories (e.g., binary gender)
- May activate stereotypes in the model
- Assumes demographics determine opinions (not always true)

**What this code does:**

Implements the **persona construction** approach from Argyle et al. (2023):

**The `create_persona` function:**
- Takes a dictionary of demographic attributes
- Formats them into a natural language persona description
- Uses second person ("You are...") to prime the model

**Key demographic variables:**
- **Age**: Specific number (not range) for precision
- **Race/ethnicity**: Following U.S. Census categories
- **Gender**: Binary in original study (limitations noted)
- **Education**: Categorical levels (high school, some college, college degree, graduate)
- **Income**: Specific dollar amount or range
- **Region**: Geographic area (affects policy preferences)
- **Party**: Political affiliation (optional, task-dependent)

**Why this format:**
- Clear, unambiguous demographic information
- Mimics how humans think about identity
- Tested extensively in Argyle et al. (2023)

**Limitations:**
- Simplified categories (e.g., binary gender)
- May activate stereotypes in the model
- Assumes demographics determine opinions (not always true)

---

## Part 2: Generating Synthetic Survey Responses

Now we'll implement the core silicon sampling function to generate survey responses.

In [None]:
def silicon_sample_likert(demographics, question, scale=(1, 5), model="gpt-3.5-turbo", temperature=1.0):
    """
    Generate synthetic survey response for Likert scale question
    
    Args:
        demographics: dict of demographic attributes
        question: survey question text
        scale: tuple of (min, max) for Likert scale
        model: which OpenAI model to use
        temperature: sampling temperature (1.0 matches Argyle et al.)
    
    Returns:
        int or None: numeric response on scale, or None if parsing fails
    """
    persona = create_persona(demographics)
    
    prompt = f"""{question}

Please respond with only a number from {scale[0]} to {scale[1]}, where:
{scale[0]} = Strongly Disagree
{scale[1]} = Strongly Agree

Your response (number only):"""
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": persona},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature
        )
        
        # Extract numeric response
        content = response.choices[0].message.content.strip()
        # Try to extract first number
        import re
        numbers = re.findall(r'\d+', content)
        if numbers:
            value = int(numbers[0])
            # Validate in range
            if scale[0] <= value <= scale[1]:
                return value
        
        return None
        
    except Exception as e:
        print(f"Error: {e}")
        return None

# Test with example question
question = "The government should provide universal healthcare for all citizens."

print(f"Question: {question}\n")
print("=" * 70)
print("\nSynthetic responses:\n")

for i, demo in enumerate(example_personas, 1):
    response = silicon_sample_likert(demo, question)
    print(f"Persona {i} ({demo['party']}): {response}")
    time.sleep(0.3)  # Rate limiting

**What this code does:**

Implements the core **silicon sampling function** for Likert-scale questions:

**The `silicon_sample_likert` function workflow:**
1. Create persona from demographics
2. Format question with clear scale instructions
3. Send to LLM with persona as system message
4. Extract numeric response with error handling
5. Validate response is in valid range

**Key parameters:**
- **`temperature=1.0`**: Matches Argyle et al. (2023) default
  - Higher than annotation tasks (0.1-0.3)
  - Allows for within-group diversity
  - Still not as diverse as real humans
  - **Why 1.0 instead of lower values**: We want to simulate human variability, not find a single "correct" answer. Lower temperatures would reduce response diversity, making personas with similar demographics give more similar responses. Temperature=1.0 introduces sampling variability to better approximate real human diversity.
- **`model="gpt-3.5-turbo"`**: Original study used GPT-3 (similar)

**Robust parsing:**
- Uses regex to extract first number from response
- Handles cases where model adds explanation
- Validates number is in valid range
- Returns `None` if parsing fails

**Why system vs user message:**
- **System message**: Sets persistent context (persona)
- **User message**: Contains the specific question
- This separation helps model stay "in character"

**Cost consideration:** Each call costs ~$0.0005-0.001 with GPT-3.5-turbo, so 1000 responses ≈ $0.50-1.00

### Generating responses for multiple questions

In [None]:
# Define survey questions (simplified ANES-style)
questions = [
    "The government should provide universal healthcare for all citizens.",
    "We should increase spending on defense and military.",
    "Climate change is one of the most serious problems facing the country.",
    "Immigration levels should be decreased.",
    "The government should do more to regulate big corporations."
]

def collect_responses(demographics, questions, model="gpt-3.5-turbo"):
    """
    Collect responses for multiple questions from a single persona
    """
    responses = {}
    
    for i, question in enumerate(questions, 1):
        response = silicon_sample_likert(demographics, question, model=model)
        responses[f"Q{i}"] = response
        time.sleep(0.2)  # Rate limiting
    
    return responses

# Collect responses from example personas
results = []

for demo in example_personas:
    print(f"Collecting responses for: {demo['party']}, {demo['age']}, {demo['race']}...")
    responses = collect_responses(demo, questions)
    
    result = {**demo, **responses}
    results.append(result)

df_synthetic = pd.DataFrame(results)

print("\n✓ Synthetic responses collected\n")
print(df_synthetic[['party', 'age', 'race', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5']])

**What this code does:**

Implements **multi-question survey collection** from synthetic personas:

**The `collect_responses` function:**
- Takes single demographic profile and list of questions
- Collects response for each question sequentially
- Returns dictionary of question IDs → responses
- Includes rate limiting to avoid API throttling

**Question design:**
- Based on typical ANES (American National Election Studies) format
- Cover different policy domains (healthcare, defense, environment, immigration, economy)
- Clear, unambiguous phrasing
- Scaled as agreement (1-5)

**Why collect multiple questions:**
- Test consistency within persona
- Enable correlation analysis (do issues cluster as expected?)
- Compare to real survey patterns
- Detect stereotyping across domains

**Output format:**
- Pandas DataFrame with demographics + responses
- Easy to analyze, visualize, export
- Can merge with real survey data for comparison

**Note on None values:**
- If the LLM fails to return a valid numeric response, `silicon_sample_likert` returns `None`
- This can happen if the model provides explanation instead of just a number, or if parsing fails
- These None values will appear in the DataFrame and should be handled in analysis (e.g., filtered out or imputed)
- Always check for and report the rate of failed parses in your validation

**Next steps:** Scale up to many personas and compare to real distributions

---

## Part 3: Validating Synthetic Data Against Real Surveys

**Validation** measures how well synthetic responses match real survey data.

Key metrics:
- **Proportion matching**: What percentage of synthetic responses match the modal real response?
- **Response variance**: Do synthetic responses show similar spread to real responses?
- **Distribution plots**: Visual comparison of response distributions

**⚠️ IMPORTANT: The "real" survey data used in this section is simulated for demonstration purposes only. In actual research, you MUST use authentic survey data from sources like ANES, GSS, Pew, or your own validated surveys.**

In [None]:
# Simulate real survey data for comparison
# (In practice, you'd use actual survey data like ANES or GSS)
np.random.seed(42)

# Create "real" data with realistic distributions
# Democrats favor Q1 (healthcare), Republicans favor Q2 (defense)
real_data = []

for party in ['Democrat', 'Republican', 'Independent']:
    n = 100
    
    if party == 'Democrat':
        q1 = np.random.choice([3, 4, 5], n, p=[0.2, 0.4, 0.4])  # Pro healthcare
        q2 = np.random.choice([1, 2, 3], n, p=[0.4, 0.4, 0.2])  # Anti defense spending
    elif party == 'Republican':
        q1 = np.random.choice([1, 2, 3], n, p=[0.4, 0.4, 0.2])  # Anti healthcare
        q2 = np.random.choice([3, 4, 5], n, p=[0.2, 0.4, 0.4])  # Pro defense
    else:  # Independent
        q1 = np.random.choice([2, 3, 4], n, p=[0.3, 0.4, 0.3])  # Moderate
        q2 = np.random.choice([2, 3, 4], n, p=[0.3, 0.4, 0.3])  # Moderate
    
    for i in range(n):
        real_data.append({
            'party': party,
            'Q1': q1[i],
            'Q2': q2[i]
        })

df_real = pd.DataFrame(real_data)

print("'Real' survey data (simulated):")
print(df_real.groupby('party')[['Q1', 'Q2']].mean().round(2))

In [None]:
# Calculate proportion matching and variance comparison

print("Validation Metrics\n")
print("=" * 70)

# For each party and question, compare synthetic to real
for party in ['Democrat', 'Republican']:
    print(f"\n{party} - Q1 (Healthcare):")
    print("-" * 70)
    
    # Get real responses
    real_responses = df_real[df_real['party'] == party]['Q1']
    real_mean = real_responses.mean()
    real_std = real_responses.std()
    real_mode = real_responses.mode()[0]
    
    print(f"  Real data:")
    print(f"    Mean: {real_mean:.2f}")
    print(f"    Std:  {real_std:.2f}")
    print(f"    Modal response: {real_mode}")
    
    # For synthetic (in practice, you'd have many synthetic responses per party)
    # Here we just show the structure
    print(f"\n  Synthetic data (need larger sample for comparison):")
    print(f"    [Generate 30+ responses per party]")
    print(f"    Then calculate:")
    print(f"      - Proportion matching modal response")
    print(f"      - Mean and std of synthetic responses")
    print(f"      - Compare std to real std")

print("\n" + "=" * 70)
print("\n**Interpretation:**")
print("- **Proportion matching**: Higher is better (> 50% good)")
print("- **Variance comparison**: Synthetic std should be close to real std")
print("  - If synthetic std << real std: Underestimating diversity (invariance)")
print("  - If synthetic std >> real std: Overestimating diversity")
print("- **Visual inspection**: Plot distributions side-by-side")

**What this code does:**

**Proportion matching:**
- Find the most common (modal) response in real data
- Calculate what % of synthetic responses match that mode
- **Example**: If real Democrats most commonly say "5" (strongly agree) to healthcare, what % of synthetic Democrats also say "5"?
- Higher matching indicates better alignment with real data

**Variance comparison:**
- Real humans have variance in responses even with same demographics
- Compare std deviation of synthetic vs real
- **Example**: 
  - Real Democrats on healthcare: mean=4.2, std=1.1
  - Synthetic Democrats: mean=4.3, std=0.4
  - **Problem**: Synthetic std too low → invariance issue
- Ideally, synthetic variance should be similar to real variance

**Why these metrics:**
- **More intuitive**: Easier to explain to non-technical audiences
- **Direct**: Measures actual agreement, not abstract distance
- **Actionable**: Clear when synthetic data fails validation

**Visual comparison (next cell):**
- Plot distributions side-by-side
- See where synthetic over/under-represents responses
- Identify systematic biases

In [None]:
# Visualize variance comparison
# This would use actual synthetic data once generated

# Example visualization code:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Distribution comparison
ax = axes[0]
parties = ['Democrat', 'Republican']
x = np.arange(len(parties))
width = 0.35

# Real data means (from df_real)
real_means = [df_real[df_real['party'] == p]['Q1'].mean() for p in parties]
real_stds = [df_real[df_real['party'] == p]['Q1'].std() for p in parties]

# Synthetic would go here (placeholder)
synthetic_means = [4.1, 2.3]  # Placeholder - replace with actual synthetic data
synthetic_stds = [0.6, 0.5]   # Placeholder

ax.bar(x - width/2, real_means, width, label='Real', yerr=real_stds, capsize=5, alpha=0.7)
ax.bar(x + width/2, synthetic_means, width, label='Synthetic (placeholder)', yerr=synthetic_stds, capsize=5, alpha=0.7)
ax.set_ylabel('Mean Response (1-5)')
ax.set_title('Q1: Healthcare - Mean ± Std')
ax.set_xticks(x)
ax.set_xticklabels(parties)
ax.legend()
ax.set_ylim(0, 6)
ax.axhline(y=3, color='gray', linestyle='--', alpha=0.3, label='Neutral')

# Plot 2: Variance comparison
ax = axes[1]
variance_data = {
    'Democrat': {'Real': real_stds[0], 'Synthetic': synthetic_stds[0]},
    'Republican': {'Real': real_stds[1], 'Synthetic': synthetic_stds[1]}
}

x = np.arange(len(parties))
real_vars = [variance_data[p]['Real'] for p in parties]
synth_vars = [variance_data[p]['Synthetic'] for p in parties]

ax.bar(x - width/2, real_vars, width, label='Real', alpha=0.7)
ax.bar(x + width/2, synth_vars, width, label='Synthetic (placeholder)', alpha=0.7)
ax.set_ylabel('Standard Deviation')
ax.set_title('Response Variance Comparison')
ax.set_xticks(x)
ax.set_xticklabels(parties)
ax.legend()
ax.axhline(y=1.0, color='red', linestyle='--', alpha=0.3, label='Target (human-level)')

plt.tight_layout()
plt.show()

print("\n✓ Visualization shows:")
print("  Left: Mean responses with error bars (std)")
print("  Right: Direct variance comparison")
print("  ⚠ Replace placeholder synthetic data with actual generated responses")

### Generating larger synthetic sample for proper validation

In [None]:
# Generate larger synthetic sample (this will take a few minutes and cost ~$0.09-$0.50)
# Uncomment to run - skipping by default to save API costs

"""
# Create diverse demographic profiles
from itertools import product

# Define demographic space
ages = [25, 35, 45, 55, 65]
races = ['white', 'Black', 'Hispanic']
genders = ['male', 'female']
educations = ['high school', 'college degree']
incomes = ['35,000', '65,000', '95,000']
regions = ['Northeast', 'South', 'Midwest', 'West']
parties = ['Democrat', 'Republican', 'Independent']

# Generate personas (this creates hundreds of combinations)
# For demo purposes, sample a subset
np.random.seed(42)

personas = []
for party in parties:
    for i in range(30):  # 30 per party = 90 total
        persona = {
            'age': np.random.choice(ages),
            'race': np.random.choice(races),
            'gender': np.random.choice(genders),
            'education': np.random.choice(educations),
            'income': np.random.choice(incomes),
            'region': np.random.choice(regions),
            'party': party
        }
        personas.append(persona)

# Collect responses
synthetic_results = []

for i, persona in enumerate(personas, 1):
    if i % 10 == 0:
        print(f"Progress: {i}/{len(personas)}")
    
    responses = collect_responses(persona, questions[:2])  # Just Q1, Q2 for speed
    result = {**persona, **responses}
    synthetic_results.append(result)

df_synthetic_large = pd.DataFrame(synthetic_results)
df_synthetic_large.to_csv('synthetic_survey_data.csv', index=False)

print("\n✓ Large synthetic sample collected and saved")
"""

print("[Skipped to save API costs - uncomment to run]")
print("This would generate 90 synthetic respondents")
print("Cost estimate: ~$0.09 for completions, up to $0.50 with prompt tokens depending on length")

**What this code does:**

Demonstrates how to generate a **large-scale synthetic sample** for validation:

**The demographic space:**
- **Full factorial**: All combinations of demographics
- **Sampling strategy**: Random sample from space (more efficient than full grid)
- **Stratification**: Equal samples per party (can weight later)

**Sample size considerations:**
- **Minimum**: 30-50 per group for distribution comparison
- **Good**: 100+ per group
- **Excellent**: 200+ per group (like Argyle et al.)

**Cost calculation:**
- 90 personas × 2 questions = 180 API calls
- ~$0.0005 per call with GPT-3.5-turbo
- Total: ~$0.10-0.50 depending on prompt length

**Why commented out:**
- Saves API costs for students running the notebook
- Takes 5-10 minutes to run
- You can uncomment when ready to do real validation

**Best practices:**
- Save results to CSV after generation
- Don't regenerate unnecessarily
- Include random seed for reproducibility
- Log all parameters (model, temperature, timestamp)

# Test invariance: multiple responses from same persona
test_persona = example_personas[0]  # Democrat, 45, white male
test_question = questions[0]  # Healthcare

print(f"Testing invariance for persona: {test_persona['party']}, {test_persona['age']}, {test_persona['race']}")
print(f"Question: {test_question}\n")
print("=" * 70)

# Collect 10 responses from same persona
responses = []
for i in range(10):
    response = silicon_sample_likert(test_persona, test_question)
    responses.append(response)
    print(f"Response {i+1}: {response}")
    time.sleep(0.3)

# Calculate variance
responses = [r for r in responses if r is not None]
mean_response = np.mean(responses)
variance = np.var(responses)
std = np.std(responses)

print(f"\nWithin-persona statistics:")
print(f"  Mean: {mean_response:.2f}")
print(f"  Std: {std:.2f}")
print(f"  Variance: {variance:.2f}")

# Compare to expected human variance
print(f"\nComparison:")
print(f"  LLM within-persona std: {std:.2f}")
print(f"  Typical human within-group std: ~1.0-1.5")
print(f"\nInterpretation: Lower std indicates less diversity than humans")

In [None]:
**What this code does:**

Tests for **invariance** by repeatedly querying the same persona:

**The invariance problem:**
- Real humans: Same demographics ≠ identical opinions  
- LLMs: May produce very similar responses for identical personas
- This **underestimates real heterogeneity**

**Why invariance happens:**
- Even with temperature > 0, models show less diversity than humans
- Models average over training data
- Stereotypical "typical" response for each demographic
- Missing individual-level factors (personality, experiences, etc.)

**What we measure:**
- **Mean**: Average response for persona
- **Std (standard deviation)**: Spread of responses
- **Variance**: Squared std

**Interpretation:**
- Compare LLM std to typical human std (~1.0-1.5 on 5-point scales)
- Lower LLM std indicates less diversity (invariance problem)
- Higher LLM std approaching human levels suggests better diversity

**Typical observations:**
- LLMs: Std varies by question, model, and temperature
- Humans: Generally show more within-group diversity
- **Implication**: LLMs often underestimate within-group diversity

**Bisbee et al. (2024) finding:**
- Invariance worse for identity-salient questions
- Better for non-political factual questions
- Larger models (GPT-4) show slightly better diversity than GPT-3.5

**Solutions:**
- Higher temperature helps but doesn't fully solve the problem
- Add personality traits to personas
- Fine-tune on diverse real responses
- Acknowledge limitation in reporting

**What this code does:**

Tests for **invariance** by repeatedly querying the same persona:

**The invariance problem:**
- Real humans: Same demographics ≠ identical opinions
- LLMs: May produce very similar responses for identical personas
- This **underestimates real heterogeneity**

**Why invariance happens:**
- Temperature > 0 adds randomness, but not enough
- Models average over training data
- Stereotypical "typical" response for each demographic
- Missing individual-level factors (personality, experiences, etc.)

**What we measure:**
- **Mean**: Average response for persona
- **Std (standard deviation)**: Spread of responses
- **Variance**: Squared std

**Interpretation:**
- **Std < 0.5**: High invariance (responses very similar)
- **Std 0.5-1.0**: Moderate diversity (less than humans)
- **Std > 1.0**: Good diversity (approaching human levels)

**Note on thresholds with temperature=1.0:**
- With temperature=1.0, LLM variance is typically higher than with temperature=0
- However, the thresholds above (< 0.5, 0.5-1.0, > 1.0) still apply because they're calibrated against human within-group variance
- Even with temperature=1.0, LLMs often show variance around 0.3-0.8, which is still less than typical human variance of 1.0-1.5
- The higher temperature helps but doesn't fully solve the invariance problem

**Typical observations:**
- LLMs: Std ~ 0.3-0.8 (varies by question and model)
- Humans: Std ~ 1.0-1.5 on same demographic
- **Implication**: LLMs underestimate within-group diversity

**Bisbee et al. (2024) finding:**
- Invariance worse for identity-salient questions
- Better for non-political factual questions
- Larger models (GPT-4) show slightly better diversity than GPT-3.5

**Solutions:**
- Higher temperature (but may reduce accuracy)
- Add personality traits to personas
- Fine-tune on diverse real responses
- Acknowledge limitation in reporting

---

## Part 5: Diagnosing Stereotyping

**Stereotyping** occurs when LLMs exaggerate between-group differences compared to real data.

**The problem:**
- LLMs trained on text that often contains stereotypes
- May amplify partisan/demographic differences
- Creates artificial polarization

**How to test:**
- Compare effect sizes for demographics in synthetic vs real data
- Look for exaggerated differences between groups

In [None]:
# Calculate means and differences
dem_mean = df_stereotyping[df_stereotyping['party'] == 'Democrat']['response'].mean()
rep_mean = df_stereotyping[df_stereotyping['party'] == 'Republican']['response'].mean()

print("\nBetween-group analysis (Healthcare question):")
print("=" * 70)
print(f"\nDemocrat mean: {dem_mean:.2f}")
print(f"Republican mean: {rep_mean:.2f}")
print(f"Difference: {abs(dem_mean - rep_mean):.2f}")

print(f"\nNote: Compare this difference to real survey data")
print(f"If synthetic difference >> real difference, this suggests stereotyping")

**What this code does:**

Tests for **stereotyping** by measuring between-group differences:

**What is stereotyping in silicon sampling:**
- LLMs may exaggerate differences between demographic groups
- E.g., making Democrats MORE pro-healthcare than real Democrats
- Or Republicans MORE anti-healthcare than real Republicans
- Creates artificial polarization

**Measuring group differences:**
- Calculate mean response for each group
- Calculate absolute difference
- Compare to real data differences

**How to detect stereotyping:**
1. Calculate group difference for synthetic data
2. Calculate group difference for real data
3. Compare: If synthetic difference >> real difference, stereotyping present

**Example:**
- Real data: Democrat vs Republican on healthcare, difference = 1.2
- Synthetic: difference = 2.4
- **Interpretation**: LLM is doubling the partisan divide

**Why stereotyping happens:**
- Training data contains exaggerated partisan rhetoric
- News articles emphasize differences
- Social media polarization in training data
- Models learn "prototypical" Democrat/Republican

**Bisbee et al. (2024) findings:**
- Stereotyping worse for:
  - Morally charged issues (abortion, guns)
  - Identity-salient topics (race, gender)
  - Minority subgroups (underrepresented in training)
- Better for:
  - Non-political topics
  - Aggregate estimates (averaging reduces bias)

**Solutions:**
- Compare differences to real data (essential validation step)
- Avoid using for identity-salient questions
- Consider weighting/calibrating to match real distributions

---

## Part 6: When Does Silicon Sampling Work? (Bisbee et al. 2024)

Based on Bisbee et al. (2024), silicon sampling has clear **boundary conditions**.

### When it works:
- ✓ High-consensus topics (e.g., basic civic knowledge)
- ✓ Non-identity-salient issues
- ✓ Aggregate-level estimates
- ✓ Larger models (GPT-4 > GPT-3.5)

### When it fails:
- ✗ Morally charged issues (abortion, immigration)
- ✗ Identity-salient topics (racial attitudes, gender policies)
- ✗ Minority subgroups (underrepresented in training)
- ✗ Individual-level predictions

### Appropriate use cases:
1. **Question development**: Test survey instruments before fielding
2. **Exploratory research**: Generate hypotheses to test with real data
3. **Education**: Teach survey methodology
4. **Augmentation**: Supplement small real samples (with caution)

### Inappropriate use cases:
1. ✗ Replacing representative surveys
2. ✗ Studying marginalized populations
3. ✗ Making substantive claims about "public opinion"
4. ✗ High-stakes decisions

---

## Summary

**What we learned:**
1. ✓ How to construct **demographic personas** following Argyle et al. (2023)
2. ✓ How to generate **synthetic survey responses** with LLMs
3. ✓ How to validate using **proportion matching** and **variance comparison**
4. ✓ How to diagnose **invariance** (within-group diversity)
5. ✓ How to diagnose **stereotyping** (exaggerated between-group differences)
6. ✓ **Boundary conditions** for when silicon sampling works (Bisbee et al. 2024)

**Key insights:**
- Silicon sampling can **approximate** population distributions for some questions
- But faces serious challenges: **invariance** and **stereotyping**
- Works best for **non-controversial, aggregate-level** estimates
- Fails for **identity-salient, morally charged** topics
- **Validation** against real data is essential

**Historical context:**
- Before LLMs: ABMs, synthetic populations, MRP, multiple imputation
- LLMs add: Flexible generation, natural language, few-shot learning
- But: Less principled uncertainty quantification than traditional methods

**Validation metrics:**

| Metric | What it measures | Target |
|--------|------------------|--------|
| Proportion matching | Agreement with modal response | > 50% |
| Variance ratio | Synthetic std / Real std | 0.7-1.3 |
| Within-group std | Invariance | > 1.0 |
| Mean difference ratio | Stereotyping | ≈ 1.0 |

**Best practices:**
1. Always validate against real survey data
2. Report all three challenges (accuracy, invariance, stereotyping)
3. Use for exploration, not substitution
4. Be transparent about limitations
5. Consider ethical implications

**Ethical considerations:**
- Risk of reproducing harmful stereotypes
- Collapsing diversity within demographic groups
- Claiming to represent voices not consulted
- Must disclose use of synthetic data

**Next steps:**
- **Week 8**: Interactive simulations and behavioral experiments with LLMs
- **Week 9**: Generative agents and multi-agent systems
- **Week 10**: Using LLMs to predict experimental outcomes

---
