# LLMs for Synthetic Data I: Simulating Survey Respondents

**Learning objectives:**
- Understand silicon sampling and its application to survey research
- Implement demographic persona construction
- Generate synthetic survey responses and compare to real data
- Measure algorithmic fidelity, invariance, and stereotyping
- Validate synthetic data against ground truth surveys

**How to run this notebook:**
- **Google Colab** (recommended): Works for all parts
- **OpenAI API key needed**: For generating synthetic responses
- **SubPOP dataset**: Available at github.com/JosephJeesungSuh/subpop

---

## What is Silicon Sampling?

**Silicon sampling** is the use of large language models (LLMs) to simulate survey respondents by conditioning the model on demographic characteristics.

**The basic idea:**
1. Create a demographic "persona" (age, gender, education, etc.)
2. Prompt an LLM to respond as that persona would
3. Ask survey questions and collect responses
4. Aggregate across many personas to estimate population distributions

**Potential applications:**
- Rapid prototyping of survey instruments
- Exploring counterfactual scenarios
- Augmenting small samples
- Pre-testing research designs

**Key challenges:**
- **Algorithmic fidelity**: Do synthetic distributions match real ones?
- **Invariance**: Do all personas with same demographics give identical answers?
- **Stereotyping**: Are between-group differences exaggerated?

---

## Setup

In [None]:
# Install packages
!pip install -q openai pandas numpy scipy scikit-learn matplotlib seaborn requests

In [1]:
import os
import json
import getpass
import time
from datetime import datetime
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import cosine
from scipy.stats import entropy, spearmanr
from sklearn.metrics import cohen_kappa_score, confusion_matrix
from openai import OpenAI

# Set API key
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

client = OpenAI()

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("✓ Setup complete!")

Enter your OpenAI API key: ··········
✓ Setup complete!


**What this code does:**

Sets up the environment for silicon sampling experiments:

**Key libraries:**
- **`openai`**: API access to GPT models
- **`pandas`**: Data manipulation and analysis
- **`scipy`**: Statistical measures (cosine similarity, KL divergence, Spearman correlation)
- **`sklearn`**: Validation metrics (Cohen's kappa, confusion matrix)
- **`matplotlib/seaborn`**: Visualization

**Why these specific tools:**
- **Cosine similarity**: Measure algorithmic fidelity (how similar are distributions?)
- **KL divergence**: Another distance metric for distributions
- **Cohen's kappa**: Inter-rater reliability between synthetic and real
- **Spearman correlation**: Ordinal association between rankings

**Security reminder:** Uses `getpass` for API keys - never hardcode them.

---

## Part 1: Creating Demographic Personas

The foundation of silicon sampling is constructing realistic demographic personas. personas should include:

- Age
- Gender
- Race/ethnicity
- Education level
- Income bracket
- Geographic region
- Political party (for political questions)

In [2]:
def create_persona(demographics):
    """
    Create persona string for silicon sampling

    Args:
        demographics: dict with keys: age, race, gender, education, income, region

    Returns:
        str: Formatted persona prompt
    """
    persona = f"""You are a {demographics['age']}-year-old {demographics['race']} {demographics['gender']} from {demographics['region']} with {demographics['education']} education and an income of ${demographics['income']}."""

    # Add political affiliation if present
    if 'party' in demographics:
        persona += f" You identify as a {demographics['party']}."

    return persona

# Example personas
example_personas = [
    {
        'age': 45,
        'race': 'white',
        'gender': 'male',
        'education': 'college degree',
        'income': '75,000',
        'region': 'Midwest',
        'party': 'Democrat'
    },
    {
        'age': 29,
        'race': 'Black',
        'gender': 'female',
        'education': 'high school',
        'income': '35,000',
        'region': 'South',
        'party': 'Independent'
    },
    {
        'age': 67,
        'race': 'Hispanic',
        'gender': 'male',
        'education': 'some college',
        'income': '55,000',
        'region': 'West',
        'party': 'Republican'
    }
]

print("Example personas:\n")
print("=" * 70)
for i, demo in enumerate(example_personas, 1):
    print(f"\nPersona {i}:")
    print(create_persona(demo))

Example personas:


Persona 1:
You are a 45-year-old white male from Midwest with college degree education and an income of $75,000. You identify as a Democrat.

Persona 2:
You are a 29-year-old Black female from South with high school education and an income of $35,000. You identify as a Independent.

Persona 3:
You are a 67-year-old Hispanic male from West with some college education and an income of $55,000. You identify as a Republican.


**What this code does:**

Implements the **persona construction** approach from Argyle et al. (2023):

**The `create_persona` function:**
- Takes a dictionary of demographic attributes
- Formats them into a natural language persona description
- Uses second person ("You are...") to prime the model

**Key demographic variables:**
- **Age**: Specific number (not range) for precision
- **Race/ethnicity**: Following U.S. Census categories
- **Gender**: Binary in original study (limitations noted)
- **Education**: Categorical levels (high school, some college, college degree, graduate)
- **Income**: Specific dollar amount or range
- **Region**: Geographic area (affects policy preferences)
- **Party**: Political affiliation (optional, task-dependent)

**Why this format:**
- Clear, unambiguous demographic information
- Mimics how humans think about identity
- Tested in silicon sampling research

**Limitations:**
- Simplified categories (e.g., binary gender)
- May activate stereotypes in the model
- Assumes demographics determine opinions (not always true)

---

## Part 2: Generating Synthetic Survey Responses

Now we'll implement the core silicon sampling function to generate survey responses.

In [3]:
def silicon_sample_likert(demographics, question, scale=(1, 5), model="gpt-3.5-turbo", temperature=1.0):
    """
    Generate synthetic survey response for Likert scale question

    Args:
        demographics: dict of demographic attributes
        question: survey question text
        scale: tuple of (min, max) for Likert scale
        model: which OpenAI model to use
        temperature: sampling temperature (1.0 is default)

    Returns:
        int or None: numeric response on scale, or None if parsing fails
    """
    persona = create_persona(demographics)

    prompt = f"""{question}

Please respond with only a number from {scale[0]} to {scale[1]}, where:
{scale[0]} = Strongly Disagree
{scale[1]} = Strongly Agree

Your response (number only):"""

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": persona},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature
        )

        # Extract numeric response
        content = response.choices[0].message.content.strip()
        # Try to extract first number
        import re
        numbers = re.findall(r'\d+', content)
        if numbers:
            value = int(numbers[0])
            # Validate in range
            if scale[0] <= value <= scale[1]:
                return value

        return None

    except Exception as e:
        print(f"Error: {e}")
        return None

# Test with example question
question = "The government should provide universal healthcare for all citizens."

print(f"Question: {question}\n")
print("=" * 70)
print("\nSynthetic responses:\n")

for i, demo in enumerate(example_personas, 1):
    response = silicon_sample_likert(demo, question)
    print(f"Persona {i} ({demo['party']}): {response}")
    time.sleep(0.3)  # Rate limiting

Question: The government should provide universal healthcare for all citizens.


Synthetic responses:

Persona 1 (Democrat): 5
Persona 2 (Independent): 4
Persona 3 (Republican): 2


**What this code does:**

Implements the core **silicon sampling function** for Likert-scale questions:

**The `silicon_sample_likert` function workflow:**
1. Create persona from demographics
2. Format question with clear scale instructions
3. Send to LLM with persona as system message
4. Extract numeric response with error handling
5. Validate response is in valid range

**Key parameters:**
- **`temperature=1.0`**: Default setting (1.0)
  - Higher than annotation tasks (0.1-0.3)
  - Allows for within-group diversity
  - Still not as diverse as real humans
- **`model="gpt-3.5-turbo"`**: Original study used GPT-3 (similar)

**Robust parsing:**
- Uses regex to extract first number from response
- Handles cases where model adds explanation
- Validates number is in valid range
- Returns `None` if parsing fails

**Why system vs user message:**
- **System message**: Sets persistent context (persona)
- **User message**: Contains the specific question
- This separation helps model stay "in character"

**Cost consideration:** Each call costs ~$0.0005-0.001 with GPT-3.5-turbo, so 1000 responses ≈ $0.50-1.00

### Generating responses for multiple questions

In [4]:
# Define survey questions (simplified ANES-style)
questions = [
    "The government should provide universal healthcare for all citizens.",
    "We should increase spending on defense and military.",
    "Climate change is one of the most serious problems facing the country.",
    "Immigration levels should be decreased.",
    "The government should do more to regulate big corporations."
]

def collect_responses(demographics, questions, model="gpt-3.5-turbo"):
    """
    Collect responses for multiple questions from a single persona
    """
    responses = {}

    for i, question in enumerate(questions, 1):
        response = silicon_sample_likert(demographics, question, model=model)
        responses[f"Q{i}"] = response
        time.sleep(0.2)  # Rate limiting

    return responses

# Collect responses from example personas
results = []

for demo in example_personas:
    print(f"Collecting responses for: {demo['party']}, {demo['age']}, {demo['race']}...")
    responses = collect_responses(demo, questions)

    result = {**demo, **responses}
    results.append(result)

df_synthetic = pd.DataFrame(results)

print("\n✓ Synthetic responses collected\n")
print(df_synthetic[['party', 'age', 'race', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5']])

Collecting responses for: Democrat, 45, white...
Collecting responses for: Independent, 29, Black...
Collecting responses for: Republican, 67, Hispanic...

✓ Synthetic responses collected

         party  age      race  Q1  Q2  Q3  Q4  Q5
0     Democrat   45     white   5   2   5   3   4
1  Independent   29     Black   5   3   4   2   3
2   Republican   67  Hispanic   2   4   3   3   2


**What this code does:**

Implements **multi-question survey collection** from synthetic personas:

**The `collect_responses` function:**
- Takes single demographic profile and list of questions
- Collects response for each question sequentially
- Returns dictionary of question IDs → responses
- Includes rate limiting to avoid API throttling

**Question design:**
- Based on typical ANES (American National Election Studies) format
- Cover different policy domains (healthcare, defense, environment, immigration, economy)
- Clear, unambiguous phrasing
- Scaled as agreement (1-5)

**Why collect multiple questions:**
- Test consistency within persona
- Enable correlation analysis (do issues cluster as expected?)
- Compare to real survey patterns
- Detect stereotyping across domains

**Output format:**
- Pandas DataFrame with demographics + responses
- Easy to analyze, visualize, export
- Can merge with real survey data for comparison

**Next steps:** Scale up to many personas and compare to real distributions

---

## Part 3: Measuring Algorithmic Fidelity

**Algorithmic fidelity** measures how well synthetic distributions match real survey distributions.

Common metrics:
- **Cosine similarity**: 1 = identical, 0 = orthogonal, -1 = opposite
- **KL divergence**: 0 = identical, higher = more different
- **Spearman correlation**: Ordinal association (-1 to +1)

In [5]:
# Simulate real survey data for comparison
# (In practice, you'd use actual survey data like ANES or GSS)
np.random.seed(42)

# Create "real" data with realistic distributions
# Democrats favor Q1 (healthcare), Republicans favor Q2 (defense)
real_data = []

for party in ['Democrat', 'Republican', 'Independent']:
    n = 100

    if party == 'Democrat':
        q1 = np.random.choice([3, 4, 5], n, p=[0.2, 0.4, 0.4])  # Pro healthcare
        q2 = np.random.choice([1, 2, 3], n, p=[0.4, 0.4, 0.2])  # Anti defense spending
    elif party == 'Republican':
        q1 = np.random.choice([1, 2, 3], n, p=[0.4, 0.4, 0.2])  # Anti healthcare
        q2 = np.random.choice([3, 4, 5], n, p=[0.2, 0.4, 0.4])  # Pro defense
    else:  # Independent
        q1 = np.random.choice([2, 3, 4], n, p=[0.3, 0.4, 0.3])  # Moderate
        q2 = np.random.choice([2, 3, 4], n, p=[0.3, 0.4, 0.3])  # Moderate

    for i in range(n):
        real_data.append({
            'party': party,
            'Q1': q1[i],
            'Q2': q2[i]
        })

df_real = pd.DataFrame(real_data)

print("'Real' survey data (simulated):")
print(df_real.groupby('party')[['Q1', 'Q2']].mean().round(2))

'Real' survey data (simulated):
               Q1    Q2
party                  
Democrat     4.09  1.80
Independent  3.05  3.05
Republican   1.82  4.18


In [6]:
def calculate_cosine_similarity(dist1, dist2):
    """
    Calculate cosine similarity between two distributions

    Returns:
        float: 1 = identical, 0 = orthogonal, -1 = opposite
    """
    # Ensure same length and normalize
    dist1 = np.array(dist1) / np.sum(dist1)
    dist2 = np.array(dist2) / np.sum(dist2)

    return 1 - cosine(dist1, dist2)

def calculate_kl_divergence(p, q):
    """
    Calculate KL divergence from distribution q to p
    Lower is better (0 = identical)
    """
    p = np.array(p) / np.sum(p)
    q = np.array(q) / np.sum(q)

    # Add small epsilon to avoid division by zero
    epsilon = 1e-10
    p = p + epsilon
    q = q + epsilon

    return entropy(p, q)

# Compare distributions for Q1 by party
print("Algorithmic Fidelity Analysis (Q1: Healthcare)\n")
print("=" * 70)

for party in ['Democrat', 'Republican']:
    # Get real distribution
    real_dist = df_real[df_real['party'] == party]['Q1'].value_counts().sort_index()
    real_dist = real_dist.reindex([1, 2, 3, 4, 5], fill_value=0).values

    # For synthetic, we only have 1 sample per party in example
    # In practice, you'd generate many samples
    print(f"\n{party}:")
    print(f"  Real distribution: {real_dist}")
    print(f"  (Note: Need larger synthetic sample for proper comparison)")

Algorithmic Fidelity Analysis (Q1: Healthcare)


Democrat:
  Real distribution: [ 0  0 28 35 37]
  (Note: Need larger synthetic sample for proper comparison)

Republican:
  Real distribution: [39 40 21  0  0]
  (Note: Need larger synthetic sample for proper comparison)


**What this code does:**

Implements **algorithmic fidelity metrics** to compare synthetic and real distributions:

**Why we need simulated data:**
- Real survey data (ANES, GSS) requires download/access
- Simulated data lets us demonstrate the metrics
- In practice, you'd replace this with actual survey data

**Cosine similarity:**
- Measures angle between two vectors
- **Range**: -1 (opposite) to +1 (identical)
- **Interpretation**:
  - > 0.9: Excellent match
  - 0.7-0.9: Good match (typical for good fidelity)
  - < 0.7: Poor match
- **Advantage**: Scale-invariant (doesn't matter if one dist is larger)

**KL divergence (Kullback-Leibler):**
- Measures how one distribution differs from another
- **Range**: 0 (identical) to ∞ (completely different)
- **Interpretation**:
  - < 0.1: Excellent match
  - 0.1-0.5: Moderate difference
  - > 0.5: Large difference
- **Asymmetric**: KL(P||Q) ≠ KL(Q||P)
- **Sensitive to zeros**: Need epsilon for numerical stability

**When to use each:**
- **Cosine**: Easier to interpret, symmetric, good for correlation
- **KL divergence**: More sensitive to differences, standard in ML
- **Report both**: Different perspectives on same comparison

**Practical note:** Need large samples (100+ per group) for reliable distribution comparison

### Understanding Cosine Similarity and KL Divergence with Examples

Let's simulate different scenarios to understand how these metrics work:

In [None]:
# Simulate different distribution scenarios
import matplotlib.pyplot as plt
import numpy as np

# Scenario 1: Perfect match (synthetic = real)
real_perfect = np.array([0, 5, 20, 40, 35])  # Distribution across 1-5 scale
synthetic_perfect = np.array([0, 5, 20, 40, 35])

# Scenario 2: Good match (slight differences)
real_good = np.array([0, 5, 20, 40, 35])
synthetic_good = np.array([0, 8, 22, 38, 32])

# Scenario 3: Moderate mismatch (distribution shifted)
real_moderate = np.array([0, 5, 20, 40, 35])
synthetic_moderate = np.array([5, 15, 30, 30, 20])

# Scenario 4: Poor match (very different distributions)
real_poor = np.array([0, 5, 20, 40, 35])  # Skewed toward "Agree"
synthetic_poor = np.array([35, 40, 20, 5, 0])  # Skewed toward "Disagree" (reversed)

# Calculate metrics for each scenario
scenarios = [
    ("Perfect Match", real_perfect, synthetic_perfect),
    ("Good Match", real_good, synthetic_good),
    ("Moderate Mismatch", real_moderate, synthetic_moderate),
    ("Poor Match", real_poor, synthetic_poor)
]

print("Distribution Comparison Examples")
print("=" * 80)
print()

for name, real, synthetic in scenarios:
    cos_sim = calculate_cosine_similarity(real, synthetic)
    kl_div = calculate_kl_divergence(real, synthetic)
    
    print(f"{name}:")
    print(f"  Real:      {real}")
    print(f"  Synthetic: {synthetic}")
    print(f"  Cosine similarity: {cos_sim:.3f}")
    print(f"  KL divergence:     {kl_div:.3f}")
    print()

# Visualize all scenarios
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

x = np.arange(1, 6)  # Likert scale 1-5

for idx, (name, real, synthetic) in enumerate(scenarios):
    ax = axes[idx]
    
    width = 0.35
    ax.bar(x - width/2, real, width, label='Real', alpha=0.8, color='steelblue')
    ax.bar(x + width/2, synthetic, width, label='Synthetic', alpha=0.8, color='coral')
    
    cos_sim = calculate_cosine_similarity(real, synthetic)
    kl_div = calculate_kl_divergence(real, synthetic)
    
    ax.set_title(f"{name}\nCosine: {cos_sim:.3f}, KL: {kl_div:.3f}", fontsize=12, fontweight='bold')
    ax.set_xlabel('Response (1=Disagree, 5=Agree)')
    ax.set_ylabel('Count')
    ax.set_xticks(x)
    ax.legend()
    ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 80)
print("Key Insights:")
print("=" * 80)
print()
print("COSINE SIMILARITY:")
print("  • Perfect match (1.000): Distributions are identical")
print("  • Good match (>0.95): Very similar shape and position")
print("  • Moderate (0.85-0.95): Similar trends but noticeable differences")
print("  • Poor (<0.85): Different patterns or reversed")
print("  • Measures: The 'angle' between distributions (shape similarity)")
print()
print("KL DIVERGENCE:")
print("  • Perfect match (0.000): Distributions are identical")
print("  • Good match (<0.05): Very close distributions")
print("  • Moderate (0.05-0.20): Noticeable differences")
print("  • Poor (>0.20): Very different distributions")
print("  • Measures: How much 'information is lost' using synthetic instead of real")
print()
print("PRACTICAL INTERPRETATION:")
print("  • If cosine similarity is high but KL divergence is moderate:")
print("    → Distributions have similar shape but different magnitudes")
print("  • If both metrics are poor:")
print("    → Synthetic data fails to capture real distribution")
print("  • For silicon sampling validation, aim for:")
print("    → Cosine similarity > 0.90")
print("    → KL divergence < 0.10")


**What this demonstration shows:**

**The four scenarios illustrate:**

1. **Perfect Match** (Cosine ≈ 1.0, KL ≈ 0.0):
   - Distributions are identical
   - This is the ideal but rarely achieved in practice

2. **Good Match** (Cosine > 0.95, KL < 0.05):
   - Small differences in counts but same overall pattern
   - This level of agreement suggests synthetic data captures real patterns well
   - Typical of successful silicon sampling

3. **Moderate Mismatch** (Cosine 0.85-0.95, KL 0.05-0.20):
   - Distribution is shifted (e.g., synthetic responses are less extreme)
   - Still captures general trend but misses magnitude
   - May indicate **invariance problem** (responses too centered)

4. **Poor Match** (Cosine < 0.85, KL > 0.20):
   - Completely different patterns (e.g., reversed preferences)
   - May indicate **stereotyping** (LLM has wrong beliefs about group)
   - Synthetic data should NOT be used if this occurs

**Why both metrics matter:**
- **Cosine similarity** tells you if the *shape* matches (correlation)
- **KL divergence** tells you if the *exact values* match (precision)
- You need both high cosine and low KL for good fidelity

**Real-world application:**
When validating your silicon sampling:
1. Calculate both metrics for each demographic group × question
2. If most comparisons show good match: proceed with caution
3. If many show moderate/poor match: don't use synthetic data for that question
4. Always report these metrics in your validation section

### Generating larger synthetic sample for proper validation

In [7]:
# Generate larger synthetic sample (this will take a few minutes and cost ~$0.50)
# Uncomment to run - skipping by default to save API costs

# Create diverse demographic profiles
from itertools import product

# Define demographic space
ages = [25, 35, 45, 55, 65]
races = ['white', 'Black', 'Hispanic']
genders = ['male', 'female']
educations = ['high school', 'college degree']
incomes = ['35,000', '65,000', '95,000']
regions = ['Northeast', 'South', 'Midwest', 'West']
parties = ['Democrat', 'Republican', 'Independent']

# Generate personas (this creates hundreds of combinations)
# For demo purposes, sample a subset
np.random.seed(42)

personas = []
for party in parties:
    for i in range(30):  # 30 per party = 90 total
        persona = {
            'age': np.random.choice(ages),
            'race': np.random.choice(races),
            'gender': np.random.choice(genders),
            'education': np.random.choice(educations),
            'income': np.random.choice(incomes),
            'region': np.random.choice(regions),
            'party': party
        }
        personas.append(persona)

# Collect responses
synthetic_results = []

for i, persona in enumerate(personas, 1):
    if i % 10 == 0:
        print(f"Progress: {i}/{len(personas)}")

    responses = collect_responses(persona, questions[:2])  # Just Q1, Q2 for speed
    result = {**persona, **responses}
    synthetic_results.append(result)

df_synthetic_large = pd.DataFrame(synthetic_results)
df_synthetic_large.to_csv('synthetic_survey_data.csv', index=False)

print("\n✓ Large synthetic sample collected and saved")

Progress: 10/90
Progress: 20/90
Progress: 30/90
Progress: 40/90
Progress: 50/90
Progress: 60/90
Progress: 70/90
Progress: 80/90
Progress: 90/90

✓ Large synthetic sample collected and saved
[Skipped to save API costs - uncomment to run]
This would generate 90 synthetic respondents (~$0.50-1.00)


**What this code does:**

Demonstrates how to generate a **large-scale synthetic sample** for validation:

**The demographic space:**
- **Full factorial**: All combinations of demographics
- **Sampling strategy**: Random sample from space (more efficient than full grid)
- **Stratification**: Equal samples per party (can weight later)

**Sample size considerations:**
- **Minimum**: 30-50 per group for distribution comparison
- **Good**: 100+ per group
- **Excellent**: 200+ per group (for robust comparison)

**Cost calculation:**
- 90 personas × 2 questions = 180 API calls
- ~$0.0005 per call with GPT-3.5-turbo
- Total: ~$0.10-0.50 depending on prompt length

**Why commented out:**
- Saves API costs for students running the notebook
- Takes 5-10 minutes to run
- You can uncomment when ready to do real validation

**Best practices:**
- Save results to CSV after generation
- Don't regenerate unnecessarily
- Include random seed for reproducibility
- Log all parameters (model, temperature, timestamp)

---

## Part 4: Diagnosing Invariance

**Invariance** refers to the lack of within-group diversity - do all personas with the same demographics give identical answers?

**The problem:**
- Real humans with same demographics have diverse opinions
- LLMs may give identical answers for identical demographics
- This underestimates real heterogeneity

**How to test:**
- Generate multiple responses for same persona
- Measure within-persona variance
- Compare to human within-group variance

In [8]:
# Test invariance: multiple responses from same persona
test_persona = example_personas[0]  # Democrat, 45, white male
test_question = questions[0]  # Healthcare

print(f"Testing invariance for persona: {test_persona['party']}, {test_persona['age']}, {test_persona['race']}")
print(f"Question: {test_question}\n")
print("=" * 70)

# Collect 10 responses from same persona
responses = []
for i in range(10):
    response = silicon_sample_likert(test_persona, test_question)
    responses.append(response)
    print(f"Response {i+1}: {response}")
    time.sleep(0.3)

# Calculate variance
responses = [r for r in responses if r is not None]
mean_response = np.mean(responses)
variance = np.var(responses)
std = np.std(responses)

print(f"\nWithin-persona statistics:")
print(f"  Mean: {mean_response:.2f}")
print(f"  Std: {std:.2f}")
print(f"  Variance: {variance:.2f}")

# Compare to expected human variance
# For real data, same-demographic humans typically have std ~ 1.0-1.5 on 5-point scale
print(f"\nComparison:")
print(f"  LLM within-persona std: {std:.2f}")
print(f"  Expected human within-group std: ~1.0-1.5")

if std < 0.5:
    print(f"  ⚠ High invariance detected (low diversity)")
elif std < 1.0:
    print(f"  ⚠ Moderate invariance (less diverse than humans)")
else:
    print(f"  ✓ Variance comparable to humans")

Testing invariance for persona: Democrat, 45, white
Question: The government should provide universal healthcare for all citizens.

Response 1: 5
Response 2: 5
Response 3: 5
Response 4: 5
Response 5: 5
Response 6: 5
Response 7: 5
Response 8: 5
Response 9: 5
Response 10: 5

Within-persona statistics:
  Mean: 5.00
  Std: 0.00
  Variance: 0.00

Comparison:
  LLM within-persona std: 0.00
  Expected human within-group std: ~1.0-1.5
  ⚠ High invariance detected (low diversity)


**What this code does:**

Tests for **invariance** by repeatedly querying the same persona:

**The invariance problem:**
- Real humans: Same demographics ≠ identical opinions
- LLMs: May produce very similar responses for identical personas
- This **underestimates real heterogeneity**

**Why invariance happens:**
- Temperature > 0 adds randomness, but not enough
- Models average over training data
- Stereotypical "typical" response for each demographic
- Missing individual-level factors (personality, experiences, etc.)

**What we measure:**
- **Mean**: Average response for persona
- **Std (standard deviation)**: Spread of responses
- **Variance**: Squared std

**Interpretation:**
- **Std < 0.5**: High invariance (responses very similar)
- **Std 0.5-1.0**: Moderate diversity (less than humans)
- **Std > 1.0**: Good diversity (approaching human levels)

**Typical observations:**
- LLMs: Std ~ 0.3-0.8 (varies by question and model)
- Humans: Std ~ 1.0-1.5 on same demographic
- **Implication**: LLMs underestimate within-group diversity

**Solutions:**
- Higher temperature (but may reduce accuracy)
- Add personality traits to personas
- Fine-tune on diverse real responses
- Acknowledge limitation in reporting

---

## Part 5: Diagnosing Stereotyping

**Stereotyping** occurs when LLMs exaggerate between-group differences compared to real data.

**The problem:**
- LLMs trained on text that often contains stereotypes
- May amplify partisan/demographic differences
- Creates artificial polarization

**How to test:**
- Compare effect sizes for demographics in synthetic vs real data
- Look for exaggerated differences between groups

In [9]:
# Analyze stereotyping: between-group differences

# For demonstration, generate synthetic data for Democrats vs Republicans
print("Generating responses from 5 Democrats and 5 Republicans...\n")

stereotyping_data = []

for party in ['Democrat', 'Republican']:
    for i in range(5):
        persona = {
            'age': 45 + i * 5,
            'race': 'white',
            'gender': 'male' if i % 2 == 0 else 'female',
            'education': 'college degree',
            'income': '75,000',
            'region': 'Midwest',
            'party': party
        }

        response = silicon_sample_likert(persona, questions[0])  # Healthcare question
        stereotyping_data.append({
            'party': party,
            'response': response
        })
        time.sleep(0.3)

df_stereotyping = pd.DataFrame(stereotyping_data)

# Calculate means and effect size
dem_mean = df_stereotyping[df_stereotyping['party'] == 'Democrat']['response'].mean()
rep_mean = df_stereotyping[df_stereotyping['party'] == 'Republican']['response'].mean()

# Cohen's d (effect size)
pooled_std = df_stereotyping.groupby('party')['response'].std().mean()
cohens_d = (dem_mean - rep_mean) / pooled_std if pooled_std > 0 else 0

print("\nBetween-group analysis (Healthcare question):")
print("=" * 70)
print(f"\nDemocrat mean: {dem_mean:.2f}")
print(f"Republican mean: {rep_mean:.2f}")
print(f"Difference: {abs(dem_mean - rep_mean):.2f}")
print(f"Cohen's d: {abs(cohens_d):.2f}")

print(f"\nEffect size interpretation:")
if abs(cohens_d) < 0.5:
    print("  Small effect (< 0.5)")
elif abs(cohens_d) < 0.8:
    print("  Medium effect (0.5-0.8)")
else:
    print("  Large effect (> 0.8)")
    print("  ⚠ May indicate stereotyping if larger than real data")

print(f"\nNote: Compare this to real survey data Cohen's d")
print(f"If synthetic d >> real d, this suggests stereotyping")

Generating responses from 5 Democrats and 5 Republicans...


Between-group analysis (Healthcare question):

Democrat mean: 5.00
Republican mean: 2.00
Difference: 3.00
Cohen's d: 0.00

Effect size interpretation:
  Small effect (< 0.5)

Note: Compare this to real survey data Cohen's d
If synthetic d >> real d, this suggests stereotyping


**What this code does:**

Tests for **stereotyping** by measuring between-group differences:

**What is stereotyping in silicon sampling:**
- LLMs may exaggerate differences between demographic groups
- E.g., making Democrats MORE pro-healthcare than real Democrats
- Or Republicans MORE anti-healthcare than real Republicans
- Creates artificial polarization

**Cohen's d (effect size):**
- Standardized measure of group difference
- **Formula**: (Mean1 - Mean2) / Pooled SD
- **Interpretation**:
  - d < 0.5: Small effect
  - d = 0.5-0.8: Medium effect
  - d > 0.8: Large effect

**How to detect stereotyping:**
1. Calculate Cohen's d for synthetic data
2. Calculate Cohen's d for real data
3. Compare: If synthetic d >> real d, stereotyping present

**Example:**
- Real data: Democrat vs Republican on healthcare, d = 1.2
- Synthetic: d = 2.4
- **Interpretation**: LLM is doubling the partisan divide

**Why stereotyping happens:**
- Training data contains exaggerated partisan rhetoric
- News articles emphasize differences
- Social media polarization in training data
- Models learn "prototypical" Democrat/Republican

**Bisbee et al. (2024) findings:**
- Stereotyping varies by question type
- More polarized responses for morally charged issues (abortion, guns)
- Better performance for less controversial topics
- Aggregate estimates can help reduce bias

**Solutions:**
- Compare effect sizes to real data (essential validation step)
- Use caution with controversial or morally charged questions
- Consider weighting/calibrating to match real distributions