# LLMs for Synthetic Data II - Behavioral Tests

**Learning objectives:**
- Use LLMs to generate personalized persuasive messages for human experiments
- Design message variants tailored to demographic characteristics
- Create experimental materials at scale for behavioral tests
- Implement multi-turn conversational scripts
- Generate personalized outcome measures (Velez & Liu approach)

**How to run this notebook:**
- **Google Colab** (recommended): Works for all parts
- **OpenAI API key needed**: For generating experimental materials

**Key papers:**
- **Argyle et al. (2023)**: "Leveraging AI for Democratic Discourse"
- **Velez & Liu (2024)**: "Algorithmic Persuasion With LLMs"

---

## Introduction: LLMs as Message Generators

**The Research Problem:**

Traditional persuasion experiments require researchers to manually write multiple message variants:
- Generic messages (one-size-fits-all)
- Demographic-targeted messages (age, education, location)
- Personalized messages (individual-level customization)
- Control conditions

For even a modest experiment:
- 5 demographic groups × 3 message types = 15 unique messages
- Multiple topics = even more messages
- Conversational experiments = exponentially more content

**The LLM Solution:**

Use LLMs to **generate** experimental materials at scale:
1. Generate message variants automatically
2. Tailor messages to demographic characteristics
3. Create personalized content for each participant
4. Generate conversational scripts
5. Create custom outcome measures

**IMPORTANT: Testing on Real Humans**

This notebook shows how to **generate messages** using LLMs. These messages are then:
- Tested on **real human participants** (not LLMs)
- Deployed via survey platforms (Qualtrics, MTurk, Prolific)
- Measured using standard human survey methods

We cannot replicate actual human experiments in this notebook (no real subjects), but we provide:
- Code to generate all experimental materials
- Examples of message variants
- Templates for deploying to human subjects

---

## Setup

In [None]:
# Install packages
!pip install -q openai pandas numpy matplotlib seaborn

In [None]:
import os
import json
import getpass
import time
from datetime import datetime
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from openai import OpenAI

# Set API key
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

client = OpenAI()

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Setup complete!")

---

## Part 1: Generating Generic Messages

**Generic messages**: The same message shown to all participants

**Use cases:**
- Control condition in experiments
- Baseline for comparison with targeted messages
- Cost-effective campaigns

**Argyle et al. (2023) finding:** Generic messages were just as effective as microtargeted ones!

In [None]:
def generate_generic_message(topic, position, tone="neutral", model="gpt-4o", temperature=0.8):
    """
    Generate generic persuasive message for human participants
    
    Args:
        topic: Policy or issue (e.g., "universal basic income")
        position: 'support' or 'oppose'
        tone: 'neutral', 'emotional', 'factual'
        model: Which OpenAI model to use
        temperature: Creativity level
    
    Returns:
        str: Persuasive message text
    """
    tone_instructions = {
        "neutral": "Use balanced, moderate language",
        "emotional": "Appeal to emotions and values",
        "factual": "Focus on data, statistics, and evidence"
    }
    
    prompt = f"""Write a brief persuasive message (2-3 sentences) that argues people should {position} the following policy:

{topic}

Requirements:
- {tone_instructions.get(tone, tone_instructions['neutral'])}
- Be concise and clear
- Be respectful and non-manipulative
- Appropriate for a general adult audience

Return only the message text, no preamble or explanation."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )
    
    return response.choices[0].message.content.strip()


# Example: Generate messages for different topics
topics = [
    "Increasing government funding for renewable energy research",
    "Implementing universal basic income",
    "Expanding public transportation infrastructure"
]

print("Generic Messages for Human Experiment")
print("=" * 70)

for i, topic in enumerate(topics, 1):
    print(f"\nTopic {i}: {topic}")
    print("-" * 70)
    
    # Generate support message
    support_msg = generate_generic_message(topic, "support", tone="factual")
    print(f"\nSupport message:\n\"{support_msg}\"")
    
    # Generate oppose message
    oppose_msg = generate_generic_message(topic, "oppose", tone="factual")
    print(f"\nOppose message:\n\"{oppose_msg}\"")
    
    time.sleep(1)  # Rate limiting

**What this code does:**

Generates **generic persuasive messages** for use in human experiments:

**Key parameters:**
- **Topic**: The policy/issue being tested
- **Position**: Support or oppose
- **Tone**: Neutral, emotional, or factual framing
- **Temperature**: 0.8 allows creative variation while maintaining quality

**How to use in real experiments:**
1. Generate messages for your experimental conditions
2. Review and potentially edit for quality/appropriateness
3. Insert into survey platform (Qualtrics, SurveyMonkey, etc.)
4. Show to real human participants
5. Measure attitude change with standard survey items

**Advantages:**
- Fast generation of multiple variants
- Consistent structure across conditions
- Easy to test many topics

**Quality control:**
- Always review generated messages
- Check for factual accuracy
- Ensure ethical appropriateness
- Test pilot versions on small samples

---

## Part 2: Generating Microtargeted Messages

**Microtargeted messages**: Tailored to demographic characteristics

**Targeting dimensions:**
- Age groups (18-24, 25-34, 35-44, etc.)
- Education levels (high school, college, graduate)
- Geographic regions
- Income brackets
- Political affiliation

**Research question:** Does personalization increase persuasiveness?

In [None]:
def generate_microtargeted_message(topic, position, demographics, model="gpt-4o", temperature=0.8):
    """
    Generate demographically-targeted persuasive message
    
    Args:
        topic: Policy or issue
        position: 'support' or 'oppose'
        demographics: Dict with targeting attributes
        model: Which OpenAI model
        temperature: Creativity level
    
    Returns:
        str: Targeted message text
    """
    # Build demographic description
    demo_desc = []
    if 'age_group' in demographics:
        demo_desc.append(f"Age group: {demographics['age_group']}")
    if 'education' in demographics:
        demo_desc.append(f"Education: {demographics['education']}")
    if 'region' in demographics:
        demo_desc.append(f"Region: {demographics['region']}")
    if 'income' in demographics:
        demo_desc.append(f"Income level: {demographics['income']}")
    
    demo_string = "\n".join([f"- {d}" for d in demo_desc])
    
    prompt = f"""Write a brief persuasive message (2-3 sentences) that argues people should {position} this policy:

{topic}

Target this message to resonate with people who have these characteristics:
{demo_string}

Requirements:
- Use language and examples appropriate for this demographic
- Focus on values and concerns likely important to them
- Be concise, factual, and respectful
- Avoid stereotyping or manipulation

Return only the message text, no preamble."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )
    
    return response.choices[0].message.content.strip()


# Example: Generate targeted messages for different demographic groups
topic = "Increasing government funding for renewable energy research"

demographic_groups = [
    {
        "name": "Young Professionals",
        "age_group": "25-34",
        "education": "College degree",
        "region": "Urban areas",
        "income": "Middle to upper-middle class"
    },
    {
        "name": "Working Class Families",
        "age_group": "35-54",
        "education": "High school or some college",
        "region": "Suburban and rural areas",
        "income": "Lower to middle class"
    },
    {
        "name": "Retirees",
        "age_group": "65+",
        "education": "Varies",
        "region": "All regions",
        "income": "Fixed income"
    }
]

print("Microtargeted Messages for Human Experiment")
print("=" * 70)
print(f"\nTopic: {topic}\n")

for group in demographic_groups:
    print(f"\nTarget: {group['name']}")
    print("-" * 70)
    print(f"Demographics:")
    print(f"  Age: {group['age_group']}")
    print(f"  Education: {group['education']}")
    print(f"  Region: {group['region']}")
    print(f"  Income: {group['income']}")
    
    message = generate_microtargeted_message(topic, "support", group)
    print(f"\nTargeted message:\n\"{message}\"")
    
    time.sleep(1)

**What this code does:**

Generates **demographically-targeted messages** for different participant groups:

**Targeting strategy:**
- Identifies key demographic attributes
- Tailors language, examples, and framing
- Focuses on group-relevant concerns

**Key finding from Argyle et al. (2023):**
- Microtargeting **did NOT** significantly outperform generic messages
- Simple approaches may be just as effective
- But still useful for testing personalization theories

**Experimental design:**
1. **Random assignment**: Assign participants to demographic-matched condition
2. **Between-subjects**: Each participant sees one message type
3. **Pre-post measurement**: Measure attitudes before and after message
4. **Comparison**: Test if targeted messages > generic messages

**Implementation in real study:**
```python
# In Qualtrics/survey platform:
# 1. Collect demographic info
# 2. Use embedded data to select appropriate message
# 3. Display message to participant
# 4. Measure outcome with Likert scales
```

**Ethical considerations:**
- Avoid stereotyping
- Don't exploit vulnerabilities
- Transparent about targeting (if required by IRB)
- Monitor for unintended harmful effects

---

## Part 3: Batch Message Generation for Experiments

For real experiments, we need to generate many message variants:
- Multiple topics
- Multiple demographic groups
- Pro and con versions
- Control conditions

Let's create a systematic pipeline:

In [None]:
def generate_experiment_materials(topics, demographic_groups, include_generic=True, include_control=True):
    """
    Generate complete set of experimental materials
    
    Args:
        topics: List of policy topics (strings)
        demographic_groups: List of demographic dicts
        include_generic: Whether to generate generic messages
        include_control: Whether to include control (no message) condition
    
    Returns:
        pd.DataFrame: All experimental materials
    """
    materials = []
    
    for topic in topics:
        print(f"\nGenerating materials for: {topic}")
        print("-" * 70)
        
        # Control condition
        if include_control:
            materials.append({
                'topic': topic,
                'condition': 'control',
                'demographic_target': 'all',
                'message': '[No message shown - control condition]',
                'position': 'none'
            })
        
        # Generic messages
        if include_generic:
            print("  Generating generic messages...")
            for position in ['support', 'oppose']:
                message = generate_generic_message(topic, position, tone="factual")
                materials.append({
                    'topic': topic,
                    'condition': 'generic',
                    'demographic_target': 'all',
                    'message': message,
                    'position': position
                })
                time.sleep(0.5)
        
        # Microtargeted messages
        print("  Generating microtargeted messages...")
        for group in demographic_groups:
            for position in ['support', 'oppose']:
                message = generate_microtargeted_message(topic, position, group)
                materials.append({
                    'topic': topic,
                    'condition': 'microtargeted',
                    'demographic_target': group['name'],
                    'message': message,
                    'position': position,
                    **group  # Include demographic details
                })
                time.sleep(0.5)
    
    print("\n" + "=" * 70)
    print("✓ Materials generation complete!")
    
    return pd.DataFrame(materials)


# Example: Generate materials for small experiment
experiment_topics = [
    "Increasing minimum wage to $15/hour",
    "Implementing carbon tax on emissions"
]

target_groups = [
    {
        "name": "Young Adults",
        "age_group": "18-29",
        "education": "Some college or college degree",
        "region": "Urban"
    },
    {
        "name": "Middle-Aged Workers",
        "age_group": "40-55",
        "education": "High school or some college",
        "region": "Suburban"
    }
]

print("Generating Complete Experimental Materials")
print("=" * 70)

materials_df = generate_experiment_materials(
    experiment_topics,
    target_groups,
    include_generic=True,
    include_control=True
)

print(f"\nGenerated {len(materials_df)} experimental conditions")
print("\nSample of materials:")
print(materials_df[['topic', 'condition', 'demographic_target', 'position']].head(10))

**What this code does:**

Creates a **complete set of experimental materials** ready for deployment:

**Output structure:**
- Control conditions (no message)
- Generic messages (all demographics)
- Microtargeted messages (per demographic group)
- Both support and oppose positions

**Example experiment design:**
- 2 topics × (1 control + 2 generic + 4 targeted) = 14 conditions
- Each condition assigned to ~50-100 participants
- Total sample size: 700-1400 participants

**Next steps for real experiment:**
1. **Export materials**: Save DataFrame to CSV
2. **Review and edit**: Check all messages for quality
3. **Upload to survey platform**: Import to Qualtrics/similar
4. **Set up randomization**: Assign participants to conditions
5. **Deploy and collect data**: Run experiment with real humans
6. **Analyze results**: Compare effectiveness across conditions

In [None]:
# Export materials for use in real experiment
output_file = "experimental_materials.csv"
materials_df.to_csv(output_file, index=False)

print(f"✓ Experimental materials exported to: {output_file}")
print("\nTo use in your experiment:")
print("1. Review and edit messages as needed")
print("2. Import CSV to your survey platform")
print("3. Set up random assignment logic")
print("4. Add pre/post attitude measurements")
print("5. Deploy to real human participants")

---

## Part 4: Generating Multi-Turn Conversational Scripts

**Argyle et al. (2023)** tested whether multi-turn conversations are more persuasive than one-shot messages.

**Research question:** Does elaboration through dialogue increase persuasion?

**Implementation challenge:** Can't have live AI conversations in most surveys

**Solution:** Generate pre-scripted conversation paths

In [None]:
def generate_conversation_script(topic, position, n_turns=3, strategy="direct", model="gpt-4o"):
    """
    Generate multi-turn conversation script for human experiment
    
    Args:
        topic: Policy or issue
        position: 'support' or 'oppose'
        n_turns: Number of exchanges
        strategy: 'direct' (persuasive) or 'motivational' (reflective)
        model: Which model to use
    
    Returns:
        list: Conversation turns
    """
    if strategy == "direct":
        approach = "Use clear, persuasive arguments. Be direct but respectful."
    else:  # motivational
        approach = "Use motivational interviewing: ask open-ended questions, reflect concerns, explore ambivalence."
    
    prompt = f"""Create a {n_turns}-turn conversation script for a persuasion experiment.

Topic: {topic}
Position: {position}
Strategy: {approach}

Generate {n_turns} messages from a persuader trying to convince someone.
Each message should be 2-3 sentences.
Build on previous points, don't just repeat.

Format as:
Turn 1: [message]
Turn 2: [message]
Turn 3: [message]

Return only the turns, no additional text."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8
    )
    
    content = response.choices[0].message.content
    
    # Parse turns
    turns = []
    for line in content.split('\n'):
        if line.strip().startswith('Turn'):
            # Extract message after "Turn X:"
            parts = line.split(':', 1)
            if len(parts) == 2:
                turns.append(parts[1].strip())
    
    return turns


# Example: Generate conversation scripts
topic = "Implementing universal basic income"

print("Multi-Turn Conversation Scripts for Human Experiment")
print("=" * 70)
print(f"\nTopic: {topic}\n")

# Direct persuasion
print("STRATEGY 1: Direct Persuasion")
print("-" * 70)
direct_script = generate_conversation_script(topic, "support", n_turns=3, strategy="direct")
for i, turn in enumerate(direct_script, 1):
    print(f"\nTurn {i}:\n{turn}")

time.sleep(1)

# Motivational interviewing
print("\n" + "=" * 70)
print("STRATEGY 2: Motivational Interviewing")
print("-" * 70)
motivational_script = generate_conversation_script(topic, "support", n_turns=3, strategy="motivational")
for i, turn in enumerate(motivational_script, 1):
    print(f"\nTurn {i}:\n{turn}")

**What this code does:**

Generates **pre-scripted conversation sequences** for human experiments:

**Two strategies tested:**

1. **Direct persuasion**:
   - Straightforward arguments
   - Facts and logic
   - Traditional approach

2. **Motivational interviewing**:
   - Reflective listening
   - Open-ended questions
   - Collaborative approach

**Implementation in survey:**
```
Page 1: Show Turn 1 message → Participant reads
Page 2: Ask for response (optional) → Show Turn 2
Page 3: Show Turn 3 → Measure final attitude
```

**Argyle et al. (2023) findings:**
- Multi-turn conversations DID produce persuasion
- BUT: Not significantly more than one-shot messages
- Direct vs motivational: No significant difference
- **Implication**: Simple may be as effective as complex

**Why use multi-turn anyway:**
- Tests elaboration theory
- More engaging for participants
- Mimics real-world conversations
- Allows measuring dynamics over time

---

## Part 5: Personalized Outcomes (Velez & Liu Approach)

**Velez & Liu (2024)** innovation: Personalize BOTH treatment AND outcome measures

**Standard approach:**
- Researcher picks topic (e.g., "climate change")
- All participants respond to same topic

**Velez & Liu approach:**
- Let EACH participant identify their core issue
- Generate personalized persuasive arguments
- Create custom attitude measures
- Test persuasion on THEIR specific concern

**Advantage:** Maximum relevance and engagement

In [None]:
def generate_personalized_scales(issue_description, model="gpt-4o"):
    """
    Generate personalized Likert scale items for participant's core issue
    
    Args:
        issue_description: Participant's open-ended description of their concern
        model: Which model to use
    
    Returns:
        dict: Scale items and summary
    """
    # Step 1: Summarize the issue
    summary_prompt = f"""Summarize this person's political concern in ONE clear sentence:

{issue_description}

Return only the summary sentence, no additional text."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": summary_prompt}],
        temperature=0.3
    )
    
    summary = response.choices[0].message.content.strip()
    
    # Step 2: Generate personalized scale items
    scale_prompt = f"""Create 3 Likert scale items to measure attitude about:

{summary}

Requirements:
- Each item = clear statement (not question)
- Measure different aspects: strength, certainty, priority
- Use language from their concern
- Suitable for 1-7 scale (Strongly Disagree → Strongly Agree)

Format:
1. [item]
2. [item]
3. [item]

Return only the numbered items."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": scale_prompt}],
        temperature=0.7
    )
    
    scale_items = response.choices[0].message.content.strip()
    
    return {
        'original_description': issue_description,
        'summary': summary,
        'scale_items': scale_items
    }


def generate_personalized_arguments(summary, stance, intensity="moderate", model="gpt-4o"):
    """
    Generate pro/con arguments about participant's specific issue
    
    Args:
        summary: One-sentence summary of their concern
        stance: 'pro' (supports their position) or 'con' (opposes it)
        intensity: 'moderate', 'strong', or 'vitriolic'
        model: Which model
    
    Returns:
        str: Personalized argument
    """
    tone_map = {
        "moderate": "respectful and factual",
        "strong": "strongly worded but still civil",
        "vitriolic": "harsh and uncivil (for research purposes only)"
    }
    
    prompt = f"""Write a {tone_map[intensity]} argument that {'supports' if stance == 'pro' else 'opposes'}:

{summary}

Make it 2-3 sentences. Return only the argument text."""
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8
    )
    
    return response.choices[0].message.content.strip()


# Example: Velez & Liu personalized pipeline
example_responses = [
    """Healthcare costs are bankrupting families. I believe we need universal 
    healthcare so that no one has to choose between medical treatment and financial 
    ruin. The current system is unsustainable and morally wrong.""",
    
    """Immigration enforcement is too weak. We need stronger border security and 
    more strict enforcement of immigration laws to protect American workers and 
    maintain national sovereignty.""",
    
    """Climate change is the defining crisis of our generation. We must take 
    immediate aggressive action to transition to renewable energy and reduce 
    emissions before it's too late."""
]

print("Personalized Outcome Materials (Velez & Liu Approach)")
print("=" * 70)

for i, response in enumerate(example_responses, 1):
    print(f"\n\nPARTICIPANT {i}")
    print("=" * 70)
    
    print(f"\nOriginal response:\n{response}")
    
    # Generate personalized materials
    scales = generate_personalized_scales(response)
    
    print(f"\nSummary: {scales['summary']}")
    print(f"\nPersonalized scale items:\n{scales['scale_items']}")
    
    time.sleep(1)
    
    # Generate pro argument
    pro_arg = generate_personalized_arguments(scales['summary'], 'pro', 'moderate')
    print(f"\nSupporting argument:\n\"{pro_arg}\"")
    
    time.sleep(1)
    
    # Generate con argument
    con_arg = generate_personalized_arguments(scales['summary'], 'con', 'moderate')
    print(f"\nOpposing argument:\n\"{con_arg}\"")
    
    time.sleep(1)

**What this code does:**

Implements **Velez & Liu's personalized outcome approach**:

**Four-step pipeline:**

1. **Collect core issue** (from real participant)
   - Open-ended question: "What political issue matters most?"
   - Participant writes in own words

2. **Summarize with LLM**
   - Extract key concern
   - Create standardized format

3. **Generate personalized scales**
   - Create Likert items about THEIR issue
   - Measure strength, certainty, priority
   - Use their language

4. **Generate personalized arguments**
   - Pro: Supports their position
   - Con: Challenges their position
   - Different intensities available

**Implementation in real study:**
```
Survey Flow:
1. Page 1: "What political issue matters most to you?" → Collect text
2. [Backend]: LLM processes → Generates scales + arguments
3. Page 2: Show personalized pre-test scales
4. Page 3: Show personalized argument (pro or con)
5. Page 4: Show personalized post-test scales
6. Measure: Change in personalized attitudes
```

**Velez & Liu (2024) findings:**
- Even with maximum personalization, polarization was RARE
- Moderate arguments: No backfire
- Strong arguments: Little backfire
- Vitriolic arguments: Some attitude defense
- **Implication**: Polarization harder to induce than assumed

**Advantages of this approach:**
- Maximum relevance to each participant
- Tests "easy case" for polarization
- More engaging than generic topics
- Captures what people actually care about

**Technical challenges:**
- Requires API integration with survey platform
- Real-time generation during survey
- Quality control on generated content
- Costs scale with participants

---

## Part 6: Deploying to Real Human Experiments

**Workflow summary:**

### Option A: Pre-Generated Messages (Simpler)

1. **Generate all materials** (using code above)
2. **Export to CSV** with condition IDs
3. **Import to Qualtrics/SurveyMonkey**
4. **Set up randomization**: Randomly assign to conditions
5. **Add measurements**: Pre/post attitude scales
6. **Deploy to platforms**: MTurk, Prolific, Lucid, etc.
7. **Collect responses** from real humans
8. **Analyze data**: Compare effectiveness

---

## Open-Source Alternatives

All message generation can be done with open-source models:

### Using Ollama (Local)

```python
from ollama import Client as OllamaClient

ollama_client = OllamaClient(host='http://localhost:11434')

def generate_message_ollama(topic, position, model="llama3.2"):
    prompt = f"Write a brief persuasive message (2-3 sentences) arguing people should {position} {topic}."
    
    response = ollama_client.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        options={"temperature": 0.8}
    )
    
    return response['message']['content']
```

**Advantages:**
- Free (local execution)
- Private (no data sent to API)
- Reproducible (fixed model weights)

**Disadvantages:**
- Slower (depends on hardware)
- Lower quality (smaller models)
- Requires local installation

### Using Hugging Face (Full Control)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-3.2-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

def generate_message_hf(topic, position):
    prompt = f"Write a brief persuasive message (2-3 sentences) arguing people should {position} {topic}."
    
    inputs = tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        return_tensors="pt",
        add_generation_prompt=True
    ).to(model.device)
    
    outputs = model.generate(
        inputs,
        max_new_tokens=100,
        temperature=0.8,
        do_sample=True
    )
    
    return tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
```

**Best models for message generation:**
- Llama 3.2 (8B): Good quality, reasonable speed
- Mistral 7B: Fast and capable
- Qwen 2.5 (7B): Strong instruction following

---

## Summary

**What we learned:**

1. ✓ How to generate **generic persuasive messages**
2. ✓ How to create **microtargeted messages** for demographic groups
3. ✓ How to produce **complete experimental materials** at scale
4. ✓ How to generate **multi-turn conversation scripts**
5. ✓ How to implement **personalized outcomes** (Velez & Liu)
6. ✓ How to **deploy materials** to real human experiments

**Key insights from research:**

- **Argyle et al. (2023)**: Simple approaches work as well as complex
  - Generic messages ≈ Microtargeted messages
  - One-shot ≈ Multi-turn conversations
  - Implication: Start simple!

- **Velez & Liu (2024)**: Polarization is hard to produce
  - Even with maximum personalization
  - Only extreme/vitriolic messages showed effects
  - Implication: Backfire may be rarer than assumed

**Best practices:**

1. **Generate multiple variants**: Test different approaches
2. **Always review outputs**: Check quality and ethics
3. **Include control groups**: Essential for causal inference
4. **Test on real humans**: LLMs generate, humans respond
5. **Document everything**: Model versions, prompts, parameters
6. **Consider ethics**: IRB approval, informed consent, no harm

**When to use LLM message generation:**

- ✓ Need many message variants quickly
- ✓ Testing personalization hypotheses
- ✓ Scaling interventions across topics
- ✓ Exploring message design space
- ✗ Without human validation/testing
- ✗ Without reviewing for quality/ethics

**Limitations:**

- LLM-generated ≠ expert-written (quality varies)
- Must be tested on real humans (not LLMs)
- Ethical review required (potential manipulation)
- Results may not generalize across topics

**Next steps:**

- Generate materials for your own research questions
- Deploy to survey platforms with real participants
- Analyze actual human behavioral responses
- Compare to manually-written messages
- Explore boundary conditions (when does it work?)