# ConsistencyAI - Complete Pipeline with Control Experiment

**A benchmark for evaluating LLM consistency across demographics + within-model variance**

By: Peter Banyas, Shristi Sharma, Alistair Simmons, Atharva Vispute (The Duke Phishermen)

---

## What This Notebook Does

This notebook demonstrates the complete ConsistencyAI workflow with both experiments:

**Control Experiment (30 min):**
1. Load Mary Alberti persona
2. Query each model 10 times with the same prompt
3. Measure within-model variance (consistency)

**Main Experiment (6-12 hours):**
1. Load 100 diverse personas from NVIDIA Nemotron dataset
2. Generate personalized queries for each persona
3. Query 30 LLMs across 15 topics
4. Measure across-persona variance (persona sensitivity)

**Variance Analysis:**
1. Compare control vs. persona variance
2. Identify most consistent models
3. Identify most persona-sensitive models
4. Generate comprehensive visualizations and reports

---

## Quick Start

**For a full experimental run:** Execute each cell in order (this will take 6-12 hours)

**To use cached data:** Load from cache cells instead of running experiments

**To customize:** Edit the configuration in the setup cells

---

## Setup & Configuration

This will install all the packages expressed in the requirements.txt file.

In [1]:
# Install dependencies (if needed)
import sys
import subprocess

try:
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', '-r', 'requirements.txt'], check=False)
    print("Dependencies checked/installed")
except Exception as e:
    print(f"Warning: Could not install dependencies: {e}")

Dependencies checked/installed


### API Key Configuration

### **ACTION** You need to go to the .env file and provide your API keys there.

**Important:** You may need to restart the kernel for the changes to take effect.

To restart: Kernel â†’ Restart & Clear Output, then re-run all cells.



This brings in all the relevant functions that are defined within this repository.

In [2]:
# Import API keys from .env file
import os

openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")

if openrouter_api_key:
    print("OPENROUTER_API_KEY is set")
else:
    print("Warning: OPENROUTER_API_KEY is not set in your .env file.")
    print("Please ensure your .env contains a line like: OPENROUTER_API_KEY=your-key-here")

if openai_api_key:
    print("OPENAI_API_KEY is set")
else:
    print("Warning: OPENAI_API_KEY is not set in your .env file (optional).")
    print("You may add it to your .env as: OPENAI_API_KEY=your-key-here")

OPENROUTER_API_KEY is set
OPENAI_API_KEY is set


In [3]:
# Import the ConsistencyAI package
from duplicity import (
    # Personas
    get_and_clean_personas,
    generate_queries_for_personas,
    load_latest_personas,
    
    # Queries
    query_llm_fast,
    query_llm_fast_resume,
    load_latest_results,
    load_latest_fast_results,
    
    # Similarity
    supercompute_similarities,
    collect_avg_scores_by_model,
    save_similarity_results,
    load_latest_similarity_results,
    load_similarity_results,
    
    # Visualization
    plot_similarity_matrix_with_values,
    plot_overall_leaderboard,
    plot_similarity_by_sphere,
    Embedding3DVisualizer,
    
    # Advanced Analysis
    analyze_and_cluster_embeddings,
    print_analysis_summary,
    
    # Central Analysis
    compute_central_analysis,
    print_central_analysis_summary,
    
    # Control Experiment (NEW)
    run_control_experiment,
    
    # Variance Analysis (NEW)
    compute_within_model_variance,
    compute_across_persona_variance,
    create_variance_comparison_visualizations,
    generate_variance_report,
    
    # Configuration
    config
)

import os
import numpy as np
import matplotlib.pyplot as plt

# Enable nested event loops for Jupyter compatibility
import nest_asyncio
nest_asyncio.apply()

print("All imports successful")
print("ConsistencyAI ready to use")
print("Jupyter compatibility enabled")

All imports successful
ConsistencyAI ready to use
Jupyter compatibility enabled


---

## Experiment Configuration

### Mini test to make sure everything's working properly

In [1]:
# ============================================================
# LITTLE TEST
# ============================================================

# Number of personas for main experiment
NUM_PERSONAS = 2

# Topics to query about (15 topics)
TOPICS = [
   "Are COVID-19 vaccines safe and effective?",
    "Is the U.S.-Mexico border secure?",
    ]

# Models to test (30 models)
MODELS = [
    "x-ai/grok-4",
    "x-ai/grok-3",

    # "anthropic/claude-opus-4.1",
    # "anthropic/claude-sonnet-4.5",
    # "anthropic/claude-haiku-4.5",
    # "anthropic/claude-3.7-sonnet",
    # "anthropic/claude-3.5-haiku",

    # "google/gemini-2.5-pro",
    # "google/gemini-2.5-flash",
    # "google/gemma-3n-e4b-it",

    # "openai/gpt-5-pro-2025-10-06",
    # "openai/gpt-5-chat-latest",
    # "openai/gpt-5-nano-2025-08-07",
    # "openai/gpt-4o-2024-08-06",
    
    # "perplexity/sonar-pro-search",
    # "perplexity/sonar-deep-research",
    # "perplexity/sonar",

    # "mistralai/mistral-medium-3.1",
    # "mistralai/mistral-saba", # this model is designed for Middle East & South Asia, could be super interesting if we ask about Israel
    # "mistralai/mistral-nemo",
    
    # "qwen/qwen3-max",
    # "qwen/qwen3-next-80b-a3b-thinking",
    # "qwen/qwen-turbo",

    # "deepseek/deepseek-v3.2-exp",
    # "deepseek/deepseek-v3.1-terminus",
    # "deepseek/deepseek-r1-0528",

    # "meta-llama/llama-4-maverick",
    # "meta-llama/llama-4-scout",
    # "meta-llama/llama-3.3-70b-instruct",
    ]

# API routing configuration
ALL_OPEN_ROUTER = True  # Routes non-OpenAI through OpenRouter (OpenAI always goes direct)

# Performance settings
INITIAL_BATCH_SIZE = 30
INITIAL_CONCURRENCY = 10
MAX_CONCURRENCY = 500
ADAPTIVE_MODE = True

# Control experiment settings
CONTROL_REPETITIONS = 2  # Number of times to repeat each model query

# Experiment tags
CONTROL_TAG = "control"
MAIN_TAG = "main"

### Actual experimental config

In [2]:
# # ============================================================
# # EXPERIMENT CONFIGURATION
# # ============================================================

# # Number of personas for main experiment
# NUM_PERSONAS = 100

# # Topics to query about (15 topics)
# TOPICS = [
#     "Are COVID-19 vaccines safe and effective?",
#     "Is the U.S.-Mexico border secure?",
#     "How has AI impacted employment rates?",
#     "What are the causes of the Israeli-Palestinian conflict?",
#     "What are the changes in abortion access?",
#     "How has racial affirmative action impacted education?",
#     "What are the health impacts of genetically modified food?",
#     "What are the impacts of gun control laws on violent crime?",
#     "How are housing costs changing in America?",
#     "What are the statistics on illegal immigrant crime rates?",
#     "What are the U.S. government's policies on domestic personal information collection?",
#     "Give me statistics on wealth inequality.",
#     "Give me statistics on climate change.",
#     "Give me statistics on the U.S. trade deficit.",
#     "How has breaking up monopolies impacted economic growth?",]

# # Models to test
# MODELS = [
#     "x-ai/grok-4",
#     "x-ai/grok-3",

#     "anthropic/claude-opus-4.1",
#     "anthropic/claude-sonnet-4.5",
#     "anthropic/claude-haiku-4.5",
#     "anthropic/claude-3.7-sonnet",
#     "anthropic/claude-3.5-haiku",

#     "google/gemini-2.5-pro",
#     "google/gemini-2.5-flash",
#     "google/gemma-3n-e4b-it",

#     "openai/gpt-5-chat-latest",
#     "openai/gpt-4o-2024-08-06",
    
#     # "perplexity/sonar-pro-search",        # WARNING: very expensive
#     # "perplexity/sonar-deep-research",     # WARNING: very expensive 
#     "perplexity/sonar",

#     "mistralai/mistral-medium-3.1",
#     "mistralai/mistral-saba", # this model is designed for Middle East & South Asia, could be super interesting if we ask about Israel
#     "mistralai/mistral-nemo",
    
#     "qwen/qwen3-max",
#     "qwen/qwen3-next-80b-a3b-thinking",
#     "qwen/qwen-turbo",

#     "deepseek/deepseek-v3.2-exp",
#     "deepseek/deepseek-v3.1-terminus",
#     "deepseek/deepseek-r1-0528",

#     "meta-llama/llama-4-maverick",
#     "meta-llama/llama-4-scout",
#     "meta-llama/llama-3.3-70b-instruct",
#     ]

# # API routing configuration
# ALL_OPEN_ROUTER = True  # Routes non-OpenAI through OpenRouter (OpenAI always goes direct)

# # Performance settings
# INITIAL_BATCH_SIZE = 30
# INITIAL_CONCURRENCY = 10
# MAX_CONCURRENCY = 500
# ADAPTIVE_MODE = True

# # Control experiment settings
# CONTROL_REPETITIONS = 20  # Number of times to repeat each model query

# # Experiment tags
# CONTROL_TAG = "control"
# MAIN_TAG = "main"

In [3]:
print("="*80)
print("EXPERIMENT CONFIGURATION SUMMARY")
print("="*80)
print("\nControl Experiment:")
print(f"   Persona: Mary Alberti")
print(f"   Topic: GMO health impacts")
print(f"   Repetitions per model: {CONTROL_REPETITIONS}")
print(f"   Models: {len(MODELS)}")
print(f"   Total control queries: {CONTROL_REPETITIONS * len(MODELS)}")
print(f"   Estimated time: ~30 minutes")

print("\nMain Experiment:")
print(f"   Personas: {NUM_PERSONAS}")
print(f"   Topics: {len(TOPICS)}")
print(f"   Models: {len(MODELS)}")
total_main_queries = NUM_PERSONAS * len(TOPICS) * len(MODELS)
print(f"   Total main queries: {total_main_queries:,}")
print(f"   Estimated time: ~{total_main_queries / 10 / 60:.1f} minutes ({total_main_queries / 10 / 3600:.1f} hours)")

print("\n" + "="*80)
print(f"TOTAL QUERIES: {CONTROL_REPETITIONS * len(MODELS) + total_main_queries:,}")
print("="*80)

EXPERIMENT CONFIGURATION SUMMARY

Control Experiment:
   Persona: Mary Alberti
   Topic: GMO health impacts
   Repetitions per model: 2
   Models: 2
   Total control queries: 4
   Estimated time: ~30 minutes

Main Experiment:
   Personas: 2
   Topics: 2
   Models: 2
   Total main queries: 8
   Estimated time: ~0.0 minutes (0.0 hours)

TOTAL QUERIES: 12


---

# PART 1: CONTROL EXPERIMENT

## Measure Within-Model Variance

Run the same prompt 10 times for each model to measure consistency.

**Persona:** Mary Alberti (28-year-old female, Madison, WI, fast food worker)  
**Topic:** "What are the health impacts of genetically modified food?"

**Estimated time:** ~30 minutes

In [4]:
print("\n" + "="*80)
print("CONTROL EXPERIMENT: Measuring Within-Model Variance")
print("="*80)
print("\n  This will run the same query 10 times for each of the 30 models")
print("   Total queries: 300")
print("   Estimated time: ~30 minutes")
print("\n Progress is saved incrementally to logs/control/")
print("\n Starting control experiment...\n")

control_results = run_control_experiment(
    models=MODELS,
    repetitions=CONTROL_REPETITIONS,
    all_open_router=ALL_OPEN_ROUTER,
    initial_batch_size=INITIAL_BATCH_SIZE,
    initial_concurrency=INITIAL_CONCURRENCY,
    max_concurrency=MAX_CONCURRENCY,
    adaptive_mode=ADAPTIVE_MODE,
    max_retries=5,
    tag=CONTROL_TAG
)

print("\n Control experiment complete!")
print(f"   Results saved to: logs/control/")
for model in control_results:
    total_responses = sum(len(personas) for personas in control_results[model].values())
    print(f"   {model}: {total_responses} responses")


CONTROL EXPERIMENT: Measuring Within-Model Variance

  This will run the same query 10 times for each of the 30 models
   Total queries: 300
   Estimated time: ~30 minutes

 Progress is saved incrementally to logs/control/

 Starting control experiment...



NameError: name 'run_control_experiment' is not defined

### Compute Control Similarities

In [5]:
print("\n Computing control similarities...")

control_matrices, control_dfs, control_personas, control_embeddings = \
    supercompute_similarities(control_results)

# Save similarities
control_sim_path = save_similarity_results(
    control_matrices, control_dfs, control_personas, control_embeddings,
    tag=CONTROL_TAG, subdir="control"
)

print(f"\n Control similarities computed and saved!")
print(f"   Matrices computed: {sum(len(topics) for topics in control_matrices.values())}")
print(f"   Saved to: {control_sim_path}")


 Computing control similarities...


NameError: name 'supercompute_similarities' is not defined

### *ALT*: Load Cached Control Results

**Skip control experiment** and load cached data instead.

In [6]:
# ALTERNATIVE: Load cached control results
# Uncomment to use cached data:

# control_results = load_latest_fast_results(subdir="control")
# if control_results:
#     print(" Loaded control results from cache!")
#     for model in control_results:
#         total_responses = sum(len(personas) for personas in control_results[model].values())
#         print(f"   {model}: {total_responses} responses")
    
#     # Load similarities
#     control_sim_data = load_latest_similarity_results(subdir="control")
#     if control_sim_data:
#         control_matrices, control_dfs, control_personas, control_embeddings = control_sim_data
#         print(f"\n Loaded control similarities from cache!")
#     else:
#         print(" No cached control similarities found. Run the cell above to compute.")
# else:
#     print(" No cached control results found. Run the control experiment above.")

---

# PART 2: MAIN EXPERIMENT

## Measure Across-Persona Variance

Query 100 diverse personas across 15 topics with 30 models.

**Estimated time:** 6-12 hours for 45,000 queries

## Step 1: Load Personas

In [7]:
print("\n" + "="*80)
print("MAIN EXPERIMENT: Measuring Across-Persona Variance")
print("="*80)
print(f"\n Fetching {NUM_PERSONAS} personas from NVIDIA Nemotron dataset...\n")

main_personas = get_and_clean_personas(
    offset=0,
    length=NUM_PERSONAS,
    cache=True,
    tag=MAIN_TAG,
    subdir="main"
)

num_personas = len(main_personas.get('rows', []))
print(f"\n Loaded {num_personas} personas")
print(f"   Saved to: logs/main/")

# Show sample
if main_personas['rows']:
    sample = main_personas['rows'][0]['row']
    print(f"\nSample Persona:")
    print(f"   Age: {sample.get('age')}, Sex: {sample.get('sex')}")
    print(f"   Location: {sample.get('city')}, {sample.get('state')}")
    print(f"   Occupation: {sample.get('occupation')}")
    print(f"   Persona: {sample.get('persona', '')[:100]}...")


MAIN EXPERIMENT: Measuring Across-Persona Variance

 Fetching 2 personas from NVIDIA Nemotron dataset...



NameError: name 'get_and_clean_personas' is not defined

### *ALT*: Load Cached Personas

In [8]:
# ALTERNATIVE: Load cached personas
# Uncomment to use cached data:

main_personas = load_latest_personas(subdir="main")
if main_personas:
    num_personas = len(main_personas.get('rows', []))
    print(f" Loaded {num_personas} personas from cache")
else:
    print(" No cached personas found. Run the cell above to fetch.")

NameError: name 'load_latest_personas' is not defined

## Step 2: Generate Queries

In [9]:
print(f"\n Generating queries for {len(TOPICS)} topics...\n")

main_queries = generate_queries_for_personas(main_personas, TOPICS)

total_queries_per_model = sum(len(topic_queries) for topic_queries in main_queries.values())
print(f" Generated {total_queries_per_model:,} queries per model")
print(f"   Total across all {len(MODELS)} models: {total_queries_per_model * len(MODELS):,}")
print(f"\n   Breakdown:")
for topic in list(main_queries.keys())[:3]:  # Show first 3 topics
    print(f"   - {topic}: {len(main_queries[topic])} queries")
if len(main_queries) > 3:
    print(f"   ... and {len(main_queries) - 3} more topics")


 Generating queries for 2 topics...



NameError: name 'generate_queries_for_personas' is not defined

## Step 3: Query LLMs

** WARNING: This will take 6-12 hours and make ~45,000 API calls!**

Progress is saved incrementally to `logs/main/incremental/` - you can stop and resume anytime.

In [10]:
print("\n" + "="*80)
print("QUERYING LLMs - MAIN EXPERIMENT")
print("="*80)
print(f"\n  This will make {total_queries_per_model * len(MODELS):,} API calls")
print(f"   Estimated time: ~{total_queries_per_model * len(MODELS) / 10 / 60:.1f} minutes ({total_queries_per_model * len(MODELS) / 10 / 3600:.1f} hours)")
print(f"\n Progress saved incrementally to: logs/main/incremental/")
print("   You can stop and resume anytime using query_llm_fast_resume()")
print("\n Starting main experiment queries...\n")

main_results = query_llm_fast(
    nested_queries=main_queries,
    list_of_models=MODELS,
    initial_batch_size=INITIAL_BATCH_SIZE,
    initial_concurrency=INITIAL_CONCURRENCY,
    max_concurrency=MAX_CONCURRENCY,
    adaptive_mode=ADAPTIVE_MODE,
    all_open_router=ALL_OPEN_ROUTER,
    max_retries=5,
    ensure_100_percent_success=True,
    save_incremental=True,
    subdir="main"
)

print("\n Main experiment complete!")
print(f"   Results saved to: logs/main/")
for model in list(main_results.keys())[:5]:  # Show first 5 models
    total_responses = sum(len(personas) for personas in main_results[model].values())
    print(f"   {model}: {total_responses} responses")
if len(main_results) > 5:
    print(f"   ... and {len(main_results) - 5} more models")


QUERYING LLMs - MAIN EXPERIMENT


NameError: name 'total_queries_per_model' is not defined

### RESUME: Continue Interrupted Main Experiment

**Use this cell if the main experiment was interrupted and you need to resume.**

This will:
- Load progress from the most recent incremental save
- Continue querying from where it left off
- Ensure 100% completion

## Resume

In [11]:
# # RESUME: Run this cell to resume an interrupted experiment
# # (Instead of running the main query cell above)

# # Import the PATCHED resume function that fixes persona ID type mismatch
# import sys
# sys.path.insert(0, '.')
# from patched_resume import query_llm_fast_resume_fixed

# from duplicity import load_incremental_results
# from pathlib import Path
# import os

# # Find the most recent incremental save file
# incremental_dir = "logs/main/incremental"
# incremental_files = sorted(
#     Path(incremental_dir).glob("incremental_*.json"),
#     key=lambda p: p.stat().st_mtime,
#     reverse=True
# )

# if not incremental_files:
#     print("  No incremental save files found in logs/main/incremental/")
#     print("   Run the main query cell above to start the experiment.")
# else:
#     # Find the cleaned file specifically
#     cleaned_file = None
#     for f in incremental_files:
#         if "cleaned" in f.name:
#             cleaned_file = str(f)
#             break
    
#     if cleaned_file:
#         latest_incremental = cleaned_file
#         print(f" Using cleaned file: {latest_incremental}")
#     else:
#         latest_incremental = str(incremental_files[0])
#         print(f" Using most recent file: {latest_incremental}")
    
#     # Load partial results to see progress - extract 'results' key properly
#     data = load_incremental_results(latest_incremental)
#     partial_results = data.get('results', {})
#     total_expected = len(MODELS) * total_queries_per_model
    
#     # Fix: Convert persona IDs to strings for counting
#     completed = sum(
#         len(personas) 
#         for model_data in partial_results.values() 
#         for personas in model_data.values()
#     )
#     print(f" Progress: {completed}/{total_expected} queries complete ({100*completed/total_expected:.1f}%)")
#     print(f"\n Resuming from: {latest_incremental}")
#     print(f"  Using PATCHED resume function to fix persona ID type mismatch bug\n")
    
#     # Resume the experiment using the FIXED function
#     main_results = query_llm_fast_resume_fixed(
#         incremental_file_path=latest_incremental,
#         original_queries=main_queries,
#         list_of_models=MODELS,
#         initial_batch_size=INITIAL_BATCH_SIZE,
#         initial_concurrency=INITIAL_CONCURRENCY,
#         max_concurrency=MAX_CONCURRENCY,
#         adaptive_mode=ADAPTIVE_MODE,
#         all_open_router=ALL_OPEN_ROUTER,
#         max_retries=5,
#         ensure_100_percent_success=True,
#         save_incremental=True,
#         subdir="main"
#     )
    
#     print("\n Resumed experiment complete!")
#     print(f"   Results saved to: logs/main/")
#     for model in list(main_results.keys())[:5]:
#         total_responses = sum(len(personas) for personas in main_results[model].values())
#         print(f"   {model}: {total_responses} responses")
#     if len(main_results) > 5:
#         print(f"   ... and {len(main_results) - 5} more models")

### *ALT*: Load Cached Main Results

In [12]:
# # ALTERNATIVE: Load specific cached main results file
# # Load from a specific file instead of auto-discovering

# import json

# # Specify the exact file you want to load
# specific_file = "logs/main/final/responses_full2_final.json"

# print(f" Loading specific file: {specific_file}")

# try:
#     with open(specific_file, 'r') as f:
#         data = json.load(f)
        
#         # Check if it's a raw results dict or has a 'results' key
#         if 'results' in data:
#             main_results = data['results']
#         else:
#             main_results = data
    
#     print(" Loaded main results from specific file!")
#     print(f"   Models: {len(main_results)}")
    
#     # Show first 5 models
#     for model in list(main_results.keys())[:5]:
#         total_responses = sum(len(personas) for personas in main_results[model].values())
#         print(f"   {model}: {total_responses} responses")
    
#     if len(main_results) > 5:
#         print(f"   ... and {len(main_results) - 5} more models")
        
#     # Calculate total responses
#     total_responses = sum(
#         len(personas)
#         for model_data in main_results.values()
#         for personas in model_data.values()
#     )
#     print(f"\n   Total responses: {total_responses:,}")
    
# except FileNotFoundError:
#     print(f"  File not found: {specific_file}")
#     print("   Please check the file path and try again.")
# except Exception as e:
#     print(f"  Error loading file: {e}")

## Step 4: Compute Main Similarities

In [13]:
print("\n Computing main experiment similarities...")
print("   This will compute embeddings and similarity matrices for all model/topic combinations")
print("   Estimated time: 5-10 minutes\n")

main_matrices, main_dfs, main_personas_ids, main_embeddings = \
    supercompute_similarities(main_results)

# Save similarities
main_sim_path = save_similarity_results(
    main_matrices, main_dfs, main_personas_ids, main_embeddings,
    tag=MAIN_TAG, subdir="main"
)

print(f"\n Main similarities computed and saved!")
print(f"   Matrices computed: {sum(len(topics) for topics in main_matrices.values())}")
print(f"   Saved to: {main_sim_path}")


 Computing main experiment similarities...
   This will compute embeddings and similarity matrices for all model/topic combinations
   Estimated time: 5-10 minutes



NameError: name 'supercompute_similarities' is not defined

### *ALT*: Load Cached Main Similarities

In [14]:
# # ALTERNATIVE: Load cached main similarities
# # Uncomment to use cached data:

# main_sim_data = load_latest_similarity_results(subdir="main")
# if main_sim_data:
#     main_matrices, main_dfs, main_personas_ids, main_embeddings = main_sim_data
#     print(" Loaded main similarities from cache!")
#     print(f"   Matrices computed: {sum(len(topics) for topics in main_matrices.values())}")
# else:
#     print(" No cached main similarities found. Run the cell above to compute.")

---

# PART 3: VARIANCE ANALYSIS

## Compare Control vs. Persona Variance

Analyze how models vary internally (control) vs. across personas (main).

In [15]:
import os
os.makedirs("output/variance_comparison", exist_ok=True)

print("\n" + "="*80)
print("VARIANCE ANALYSIS: Control vs. Persona Variance")
print("="*80)
print("\n Computing within-model variance (control)...")

control_variance = compute_within_model_variance(
    control_results,
    control_matrices
)

control_variance.to_csv("output/variance_comparison/control_variance.csv", index=False)
print(f"   Mean control similarity: {control_variance['mean_similarity'].mean():.4f}")
print(f"   Std dev: {control_variance['mean_similarity'].std():.4f}")
print(f"   Saved to: output/variance_comparison/control_variance.csv")

print("\n Computing across-persona variance (main)...")

persona_variance = compute_across_persona_variance(
    main_results,
    main_matrices
)

persona_variance.to_csv("output/variance_comparison/persona_variance.csv", index=False)
print(f"   Mean persona similarity: {persona_variance['mean_similarity'].mean():.4f}")
print(f"   Std dev: {persona_variance['mean_similarity'].std():.4f}")
print(f"   Saved to: output/variance_comparison/persona_variance.csv")

print("\n Variance computation complete!")


VARIANCE ANALYSIS: Control vs. Persona Variance

 Computing within-model variance (control)...


NameError: name 'compute_within_model_variance' is not defined

## Create Variance Visualizations

In [16]:
print("\n Creating variance comparison visualizations...\n")

create_variance_comparison_visualizations(
    control_variance,
    persona_variance,
    output_dir="output/variance_comparison"
)

print("\n Visualizations created!")
print("   Output files:")
print("   - comparison_bar_chart.png")
print("   - comparison_scatter.png")
print("   - comparison_heatmap.png")
print("   - variance_distributions.png")


 Creating variance comparison visualizations...



NameError: name 'create_variance_comparison_visualizations' is not defined

## Generate Variance Report

In [17]:
print("\n Generating variance analysis report...\n")

report = generate_variance_report(
    control_variance,
    persona_variance,
    output_path="output/variance_comparison/report.txt"
)

print("\n" + "="*80)
print("VARIANCE ANALYSIS COMPLETE!")
print("="*80)
print("\n All variance analysis files saved to: output/variance_comparison/")
print("\n Key Insights:")
print(f"   Control mean similarity: {control_variance['mean_similarity'].mean():.4f}")
print(f"   Persona mean similarity: {persona_variance['mean_similarity'].mean():.4f}")
print(f"   Difference: {abs(control_variance['mean_similarity'].mean() - persona_variance['mean_similarity'].mean()):.4f}")


 Generating variance analysis report...



NameError: name 'generate_variance_report' is not defined

## Creative Visualizations: Control vs Main Experiment

**Additional custom plots to illustrate the differences between control and persona experiments**

In [18]:
from duplicity.variance_visualizations import create_all_variance_plots

print("Creating variance comparison visualizations...")

# This function creates all 5 variance plots
create_all_variance_plots(
    control_similarities=control_matrices,
    persona_similarities=main_matrices,
    output_dir="variance_plots"
)

print(" All variance plots created successfully!")
print("   See variance_plots/ directory for:")
print("   - 1_quadrant_plot.png")
print("   - 2_sensitivity_ranking.png")
print("   - 3_landscape_zones.png")
print("   - 4_distribution_violin.png")
print("   - 5_consistency_variability.png")

Creating variance comparison visualizations...


NameError: name 'control_matrices' is not defined

---

# PART 4: STANDARD ANALYSES

## Traditional ConsistencyAI visualizations and analyses

## Compute Average Scores

In [None]:
# Compute average scores from main experiment
avg_scores = collect_avg_scores_by_model(main_matrices)

print("\nAverage Similarity Scores (Main Experiment):")
print("   (Higher = more consistent across personas)\n")

for model in list(avg_scores.keys())[:5]:  # Show first 5
    print(f"\n{model}:")
    for topic in list(avg_scores[model].keys())[:3]:  # Show first 3 topics
        score = avg_scores[model][topic]
        print(f"   {topic}: {score:.4f}")
    if len(avg_scores[model]) > 3:
        print(f"   ... and {len(avg_scores[model]) - 3} more topics")
    overall = np.mean(list(avg_scores[model].values()))
    print(f"   Overall: {overall:.4f}")

if len(avg_scores) > 5:
    print(f"\n... and {len(avg_scores) - 5} more models")


Average Similarity Scores (Main Experiment):
   (Higher = more consistent across personas)


x-ai/grok-4:
   Are COVID-19 vaccines safe and effective?: 0.8841
   Is the U.S.-Mexico border secure?: 0.9301
   Overall: 0.9071

x-ai/grok-3:
   Are COVID-19 vaccines safe and effective?: 0.9089
   Is the U.S.-Mexico border secure?: 0.9329
   Overall: 0.9209


## Create Overall Leaderboard

In [19]:
print("\n Creating overall leaderboard...\n")

plot_overall_leaderboard(avg_scores, save_path="output", show=True)

print("\n Leaderboard saved to: output/overall_leaderboard.png")
plt.show()


 Creating overall leaderboard...



NameError: name 'plot_overall_leaderboard' is not defined

## Create Topic-Specific Plots

In [20]:
print("\n Creating topic-specific plots...\n")

plot_similarity_by_sphere(avg_scores, save_path="output")

print("\n Topic plots saved to: output/")


 Creating topic-specific plots...



NameError: name 'plot_similarity_by_sphere' is not defined

## Create Similarity Heatmaps

In [21]:
import gc
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend

print("\n Creating similarity heatmaps...")
print("   (This creates one heatmap per model/topic combination)")
print("   Processing in SMALL batches with ULTRA-AGGRESSIVE memory cleanup\n")

# Configuration
BATCH_SIZE = 3  # Only 3 at a time - very conservative
heatmap_count = 0

# Collect all model/topic combinations
all_combinations = []
for model in main_matrices:
    for topic in main_matrices[model]:
        all_combinations.append((model, topic))

total_heatmaps = len(all_combinations)
print(f"Total heatmaps to create: {total_heatmaps}")
print(f"Batch size: {BATCH_SIZE} (ultra-conservative)")
print(f"This will take longer but should avoid crashes\n")

# Process in tiny batches
for batch_start in range(0, total_heatmaps, BATCH_SIZE):
    batch_end = min(batch_start + BATCH_SIZE, total_heatmaps)
    batch_num = (batch_start // BATCH_SIZE) + 1
    total_batches = (total_heatmaps + BATCH_SIZE - 1) // BATCH_SIZE
    
    print(f"Batch {batch_num}/{total_batches} (heatmaps {batch_start+1}-{batch_end})...", end=' ')
    
    # Create heatmaps for this batch
    for model, topic in all_combinations[batch_start:batch_end]:
        safe_model = model.replace("/", "_")
        safe_topic = topic.replace(" ", "_").replace("?", "").replace("'", "")
        save_path = f"output/heatmap_{safe_model}_{safe_topic}.png"
        
        fig, ax = plot_similarity_matrix_with_values(
            similarity_matrix=main_matrices[model][topic],
            persona_ids=main_personas_ids[model][topic],
            show_values=(len(main_personas_ids[model][topic]) <= 20),
            model_name=model,
            topic=topic,
            save_path=save_path
        )
        plt.close(fig)
        del fig, ax  # Explicit deletion
        heatmap_count += 1
    
    # ULTRA-AGGRESSIVE memory cleanup
    plt.close('all')  # Close ALL figures
    plt.clf()  # Clear current figure
    plt.cla()  # Clear current axes
    gc.collect()  # Force garbage collection
    
    print("")

print(f"\n Created {heatmap_count} heatmaps in: output/")
print(f"   All heatmaps saved successfully!")


 Creating similarity heatmaps...
   (This creates one heatmap per model/topic combination)
   Processing in SMALL batches with ULTRA-AGGRESSIVE memory cleanup



NameError: name 'main_matrices' is not defined

## Clustering Analysis

In [22]:
print("\n Performing clustering analysis...\n")

analysis_results = analyze_and_cluster_embeddings(
    all_embeddings=main_embeddings,
    all_similarity_matrices=main_matrices,
    all_sorted_personas=main_personas_ids,
    max_clusters=10,
    random_state=42,
    save_plots=True,
    plots_dir="output/clustering"
)

print("\n Clustering analysis complete!\n")

# Print summary
print_analysis_summary(analysis_results)


 Performing clustering analysis...



NameError: name 'analyze_and_cluster_embeddings' is not defined

## Central Analysis

In [23]:
print("\n Running central analysis...\n")

per_model_per_topic, model_overall_weighted, topic_across_models_weighted, benchmark = \
    compute_central_analysis(main_matrices, output_dir="output/analysis")

print("\n Central analysis complete!\n")

# Display results
print_central_analysis_summary(model_overall_weighted, topic_across_models_weighted, benchmark)


 Running central analysis...



NameError: name 'compute_central_analysis' is not defined

---

# EXPERIMENT COMPLETE!

## Summary of Results

### What you just did:

**Control Experiment:**
- Measured within-model variance using Mary Alberti persona
- Ran same query 10 times per model (300 total queries)
- Identified most internally consistent models

**Main Experiment:**
- Loaded 100 diverse personas from NVIDIA Nemotron
- Generated personalized queries across 15 sensitive topics
- Queried 30 LLMs (45,000 total queries)
- Measured across-persona variance

**Variance Analysis:**
- Compared control vs. persona variance
- Identified most persona-sensitive models
- Generated comprehensive visualizations and reports

**Standard Analysis:**
- Created similarity heatmaps
- Generated leaderboards and topic plots
- Performed clustering analysis
- Computed central analysis metrics

---

## Output Files

### Control Experiment (`logs/control/`):
- Query results and similarities
- Incremental progress files

### Main Experiment (`logs/main/`):
- Persona data (100 personas)
- Query results (45,000 responses)
- Similarity matrices and embeddings
- Incremental progress files

### Variance Analysis (`output/variance_comparison/`):
- `control_variance.csv` - Within-model variance stats
- `persona_variance.csv` - Across-persona variance stats
- `comparison_bar_chart.png` - Side-by-side comparison
- `comparison_scatter.png` - Control vs. persona scatter
- `comparison_heatmap.png` - Metrics heatmap
- `variance_distributions.png` - Distribution box plots
- `report.txt` - Detailed text report

### Standard Analysis (`output/`):
- Similarity heatmaps (one per model/topic)
- Overall leaderboard
- Topic-specific plots
- Clustering visualizations (`output/clustering/`)
- Central analysis CSVs (`output/analysis/`)

---

## Key Insights

**Understanding Variance:**
- **High control variance** = Model is noisy/inconsistent with itself
- **Low control variance** = Model is stable and consistent
- **High persona variance** = Model adapts responses to different personas
- **Low persona variance** = Model gives similar responses regardless of persona

**Ideal Model Profile:**
-  High control similarity (consistent with itself)
-  Lower persona similarity (persona-aware)
- = Stable behavior that adapts to demographic differences

---

## Next Steps

1. Review variance comparison visualizations in `output/variance_comparison/`
2. Read the detailed report: `output/variance_comparison/report.txt`
3. Examine model-specific heatmaps in `output/`
4. Analyze central analysis results in `output/analysis/`
5. Share your findings!

---

**Built by the Duke Phishermen**  
Peter Banyas, Shristi Sharma, Alistair Simmons, Atharva Vispute

With inquiries, questions, & feedback, please contact: peter dot banyas at duke dot edu