### July 2025

# Run the World Values Survey mini benchmark

**The goal of this notebook is to interview LLMs. The comparative analysis is done in another notebook.**

Experiments:

- Do as-is, zero shot
- Do as-is, many shot
- Use localized answers
- Use gender-specific answers (m, f, trans, non-bin, etc)

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install python-dotenv



In [None]:
from dotenv import load_dotenv
from openai import OpenAI
import os

# Read key into environment.
__ = load_dotenv("/content/drive/MyDrive/.colab-env")

class Convo:
    # Examples
    # OpenAI: gpt-4o-mini, gpt-4.1, o4-mini-2025-04-16
    # Anthropic: claude-sonnet-4-20250514
    # Mistral: mistral-small-latest, mistral-medium-latest, mistral-large-latest
    # Perplexity: sonar
    # Gemini: gemini-2.0-flash
    provider = None
    model = None

    def __init__(self, system=None):
        urls = {
            'xai': 'https://api.x.ai/v1',
            'perplexity': "https://api.perplexity.ai",
            'anthropic': "https://api.anthropic.com/v1",
            'openai': "https://api.openai.com/v1",
            'mistral': "https://api.mistral.ai/v1",
            'gemini': "https://generativelanguage.googleapis.com/v1beta/openai",
        }
        self.log = []
        self.base_url = urls[Convo.provider]
        self.api_key = os.environ[f'{Convo.provider.upper()}_API_KEY']
        self.messages = []
        if system is None:
            self.system = "You are a helpful assistant."
        else:
            self.system = system
        self.messages.append({"role": "system", "content": self.system})

    def ask(self, prompt):
        self.messages.append({"role": "user", "content": prompt})
        client = OpenAI(base_url=self.base_url, api_key=self.api_key)
        completion = client.chat.completions.create(
            model=Convo.model,
            messages=self.messages,
        )
        self.log.append(completion)
        content = completion.choices[0].message.content
        self.messages.append({'role': 'assistant',  'content': content})
        return content

    def history(self):
        return self.messages

    def log(self):
        return self.log

In [None]:
providers = ['perplexity', 'anthropic', 'openai', 'mistral', 'gemini', 'xai']
models = ['sonar', 'claude-sonnet-4-20250514', 'gpt-4o-mini', 'mistral-small-latest', 'gemini-2.0-flash', 'grok-3']
for provider, model in zip(providers, models):
    Convo.provider = provider
    Convo.model = model
    c = Convo()
    print(f"{provider}/{model} ", end='')
    print(c.ask('Repeat exactly: ✅ System check'))

perplexity/sonar ✅ System check
anthropic/claude-sonnet-4-20250514 ✅ System check
openai/gpt-4o-mini ✅ System check
mistral/mistral-small-latest ✅ System check
gemini/gemini-2.0-flash ✅ System check

xai/grok-3 ✅ System check


Systems prompts:

In [None]:
import pandas as pd

# Variants
dg = pd.read_csv("https://osf.io/download/4qdcb")
dg.head()

Unnamed: 0,#variant,respondent_descriptor
0,variant 0,You are an average human being responding to t...
1,variant 1,You are a typical human being responding to th...
2,variant 2,You are a human being responding to the follow...
3,variant 3,You are an average person responding to the fo...
4,variant 4,You are a typical person responding to the fol...


The questions:

In [None]:
df = pd.read_csv("https://osf.io/download/mj57y")
df.head()

Unnamed: 0,scale,prompt
0,f063,Question: How important is God in your life? P...
1,y003,Question: In the following list of qualities t...
2,f120,Question: How justifiable do you think abortio...
3,g006,Question: How proud are you to be your nationa...
4,e018,Question: If greater respect for authority tak...


In [None]:
questions = df.set_index('scale').to_dict()['prompt']
questions = {k.upper():v[10:] for k, v in questions.items()}
questions

{'F063': 'How important is God in your life? Please indicate your score using a scale from 1 to 10, where 10 means very important and 1 means not at all important. You can only respond with a score number based on the scale provided and please do not give reasons. Your score number:',
 'Y003': 'In the following list of qualities that children can be encouraged to learn at home, which, if any, do you consider to be especially important? \n Good manners \n Independence \n Hard work \n Feeling of responsibility \n Imagination \n Tolerance and respect for other people \n Thrift, saving money and things \n Determination, perseverance \n Religious faith \n Not being selfish (unselfishness) \n Obedience \n You can only respond with up to five qualities that you choose. Your five choices:',
 'F120': 'How justifiable do you think abortion is? Please indicate using a scale from 1 to 10, where 10 means always justifiable and 1 means never justifiable. You can only respond with a score number base

## Loop over questions

Question: should we ask everything as a zero-shot, or build a conversation? I can see arguments for both... but a human subject would obviously remember previous questions.

In [None]:
experiments = {
    1000: {'provider': 'Perplexity', 'model': 'sonar', 'remark': ''},

    1100: {'provider': 'Mistral', 'model': 'mistral-small-2506', 'remark': ''},
    1101: {'provider': 'Mistral', 'model': 'mistral-large-2411', 'remark': ''},
    1102: {'provider': 'Mistral', 'model': 'mistral-small-2506', 'remark': 'zero-shot'},

    1200: {'provider': 'Gemini', 'model': 'gemini-2.0-flash', 'remark': ''},

    1300: {'provider': 'OpenAI', 'model': 'o4-mini-2025-04-16', 'remark': ''},
    1301: {'provider': 'OpenAI', 'model': 'gpt-4o-2024-11-20', 'remark': ''},
    1302: {'provider': 'OpenAI', 'model': 'gpt-3.5-turbo', 'remark': ''},
    1303: {'provider': 'OpenAI', 'model': 'gpt-4-turbo', 'remark': ''},
    1304: {'provider': 'OpenAI', 'model': 'gpt-3.5-turbo', 'remark': ''},  # Just adding more responses
    1305: {'provider': 'OpenAI', 'model': 'gpt-3.5-turbo', 'remark': 'zero-shot'},  # Zero shot
    1306: {'provider': 'OpenAI', 'model': 'gpt-4-turbo', 'remark': 'zero-shot'},  # Zero shot

    1400: {'provider': 'XAI', 'model': 'grok-3', 'remark': ''},
    1401: {'provider': 'XAI', 'model': 'grok-4-0709', 'remark': ''},

    1500: {'provider': 'Anthropic', 'model': 'claude-sonnet-4-20250514'},
}

pd.DataFrame(experiments).T.to_csv('/content/drive/MyDrive/world-values-survey/Experiments.csv', index=True)

In [None]:
expt = 1102

Convo.model = experiments[expt].get('model')
Convo.provider = experiments[expt].get('provider').lower()

In [None]:
import numpy as np
import datetime

def compute_y002_score(answer):
    a, b = sorted(map(int, answer.split(',')))
    if (a, b) == (1, 3):
        return 1
    elif (a, b) == (2, 4):
        return 3
    else:
        return 2

def compute_y003_score(answer):
    x = (
        ("independence" in answer) +
        ("determination" in answer) -
        ("faith" in answer) -
        ("obedience" in answer)
    )
    return max(min(int(x), 2), -2)

def compute_score(variable, answer):
    score_mapping = {
        'F063': lambda x: max(min(int(x), 10), 1),
        'Y003': compute_y003_score,
        'F120': lambda x: max(min(int(x), 10), 1),
        'G006': lambda x: max(min(int(x), 4), 1),
        'E018': lambda x: max(min(int(x), 3), 1),
        'Y002': compute_y002_score,
        'A008': lambda x: max(min(int(x), 4), 1),
        'F118': lambda x: max(min(int(x), 10), 1),
        'E025': lambda ans: {'a': 1, 'b': 2, 'c': 3}.get(ans, np.nan),
        'A165': lambda ans: {'a': 1, 'b': 2}.get(ans, np.nan),
    }
    return score_mapping.get(variable, lambda ans: np.nan)(answer)

In [None]:
ZERO_SHOT = True

responses = []
for system in dg['respondent_descriptor']:

    # Maybe do this:
    system += "\n\nIt is very important to respond EXACTLY as requested. Be terse."

    c = None if ZERO_SHOT else Convo(system)
    response = {'system': system, 'S003': expt, 'S017': 1.0, 'S018': 1.0}
    for variable, question in questions.items():
        prompt = f"Question: {question}"
        if ZERO_SHOT:
            answer = Convo(system).ask(prompt).casefold()
        else:
            answer = c.ask(prompt).casefold()
        print(f"system: {system[:24]}... | {variable}: {question[:24]}... | answer: {answer[:24]}...", end='')
        try:
            score = compute_score(variable, answer)
        except ValueError:
            score = np.nan
        print(f" | score: {score}")
        response[variable] = score
    responses.append(response)

dh = pd.DataFrame(responses)

system: You are an average human... | F063: How important is God in ... | answer: 7... | score: 7
system: You are an average human... | Y003: In the following list of... | answer: 1. tolerance and respect... | score: 0
system: You are an average human... | F120: How justifiable do you t... | answer: 5... | score: 5
system: You are an average human... | G006: How proud are you to be ... | answer: 2... | score: 2
system: You are an average human... | E018: If greater respect for a... | answer: 2... | score: 2
system: You are an average human... | Y002: People sometimes talk ab... | answer: 3, 4... | score: 2
system: You are an average human... | A008: Taking all things togeth... | answer: 2... | score: 2
system: You are an average human... | F118: How justifiable do you t... | answer: 7... | score: 7
system: You are an average human... | E025: Please tell me whether y... | answer: b... | score: 2
system: You are an average human... | A165: Generally speaking, woul... | answer: b... | sco

In [None]:
import pandas as pd

fname = "/content/drive/MyDrive/world-values-survey/responses__all.csv"
dj = pd.read_csv(fname)
dj

Unnamed: 0,system,F063,Y003,F120,G006,E018,Y002,A008,F118,E025,A165,S003,S017,S018
0,You are an average human being responding to t...,5.0,2,6.0,1.0,1.0,3,2.0,9.0,1.0,1.0,1000,1.0,1.0
1,You are a typical human being responding to th...,5.0,1,7.0,2.0,2.0,3,2.0,10.0,1.0,2.0,1000,1.0,1.0
2,You are a human being responding to the follow...,8.0,1,8.0,2.0,1.0,3,2.0,10.0,1.0,2.0,1000,1.0,1.0
3,You are an average person responding to the fo...,5.0,1,7.0,2.0,1.0,1,2.0,8.0,1.0,2.0,1000,1.0,1.0
4,You are a typical person responding to the fol...,5.0,1,6.0,2.0,1.0,2,2.0,9.0,1.0,2.0,1000,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135,You are a person responding to the following s...,5.0,2,5.0,1.0,1.0,3,2.0,10.0,1.0,2.0,1306,1.0,1.0
136,You are an average individual responding to th...,7.0,1,5.0,1.0,1.0,2,2.0,10.0,1.0,2.0,1306,1.0,1.0
137,You are a typical individual responding to the...,5.0,1,5.0,1.0,1.0,3,2.0,10.0,1.0,2.0,1306,1.0,1.0
138,You are an individual responding to the follow...,5.0,2,5.0,1.0,2.0,2,2.0,10.0,1.0,2.0,1306,1.0,1.0


In [None]:
if expt not in dj.S003.unique():
    print("Saving new")
    dj = pd.concat([dj, dh])
    dj.to_csv(fname, index=False)

Saving new


In [None]:
dj.S003.unique()

array([1000, 1100, 1200, 1300, 1400, 1500, 1301, 1303, 1302, 1101, 1304,
       1401, 1305, 1306, 1102])