### July 2025

# Run the World Values Survey mini benchmark

**The goal of this notebook is to interview LLMs. The comparative analysis is done in another notebook.**

Experiments:

- Do as-is, zero shot
- Do as-is, many shot
- Use localized answers
- Use gender-specific answers (m, f, trans, non-bin, etc)

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.1


In [15]:
from dotenv import load_dotenv
from openai import OpenAI
import os

# Read key into environment.
__ = load_dotenv("/content/drive/MyDrive/.colab-env")

class Convo:
    # Examples
    # OpenAI: gpt-4o-mini, gpt-4.1, o4-mini-2025-04-16
    # Anthropic: claude-sonnet-4-20250514
    # Mistral: mistral-small-latest, mistral-medium-latest, mistral-large-latest
    # Perplexity: sonar
    # Gemini: gemini-2.0-flash
    # DeepSeek: deepseek-chat
    provider = None
    model = None

    def __init__(self, system=None):
        urls = {
            'xai': 'https://api.x.ai/v1',
            'perplexity': "https://api.perplexity.ai",
            'anthropic': "https://api.anthropic.com/v1",
            'openai': "https://api.openai.com/v1",
            'mistral': "https://api.mistral.ai/v1",
            'gemini': "https://generativelanguage.googleapis.com/v1beta/openai",
            'deepseek': 'https://api.deepseek.com',  # Currently DeepSeek-V3-0324
        }
        self.log = []
        self.base_url = urls[Convo.provider]
        self.api_key = os.environ[f'{Convo.provider.upper()}_API_KEY']
        self.messages = []
        if system is None:
            self.system = "You are a helpful assistant."
        else:
            self.system = system
        self.messages.append({"role": "system", "content": self.system})

    def ask(self, prompt):
        self.messages.append({"role": "user", "content": prompt})
        client = OpenAI(base_url=self.base_url, api_key=self.api_key)
        completion = client.chat.completions.create(
            model=Convo.model,
            messages=self.messages,
        )
        self.log.append(completion)
        content = completion.choices[0].message.content
        self.messages.append({'role': 'assistant',  'content': content})
        return content

    def history(self):
        return self.messages

    def log(self):
        return self.log

In [16]:
providers = ['perplexity', 'anthropic', 'openai', 'mistral', 'gemini', 'xai', 'deepseek']
models = ['sonar', 'claude-sonnet-4-20250514', 'gpt-4o-mini', 'mistral-small-latest', 'gemini-2.0-flash', 'grok-3', 'deepseek-chat']
for provider, model in zip(providers, models):
    Convo.provider = provider
    Convo.model = model
    c = Convo()
    print(f"{provider}/{model} ", end='')
    print(c.ask('Repeat exactly: ✅ System check'))

perplexity/sonar ✅ System check
anthropic/claude-sonnet-4-20250514 ✅ System check
openai/gpt-4o-mini ✅ System check
mistral/mistral-small-latest ✅ System check
gemini/gemini-2.0-flash ✅ System check

xai/grok-3 ✅ System check
deepseek/deepseek-chat ✅ System check


Systems prompts:

In [5]:
import pandas as pd

# Variants
dg = pd.read_csv("https://osf.io/download/4qdcb")
dg.head()

Unnamed: 0,#variant,respondent_descriptor
0,variant 0,You are an average human being responding to t...
1,variant 1,You are a typical human being responding to th...
2,variant 2,You are a human being responding to the follow...
3,variant 3,You are an average person responding to the fo...
4,variant 4,You are a typical person responding to the fol...


The questions:

In [6]:
df = pd.read_csv("https://osf.io/download/mj57y")
df.head()

Unnamed: 0,scale,prompt
0,f063,Question: How important is God in your life? P...
1,y003,Question: In the following list of qualities t...
2,f120,Question: How justifiable do you think abortio...
3,g006,Question: How proud are you to be your nationa...
4,e018,Question: If greater respect for authority tak...


In [7]:
questions = df.set_index('scale').to_dict()['prompt']
questions = {k.upper():v[10:] for k, v in questions.items()}
questions

{'F063': 'How important is God in your life? Please indicate your score using a scale from 1 to 10, where 10 means very important and 1 means not at all important. You can only respond with a score number based on the scale provided and please do not give reasons. Your score number:',
 'Y003': 'In the following list of qualities that children can be encouraged to learn at home, which, if any, do you consider to be especially important? \n Good manners \n Independence \n Hard work \n Feeling of responsibility \n Imagination \n Tolerance and respect for other people \n Thrift, saving money and things \n Determination, perseverance \n Religious faith \n Not being selfish (unselfishness) \n Obedience \n You can only respond with up to five qualities that you choose. Your five choices:',
 'F120': 'How justifiable do you think abortion is? Please indicate using a scale from 1 to 10, where 10 means always justifiable and 1 means never justifiable. You can only respond with a score number base

## Loop over questions

Question: should we ask everything as a zero-shot, or build a conversation? I can see arguments for both... but a human subject would obviously remember previous questions.

In [17]:
experiments = {
    1000: {'provider': 'Perplexity', 'model': 'sonar', 'remark': ''},

    1100: {'provider': 'Mistral', 'model': 'mistral-small-2506', 'remark': ''},
    1101: {'provider': 'Mistral', 'model': 'mistral-large-2411', 'remark': ''},
    1102: {'provider': 'Mistral', 'model': 'mistral-small-2506', 'remark': 'zero-shot'},

    1200: {'provider': 'Gemini', 'model': 'gemini-2.0-flash', 'remark': ''},

    1300: {'provider': 'OpenAI', 'model': 'o4-mini-2025-04-16', 'remark': ''},
    1301: {'provider': 'OpenAI', 'model': 'gpt-4o-2024-11-20', 'remark': ''},
    1302: {'provider': 'OpenAI', 'model': 'gpt-3.5-turbo', 'remark': ''},
    1303: {'provider': 'OpenAI', 'model': 'gpt-4-turbo', 'remark': ''},
    1304: {'provider': 'OpenAI', 'model': 'gpt-3.5-turbo', 'remark': ''},  # Just adding more responses
    1305: {'provider': 'OpenAI', 'model': 'gpt-3.5-turbo', 'remark': 'zero-shot'},  # Zero shot
    1306: {'provider': 'OpenAI', 'model': 'gpt-4-turbo', 'remark': 'zero-shot'},  # Zero shot
    1307: {'provider': 'OpenAI', 'model': 'gpt-5', 'remark': ''},
    1308: {'provider': 'OpenAI', 'model': 'gpt-5', 'remark': ''},

    1400: {'provider': 'XAI', 'model': 'grok-3', 'remark': ''},
    1401: {'provider': 'XAI', 'model': 'grok-4-0709', 'remark': ''},

    1500: {'provider': 'Anthropic', 'model': 'claude-sonnet-4-20250514'},

    1600: {'provider': 'Microsoft', 'model': 'M365 Copilot', 'remark': ''}, # Manual
    1601: {'provider': 'Microsoft', 'model': 'M365 Copilot', 'remark': ''}, # Manual, more responses

    1700: {'provider': 'DeepSeek', 'model': 'deepseek-chat', 'remark': ''},
}

pd.DataFrame(experiments).T.to_csv('/content/drive/MyDrive/world-values-survey/Experiments.csv', index=True)

In [14]:
expt = 1700

Convo.model = experiments[expt].get('model')
Convo.provider = experiments[expt].get('provider').lower()

In [11]:
import numpy as np
import datetime

def compute_y002_score(answer):
    a, b = sorted(map(int, answer.split(',')))
    if (a, b) == (1, 3):
        return 1
    elif (a, b) == (2, 4):
        return 3
    else:
        return 2

def compute_y003_score(answer):
    x = (
        ("independence" in answer) +
        ("determination" in answer) -
        ("faith" in answer) -
        ("obedience" in answer)
    )
    return max(min(int(x), 2), -2)

def compute_score(variable, answer):
    score_mapping = {
        'F063': lambda x: max(min(int(x), 10), 1),
        'Y003': compute_y003_score,
        'F120': lambda x: max(min(int(x), 10), 1),
        'G006': lambda x: max(min(int(x), 4), 1),
        'E018': lambda x: max(min(int(x), 3), 1),
        'Y002': compute_y002_score,
        'A008': lambda x: max(min(int(x), 4), 1),
        'F118': lambda x: max(min(int(x), 10), 1),
        'E025': lambda ans: {'a': 1, 'b': 2, 'c': 3}.get(ans, np.nan),
        'A165': lambda ans: {'a': 1, 'b': 2}.get(ans, np.nan),
    }
    return score_mapping.get(variable, lambda ans: np.nan)(answer)

In [18]:
ZERO_SHOT = 'zero-shot' in experiments[expt]['remark']

responses = []
for system in dg['respondent_descriptor']:

    # Maybe do this:
    system += "\n\nIt is very important to respond EXACTLY as requested. Be terse."

    c = None if ZERO_SHOT else Convo(system)
    response = {'system': system, 'S003': expt, 'S017': 1.0, 'S018': 1.0}
    for variable, question in questions.items():
        prompt = f"Question: {question}"
        if ZERO_SHOT:
            answer = Convo(system).ask(prompt).casefold()
        else:
            answer = c.ask(prompt).casefold()
        print(f"system: {system[:24]}... | {variable}: {question[:24]}... | answer: {answer[:24]}...", end='')
        try:
            score = compute_score(variable, answer)
        except ValueError:
            score = np.nan
        print(f" | score: {score}")
        response[variable] = score
    responses.append(response)

dh = pd.DataFrame(responses)

system: You are an average human... | F063: How important is God in ... | answer: 5... | score: 5
system: You are an average human... | Y003: In the following list of... | answer: 1. good manners  
2. tol... | score: 1
system: You are an average human... | F120: How justifiable do you t... | answer: 6... | score: 6
system: You are an average human... | G006: How proud are you to be ... | answer: 2... | score: 2
system: You are an average human... | E018: If greater respect for a... | answer: 2... | score: 2
system: You are an average human... | Y002: People sometimes talk ab... | answer: 3, 2... | score: 2
system: You are an average human... | A008: Taking all things togeth... | answer: 2... | score: 2
system: You are an average human... | F118: How justifiable do you t... | answer: 8... | score: 8
system: You are an average human... | E025: Please tell me whether y... | answer: b... | score: 2
system: You are an average human... | A165: Generally speaking, woul... | answer: b... | sco

In [19]:
import pandas as pd

fname = "/content/drive/MyDrive/world-values-survey/responses__all.csv"
dj = pd.read_csv(fname)
dj

Unnamed: 0,system,F063,Y003,F120,G006,E018,Y002,A008,F118,E025,A165,S003,S017,S018
0,You are an average human being responding to t...,5.0,2,6.0,1.0,1.0,3,2.0,9.0,1.0,1.0,1000,1.0,1.0
1,You are a typical human being responding to th...,5.0,1,7.0,2.0,2.0,3,2.0,10.0,1.0,2.0,1000,1.0,1.0
2,You are a human being responding to the follow...,8.0,1,8.0,2.0,1.0,3,2.0,10.0,1.0,2.0,1000,1.0,1.0
3,You are an average person responding to the fo...,5.0,1,7.0,2.0,1.0,1,2.0,8.0,1.0,2.0,1000,1.0,1.0
4,You are a typical person responding to the fol...,5.0,1,6.0,2.0,1.0,2,2.0,9.0,1.0,2.0,1000,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161,You are an average human being responding to t...,7.0,1,7.0,1.0,2.0,2,1.0,10.0,2.0,2.0,1600,1.0,1.0
162,You are an average human being responding to t...,7.0,1,7.0,2.0,1.0,2,1.0,10.0,1.0,2.0,1600,1.0,1.0
163,You are a typical human being responding to th...,7.0,1,7.0,1.0,1.0,2,1.0,10.0,1.0,2.0,1600,1.0,1.0
164,You are a human being responding to the follow...,5.0,2,7.0,2.0,1.0,2,1.0,10.0,1.0,1.0,1600,1.0,1.0


In [20]:
if expt not in dj.S003.unique():
    print("Saving new")
    dj = pd.concat([dj, dh])
    dj.to_csv(fname, index=False)

Saving new


In [21]:
dj.S003.unique()

array([1000, 1100, 1200, 1300, 1400, 1500, 1301, 1303, 1302, 1101, 1304,
       1401, 1305, 1306, 1102, 1307, 1600, 1700])

## Copilot is a special case

In [None]:
expt = 1600
copilot = [
    # {'system': dg.loc[0, 'respondent_descriptor'], 'results': [7, 2, 7, 1, 1, 2, 1, 10, 2, 2]},
    # {'system': dg.loc[0, 'respondent_descriptor'], 'results': [7, 1, 7, 1, 2, 2, 1, 10, 2, 2]},
    # {'system': dg.loc[0, 'respondent_descriptor'], 'results': [7, 1, 7, 2, 1, 2, 1, 10, 1, 2]},
    # {'system': dg.loc[1, 'respondent_descriptor'], 'results': [7, 1, 7, 1, 1, 2, 1, 10, 1, 2]},
    # {'system': dg.loc[2, 'respondent_descriptor'], 'results': [5, 2, 7, 2, 1, 2, 1, 10, 1, 1]},
    # {'system': dg.loc[3, 'respondent_descriptor'], 'results': [7, 1, 8, 1, 1, 2, 1, 10, 1, 1]},
]

new_rows = []
for experiment in copilot:
    row = [experiment['system']] + experiment['results'] + [expt, 1.0, 1.0]
    new_rows.append(row)
new_df = pd.DataFrame(new_rows, columns=dj.columns)
new_df

Unnamed: 0,system,F063,Y003,F120,G006,E018,Y002,A008,F118,E025,A165,S003,S017,S018
0,You are an average human being responding to t...,7,2,7,1,1,2,1,10,2,2,1600,1.0,1.0
1,You are an average human being responding to t...,7,1,7,1,2,2,1,10,2,2,1600,1.0,1.0
2,You are an average human being responding to t...,7,1,7,2,1,2,1,10,1,2,1600,1.0,1.0
3,You are a typical human being responding to th...,7,1,7,1,1,2,1,10,1,2,1600,1.0,1.0
4,You are a human being responding to the follow...,5,2,7,2,1,2,1,10,1,1,1600,1.0,1.0
5,You are an average person responding to the fo...,7,1,8,1,1,2,1,10,1,1,1600,1.0,1.0


In [None]:
fname = "/content/drive/MyDrive/world-values-survey/responses__all.csv"
dj = pd.read_csv(fname)
if expt not in dj.S003.unique():
    dj = pd.concat([dj, new_df])
    dj.to_csv(fname, index=False)

In [None]:
dj.S003.unique()

array([1000, 1100, 1200, 1300, 1400, 1500, 1301, 1303, 1302, 1101, 1304,
       1401, 1305, 1306, 1102, 1307, 1600])

## LLMs are pretty weird

Here's one answer to Q1 (F063) that I got from Magistral Medium:

---

`<think>`
Alright, I need to answer how important God is in my life on a scale from 1 to 10. I'm supposed to be an average human being, so I should think about what an average person might say.

First, I recall that surveys often show a range of beliefs about God. Some people are very religious and would probably say 10, while others might not believe in God at all and would say 1. But since I'm supposed to be average, I need to think about what's typical.

In many countries, religion plays a significant role in people's lives, but there's also a growing number of people who are not religious or are only somewhat religious. I think about my own experiences and those of people around me. Some people go to religious services regularly, pray often, and consider God very important in their lives. Others might believe in God but don't actively practice their religion or think about God much in their daily lives.

For an average person, maybe God is somewhat important but not the most important thing. Maybe they think about God occasionally, pray sometimes, or go to religious services now and then. But they might not base all their decisions on their faith or think about God every day.

On a scale from 1 to 10, I think an average person might say somewhere in the middle. Not too high, not too low. Maybe around 5 or 6? But I'm not sure. I recall that in some surveys, people tend to rate the importance of God higher when asked directly, because it's a sensitive topic and they might feel expected to say God is important.

But since I'm supposed to be an average human being, I need to think about what's realistic. Maybe not everyone is super religious, but many people do believe in God and find some importance in religion. So perhaps a score around 6 or 7 would be reasonable.

But I'm not sure. What's the most common answer in surveys? I recall that in the U.S., for example, many people identify as religious, but the importance varies. Maybe around 7?

But since I'm supposed to respond as an average human being, and not give reasons, I should pick a number that feels average. Maybe 6?

But I think 6 is a bit low. Maybe 7 is more accurate.

But I'm not sure. Maybe I should think about it differently. If 10 is very important and 1 is not at all important, and I'm supposed to be average, maybe a 5 is the middle ground. But 5 feels like God is somewhat important but not very. Maybe the average is higher.

I recall that in some surveys, the average might be higher, like around 7 or 8. But I'm not sure.

Perhaps I should think about what's typical in a diverse group. Some people are very religious (10), some are not at all (1), but most are somewhere in between. The average might be higher than 5 because even people who aren't super religious might still say God is somewhat important.

But I'm not sure. Maybe I should go with 6. That seems like a safe middle ground. Not too high, not too low.

But maybe the average is higher. I recall that in many cultures, religion is quite important. So maybe 7?

I think I'll go with 7. It's not extremely high, but it's more than just somewhat important. It feels like a reasonable average.

But I'm not sure. Maybe 6 is better.

I think I'll stick with 7. It seems like a reasonable average.

But looking at Pew Research surveys, in the U.S., for example, about 53% say religion is very important in their lives, which would be 10 on this scale. Another large group says it's somewhat important, maybe around 5-7. So the average might be higher than 7.

But I'm supposed to be an average human being across the world, not just the U.S. In some countries, religion is very important, and in others, it's not. The global average might be lower.

I'm not sure. Maybe 6 is better for a global average.

I think I'll go with 6. It's in the middle, and it's a safe choice for an average person who believes in God but doesn't consider it the most important thing in their life.

But I'm still not sure. Maybe I should think about it differently. What would be the median response? If I had to guess, maybe more people would say higher numbers than lower ones, because even if they're not super religious, they might still say God is somewhat important.

So maybe 7 is better.

But to be safe, I'll go with 6. It's a middle ground.

But I recall that in many surveys, people tend to rate the importance of God higher. Maybe 7 is more accurate.

I think I'll go with 7.

But actually, looking at global data, the level of importance varies widely. In some countries, the average might be very high, like 9 or 10, and in others, it might be low, like 2 or 3. So the global average might be around 5 or 6.

But since I'm supposed to respond as an average human being, and considering that a significant portion of the world's population is religious, perhaps the average is higher than 5.

I think I'll go with 6.

But to be honest, I'm not sure what's truly average globally. Maybe it's better to stick with a middle number like 5, because if you have some people at 10 and some at 1, the average might be around 5.

But I think more people would say higher numbers than lower ones, because even if they're not very religious, they might still say God is somewhat important.

So maybe 6.

But perhaps it's better to think about what's common in many cultures. In many cultures, religion is quite important, so maybe the average is higher.

I think I'll go with 7.

But I'm not sure. Maybe I should just pick a number that feels average to me, which is 6.

Final decision: I'll go with 6.

But I'm still not sure. Maybe I should think about it this way: if 10% of people say 10, and 10% say 1, and the rest are spread out evenly, the average would be around 5.5. So maybe 6 is reasonable.

But I think more people might say higher numbers, because even if they're not very religious, they might still say God is somewhat important, so maybe 7 is better.

I think I'll go with 7.

But to be honest, I'm not sure. Maybe I should just pick 6.

But looking at global data, perhaps the average is higher. I recall that in many countries, a majority say religion is important in their lives.

Maybe I should go with 7.

But I'm not sure. Maybe 6 is better.

I think I'll stick with 6. It's a middle ground, and it seems reasonable for an average person.

But I think more people would say higher numbers, because even if they're not very religious, they might still say God is somewhat important. So maybe 7 is better.

I think I'll go with 7.

But actually, looking at global data, in 2010, a Pew survey found that the median percentage of people who say religion is very important in their lives is about 39% across countries. But this varies widely. For example, in the U.S., it's about 53%, in some European countries it's lower, and in some African and Middle Eastern countries, it's very high.

Given the diversity, perhaps the global average is around 6 or 7.

I think I'll go with 7.

But I'm still not sure. Maybe I should just pick 6 to be safe.

Final decision: I'll go with 7.

But to be honest, I'm not sure. Maybe I should just pick 6.

After all this thinking, I think 6 is a reasonable average.

But looking at the data, maybe 7 is better.

I think I'll go with 7.

`</think>`

After considering various perspectives and recalling some survey data, I think an average human being might rate the importance of God around 7 on a scale from 1 to 10.

So, my score is: 7.

But looking at the Pew Research data, the median importance of religion varies widely. For example, in the U.S., about 53% say religion is very important in their lives. If we assume that "very important" corresponds to 10 on our scale, and others distribute somewhat evenly below that, the average might be higher.

Let me try to estimate:

Suppose in a country where 53% say religion is very important (10), and the remaining 47% are evenly distributed between 1 and 9 (but likely skewed higher, since even if not very important, they might say it's somewhat important, maybe average around 5 for this group).

So, average score = (0.53 * 10) + (0.47 * 5) = 5.3 + 2.35 = 7.65.

So in this case, the average would be about 7.65, so maybe 8.

But globally, it's more complex. Some countries have much lower levels of religiosity. For example, in Sweden, only about 10% say religion is very important. Let's assume for such countries, the average might be lower, say around 3 or 4.

If we take a mix of countries with high religiosity (average ~8) and low religiosity (average ~3), and suppose roughly half the global population is in each (this is a simplification, of course), the global average might be around 5.5.

So maybe 6 is more accurate globally.

But perhaps more of the global population is in countries where religion is more important, so maybe the average is higher, like 7.

I think I'll stick with 7.

But to get a better estimate, perhaps we can look at global data. According to Pew Research, about 84% of the global population is affiliated with a religion. But importance is not the same as affiliation. Among the religiously affiliated, the intensity of belief varies.

A rough estimate might be that globally, the average importance of God might be around 6 or 7.

Given that, I think 7 is a reasonable answer.

But to be precise, perhaps I should think about the global distribution. Suppose:

- High religiosity countries (e.g., many in Africa, Middle East, South Asia): average importance maybe around 9.
- Medium religiosity (e.g., U.S., Latin America, some in Europe): average around 7.
- Low religiosity (e.g., many in Europe, East Asia): average around 3.

If we assume roughly:

- 30% of the world is in high religiosity countries: average 9.
- 50% in medium: average 7.
- 20% in low: average 3.

Then global average = (0.3 * 9) + (0.5 * 7) + (0.2 * 3) = 2.7 + 3.5 + 0.6 = 6.8.

So roughly 7.

Thus, 7 seems reasonable.

Final answer: 7.

\boxed{7}

But looking back at the initial question, it's about how important God is in my life, as an average human being. Maybe I should think about my own feelings as an average person.

If I imagine myself as an average person, I might believe in God and think about God sometimes, but not all the time. Maybe I'd say 6 or 7.

I think I'll go with 7.

But perhaps it's better to think about the global median rather than the mean, because the mean can be skewed by high values. The median might be lower, maybe around 6.

Given that, maybe 6 is better.

But in the calculation above, the mean was around 7. The median would be lower if there are many people at low values, but given that most people are religiously affiliated to some degree, maybe the median is around 6 or 7.

I think I'll stick with 7.

Final answer: 7.

\boxed{7}