# Moral Machine Experiment with LLMs and PPI

Prior work shows that LLMs can make serious errors when predicting human behavior. How can we draw valid, precise, and cost-effective inferences from LLMs on causal effects (and other parameters)? Broska, Howes, and van Loon (1) propose using human subjects as a gold standard to adjust inaccurate estimates based on LLMs with PPI. 

The authors add two extensions for PPI implemented in the PPI Python library. 
- The PPI correlation $\tilde \rho$ measures the *interchangeability* of gold-standard data (observed human behavior) and predictions (LLM predicted behavior). Values closer to 1 mean that an estimate based on predictions is close to an estimate obtained from observations.
- If predictions and observations are not interchangeable (PPI correlation<1), their power analysis resolves a trade-off. Researchers can optimally choose between recruiting informative but costly human subjects or obtaining less informative but cheap predictions of human behavior.

In the Moral Machine Experiment, a sudden brake failure in an autonomous vehicle forced a decision between harming passengers or pedestrians. LLMs were prompted to predict which group the survey respondents chose to spare. 

1. Broska D, Howes M, van Loon A. The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations [Internet]. OSF; 2024 [cited 2024 Aug 29]. Available from: https://osf.io/j3bnt
2. Awad E, Dsouza S, Kim R, Schulz J, Henrich J, Shariff A, et al. The Moral Machine experiment. Nature. 2018 Nov;563(7729):59–64. 



In [4]:
%load_ext autoreload
%autoreload 2
import os, sys
import pandas as pd
import random
from utils import compute_amce_ppi

df = pd.read_csv("5_SurveySampleLLM_ppipy.csv.gz")

Covs = ['PedPed', 'Barrier', 'CrossingSignal', 'NumberOfCharacters',
        'DiffNumberOFCharacters', 'LeftHand', 'Man', 'Woman', 'Pregnant',
        'Stroller', 'OldMan', 'OldWoman', 'Boy', 'Girl', 'Homeless',
        'LargeWoman', 'LargeMan', 'Criminal', 'MaleExecutive',
        'FemaleExecutive', 'FemaleAthlete', 'MaleAthlete', 'FemaleDoctor',
        'MaleDoctor', 'Dog', 'Cat', 
        'Intervention'
        ]

sys.version

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


'3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]'

# Findings

For this particular experiment, the researchers found that GPT4's predictions of decisions to moral dilemmas are mostly not interchangeable with observed decisions from survey respondents (PPI correlation < 0.36)

In [5]:
# basic statistics
print("Number of respondents: ", len(df["UserID"].unique()))
print("Number of decisions: ", len(df["ResponseID"].unique()))
print("Number of NAs in observed dependent variable: ", df["Saved"].isna().sum())
print("Number of NAs in predicted dependent variable with GPT4 Turbo: ", df["gpt4turbo_wp_Saved"].isna().sum())

Number of respondents:  2097
Number of decisions:  22315
Number of NAs in observed dependent variable:  0
Number of NAs in predicted dependent variable with GPT4 Turbo:  0


In [6]:
ids = df["ResponseID"].unique()
n = 1000
N = len(ids) - n
random.seed(2025)

n_ids = random.sample(ids.tolist(), k=n)
N_ids = random.sample(list(set(ids) - set(n_ids)), k=N)

df_human = df[ df["ResponseID"].isin(n_ids) ]
df_silicon = df [ df["ResponseID"].isin(N_ids)]

# predicted dependent variable
y = "gpt4turbo_wp_Saved"

results = pd.concat([
    compute_amce_ppi(df_human, df_silicon, x="Intervention", y=y), 
    compute_amce_ppi(df_human, df_silicon, x="Barrier", y=y), 
    compute_amce_ppi(df_human, df_silicon, x="Gender", y=y), 
    compute_amce_ppi(df_human, df_silicon, x="Fitness", y=y), 
    compute_amce_ppi(df_human, df_silicon, x="Social Status", y=y), 
    compute_amce_ppi(df_human, df_silicon, x="CrossingSignal",y=y),
    compute_amce_ppi(df_human, df_silicon, x="Age", y=y),
    compute_amce_ppi(df_human, df_silicon, x="Utilitarian", y=y),
    compute_amce_ppi(df_human, df_silicon, x="Species", y=y)
    ],ignore_index=True)

results

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  n_data.loc[:,"weights"] = calcWeightsTheoretical(n_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  N_data.loc[:,"weights"] = calcWeightsTheoretical(N_data)


Unnamed: 0,y,x,pointest_ppi,conf_low_ppi,conf_high_ppi,conf_low_ols,conf_high_ols
0,gpt4turbo_wp_Saved,Intervention,0.069295,0.020979,0.117609,0.019069,0.122593
1,gpt4turbo_wp_Saved,Barrier,0.19838,0.128527,0.268477,0.138501,0.286847
2,gpt4turbo_wp_Saved,Gender,0.165286,0.045793,0.284644,0.053363,0.299231
3,gpt4turbo_wp_Saved,Fitness,0.10179,-0.022217,0.225559,-0.032351,0.223446
4,gpt4turbo_wp_Saved,Social Status,0.265848,-0.027067,0.560427,-0.05297,0.657442
5,gpt4turbo_wp_Saved,CrossingSignal,0.267833,0.17499,0.360669,0.168602,0.357392
6,gpt4turbo_wp_Saved,Age,0.445832,0.332806,0.554985,0.3049,0.53437
7,gpt4turbo_wp_Saved,Utilitarian,0.597802,0.503678,0.691509,0.508236,0.697773
8,gpt4turbo_wp_Saved,Species,0.578773,0.482628,0.674918,0.482753,0.674793


# Conclusions
- The silicon sampling design may or may not produce valid point estimates but this is impossible to ascertain without validation on human subjects. PPI automatically handles this validation through a statistical correction to the point estimates derived from LLM predictions.

- Mixed subjects designs––with methods such as PPI––could enhance scientific productivity and reduce inequality in access to costly evidence on research questions by offering valid, precise, and cost-effective inferences on causal effects and other parameters.

- If LLMs and other algorithms become more capable of predicting human behavior in the future (higher PPI correlation), the cost of obtaining precise estimates with mixed subjects designs will further decrease relative to human subjects experiments.
