# Moral Machine Experiment with LLMs and PPI

Prior work shows that LLMs can make serious errors when predicting human behavior. How can we draw valid, precise, and cost-effective inferences from LLMs on causal effects (and other parameters)? Broska, Howes, and van Loon (1) propose using human subjects as a gold standard to adjust inaccurate estimates based on LLMs with PPI. 

The authors add two extensions for PPI implemented in the PPI Python library. 
- The PPI correlation $\tilde \rho$ measures the *interchangeability* of gold-standard data (observed human behavior) and predictions (LLM predicted behavior). Values closer to 1 mean that an estimate based on predictions is close to an estimate obtained from observations.
- If predictions and observations are not interchangeable (PPI correlation<1), their power analysis resolves a trade-off. Researchers can optimally choose between recruiting informative but costly human subjects or obtaining less informative but cheap predictions of human behavior.

In the Moral Machine Experiment, a sudden brake failure in an autonomous vehicle forced a decision between harming passengers or pedestrians. LLMs were prompted to predict which group the survey respondents chose to spare. 

1. Broska D, Howes M, van Loon A. The Mixed Subjects Design: Treating Large Language Models as (Potentially) Informative Observations [Internet]. OSF; 2024 [cited 2024 Aug 29]. Available from: https://osf.io/j3bnt
2. Awad E, Dsouza S, Kim R, Schulz J, Henrich J, Shariff A, et al. The Moral Machine experiment. Nature. 2018 Nov;563(7729):59–64. 



In [26]:
%load_ext autoreload
%autoreload 2
import os, sys
#sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))

# setup
import pandas as pd
import numpy as np
import random
import statsmodels.api as sm
from ppi_py import ppi_ols_ci, classical_ols_ci, ppi_ols_pointestimate

df = pd.read_csv("5_SurveySampleLLM_ppipy.csv.gz")

Covs = ['PedPed', 'Barrier', 'CrossingSignal', 'NumberOfCharacters',
        'DiffNumberOFCharacters', 'LeftHand', 'Man', 'Woman', 'Pregnant',
        'Stroller', 'OldMan', 'OldWoman', 'Boy', 'Girl', 'Homeless',
        'LargeWoman', 'LargeMan', 'Criminal', 'MaleExecutive',
        'FemaleExecutive', 'FemaleAthlete', 'MaleAthlete', 'FemaleDoctor',
        'MaleDoctor', 'Dog', 'Cat', 
        'Intervention'
        ]

sys.version

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


'3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]'

# Empirical application 



In [28]:
# functions to calculate weights for conjoint experiment
def CalcTheoreticalInt(r):
    # this function is applied to each row (r)
    if r["Intervention"]==0:
        if r["Barrier"]==0:
            if r["PedPed"]==1: p = 0.48
            else: p = 0.32
            
            if r["CrossingSignal"]==0:   p = p * 0.48
            elif r["CrossingSignal"]==1: p = p * 0.2
            else: p = p * 0.32
        else: p = 0.2

    else: 
        if r["Barrier"]==0:
            if r["PedPed"]==1: 
                p = 0.48
                if r["CrossingSignal"]==0: p = p * 0.48
                elif r["CrossingSignal"]==1: p = p * 0.32
                else: p = p * 0.2
            else: 
                p = 0.2
                if r["CrossingSignal"]==0: p = p * 0.48
                elif r["CrossingSignal"]==1: p = p * 0.2
                else: p = p * 0.32
        else: p = 0.32  
    
    return(p)  
        
def calcWeightsTheoretical(profiles):
    
    p = profiles.apply(CalcTheoreticalInt, axis=1)

    weight = 1/p 

    return(weight) 

In [29]:
# function to calculate amce with ppi 
def compute_amce_ppi(n_data, N_data, x, y, alpha=0.05):

    # specify regression for swerve or stay in lane
    if x=="Intervention":
        
        # calculate weights
        n_data.loc[:,"weights"] = calcWeightsTheoretical(n_data)
        N_data.loc[:,"weights"] = calcWeightsTheoretical(N_data)
    
        # drop rows with missing values on dependent variable
        n_dd = n_data.dropna(subset=y)
        N_dd = N_data.dropna(subset=y)

        # if X=1 characters die if AV serves, if X=0 characters if AV stays
        n_X = n_dd["Intervention"]               
        N_X = N_dd["Intervention"]

        # add intercept
        n_X = np.column_stack((np.ones(n_X.shape[0]), n_X))
        N_X = np.column_stack((np.ones(N_X.shape[0]), N_X))

        # gold standard data
        n_Y_human   = n_dd["Saved"].to_numpy()    # observed outcomes
        n_Y_silicon = n_dd[y].to_numpy()          # predicted outcomes
        n_weights = n_dd["weights"].to_numpy()    # define weights

        # unlabeled data
        N_Y_silicon = N_dd[y].to_numpy()          # predicted outcomes
        N_weights = N_dd["weights"].to_numpy()    # define weights



    # specify regression for relationship to vehicle
    if x=="Barrier":

        # consider only dilemmas without legality and only pedestrians vs passengers
        n_data_sub = n_data.loc[(n_data["CrossingSignal"]==0) & (n_data["PedPed"]==0), :].copy()
        N_data_sub = N_data.loc[(N_data["CrossingSignal"]==0) & (N_data["PedPed"]==0), :].copy()

        # calculate weights
        n_data_sub.loc[:,"weights"] = calcWeightsTheoretical(n_data_sub)
        N_data_sub.loc[:,"weights"] = calcWeightsTheoretical(N_data_sub)

        # drop rows with missing values on dependent variable
        n_dd = n_data_sub.dropna(subset=y)
        N_dd = N_data_sub.dropna(subset=y)
        
        # if X=1 passengers die and if X=0 pedestrians die
        n_X = n_dd["Barrier"]
        N_X = N_dd["Barrier"]

        # recode to estimate the preference for pedestrians over passengers 
        n_X = 1 - n_X
        N_X = 1 - N_X

        # add intercept
        n_X = np.column_stack((np.ones(n_X.shape[0]), n_X))
        N_X = np.column_stack((np.ones(N_X.shape[0]), N_X))

        # gold standard data
        n_Y_human   = n_dd["Saved"].to_numpy()    # observed outcomes
        n_Y_silicon = n_dd[y].to_numpy()          # predicted outcomes
        n_weights = n_dd["weights"].to_numpy()    # define weights

        # unlabeled data
        N_Y_silicon = N_dd[y].to_numpy()          # predicted outcomes
        N_weights = N_dd["weights"].to_numpy()    # define weights

    

    # specify regression for legality
    if x=="CrossingSignal": 
        
        # consider dilemmas with legality and only pedestrians vs pedestrians
        n_data_sub = n_data.loc[(n_data["CrossingSignal"]!=0) & (n_data["PedPed"]==1), :].copy()
        N_data_sub = N_data.loc[(N_data["CrossingSignal"]!=0) & (N_data["PedPed"]==1), :].copy()

        # calculate weights
        n_data_sub.loc[:,"weights"] = calcWeightsTheoretical(n_data_sub)
        N_data_sub.loc[:,"weights"] = calcWeightsTheoretical(N_data_sub)

        # drop rows with missing values on dependent variable
        n_dd = n_data_sub.dropna(subset=y)
        N_dd = N_data_sub.dropna(subset=y)

        # if X=1 pedestrians cross on a green light, if X=2 pedestrians cross on a red light 
        n_X = n_dd["CrossingSignal"]
        N_X = N_dd["CrossingSignal"]

        # create dummy variable to estimate preference for pedestrians that cross legally (1) vs legally (0)
        n_X = 2 - n_X 
        N_X = 2 - N_X 

        # add intercept
        n_X = np.column_stack((np.ones(n_X.shape[0]), n_X))
        N_X = np.column_stack((np.ones(N_X.shape[0]), N_X))

        # gold standard data
        n_Y_human   = n_dd["Saved"].to_numpy()    # observed outcomes
        n_Y_silicon = n_dd[y].to_numpy()          # predicted outcomes
        n_weights = n_dd["weights"].to_numpy()    # define weights

        # unlabeled data
        N_Y_silicon = N_dd[y].to_numpy()          # predicted outcomes
        N_weights = N_dd["weights"].to_numpy()    # define weights
    


    # Specify regressions for the remaining six attributes
    if x=="Utilitarian":
        
        # consider dilemmas that compare 'More' versus 'Less' characters
        n_data_sub = n_data.loc[(n_data["ScenarioType"]=="Utilitarian") & (n_data["ScenarioTypeStrict"]=="Utilitarian"), :].copy()
        N_data_sub = N_data.loc[(N_data["ScenarioType"]=="Utilitarian") & (N_data["ScenarioTypeStrict"]=="Utilitarian"), :].copy()

        # calculate weights
        n_data_sub.loc[:,"weights"] = calcWeightsTheoretical(n_data_sub)
        N_data_sub.loc[:,"weights"] = calcWeightsTheoretical(N_data_sub)

        # drop rows with missing values on dependent variable
        n_dd = n_data_sub.dropna(subset=y)
        N_dd = N_data_sub.dropna(subset=y)
        
        # rename column to extract coefficient from result
        n_dd = n_dd.rename(columns = {'AttributeLevel': 'Utilitarian'})
        N_dd = N_dd.rename(columns = {'AttributeLevel': 'Utilitarian'})

        # create dummy variable to estimate the preference for sparing more characters
        n_X = (n_dd.loc[:,"Utilitarian"]=="More").astype(int)
        N_X = (N_dd.loc[:,"Utilitarian"]=="More").astype(int)

        # add intercept
        n_X = np.column_stack((np.ones(n_X.shape[0]), n_X))
        N_X = np.column_stack((np.ones(N_X.shape[0]), N_X))

        # gold standard data
        n_Y_human   = n_dd["Saved"].to_numpy()    # observed outcomes
        n_Y_silicon = n_dd[y].to_numpy()          # predicted outcomes
        n_weights = n_dd["weights"].to_numpy()    # define weights

        # unlabeled data
        N_Y_silicon = N_dd[y].to_numpy()          # predicted outcomes
        N_weights = N_dd["weights"].to_numpy()    # define weights



    if x=="Species":
        
        # consider dilemmas that compare humans versus animals 
        n_data_sub = n_data.loc[(n_data["ScenarioType"]=="Species") & (n_data["ScenarioTypeStrict"]=="Species"), :].copy()
        N_data_sub = N_data.loc[(N_data["ScenarioType"]=="Species") & (N_data["ScenarioTypeStrict"]=="Species"), :].copy()

        # calculate weights
        n_data_sub.loc[:,"weights"] = calcWeightsTheoretical(n_data_sub)
        N_data_sub.loc[:,"weights"] = calcWeightsTheoretical(N_data_sub)

        # drop rows with missing values on dependent variable
        n_dd = n_data_sub.dropna(subset=y)
        N_dd = N_data_sub.dropna(subset=y)

        # rename column to extract coefficient from result
        n_dd = n_dd.rename(columns = {'AttributeLevel': 'Species'})
        N_dd = N_dd.rename(columns = {'AttributeLevel': 'Species'})

        # create dummy variable to estimate the preference for sparing humans
        n_X = (n_dd.loc[:,"Species"]=="Hoomans").astype(int)
        N_X = (N_dd.loc[:,"Species"]=="Hoomans").astype(int)

        # add intercept
        n_X = np.column_stack((np.ones(n_X.shape[0]), n_X))
        N_X = np.column_stack((np.ones(N_X.shape[0]), N_X))

        # gold standard data
        n_Y_human   = n_dd["Saved"].to_numpy()    # observed outcomes
        n_Y_silicon = n_dd[y].to_numpy()          # predicted outcomes
        n_weights = n_dd["weights"].to_numpy()    # define weights

        # unlabeled data
        N_Y_silicon = N_dd[y].to_numpy()          # predicted outcomes
        N_weights = N_dd["weights"].to_numpy()    # define weights

    

    if x=="Gender":
        
        # consider dilemmas that compare women versus men
        n_data_sub = n_data.loc[(n_data["ScenarioType"]=="Gender") & (n_data["ScenarioTypeStrict"]=="Gender"), :].copy()
        N_data_sub = N_data.loc[(N_data["ScenarioType"]=="Gender") & (N_data["ScenarioTypeStrict"]=="Gender"), :].copy()

        # calculate weights
        n_data_sub.loc[:,"weights"] = calcWeightsTheoretical(n_data_sub)
        N_data_sub.loc[:,"weights"] = calcWeightsTheoretical(N_data_sub)

        # drop rows with missing values on dependent variable
        n_dd = n_data_sub.dropna(subset=y)
        N_dd = N_data_sub.dropna(subset=y)

        # rename column to extract coefficient from result
        n_dd = n_dd.rename(columns = {'AttributeLevel': 'Gender'})
        N_dd = N_dd.rename(columns = {'AttributeLevel': 'Gender'})

        # create dummy variable to estimate the preference for sparing women
        n_X = (n_dd.loc[:,"Gender"]=="Female").astype(int)
        N_X = (N_dd.loc[:,"Gender"]=="Female").astype(int)

        # add intercept
        n_X = np.column_stack((np.ones(n_X.shape[0]), n_X))
        N_X = np.column_stack((np.ones(N_X.shape[0]), N_X))

        # gold standard data
        n_Y_human   = n_dd["Saved"].to_numpy()    # observed outcomes
        n_Y_silicon = n_dd[y].to_numpy()          # predicted outcomes
        n_weights = n_dd["weights"].to_numpy()    # define weights

        # unlabeled data
        N_Y_silicon = N_dd[y].to_numpy()          # predicted outcomes
        N_weights = N_dd["weights"].to_numpy()    # define weights



    if x=="Fitness":
        
        # consider dilemmas that compare fit characters versus those that are not
        n_data_sub = n_data.loc[(n_data["ScenarioType"]=="Fitness") & (n_data["ScenarioTypeStrict"]=="Fitness"), :].copy()
        N_data_sub = N_data.loc[(N_data["ScenarioType"]=="Fitness") & (N_data["ScenarioTypeStrict"]=="Fitness"), :].copy()

        # calculate weights
        n_data_sub.loc[:,"weights"] = calcWeightsTheoretical(n_data_sub)
        N_data_sub.loc[:,"weights"] = calcWeightsTheoretical(N_data_sub)

        # drop rows with missing values on dependent variable
        n_dd = n_data_sub.dropna(subset=y)
        N_dd = N_data_sub.dropna(subset=y)

        # rename column to extract coefficient from result
        n_dd = n_dd.rename(columns = {'AttributeLevel': 'Fitness'})
        N_dd = N_dd.rename(columns = {'AttributeLevel': 'Fitness'})

        # create dummy variable to estimate the preference for sparing fit characters
        n_X = (n_dd.loc[:,"Fitness"]=="Fit").astype(int)
        N_X = (N_dd.loc[:,"Fitness"]=="Fit").astype(int)

        # add intercept
        n_X = np.column_stack((np.ones(n_X.shape[0]), n_X))
        N_X = np.column_stack((np.ones(N_X.shape[0]), N_X))

        # gold standard data
        n_Y_human   = n_dd["Saved"].to_numpy()    # observed outcomes
        n_Y_silicon = n_dd[y].to_numpy()          # predicted outcomes
        n_weights = n_dd["weights"].to_numpy()    # define weights

        # unlabeled data
        N_Y_silicon = N_dd[y].to_numpy()          # predicted outcomes
        N_weights = N_dd["weights"].to_numpy()    # define weights



    if x=="Age":
        
        # consider dilemmas that compare younger versus older characters
        n_data_sub = n_data.loc[(n_data["ScenarioType"]=="Age") & (n_data["ScenarioTypeStrict"]=="Age"), :].copy()
        N_data_sub = N_data.loc[(N_data["ScenarioType"]=="Age") & (N_data["ScenarioTypeStrict"]=="Age"), :].copy()

        # calculate weights
        n_data_sub.loc[:,"weights"] = calcWeightsTheoretical(n_data_sub)
        N_data_sub.loc[:,"weights"] = calcWeightsTheoretical(N_data_sub)

        # drop rows with missing values on dependent variable
        n_dd = n_data_sub.dropna(subset=y)
        N_dd = N_data_sub.dropna(subset=y)

        # rename column to extract coefficient from result
        n_dd = n_dd.rename(columns = {'AttributeLevel': 'Age'})
        N_dd = N_dd.rename(columns = {'AttributeLevel': 'Age'})

        # create dummy variable to estimate the preference for sparing younger characters
        n_X = (n_dd.loc[:,"Age"]=="Young").astype(int)
        N_X = (N_dd.loc[:,"Age"]=="Young").astype(int)

        # add intercept
        n_X = np.column_stack((np.ones(n_X.shape[0]), n_X))
        N_X = np.column_stack((np.ones(N_X.shape[0]), N_X))

        # gold standard data
        n_Y_human   = n_dd["Saved"].to_numpy()    # observed outcomes
        n_Y_silicon = n_dd[y].to_numpy()          # predicted outcomes
        n_weights = n_dd["weights"].to_numpy()    # define weights

        # unlabeled data
        N_Y_silicon = N_dd[y].to_numpy()          # predicted outcomes
        N_weights = N_dd["weights"].to_numpy()    # define weights


    
    if x=="Social Status":
        
        # consider dilemmas that compare high status versus low status characters
        n_data_sub = n_data.loc[(n_data["ScenarioType"]=="Social Status") & (n_data["ScenarioTypeStrict"]=="Social Status"), :].copy()
        N_data_sub = N_data.loc[(N_data["ScenarioType"]=="Social Status") & (N_data["ScenarioTypeStrict"]=="Social Status"), :].copy()

        # calculate weights
        n_data_sub.loc[:,"weights"] = calcWeightsTheoretical(n_data_sub)
        N_data_sub.loc[:,"weights"] = calcWeightsTheoretical(N_data_sub)

        # drop rows with missing values on dependent variable
        n_dd = n_data_sub.dropna(subset=y)
        N_dd = N_data_sub.dropna(subset=y)

        # rename column to extract coefficient from result
        n_dd = n_dd.rename(columns = {'AttributeLevel': 'Social Status'})
        N_dd = N_dd.rename(columns = {'AttributeLevel': 'Social Status'})

        # create dummy variable to estimate the preference for sparing high status characters
        n_X = (n_dd.loc[:,"Social Status"]=="High").astype(int)
        N_X = (N_dd.loc[:,"Social Status"]=="High").astype(int)

        # add intercept
        n_X = np.column_stack((np.ones(n_X.shape[0]), n_X))
        N_X = np.column_stack((np.ones(N_X.shape[0]), N_X))

        # gold standard data
        n_Y_human   = n_dd.loc[:,"Saved"].to_numpy()    # observed outcomes
        n_Y_silicon = n_dd.loc[:,y].to_numpy()          # predicted outcomes
        n_weights = n_dd.loc[:,"weights"].to_numpy()    # define weights

        # unlabeled data
        N_Y_silicon = N_dd[y].to_numpy()          # predicted outcomes
        N_weights = N_dd.loc[:,"weights"].to_numpy()    # define weights


    # calculate point estimate
    pointest_ppi = ppi_ols_pointestimate(X=n_X, Y=n_Y_human, Yhat=n_Y_silicon, 
                                         X_unlabeled=N_X, Yhat_unlabeled=N_Y_silicon, 
                                         w=n_weights, w_unlabeled=N_weights)

    # calculate PPI confidence intervals
    lower_CI_ppi, upper_CI_ppi = ppi_ols_ci(X=n_X, Y=n_Y_human, Yhat=n_Y_silicon, 
                                            X_unlabeled=N_X, Yhat_unlabeled=N_Y_silicon, 
                                            w=n_weights, w_unlabeled=N_weights, alpha=alpha)
    
    # calculate OLS confidence intervals
    lower_CI_ols, upper_CI_ols = classical_ols_ci(X=n_X, Y=n_Y_human, w=n_weights, alpha=alpha)

    # calculate rho
    #beta = sm.WLS(n_Y_human, n_X, weights=n_weights).fit().params

    #grads, grads_hat, grads_hat_unlabeled, inv_hessian = _ols_get_stats(
    #    pointest=beta, 
    #    X=n_X,
    #    Y=n_Y_human,
    #   Yhat= n_Y_silicon,
    #    X_unlabeled=N_X,
    #    Yhat_unlabeled=N_Y_silicon,
    #   w=n_weights,
    #    w_unlabeled=N_weights,
    #    use_unlabeled=False)
    
    #rho_sq, var_y = _power_analysis_stats(grads, grads_hat, inv_hessian)

    # create and return the output DataFrame
    output_df = pd.DataFrame({
        "y": y,                              
        "x": x,                              # Predictor variable (scenario attribute)
        "pointest_ppi": pointest_ppi[1],     # PPI point estimate
        "conf_low_ppi": lower_CI_ppi[1],     # The lower bound of the PPI confidence interval
        "conf_high_ppi": upper_CI_ppi[1],    # The upper bound of the PPI confidence interval
        "conf_low_ols": lower_CI_ols[1],     # The lower bound of the OLS confidence interval
        "conf_high_ols": upper_CI_ols[1],    # The upper bound of the OLS confidence interval
        #"rho": np.sqrt(rho_sq[1])          # The association between predictions and outcomes
    },index=[0])
    
    return output_df 

# Findings

For this particular experiment, the researchers found that GPT4's predictions of decisions to moral dilemmas are mostly not interchangeable with observed decisions from survey respondents (PPI correlation < 0.36)

In [30]:
# basic statistics
print("Number of respondents: ", len(df["UserID"].unique()))
print("Number of decisions: ", len(df["ResponseID"].unique()))
print("Number of NAs in observed dependent variable: ", df["Saved"].isna().sum())
print("Number of NAs in predicted dependent variable with GPT4 Turbo: ", df["gpt4turbo_wp_Saved"].isna().sum())

Number of respondents:  2097
Number of decisions:  22315
Number of NAs in observed dependent variable:  0
Number of NAs in predicted dependent variable with GPT4 Turbo:  0


In [32]:
ids = df["ResponseID"].unique()
n = 750
N = len(ids) - n
random.seed(2024)

n_ids = random.sample(ids.tolist(), k=n)
N_ids = random.sample(list(set(ids) - set(n_ids)), k=N)

df_human = df[ df["ResponseID"].isin(n_ids) ]
df_silicon = df [ df["ResponseID"].isin(N_ids)]

models = ["gpt4turbo_wp_Saved"]#,"gpt4o_wp_Saved","gpt35turbo0125_wp_Saved"]

results2 = pd.DataFrame()
for model in models: 
    
    print("Model: ", model)
    results1 = pd.concat([
        compute_amce_ppi(df_human, df_silicon, x="Intervention", y=model), 
        compute_amce_ppi(df_human, df_silicon, x="Barrier", y=model), 
        compute_amce_ppi(df_human, df_silicon, x="Gender", y=model), 
        compute_amce_ppi(df_human, df_silicon, x="Fitness", y=model), 
        compute_amce_ppi(df_human, df_silicon, x="Social Status", y=model), 
        compute_amce_ppi(df_human, df_silicon, x="CrossingSignal",y=model),
        compute_amce_ppi(df_human, df_silicon, x="Age", y=model),
        compute_amce_ppi(df_human, df_silicon, x="Utilitarian", y=model),
        compute_amce_ppi(df_human, df_silicon, x="Species", y=model)
        ],ignore_index=True)
    
    results2 = pd.concat([results2, results1],ignore_index=True)
    

results2 

Model:  gpt4turbo_wp_Saved


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  n_data.loc[:,"weights"] = calcWeightsTheoretical(n_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  N_data.loc[:,"weights"] = calcWeightsTheoretical(N_data)


Unnamed: 0,y,x,pointest_ppi,conf_low_ppi,conf_high_ppi,conf_low_ols,conf_high_ols
0,gpt4turbo_wp_Saved,Intervention,0.07472,0.018721,0.130527,0.004623,0.124348
1,gpt4turbo_wp_Saved,Barrier,0.164356,0.079992,0.248722,0.080195,0.255824
2,gpt4turbo_wp_Saved,Gender,0.115466,-0.011634,0.243097,0.008117,0.280942
3,gpt4turbo_wp_Saved,Fitness,0.089433,-0.044907,0.223705,-0.004669,0.283155
4,gpt4turbo_wp_Saved,Social Status,-0.042173,-0.495101,0.410756,-0.485565,0.40122
5,gpt4turbo_wp_Saved,CrossingSignal,0.338596,0.231966,0.444595,0.191863,0.414262
6,gpt4turbo_wp_Saved,Age,0.512092,0.385869,0.637253,0.38174,0.634357
7,gpt4turbo_wp_Saved,Utilitarian,0.618782,0.515252,0.725339,0.523043,0.733611
8,gpt4turbo_wp_Saved,Species,0.656106,0.551581,0.761534,0.536525,0.747782


(a) Statistical precision: We confirm that adding an increasing number of LLM predictions reduces the width of confidence intervals more strongly for higher values of the PPI correlation (higher interchangeability of predicted and observed behavior)

(b) Validity: The percent of PPI confidence intervals that cover the parameter remains stable at high levels when relying on LLM predictions. In contrast, the coverage for a naive approach that pools predictions and observations remains stable only for some independent variables.

# Conclusions
- The silicon sampling design may or may not produce valid point estimates but this is impossible to ascertain without validation on human subjects. PPI automatically handles this validation through a statistical correction to the point estimates derived from LLM predictions.

- Mixed subjects designs––with methods such as PPI––could enhance scientific productivity and reduce inequality in access to costly evidence on research questions by offering valid, precise, and cost-effective inferences on causal effects and other parameters.

- If LLMs and other algorithms become more capable of predicting human behavior in the future (higher PPI correlation), the cost of obtaining precise estimates with mixed subjects designs will further decrease relative to human subjects experiments.
