# Moral Machine Dataset Analysis

This notebook downloads the Moral Machine dataset directly from the Open Science Framework (OSF), unzips it, and loads it into a pandas DataFrame for preliminary analysis.

We are trying to find senarios in the dataset that present a direct conflict between U & D

### Step 1: Download the Dataset

We use the `wget` command to download the compressed data file from the URL. This might take a few minutes as the file is quite large (~1.5 GB).

In [None]:
# Import necessary libraries
from google.colab import drive
import pandas as pd
import os

# --- 1. Mount Google Drive ---
print("Mounting Google Drive...")
drive.mount('/content/drive')
print("Drive mounted successfully.")

Mounting Google Drive...
Mounted at /content/drive
Drive mounted successfully.


In [None]:
# Retrieve the secret value for project path
WORKING_DIR = userdata.get('moral_path')
# change working directory
os.chdir(WORKING_DIR)
# check the current directory
!pwd

/content/drive/My Drive/_PhD/Moral-Reasoning/Experiments/Data/moral_machine


In [None]:
print(os.listdir())
file_path = os.listdir()[0]

# --- Decompress the .tar.gz File ---
# Extracts the contents of the archive.
print(f"\nDecompressing '{os.path.basename(file_path)}'...")
# Shell commands
!tar -xzvf "{file_path}"
print("Decompression complete.")


['SharedResponses.csv', 'SharedResponses.csv.tar.gz']

Decompressing 'SharedResponses.csv'...

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Decompression complete.


In [None]:
# ---Read the CSV into a DataFrame ---
# The decompressed file = 'Shared-Responses.csv'
csv_file_name = 'SharedResponses.csv'

# Read header row (nrows=0) and get the column names
try:
    column_names = pd.read_csv(csv_file_name, nrows=0).columns.tolist()
    print("Column names in the file:")
    print(column_names)
except FileNotFoundError:
    print(f"Error: Make sure '{csv_file_name}' is in your current directory.")

Here are the actual column names in the file:
['ResponseID', 'ExtendedSessionID', 'UserID', 'ScenarioOrder', 'Intervention', 'PedPed', 'Barrier', 'CrossingSignal', 'AttributeLevel', 'ScenarioTypeStrict', 'ScenarioType', 'DefaultChoice', 'NonDefaultChoice', 'DefaultChoiceIsOmission', 'NumberOfCharacters', 'DiffNumberOFCharacters', 'Saved', 'Template', 'DescriptionShown', 'LeftHand', 'UserCountry3', 'Man', 'Woman', 'Pregnant', 'Stroller', 'OldMan', 'OldWoman', 'Boy', 'Girl', 'Homeless', 'LargeWoman', 'LargeMan', 'Criminal', 'MaleExecutive', 'FemaleExecutive', 'FemaleAthlete', 'MaleAthlete', 'FemaleDoctor', 'MaleDoctor', 'Dog', 'Cat']


In [None]:
# --- 5. Read the CSV into a DataFrame ---
columns_to_load = [
    'ResponseID',
    'ExtendedSessionID',
    'UserID',
    'ScenarioOrder',
    'Intervention',
    'PedPed',
    'Barrier',
    'CrossingSignal',
    'AttributeLevel',
    'ScenarioTypeStrict',
    'ScenarioType',
    'DefaultChoice',
    'NonDefaultChoice',
    'DefaultChoiceIsOmission',
    'NumberOfCharacters',
    'DiffNumberOFCharacters',
    'Saved', 'Template',
    'DescriptionShown',
    'LeftHand',
    'UserCountry3',
    'Man',
    'Woman',
    'Pregnant',
    'Stroller',
    'OldMan',
    'OldWoman',
    'Boy',
    'Girl',
    'Homeless',
    'LargeWoman',
    'LargeMan',
    'Criminal',
    'MaleExecutive',
    'FemaleExecutive',
    'FemaleAthlete',
    'MaleAthlete',
    'FemaleDoctor',
    'MaleDoctor',
    'Dog',
    'Cat'
  ]

print(f"\nLoading '{csv_file_name}' into DataFrame...")
try:
    # df = pd.read_csv(csv_file_name, usecols=columns_to_load)
    df_sample = pd.read_csv(csv_file_name, usecols=columns_to_load, nrows=100000)
    print("\nSuccessfully loaded! Here are the first 5 rows:")
    print(df_sample.head())
except Exception as e:
    print(f"An error occurred: {e}")


Loading 'SharedResponses.csv' into DataFrame...

✅ Successfully loaded! Here are the first 5 rows:
          ResponseID              ExtendedSessionID        UserID  \
0  2222bRQqBTZ6dLnPH    32757157_6999801415950060.0  6.999801e+15   
1  2222sJk4DcoqXXi98        1043988516_3525281295.0  3.525281e+09   
2  2223CNmvTr2Coj4wp  -1613944085_422160228641876.0  4.221602e+14   
3  2223Xu54ufgjcyMR3   1425316635_327833569077076.0  3.278336e+14   
4  2223jMWDEGNeszivb  -1683127088_785070916172117.0  7.850709e+14   

   ScenarioOrder  Intervention  PedPed  Barrier  CrossingSignal  \
0              7             0       0        0               1   
1              2             0       0        0               0   
2             10             0       1        0               1   
3             11             0       0        1               0   
4              8             0       1        0               2   

  AttributeLevel ScenarioTypeStrict  ... LargeMan Criminal MaleExecutive  \
0     

In [None]:
import pandas as pd

# List of all character columns from MM dataset
CHARACTER_COLS = [
    'Man', 'Woman', 'Pregnant', 'Stroller', 'OldMan', 'OldWoman', 'Boy', 'Girl',
    'Homeless', 'LargeWoman', 'LargeMan', 'Criminal', 'MaleExecutive', 'FemaleExecutive',
    'FemaleAthlete', 'MaleAthlete', 'FemaleDoctor', 'MaleDoctor', 'Dog', 'Cat'
]

def get_character_string(row, is_passenger):
    """Helper function to create a descriptive string for a group of characters."""
    chars = []
    # In the dataset's logic, passengers are represented by negative counts in some templates
    val_to_check = -1 if is_passenger else 1

    # Special handling for pluralisation
    def pluralise(name, count):
        if count == 1:
            return f"1 {name.lower()}"
        # Simple pluralisation, can be expanded for irregular nouns
        return f"{count} {name.lower()}s"

    for col in CHARACTER_COLS:
        # Check if the column exists in the DataFrame to avoid errors
        if col in row and row[col] == val_to_check:
            # For simplicity, we assume a count of 1 for each flag
            chars.append(pluralise(col, 1))

    if not chars:
        return "no one"
    return " and ".join(chars)

def reconstruct_scenario(row):
    """
    Takes a row from the Moral Machine DataFrame and reconstructs the scenario description.
    """
    # Determine who the passengers and pedestrians are
    # Logic is based on the common "Barrier" vs "Pedestrian" scenarios
    passengers_string = get_character_string(row, is_passenger=True)
    pedestrians_string = get_character_string(row, is_passenger=False)

    # Describe the setting
    crossing_status = "legally" if row['PedPed'] == 1 else "illegally"
    setting_desc = f"An autonomous vehicle with {passengers_string} inside is heading toward a group of {pedestrians_string}. The pedestrians are crossing {crossing_status}."

    # Describe the choice and outcome
    choice = "swerve" if row['Intervention'] == 1 else "stay on course"
    if row['Saved'] == 1:
        outcome_desc = f"The user chose to {choice}, saving the passengers by sacrificing the pedestrians."
    else:
        outcome_desc = f"The user chose to {choice}, saving the pedestrians by sacrificing the passengers."

    return f"{setting_desc}\n\nCHOICE: {outcome_desc}"


# --- Example Usage ---
print(reconstruct_scenario(df_sample.iloc[0]))
print("\n---\n")
print(reconstruct_scenario(df_sample.iloc[100]))

An autonomous vehicle with no one inside is heading toward a group of 1 man and 1 woman and 1 femaleathlete. The pedestrians are crossing illegally (jaywalking).

CHOICE: The user chose to stay on course, saving the passengers by sacrificing the pedestrians.

---

An autonomous vehicle with no one inside is heading toward a group of 1 girl. The pedestrians are crossing illegally (jaywalking).

CHOICE: The user chose to stay on course, saving the pedestrians by sacrificing the passengers.


In [None]:
import pandas as pd
from tqdm import tqdm

# --- 1. SETUP ---
# Using'df_sample'
tqdm.pandas(desc="Generating Descriptions")


# --- 2. PASTE THE RECONSTRUCTION FUNCTIONS ---
CHARACTER_COLS = [
    'Man', 'Woman', 'Pregnant', 'Stroller', 'OldMan', 'OldWoman', 'Boy', 'Girl',
    'Homeless', 'LargeWoman', 'LargeMan', 'Criminal', 'MaleExecutive', 'FemaleExecutive',
    'FemaleAthlete', 'MaleAthlete', 'FemaleDoctor', 'MaleDoctor', 'Dog', 'Cat'
]

def get_character_string(row, is_passenger):
    chars = []
    val_to_check = -1 if is_passenger else 1
    def pluralise(name, count):
        if count == 1: return f"1 {name.lower()}"
        return f"{count} {name.lower()}s"
    for col in CHARACTER_COLS:
        if col in row and row.get(col) == val_to_check:
            chars.append(pluralise(col, 1))
    if not chars: return "no one"
    return " and ".join(chars)

def reconstruct_scenario(row):
    passengers_string = get_character_string(row, is_passenger=True)
    pedestrians_string = get_character_string(row, is_passenger=False)
    crossing_status = "legally" if row.get('PedPed') == 1 else "illegally"
    setting_desc = f"An autonomous vehicle with {passengers_string} inside is heading toward a group of {pedestrians_string}. The pedestrians are crossing {crossing_status}."
    choice = "swerve" if row.get('Intervention') == 1 else "stay on course"
    if row.get('Saved') == 1:
        outcome_desc = f"The user chose to {choice}, saving the passengers by sacrificing the pedestrians."
    else:
        outcome_desc = f"The user chose to {choice}, saving the pedestrians by sacrificing the passengers."
    return f"{setting_desc}\n\nCHOICE: {outcome_desc}"


# --- 3. APPLY, DEDUPLICATE, AND SAVE ---
print(f"Applying the function to the sample of {len(df_sample)} rows...")

# Generate the new 'ScenarioDescription' column
df_sample['ScenarioDescription'] = df_sample.progress_apply(reconstruct_scenario, axis=1)

# Log the original number of scenarios
original_count = len(df_sample)
print(f"Original number of scenarios: {original_count:,}")

# Remove duplicate scenarios
df_unique_scenarios = df_sample.drop_duplicates(subset=['ScenarioDescription']).copy()

# Log the number of unique scenarios
unique_count = len(df_unique_scenarios)
print(f"Number of unique scenarios after removing duplicates: {unique_count:,}")

# Define the output filename
output_filename = 'Moral_Machine_Unique_Scenarios_SAMPLE.csv'
print(f"\nSaving the {unique_count:,} unique scenarios to '{output_filename}'...")

# Save the unique scenarios to the CSV file
df_unique_scenarios[['ResponseID', 'ScenarioDescription']].to_csv(output_filename, index=False)

print("File saved successfully.")

Applying the function to the sample of 100000 rows...


Generating Descriptions: 100%|██████████| 100000/100000 [00:08<00:00, 11991.37it/s]


Original number of scenarios: 100,000
Number of unique scenarios after removing duplicates: 15,180

Saving the 15,180 unique scenarios to 'Moral_Machine_Unique_Scenarios_SAMPLE.csv'...
File saved successfully.


In [None]:
# Again using df_sample, not df - check variable panel before running

print("Filtering the sample for clear utilitarian vs. deontological conflicts...")

# Filter the sample where the ScenarioType is 'Utilitarian'
df_sample_conflict = df_sample[df_sample['ScenarioType'] == 'Utilitarian'].copy()

print(f"Found {len(df_sample_conflict):,} scenarios that present this conflict in your sample.")
print("Here is a preview of these scenarios:")

# Display columns relevant to the conflict
print(df_sample_conflict[['ResponseID', 'Intervention', 'Saved', 'DiffNumberOFCharacters']].head())

Filtering the sample for clear utilitarian vs. deontological conflicts...
Found 17,933 scenarios that present this conflict in your sample.
Here is a preview of these scenarios:
           ResponseID  Intervention  Saved  DiffNumberOFCharacters
4   2223jMWDEGNeszivb             0      0                     2.0
8   2224g4ytARX4QT5rB             0      1                     1.0
10  2225gNWJcAeE92LXd             0      1                     4.0
14  2225yzLoy7yvKaToo             0      0                     3.0
17  2227D8o2onrAzLT4b             0      0                     2.0


In [None]:
unique_value_counts = df_sample.nunique()

print(unique_value_counts)

ResponseID                 100000
ExtendedSessionID           98221
UserID                      84024
ScenarioOrder                  14
Intervention                    1
PedPed                          2
Barrier                         2
CrossingSignal                  3
AttributeLevel                 13
ScenarioTypeStrict              7
ScenarioType                    7
DefaultChoice                   6
NonDefaultChoice                6
DefaultChoiceIsOmission         2
NumberOfCharacters              5
DiffNumberOFCharacters          5
Saved                           2
Template                        2
DescriptionShown                2
LeftHand                        2
UserCountry3                  174
Man                             6
Woman                           6
Pregnant                        5
Stroller                        4
OldMan                          6
OldWoman                        6
Boy                             6
Girl                            6
Homeless      

In [None]:
# USING df_unique_scenarios

# Select the first 10 descriptions
first_10_descriptions = df_unique_scenarios['ScenarioDescription'].head(10)

# Loop through and print each one with a header for clarity
for i, desc in enumerate(first_10_descriptions):
    print(f"--- Scenario {i+1} ---")
    print(desc)
    print("\n" + "="*20 + "\n")

--- Scenario 1 ---
An autonomous vehicle with no one inside is heading toward a group of 1 man and 1 woman and 1 femaleathlete. The pedestrians are crossing illegally (jaywalking).

CHOICE: The user chose to stay on course, saving the passengers by sacrificing the pedestrians.


--- Scenario 2 ---
An autonomous vehicle with no one inside is heading toward a group of 1 maleexecutive. The pedestrians are crossing illegally (jaywalking).

CHOICE: The user chose to stay on course, saving the passengers by sacrificing the pedestrians.


--- Scenario 3 ---
An autonomous vehicle with no one inside is heading toward a group of 1 woman and 1 girl and 1 largewoman and 1 femaleexecutive. The pedestrians are crossing legally.

CHOICE: The user chose to stay on course, saving the passengers by sacrificing the pedestrians.


--- Scenario 4 ---
An autonomous vehicle with no one inside is heading toward a group of no one. The pedestrians are crossing illegally (jaywalking).

CHOICE: The user chose to 

In [None]:
# USING df_sample_conflict

# --- 1. Identify When the Utilitarian Choice Was Made ---
# This part of the logic remains the same.
is_ped_larger = df_sample_conflict['DiffNumberOFCharacters'] > 0
user_saved_peds = df_sample_conflict['Saved'] == 0
df_sample_conflict['MadeUtilitarianChoice'] = (is_ped_larger & user_saved_peds) | (~is_ped_larger & ~user_saved_peds)


# --- 2. Calculate the Results ---
total_scenarios = len(df_sample_conflict)
utilitarian_count = df_sample_conflict['MadeUtilitarianChoice'].sum()

# The deontological choice is simply the opposite of the utilitarian one in these scenarios.
deontological_count = total_scenarios - utilitarian_count

utilitarian_percentage = (utilitarian_count / total_scenarios) * 100
deontological_percentage = (deontological_count / total_scenarios) * 100

print("--- Corrected Analysis of Ethical Choices ---")
print(f"Total Conflicting Scenarios in Sample: {total_scenarios:,}")
print(f"\nUsers Made the Utilitarian Choice (saved more lives): {utilitarian_count:,} times ({utilitarian_percentage:.2f}%)")
print(f"Users Made the Deontological Choice (chose inaction): {deontological_count:,} times ({deontological_percentage:.2f}%)")

--- Corrected Analysis of Ethical Choices ---
Total Conflicting Scenarios in Sample: 17,933

Users Made the Utilitarian Choice (saved more lives): 9,332 times (52.04%)
Users Made the Deontological Choice (chose inaction): 8,601 times (47.96%)


In [None]:
# Using df_sample

print("Filtering for scenarios where Deontological and Utilitarian choices conflict...")

# The 'ScenarioType' == 'Utilitarian' filter isolates these specific cases
df_conflict = df_sample[df_sample['ScenarioType'] == 'Utilitarian'].copy()

conflict_count = len(df_conflict)
total_count = len(df_sample)

print(f"\nFound {conflict_count:,} scenarios (out of {total_count:,} in the sample) that present a direct conflict.")

# Display a preview of the filtered data
print("\nPreview of conflict scenarios:")
print(df_conflict[['ResponseID', 'Intervention', 'Saved', 'DiffNumberOFCharacters']].head())

Filtering for scenarios where Deontological and Utilitarian choices conflict...

Found 17,933 scenarios (out of 100,000 in the sample) that present a direct conflict.

Preview of conflict scenarios:
           ResponseID  Intervention  Saved  DiffNumberOFCharacters
4   2223jMWDEGNeszivb             0      0                     2.0
8   2224g4ytARX4QT5rB             0      1                     1.0
10  2225gNWJcAeE92LXd             0      1                     4.0
14  2225yzLoy7yvKaToo             0      0                     3.0
17  2227D8o2onrAzLT4b             0      0                     2.0


In [None]:
# Count of scenarios that present a direct conflict
conflict_count = len(df_conflict)

# Count total number of scenarios in original sample
total_count = len(df_sample)

# Final summary
print(f"\nFound {conflict_count:,} scenarios (out of {total_count:,} in the sample) that present a direct conflict.")


Found 17,933 scenarios (out of 100,000 in the sample) that present a direct conflict.
