# Existential Risk Estimates Database

This notebook processes data from the existential risk (x-risk) estimates database from [this EA Forum post](https://forum.effectivealtruism.org/posts/JQQAQrunyGGhzE23a/database-of-existential-risk-estimates).

Since the entries in the database vary in format and often use natural language, it is necessary to sanitize them for further analysis.

## Pre-Processing

In this step we fiddle with the dataset to get something more readily usable. Among other things, we:
- Create a risk category column, and get rid of the "headers" for things like AI, or Total Risk, etc.
- Remove columns that aren't necessary for the analysis

In [99]:
# Load the data and extract the relevant columns

import pandas as pd
import numpy as np
import json
import re
from llama_cpp import Llama
from functools import partial

In [100]:
# Load excel spreadsheet. Load all sheets.
excel = pd.ExcelFile('./data/xrisk-estimates-database-20241204.xlsx')
# List worksheets
excel.sheet_names

['Overall notes',
 'Existential-risk-level estimate',
 'Conditional existential-risk-le',
 'Estimates of somewhat less extr',
 'Other potential estimates or so']

In [101]:
# Use the 'Existential-risk-level estimate' worksheet.
# the first 5 columns are instructions, so skip them.
df = excel.parse('Existential-risk-level estimate', skiprows=5, index_row=0)
df.head()

Unnamed: 0,Who is the estimator?,When was the estimate made/published?,What is the estimator estimating?,What is their estimate?,Source,Have I properly read the source myself?,Is this estimate included in Beard et al.'s appendix?,Other notes,Unnamed: 8,Unnamed: 9
0,“Total risk” (or similar),,,,,,,,,
1,Toby Ord,2020,“Total existential risk” by 2120,~17% (~1 in 6),The Precipice,Yes,No,"Ord writes: ""Don’t take these numbers to be co...",,
2,GCR Conference,2008,“Overall risk of extinction prior to 2100”,0.19,https://www.fhi.ox.ac.uk/reports/2008-1.pdf,Yes,Yes,This is the median. The report about these est...,,
3,Will MacAskill,2019/2020,Existential risk in the 21st century,0.01,https://80000hours.org/podcast/episodes/will-m...,Yes,No,,,
4,"Ben Todd or 80,000 Hours",2017,Extinction risk “in the next century”,Probably at or above 3%,https://80000hours.org/articles/extinction-risk/,Yes,No,,,


In [102]:
# Print number of columns
print(f"DataFrame has {len(df.columns)} columns")

DataFrame has 10 columns


In [103]:
# Print columns
df.columns

Index(['Who is the estimator? ', 'When was the estimate made/published?',
       'What is the estimator estimating?', 'What is their estimate?',
       'Source', 'Have I properly read the source myself?',
       'Is this estimate included in Beard et al.'s appendix?', 'Other notes',
       'Unnamed: 8', 'Unnamed: 9'],
      dtype='object')

In [104]:
# Rename columns to something more concise
new_column_names = [
  'estimator',
  'date',
  'estimation_measure',
  'estimation',
  'source',
  'source_read_by_estimator',
  'estimate_included_in_beard_et_al',
  'other_notes',
  'unknown_column_1',
'unknown_column_2'
]
df.columns = new_column_names
df.head()

Unnamed: 0,estimator,date,estimation_measure,estimation,source,source_read_by_estimator,estimate_included_in_beard_et_al,other_notes,unknown_column_1,unknown_column_2
0,“Total risk” (or similar),,,,,,,,,
1,Toby Ord,2020,“Total existential risk” by 2120,~17% (~1 in 6),The Precipice,Yes,No,"Ord writes: ""Don’t take these numbers to be co...",,
2,GCR Conference,2008,“Overall risk of extinction prior to 2100”,0.19,https://www.fhi.ox.ac.uk/reports/2008-1.pdf,Yes,Yes,This is the median. The report about these est...,,
3,Will MacAskill,2019/2020,Existential risk in the 21st century,0.01,https://80000hours.org/podcast/episodes/will-m...,Yes,No,,,
4,"Ben Todd or 80,000 Hours",2017,Extinction risk “in the next century”,Probably at or above 3%,https://80000hours.org/articles/extinction-risk/,Yes,No,,,


In [105]:
# Drop unnecessary columns (last 2, which are unknown)
df = df.drop(columns=['unknown_column_1', 'unknown_column_2'])
df.sample(3)

Unnamed: 0,estimator,date,estimation_measure,estimation,source,source_read_by_estimator,estimate_included_in_beard_et_al,other_notes
3,Will MacAskill,2019/2020,Existential risk in the 21st century,0.01,https://80000hours.org/podcast/episodes/will-m...,Yes,No,
97,Holden Karnofsky,2021,Conditional on PASTA being developed this cent...,At least 50%,"""Some additional detail on what I mean by ""mos...",Yes,No,"""I want to roughly say that if something like ..."
4,"Ben Todd or 80,000 Hours",2017,Extinction risk “in the next century”,Probably at or above 3%,https://80000hours.org/articles/extinction-risk/,Yes,No,


In [106]:
# Add empty column for risk category
df['risk_category'] = np.nan
df.sample(3)

Unnamed: 0,estimator,date,estimation_measure,estimation,source,source_read_by_estimator,estimate_included_in_beard_et_al,other_notes,risk_category
13,Frank Tipler,2019?,"""Personally, I now think we humans will be wip...",,"William Poundstone, The Doomsday Calculation, ...",No,No,,
17,AI,,,,,,,,
110,Pablo Stafforini,2015/2020,"Chance that ""Humans will go extinct within mil...",0.1,http://www.stafforini.com/blog/what_i_believe/,Yes,No,,


In [107]:
# The first row indicates what type of risk is being estimated.
# 
# Under the first row are the total risk estimates. 
# 
# Further down below are estimates for other x-risks.
# 
# The categories are: "Total risk (or similar)", "AI", "Biorisk", "Nanotechnology", "Climate Change", "Natural risks (excluding natural pandemics)", "War", "Explicitly about only unrecoverable dystopia and/or unrecoverable collapse (not extinction)", "Miscellaneous".
risk_categories = [
    '“Total risk” (or similar)',
    'AI',
    'Biorisk',
    'Nanotechnology',
    'Climate change',
    'Natural risks (excluding natural pandemics)',
    'Nuclear',
    'War',
    'Explicitly about only unrecoverable dystopia and/or unrecoverable collapse (not extinction)',
    'Miscellaneous']

risk_categories_aliases = [
    'total',
    'ai',
    'biorisk',
    'nanotechnology',
    'climate_change',
    'natural_risks',
    'nuclear',
    'war',
    'dystopia',
    'miscellaneous'
]

# Get rows that contain a risk category in the first column
risk_category_rows = df[df['estimator'].isin(risk_categories)]['estimator']
risk_category_rows 

0                             “Total risk” (or similar)
17                                                   AI
39                                              Biorisk
51                                       Nanotechnology
62                                              Nuclear
72                                       Climate change
78          Natural risks (excluding natural pandemics)
87                                                  War
92    Explicitly about only unrecoverable dystopia a...
95                                        Miscellaneous
Name: estimator, dtype: object

In [108]:
# Create a copy of the dataframe
df_with_risk_category = df.copy()

# Initialize the current risk category
current_risk_category = None

# Iterate over the dataframe rows
for i, row in df_with_risk_category.iterrows():
  if row['estimator'] in risk_category_rows.values:
    # Update the current risk category
    current_risk_category = row['estimator']
  # Set the risk category for the current row
  df_with_risk_category.at[i, 'risk_category'] = current_risk_category

df_with_risk_category.sample(3)

  df_with_risk_category.at[i, 'risk_category'] = current_risk_category


Unnamed: 0,estimator,date,estimation_measure,estimation,source,source_read_by_estimator,estimate_included_in_beard_et_al,other_notes,risk_category
99,Wei Dai,2021,Expected fraction of total potential value tha...,>50%,Comment,Yes,No,"""What's your expectation of the fraction of to...",Miscellaneous
97,Holden Karnofsky,2021,Conditional on PASTA being developed this cent...,At least 50%,"""Some additional detail on what I mean by ""mos...",Yes,No,"""I want to roughly say that if something like ...",Miscellaneous
32,Buck Shlegris,2020,"""the probability of AI-induced existential ris...",0.5,https://futureoflife.org/2020/04/15/an-overvie...,Yes,No,Note that Buck gave a 25 percentage point lowe...,AI


In [109]:
# Drop the rows that contain the risk categories
df_with_risk_category = df_with_risk_category[~df_with_risk_category['estimator'].isin(risk_categories)]
df_with_risk_category.head(3)

Unnamed: 0,estimator,date,estimation_measure,estimation,source,source_read_by_estimator,estimate_included_in_beard_et_al,other_notes,risk_category
1,Toby Ord,2020,“Total existential risk” by 2120,~17% (~1 in 6),The Precipice,Yes,No,"Ord writes: ""Don’t take these numbers to be co...",“Total risk” (or similar)
2,GCR Conference,2008,“Overall risk of extinction prior to 2100”,0.19,https://www.fhi.ox.ac.uk/reports/2008-1.pdf,Yes,Yes,This is the median. The report about these est...,“Total risk” (or similar)
3,Will MacAskill,2019/2020,Existential risk in the 21st century,0.01,https://80000hours.org/podcast/episodes/will-m...,Yes,No,,“Total risk” (or similar)


In [110]:
# Rename risk categories to something more concise
df_with_risk_category['risk_category'] = df_with_risk_category['risk_category'].replace(risk_categories, risk_categories_aliases)
df_with_risk_category.sample(3)

Unnamed: 0,estimator,date,estimation_measure,estimation,source,source_read_by_estimator,estimate_included_in_beard_et_al,other_notes,risk_category
97,Holden Karnofsky,2021.0,Conditional on PASTA being developed this cent...,At least 50%,"""Some additional detail on what I mean by ""mos...",Yes,No,"""I want to roughly say that if something like ...",miscellaneous
68,"Anders Sandberg, adapting Denkenberger’s model",2018.0,“Reduction in far future potential per year fr...,~0.051%,https://www.getguesstimate.com/models/11691,Sort-of,No,,nuclear
91,,,,,,,,,war


In [111]:
# Use only a few columns relevant for the analysis.
relevant_columns = [
    "estimator",
    "estimation_measure",
    "date",
    "estimation",
    "source_read_by_estimator",
    "risk_category",
    "other_notes",
    "source"
]
df_relevant = df_with_risk_category[relevant_columns]
df_relevant.head(3)

Unnamed: 0,estimator,estimation_measure,date,estimation,source_read_by_estimator,risk_category,other_notes,source
1,Toby Ord,“Total existential risk” by 2120,2020,~17% (~1 in 6),Yes,total,"Ord writes: ""Don’t take these numbers to be co...",The Precipice
2,GCR Conference,“Overall risk of extinction prior to 2100”,2008,0.19,Yes,total,This is the median. The report about these est...,https://www.fhi.ox.ac.uk/reports/2008-1.pdf
3,Will MacAskill,Existential risk in the 21st century,2019/2020,0.01,Yes,total,,https://80000hours.org/podcast/episodes/will-m...


In [112]:
# Save the dataframe to a CSV file
df_relevant.to_csv('./data/pre-processed_data.csv', index=False)

In [113]:
# Take original numeric estimates and place them in a separate column
df_relevant.loc[:, 'estimation_numeric'] = pd.to_numeric(df_relevant['estimation'], errors='coerce')

In [114]:
# Save the dataframe to a CSV file
df_relevant.to_csv('./data/pre-processed_data.csv', index=False)

Now we have a more manageable file, however the estimates are still not homogenous. I'll now use Claude AI and ask it to go through the file and assign estimates scaled to a per century risk, taking into consideration the remarks for each row.

I'll focus on total risk for now, so I'll select those and give them to Claude AI for further processing.

In [115]:
df_total_risk=df_relevant[df_relevant['risk_category']=='total']
df_total_risk.to_csv('./data/pre-processed_data_total_risk.csv', index=False)

The file is now processed and saved in `./data/total_risk_estimates.csv` and we'll take a look at it.

In [116]:
# Open processed file.
df = pd.read_csv('./data/total_risk_estimates.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   estimator              14 non-null     object 
 1   original_estimate      14 non-null     object 
 2   per_century_risk       12 non-null     float64
 3   estimate_confidence    14 non-null     object 
 4   conversion_confidence  12 non-null     object 
 5   reasoning              14 non-null     object 
dtypes: float64(1), object(5)
memory usage: 804.0+ bytes


In [117]:
df.describe()

Unnamed: 0,per_century_risk
count,12.0
mean,0.225332
std,0.283787
min,0.00098
25%,0.04125
50%,0.1615
75%,0.26125
max,1.0


Now, I'll use the same methodology to process all other risk categories.

In [118]:
risk_categories = set(df_relevant['risk_category'])

In [119]:
# Save ai risks
df_ai = df_relevant[df_relevant['risk_category'] == 'ai']
df_ai.to_csv('./data/pre-processed_data_ai.csv', index=False)

# Save biorisk data
df_biorisk = df_relevant[df_relevant['risk_category'] == 'biorisk']
df_biorisk.to_csv('./data/pre-processed_data_biorisk.csv', index=False)

# Save climate change data
df_climate_change = df_relevant[df_relevant['risk_category'] == 'climate_change']
df_climate_change.to_csv('./data/pre-processed_data_climate_change.csv', index=False)

# Save dystopia risk data
df_dystopia = df_relevant[df_relevant['risk_category'] == 'dystopia']
df_dystopia.to_csv('./data/pre-processed_data_dystopia.csv', index=False)

# Save miscellaneous risk data
df_miscellaneous = df_relevant[df_relevant['risk_category'] == 'miscellaneous']
df_miscellaneous.to_csv('./data/pre-processed_data_miscellaneous.csv', index=False)

# Save nanotechnology risk data
df_nanotechnology = df_relevant[df_relevant['risk_category'] == 'nanotechnology']
df_nanotechnology.to_csv('./data/pre-processed_data_nanotechnology.csv', index=False)

# Save natural risks data
df_natural_risks = df_relevant[df_relevant['risk_category'] == 'natural_risks']
df_natural_risks.to_csv('./data/pre-processed_data_natural_risks.csv', index=False)

# Save nuclear risk data
df_nuclear = df_relevant[df_relevant['risk_category'] == 'nuclear']
df_nuclear.to_csv('./data/pre-processed_data_nuclear.csv', index=False)

# Save war risk data
df_war = df_relevant[df_relevant['risk_category'] == 'war']
df_war.to_csv('./data/pre-processed_data_war.csv', index=False)


I've asked Claude to process these files and give me per-century estimates for each of the estimates, along with a confidence level for each estimate. The processed estimates are located in `./data/processed_estimates`.

Now, let's re-build the total dataset for each of the categories.

In [120]:
file_names = [f'{risk_category}.csv' for risk_category in risk_categories]
data_frames = [pd.read_csv(f'./data/processed_estimates/{file_name}') for file_name in file_names]

In [121]:
# Add the risk category to each dataframe
risk_categories_list = list(risk_categories)
for i, df in enumerate(data_frames):
  df['risk_category'] = risk_categories_list[i] 

# Concatenate all dataframes
df_concatenated = pd.concat(data_frames, ignore_index=True)
df_concatenated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   estimator              89 non-null     object 
 1   original_estimate      89 non-null     object 
 2   per_century_risk       87 non-null     float64
 3   estimate_confidence    89 non-null     object 
 4   conversion_confidence  87 non-null     object 
 5   reasoning              89 non-null     object 
 6   risk_category          89 non-null     object 
dtypes: float64(1), object(6)
memory usage: 5.0+ KB


In [122]:
df_concatenated.describe()

Unnamed: 0,per_century_risk
count,87.0
mean,0.1060296
std,0.1870759
min,8.3e-10
25%,0.00083
50%,0.0167
75%,0.1405
max,1.0


# Post Processing

Let's now do a little bit of post processing to polish our data.

In [123]:
df = df_concatenated.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   estimator              89 non-null     object 
 1   original_estimate      89 non-null     object 
 2   per_century_risk       87 non-null     float64
 3   estimate_confidence    89 non-null     object 
 4   conversion_confidence  87 non-null     object 
 5   reasoning              89 non-null     object 
 6   risk_category          89 non-null     object 
dtypes: float64(1), object(6)
memory usage: 5.0+ KB


In [124]:
# Remove null values in per_century_risk
df = df.dropna(subset=['per_century_risk'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87 entries, 0 to 88
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   estimator              87 non-null     object 
 1   original_estimate      87 non-null     object 
 2   per_century_risk       87 non-null     float64
 3   estimate_confidence    87 non-null     object 
 4   conversion_confidence  87 non-null     object 
 5   reasoning              87 non-null     object 
 6   risk_category          87 non-null     object 
dtypes: float64(1), object(6)
memory usage: 5.4+ KB


In [125]:
# Set column types
column_types = {
  'estimator': 'string',
  'original_estimate': 'string',
  'estimate_confidence': 'string',
  'conversion_confidence': 'string',
  'reasoning': 'string',
  'risk_category': 'string'
}

df = df.astype(column_types)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87 entries, 0 to 88
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   estimator              87 non-null     string 
 1   original_estimate      87 non-null     string 
 2   per_century_risk       87 non-null     float64
 3   estimate_confidence    87 non-null     string 
 4   conversion_confidence  87 non-null     string 
 5   reasoning              87 non-null     string 
 6   risk_category          87 non-null     string 
dtypes: float64(1), string(6)
memory usage: 5.4 KB


Columns `estimate_confidence` and `conversion_confidence` could be written as numbers. Create extra numeric columns based on their values.

In [126]:
# First make the values lower case.
df['estimate_confidence'] = df['estimate_confidence'].str.lower()
df['conversion_confidence'] = df['conversion_confidence'].str.lower()

In [127]:
estimate_confidence_levels_str = set(df['estimate_confidence'])
estimate_confidence_levels_str

{'high', 'low', 'medium'}

In [128]:
conversion_confidence_levels_str = set(df['conversion_confidence'])
conversion_confidence_levels_str

{'high', 'low', 'medium'}

In [129]:
level_str_numeric_mapping = {
  'low': 1,
  'medium': 2,
  'high': 3,
}
df['estimate_confidence_numeric'] = df['estimate_confidence'].map(level_str_numeric_mapping)
df['conversion_confidence_numeric'] = df['conversion_confidence'].map(level_str_numeric_mapping)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 87 entries, 0 to 88
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   estimator                      87 non-null     string 
 1   original_estimate              87 non-null     string 
 2   per_century_risk               87 non-null     float64
 3   estimate_confidence            87 non-null     string 
 4   conversion_confidence          87 non-null     string 
 5   reasoning                      87 non-null     string 
 6   risk_category                  87 non-null     string 
 7   estimate_confidence_numeric    87 non-null     int64  
 8   conversion_confidence_numeric  87 non-null     int64  
dtypes: float64(1), int64(2), string(6)
memory usage: 6.8 KB


In [130]:
# Save the concatenated dataframe
df.to_csv('./data/processed_estimates/all_estimates.csv', index=False)