In [45]:
from seshat_api import SeshatAPI
import pandas as pd
from ollama import chat, ChatResponse
import re

# Questions for DeepSeek-R1

Can a recent top performing LLM (DeepSeek-R1) correctly predict whether a variable from the Seshat Global History Databank (e.g. "Scientific Literature") should be "present" or "absent" for a selection of polities, given a definition of the variable, the name of the polity and the years in which it existed?

If DeepSeek has a good understanding of history, can it use this to guess from the polity name and years what time/place the prompt is referring to, and whether the variable in question would have been present?

**Important:** I have included some example CSVs of DeepSeek responses to prompts, since running hundreds of prompts took several hours on my laptop (M1 Mac). If you want to generate new responses from DeepSeek, make sure you uncomment the calls to the `deepseek_responses` function below, but also make sure to *not* overwrite your own responses in the next cell where I load the pre-made responses from CSV.

In [46]:
client = SeshatAPI(base_url="https://seshat-db.com/api")

## Getting data from Seshat

First let's define our variable to be used in an LLM prompt later (taken from seshat-db.com), then load the data from the Seshat API to use as our ground truth to test against.

In [47]:
variable = 'Scientific Literature'
definition = "Talking about Kinds of Written Documents, Scientific literature includes mathematics, natural sciences, social sciences"

In [48]:
from seshat_api.sc import ScientificLiteratures
scientific_literatures = ScientificLiteratures(client)
scientific_literatures_df = pd.DataFrame(scientific_literatures.get_all())
len(scientific_literatures_df)

421

Let's just use expert reviewed data and ignore examples where the value is anything other than "present" or "absent" to create a subsample of the dataset.

We should also reformat the dataframe so we have information about the polities such as the start and end year alongside the variable value, then remove columns we aren't interested in.

In [49]:
def process_df(seshat_df, variable_id):
    # Filter out the records that are not expert reviewed
    seshat_df = seshat_df[seshat_df['expert_reviewed'] == True]

    # Filter out records where the variable is not either 'present' or 'absent'
    seshat_df = seshat_df[seshat_df[variable_id].isin(['present', 'absent'])]

    # Extract the polities column to a new dataframe
    polities_df = pd.DataFrame(seshat_df['polity'].tolist())

    # Add the scientific_literature column to the new dataframe
    polities_df[variable_id] = seshat_df[variable_id]

    # Filter out records where scientific_literature is NaN
    polities_df = polities_df[polities_df[variable_id].notna()]
    print("There are", len(polities_df), variable_id, "records after filtering.")

    # Get rid of the columns we don't need
    polities_df = polities_df[['new_name', 'long_name', 'start_year', 'end_year', variable_id, 'general_description']]

    return polities_df

In [50]:
polities_with_scientific_literatures_df = process_df(scientific_literatures_df, 'scientific_literature')

There are 150 scientific_literature records after filtering.


In [51]:
polities_with_scientific_literatures_df.sample(5)

Unnamed: 0,new_name,long_name,start_year,end_year,scientific_literature,general_description
185,mx_basin_of_mexico_2,Initial Formative Basin of Mexico,-2000,-1201,absent,The Basin or Valley of Mexico is a highlands p...
222,mx_basin_of_mexico_5,Late Formative Basin of Mexico,-400,-101,present,The Basin or Valley of Mexico is a highlands p...
18,gr_crete_classical,Classical Crete,-500,-323,present,<i>Population and political organization</i><b...
192,jp_jomon_6,Japan - Final Jomon,-1200,-300,absent,
191,jp_jomon_5,Japan - Late Jomon,-2500,-1200,absent,


## "Prompt engineering"

First let's define a prompt function to use with our dataframe:

In [52]:
def year_CE(year):
    if year >= 0:
        return f"{year} CE"
    else:
        return f"{abs(year)} BCE"

def prompt_func(seshat_df, polity_name, variable, variable_definition):
    df = seshat_df[seshat_df['new_name'] == polity_name]
    polity = list(df['long_name'])[0]
    # description = list(df['general_description'])[0]  # TODO: we could add this to the prompt to add more context
    start_year = list(df['start_year'])[0]
    end_year = list(df['end_year'])[0]
    prompt = "Use your knowledge of world history to answer the following question. "
    prompt += f"Given your knowledge of the historical polity '{polity}', "
    prompt += f"a polity that existed between {year_CE(start_year)} and {year_CE(end_year)}"
    prompt += f", do you expect that {variable} was present or absent? "
    prompt += f"{variable} is defined as: '{variable_definition}'. "
    
    # To help extract the answer from the text response later, make sure we have a string that can be found with a regex:
    prompt += "Answer 'XXXpresentXXX' if you expect it to be present, and 'XXXabsentXXX' if you expect it to be absent."
    return prompt
    

How does the prompt look with an example "new_name" (Seshat ID) of `tn_fatimid_cal`, which we know has a record for scientific literature in the database:

In [53]:
test_seshat_polity_name = 'tn_fatimid_cal'
test_prompt = prompt_func(polities_with_scientific_literatures_df, test_seshat_polity_name, variable, definition)
test_prompt

"Use your knowledge of world history to answer the following question. Given your knowledge of the historical polity 'Fatimid Caliphate', a polity that existed between 909 CE and 1171 CE, do you expect that Scientific Literature was present or absent? Scientific Literature is defined as: 'Talking about Kinds of Written Documents, Scientific literature includes mathematics, natural sciences, social sciences'. Answer 'XXXpresentXXX' if you expect it to be present, and 'XXXabsentXXX' if you expect it to be absent."

Ok, let's see what DeepSeek responds for this prompt:

In [54]:
response: ChatResponse = chat(model='deepseek-r1', messages=[
  {
    'role': 'user',
    'content': test_prompt,
  },
])
print(response.message.content)

<think>
Alright, so I need to figure out whether scientific literature existed during the Fatimid Caliphate period, which was from 909 CE to 1171 CE. Let's break this down step by step.

First, I know that the Fatimid Caliphate was an Islamic empire in North Africa, expanding its influence beyond its original territory. It was known for its advancements in various fields including science and mathematics. The period they're asking about is from 909 to 1171, so that's a long span of time covering several centuries.

I remember that the Fatimids were preceded by the Rashidun Caliphate and followed by the AB Caliphate. The Islamic Golden Age was during the Umayyads and the Abbasids, which was from around 650 to 1258 CE. The Fatimid Caliphate is often considered part of this broader Golden Age but perhaps a bit before or after it in terms of specific advancements.

Given that the Golden Age saw significant contributions in science, I would expect that during the Fatimid period as well, the

Was that correct? Let's check the dataframe to see:

In [55]:
polities_with_scientific_literatures_df[polities_with_scientific_literatures_df['new_name'] == test_seshat_polity_name]

Unnamed: 0,new_name,long_name,start_year,end_year,scientific_literature,general_description
237,tn_fatimid_cal,Fatimid Caliphate,909,1171,present,The Fatimid Caliphate lasted from 909 to 1171 ...


### Get DeepSeek's answer data

Let's write a function to collect responses from DeepSeek for all the polities in our dataset and make a new dataframe with the results:

In [56]:
def deepseek_responses(seshat_df, variable, definition, count=None):  # Use the count parameter to limit the number of responses
    def generate_response(polity):
        response: ChatResponse = chat(model='deepseek-r1', messages=[
            {
                'role': 'user',
                'content': prompt_func(seshat_df, polity, variable, definition),
            },
        ])
        return response

    responses = pd.DataFrame(columns=['new_name', 'answer', 'full'])
    for polity in seshat_df['new_name']:
        if count is not None and len(responses) >= count:
            break
        response = generate_response(polity)
        while 'XXXabsentXXX' not in response.message.content and 'XXXpresentXXX' not in response.message.content:
            response = generate_response(polity)  # Keep asking until we get a valid response
        answer = re.search(r'(XXXabsentXXX|XXXpresentXXX)', response.message.content).group(1)
        answer = answer.replace("XXX", "")

        # Add the new row to the DataFrame using pd.concat
        responses = pd.concat([responses, pd.DataFrame([{
            'new_name': polity,
            'answer': answer,
            'full': response.message.content
        }])], ignore_index=True)

    return responses

Now let's generate the answer data for this variable across polities in our dataset:

In [57]:
# Get the responses for the scientific literature variable - uncomment to run (you can set count to limit the number of responses)
# scientific_literature_responses = deepseek_responses(polities_with_scientific_literatures_df, variable, definition)

In [58]:
# Save the responses to a CSV file - uncomment to run
# scientific_literature_responses.to_csv('scientific_literature_responses.csv', index=False)

# Load the responses from the CSV file
scientific_literature_responses = pd.read_csv('scientific_literature_responses.csv')

In [59]:
scientific_literature_responses.sample(5)

Unnamed: 0,new_name,answer,full
42,eg_old_k_1,present,"<think>\nOkay, so I need to figure out whether..."
60,iq_abbasid_cal_1,present,"<think>\nOkay, so I need to figure out whether..."
13,tr_konya_lnl,present,"<think>\nOkay, so I need to figure out whether..."
10,tr_east_roman_emp,present,"<think>\nOkay, so I need to figure out whether..."
28,cn_eastern_han_dyn,present,"<think>\nOkay, so I have this question about t..."


### How well did the model perform?

Let's do a simple check to see what percentage of the answers were correct, according to the Seshat ground truth:

In [60]:
def performance(polities_df, responses_df, variable_id):
    total = 0
    correct = 0
    for _, response in responses_df.iterrows():
        polity = response['new_name']
        seshat_answer = polities_df[polities_df['new_name'] == polity][variable_id].values[0]
        if seshat_answer == response['answer']:
            correct += 1
        # print(polity, ": ", seshat_answer, response['answer'])
        total += 1
    percentage = correct / total * 100
    print(f"Correct: {correct}, Total: {total}, Percentage: {percentage:.2f}%")
    return percentage

In [61]:
performance(polities_with_scientific_literatures_df, scientific_literature_responses, 'scientific_literature')

Correct: 79, Total: 150, Percentage: 52.67%


52.666666666666664

## Let's try that again

Now let's run the pipeline again with a different variable:

In [62]:
from seshat_api.sc import DrinkingWaterSupplies
drinking_water_supplies = DrinkingWaterSupplies(client)
drinking_water_supplies_df = pd.DataFrame(drinking_water_supplies.get_all())
polities_with_drinking_water_supplies_df = process_df(drinking_water_supplies_df, 'drinking_water_supply_system')
polities_with_drinking_water_supplies_df.sample(5)

There are 112 drinking_water_supply_system records after filtering.


Unnamed: 0,new_name,long_name,start_year,end_year,drinking_water_supply_system,general_description
80,tr_konya_lba,Konya Plain - Late Bronze Age II,-1500,-1400,absent,The period of 1500-1400 BCE was an 'intermedia...
136,ir_achaemenid_emp,Achaemenid Empire,-550,-331,present,The Achaemenid Empire was established by Cyrus...
45,it_western_roman_emp,Western Roman Empire - Late Antiquity,395,476,absent,The period of the Western Roman Empire begins ...
77,af_kushan_emp,Kushan Empire,35,319,present,The Kushan Empire was a confederated state hea...
66,us_cahokia_2,Cahokia - Moorehead,1200,1275,present,Generations of archaeologists have been amazed...


In [63]:
# Get the responses for the Drinking Water Supply System variable
# polities_with_drinking_water_supplies_responses = deepseek_responses(polities_with_drinking_water_supplies_df,
#                                                      'Drinking Water Supply System',
#                                                      "Talking about Specialized Buildings, drinking water supply systems are polity owned (which includes owned by the community, or the state), we have coded the absence or presence of the variable",
#                                                      )

In [64]:
# Save the responses to a CSV file
# polities_with_drinking_water_supplies_responses.to_csv('polities_with_drinking_water_supplies_responses.csv', index=False)

# Load the responses from the CSV file
polities_with_drinking_water_supplies_responses = pd.read_csv('polities_with_drinking_water_supplies_responses.csv')

In [65]:
polities_with_drinking_water_supplies_responses.sample(5)

Unnamed: 0,new_name,answer,full
17,tr_konya_eba,present,"<think>\nOkay, so I need to figure out whether..."
59,in_vijayanagara_emp,present,"<think>\nOkay, so I need to figure out whether..."
52,mx_basin_of_mexico_4,present,"<think>\nOkay, so I need to figure out whether..."
97,ru_sakha_late,present,"<think>\nOkay, so I need to figure out whether..."
22,us_emergent_mississippian_2,absent,"<think>\nAlright, so I need to figure out whet..."


In [66]:
performance(polities_with_drinking_water_supplies_df, polities_with_drinking_water_supplies_responses, 'drinking_water_supply_system')

Correct: 52, Total: 112, Percentage: 46.43%


46.42857142857143

## Another one!

In [67]:
from seshat_api.sc import Roads
roads = Roads(client)
roads_df = pd.DataFrame(roads.get_all())
polities_with_roads_df = process_df(roads_df, 'road')
polities_with_roads_df.sample(5)

There are 187 road records after filtering.


Unnamed: 0,new_name,long_name,start_year,end_year,road,general_description
250,tr_ottoman_emp_2,Ottoman Empire II,1517,1683,present,"In the 15th century CE, the Turkic Ottoman Sul..."
131,es_spanish_emp_1,Spanish Empire I,1516,1715,absent,The Habsburg Dynasty came together as Ferdinan...
199,it_roman_rep_2,Middle Roman Republic,-264,-133,present,"The last of the Roman kings, the tyrannical Lu..."
171,ir_susiana_archaic,Susiana - Muhammad Jaffar,-7000,-6000,present,"""The Archaic Susiana 0 Phase. The appearance o..."
158,mn_mongol_emp,Mongol Empire,1206,1270,absent,The Mongols began as one of a group of nomadic...


In [68]:
# Get the responses for the Roads variable
# polities_with_roads_responses = deepseek_responses(polities_with_roads_df,
#                                                      'Roads',
#                                                      "Talking about Transport infrastructure, roads refers to deliberately constructed roads that connect settlements or other sites. It excludes streets/accessways within settlements and paths between settlements that develop through repeated use"
#                                                      )

In [69]:
# Save the responses to a CSV file
# polities_with_roads_responses.to_csv('polities_with_roads_responses.csv', index=False)

# Load the responses from the CSV file
polities_with_roads_responses = pd.read_csv('polities_with_roads_responses.csv')

In [70]:
polities_with_roads_responses.sample(5)

Unnamed: 0,new_name,answer,full
93,ir_elymais_2,present,"<think>\nAlright, so I'm trying to figure out ..."
53,it_venetian_rep_4,absent,"<think>\nOkay, so I need to figure out whether..."
45,af_hephthalite_emp,absent,"<think>\nOkay, so I need to figure out whether..."
127,jp_kamakura,absent,"<think>\nOkay, so I need to figure out whether..."
110,ir_susa_1,present,"<think>\nOkay, so I need to figure out whether..."


In [71]:
performance(polities_with_roads_df, polities_with_roads_responses, 'road')

Correct: 110, Total: 187, Percentage: 58.82%


58.82352941176471

## One more...

In [72]:
from seshat_api.wf import Irons
irons = Irons(client)
irons_df = pd.DataFrame(irons.get_all())
polities_with_irons_df = process_df(irons_df, 'iron')
polities_with_irons_df.sample(5)

There are 337 iron records after filtering.


Unnamed: 0,new_name,long_name,start_year,end_year,iron,general_description
138,af_hephthalite_emp,Hephthalites,408,561,absent,The Hepthalites were one group of a series of ...
159,ec_shuar_1,Shuar - Colonial,1534,1830,absent,"The forested foothills of the Andes, near the ..."
344,eg_dynasty_2,Egypt - Dynasty II,-2900,-2687,present,The Second Dynasty of Egypt (c. 2900‒2687 BCE)...
154,cn_qing_dyn_2,Late Qing,1796,1912,present,"The Qing Dynasty (or Empire of the Great Qing,..."
307,pg_orokaiva_colonial,Orokaiva - Colonial,1884,1942,absent,The Northern Province of Papua New Guinea has ...


In [73]:
# Get the responses for the Military use of Metals: Iron variable
# polities_with_irons_responses = deepseek_responses(polities_with_irons_df,
#                                                      'Military use of Metals: Iron',
#                                                      "The absence or presence of iron as a military technology used in warfare"
#                                                      )

In [74]:
# Save the responses to a CSV file
# polities_with_irons_responses.to_csv('polities_with_irons_responses.csv', index=False)

# Load the responses from the CSV file
polities_with_irons_responses = pd.read_csv('polities_with_irons_responses.csv')

In [75]:
polities_with_irons_responses.sample(5)

Unnamed: 0,new_name,answer,full
318,tr_rum_sultanate,present,"<think>\nOkay, so I need to figure out whether..."
147,cn_peiligang,present,"<think>\nOkay, so I need to figure out whether..."
62,iq_abbasid_cal_1,present,"<think>\nAlright, so I need to figure out whet..."
256,ml_jenne_jeno_1,present,"<think>\nOkay, so I'm trying to figure out whe..."
194,ir_elam_7,present,"<think>\nOkay, so I need to figure out whether..."


In [78]:
performance(polities_with_irons_df, polities_with_irons_responses, 'iron')

Correct: 209, Total: 337, Percentage: 62.02%


62.01780415430267

# Conclusion

Using DeepSeek-R1, as an example of a recent language model with advanced capabilities, we have provided a series of prompts to test its knowledge of history, using expert reviewed data from the Seshat Global History data as a ground truth. We have specifically asked a set of questions that would require some level of reasoning, rather than just knowledge. Since the model was run locally, it was unable to query online resources, all answers coming directly from the model itself.

The questions are all about social complexity and military technology, and pertain to variables for which data has been collected in Seshat, across a range of historical polities. These polities vary by era and geography, but the only information we gave the LLM was the polity name and the years in which it was active. The prompts read something like this:

> "Use your knowledge of world history to answer the following question. Given your knowledge of the historical polity **'Fatimid Caliphate'**, a polity that existed between **909 CE** and **1171 CE**, do you expect that **Scientific Literature** was *present* or *absent*? **Scientific Literature** is defined as: **'Talking about Kinds of Written Documents, Scientific literature includes mathematics, natural sciences, social sciences'**. Answer *'XXXpresentXXX'* if you expect it to be present, and *'XXXabsentXXX'* if you expect it to be absent."

We then extracted the answers from DeepSeek's responses. The results showed that the number of correct answers was not far from **50%**, which is what you might expect if it was picking at random given the binary choice of "absent" and "present" (other values such as "transitional" and "unknown" exist for these variables in Seshat data, but we didn't include those polities here).

In the CSV example answers I have saved in this repository, the results in terms of correct answers were:
- Scientific Literature: 52.67% (79/150) 
- Drinking Water Supply System: 46.43% (52/112)
- Roads: 58.82% (110/187) 
- Military use of Metals - Iron: 62.02% (209/337)