# Extract information with LLM

Colab notebook written by Emma Bonutti D'Agostini and Emilien Schultz, June 2025.

## Install and import packages

In [74]:
# Install
!pip install -q tqdm pandas scikit-learn openapi openai Levenshtein openpyxl

In [2]:
# Import
import pandas as pd
import json
from openai import OpenAI
from tqdm import tqdm
import warnings
warnings.simplefilter(action='ignore')

In [None]:
# If you are working with Colab, connect this notebook to your personal Google Drive account
from google.colab import drive
drive.mount('/content/drive')

## Define request functions

We use open router to prompt different models. We will need to enter our **key**.

We will use meta-llama/llama-3.3-70b-instruct which is both efficient and cheap.

In [None]:
# Define a function to make requests to the API

token_or = "YOUR KEY" # INSERT YOUR KEY HERE

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=token_or,
)

def do_predictions(prompt_generator, texts, model = "meta-llama/llama-3.3-70b-instruct"):
    """
    Run a prompt generator on a list of text for a specific model
    """
    results = []
    total = len(texts)
    with tqdm(total=total, desc="Progress", unit='item',
              bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} [{percentage:3.0f}%]') as pbar:
        for i, j in texts.items():
            try:
                completion = client.chat.completions.create(
                    model=model,
                    messages=prompt_generator(j)
                )
                results.append(completion)
            except Exception as e:
                print(e)
                results.append(None)
            pbar.update(1)
    return results

Small test to see if everythings works

In [4]:
# With the second model
completion = client.chat.completions.create(
    model="meta-llama/llama-3.3-70b-instruct",
    messages=[
    {
      "role": "user",
      "content": "Could you make a joke on computational social science ?"
    }
  ]
)
completion

ChatCompletion(id='gen-1757939121-vY4ClgiDvsvKO6xWzlKu', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='A joke about computational social science! Here\'s one:\n\nWhy did the computational social scientist break up with his girlfriend?\n\nBecause he realized their relationship was just a weak tie in a large network, and his sentiment analysis kept showing a decline in positive emotions. He tried to optimize their communication using natural language processing, but it was just a latency issue – she was always responding slowly to his messages! In the end, he decided to terminate the relationship, citing "insufficient convergence" and a lack of "emergent behavior" in their interactions. Now, he\'s just a node in a singles\' network, searching for a stronger connection... \n\nHope that one computed a smile!', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning=None), native_fini

# Information extraction
Partial replication of "From Codebooks to Promptbooks" (Stuhler, Ton and Ollion 2025).

- link to [paper](https://journals.sagepub.com/eprint/T3QXU8KV5BP5QYKZTDZI/full)
- link to [replication materials](https://osf.io/hwuvs/)

The obituaries analyzed in the paper are a synthetic sample - i.e. they are not real obituaties, but generated by AI on the model of NYT obituaries.

Again, we will work with a sample of 100 texts to avoid making too many requests.

We will replicate together the information extraction for gender (categorical) and educational instution (open field) attended.

## Dataset
Let's look at how typical obituaries look, at how the information of interest is presented.

In [16]:
url = "https://github.com/css-polytechnique/css-ipp-materials/raw/refs/heads/main/Python-tutorials/SICSS-2025/information-extraction/20241009_Synthetic300.csv"
obits = pd.read_csv(url)
def combine_date_title_and_text(row : pd.Series):
    return f"Date: {row["date_death"]}\nObituary: {row["text"]}"

obits["text_combined"] = obits.apply(combine_date_title_and_text, axis = 1)
obits.head()

Unnamed: 0.1,Unnamed: 0,nid,gender,name,birth_month,birth_day,birth_year,occup,occup_elaborate,children,...,milit,birth,raised,last_lived,place_death,Religion,text,text_tok,text_char,text_combined
0,1,52,man,John Deer,July,12,1944,Geneticist,2,1,...,Vietnam,"Anaheim, California",the same place as they were born in,Los Angeles,Stanford Health Care-Stanford Hospital,none mentioned,"John Deer, a pioneering geneticist whose groun...",602,3556,"Date: October 1st, 2024\nObituary: John Deer, ..."
1,2,101,man,John Deer,May,3,1942,Forensic Scientist,1,0,...,Vietnam,"Providence , Rhode Island",the same place as they were born in,New York,,none mentioned,"John Deer, a distinguished forensic scientist ...",486,2748,"Date: October 1st, 2024\nObituary: John Deer, ..."
2,3,3,man,John Deer,February,12,1901,Engineer,3,2,...,WWI,"Chicago, Illinois","Arlington, Texas",Chicago,in an elderly care facility,none mentioned,"John Deer, a revered engineer known for his in...",474,2715,"Date: October 1st, 2024\nObituary: John Deer, ..."
3,4,115,man,John Deer,August,10,1900,Neurosurgeon,5,2,...,WWI,"Jackson, Mississippi",the same place as they were born in,Phoenix,UCLA Medical Center – Los Angeles,none mentioned,"John Deer, a pioneering neurosurgeon renowned ...",514,2929,"Date: October 1st, 2024\nObituary: John Deer, ..."
4,5,1,man,John Deer,July,9,1932,Medical doctor,1,3,...,WWII,"Arlington, Texas",the same place as they were born in,New York,,none mentioned,"John Deer, a dedicated medical doctor who touc...",470,2506,"Date: October 1st, 2024\nObituary: John Deer, ..."


In [17]:
obits['text_combined'].iloc[10]

"Date: October 1st, 2024\nObituary: John Deer, a dedicated auditor known for his meticulous attention to detail and unwavering integrity, passed away on October 7th, 2024, at the UCSF Medical Center in California. He was born on December 5, 1919, in Simi Valley, California, and was raised in the same nurturing community that shaped his values and principles.\n\nMr. Deer attended Yale, where he pursued his passion for numbers and finance, ultimately earning his college degree. Before his time at Yale, he attended a local high school where his exceptional academic abilities first caught the attention of his teachers. This early recognition of his potential set the stage for a lifetime of accomplishments in the field of auditing.\n\nDuring World War II, John bravely served his country with honor and distinction, embodying the spirit of sacrifice and commitment that defined his generation. His military service instilled in him a sense of duty and camaraderie that stayed with him throughout

We will extract different pieces of information:

- gender (categorical variable)
- educational degree (trickier)


Comment : obitaries are long texts: cost will be higher

## Replication for Gender

In [22]:
#Function to build the prompt: system, user, + text to process (passed as argument)

prompt_system = ("You are a highly efficient information detection and extraction engine, specialized in analyzing natural language data.\n"
    "You value accuracy: when the user asks you to extract certain information from given text data, you will try your best to adhere to what is directly mentioned in the text and the extraction criteria.\n"
    "You value efficiency: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt. ")

def prompt_user_gender(text):
    #Text of the system prompt
    return [
       {"role":"system","content":prompt_system},
       {"role":"user","content": "Below I will provide an obituary of a deceased person.\n" +
        "Based on the text, infer the gender of the deceased person. Provide a one-word response from only one of the following options: 'man', 'woman', 'other'." +
        f"\n\nThe text : {text}"}]

# Check if the prompt is correct
prompt_user_gender(obits.loc[1,"text_combined"])

[{'role': 'system',
  'content': 'You are a highly efficient information detection and extraction engine, specialized in analyzing natural language data.\nYou value accuracy: when the user asks you to extract certain information from given text data, you will try your best to adhere to what is directly mentioned in the text and the extraction criteria.\nYou value efficiency: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt. '},
 {'role': 'user',
  'content': "Below I will provide an obituary of a deceased person.\nBased on the text, infer the gender of the deceased person. Provide a one-word response from only one of the following options: 'man', 'woman', 'other'.\n\nThe text : Date: October 1st, 2024\nObituary: John Deer, a distinguished forensic scientist known for his groundbreaking work in criminal investigations, passed away on October 7th, 20

Careful: inference can take several minutes (around 3). For this reason let's  use only a sample of the data

In [23]:
# Create a sample to test
N_max = 10
df = obits[0:N_max]

And run the prompt on the data

In [24]:
# Run the prompts
r_llama33 = do_predictions(prompt_user_gender,
                           df['text_combined'],
                           "meta-llama/llama-3.3-70b-instruct" #test with a different model
                           )
# Add the result to the dataframe
df.loc[:, "gender_llama33"] = [i.choices[0].message.content if i is not None else None for i in r_llama33] #same for the second model tested

# Save the data
df.to_csv("table_results.csv")

# Display
df[["gender", "gender_llama33"]]

Progress: 100%|██████████ | 10/10 [100%]


Unnamed: 0,gender,gender_llama33
0,man,man
1,man,man
2,man,man
3,man,man
4,man,man
5,man,man
6,man,man
7,man,man
8,man,man
9,man,man


It works pretty well ! Let's get the accuracy.

In [25]:
(df["gender"] == df["gender_llama33"]).mean()

np.float64(1.0)

## Replicate for Educational Institution Attended

In [65]:
# We use the same system prompt but define a new user prompt
def prompt_user_educinstit(text):
  return [
       {"role":"system","content":prompt_system},
       {"role":"user","content": """Below I will provide an obituary of a deceased person.
Record all institutions of higher education that the person obtained a degree from (i.e., universities, colleges, or graduate & professional schools), exactly as written in the text. If the text indicates that this person attended some institution as a student, but did not complete their degree, record this institution as well. When giving your response, consider the following rules:
1) Do not include high schools or college preparatory schools.
2) Do not include institutions that the person’s friends, family, coworkers or partners attended, unless the deceased person also attended them.
3) Obituaries may describe decedents who were employed at academic institutions, such as instructors, scientists, university administrators and coaches. You must distinguish higher education institutions that this person studied at from those that this person worked at. Only institutions where the person studied should be considered in your response. Do not record higher education institutions only because the person worked, taught, or held a job there. For example, if the text says “after transferring from University 1 to study mathematics at University 2, he eventually got a master's degree from University 3. He became a head coach at University 4 and taught sports science at University 5”, your response should only include Universities 1, 2 and 3, but not University 4.
4) If universities are famously know with it's initials, give them instead of the full name (i.e. MIT for Massachusetts Institute of Technology)
If the text does not mention any institutions of higher education that the person attended, simply respond with “none”.
If your response is a list of two or more institutions, please separate each institution with a comma (e.g.: 'university 1, university 2, university 3').""" +
f"\n\nThe text : {text}"}]

prompt_user_educinstit(obits.loc[1,"text_combined"])


[{'role': 'system',
  'content': 'You are a highly efficient information detection and extraction engine, specialized in analyzing natural language data.\nYou value accuracy: when the user asks you to extract certain information from given text data, you will try your best to adhere to what is directly mentioned in the text and the extraction criteria.\nYou value efficiency: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt. '},
 {'role': 'user',
  'content': "Below I will provide an obituary of a deceased person.\nRecord all institutions of higher education that the person obtained a degree from (i.e., universities, colleges, or graduate & professional schools), exactly as written in the text. If the text indicates that this person attended some institution as a student, but did not complete their degree, record this institution as well. When givin

In [66]:
# Run the prompts
r_llama33 = do_predictions(prompt_user_educinstit,
                           df['text_combined'],
                           "meta-llama/llama-3.3-70b-instruct" #test with a different model
                           )
# Add the result to the dataframe
df.loc[:, "education_institution_llama33"] = [i.choices[0].message.content if i is not None else None for i in r_llama33] #same for the second model tested

# Save the data
df.to_csv("table_results.csv")

# Display
df[["educ_inst", "education_institution_llama33"]]

Progress: 100%|██████████ | 10/10 [100%]


Unnamed: 0,educ_inst,education_institution_llama33
0,,"University of California, Berkeley"
1,Bowndoin College,Bowdoin College
2,Cornell,"MIT, Cornell"
3,Columbia,Columbia University
4,MIT,MIT
5,Carnegie Mellon,Carnegie Mellon
6,Columbia,Columbia
7,Cornell,Cornell University
8,University of Central Florida,"University of Central Florida, Brownsville Sch..."
9,,none


A raw accuracy yields catastrophic validation metrics

In [67]:
(df["education_institution_llama33"] == df["educ_inst"]).mean()

np.float64(0.3)

# Your turn to try!

Try with a different model

Try to better define the extraction format to limit the variation

Try with the variable religion. Do you notice something?


There are differences, but are they significant differences ? How to compute metrics ?

# Taking into account variations

Since the generative process generates free responses, we need to design a strategy to deal with them.

Different strategies can be used:

- systematic modification : lower case, remove punctuation
- human judgment on disagreement to validate
- LLM-as-a-judge with a new request to a LLM

For the article, we used human-in-the-loop:

1. A first comparison with the gold standard with simple automatic rules
2. A human to judge if the disagreement is real (with 3 possibilities : disagreement, agreement, partial agreement)
3. Computation of the metrics using the human loop results

So we have 2 rules to decide if a prediction is correct:

- it is the same string with small variations (1/2 different letters or punctuation)
- if a human decides it is the same

In [68]:
# Reload the elements if needed for the comparison
df_compare = df[["educ_inst","education_institution_llama33"]]
df_compare

Unnamed: 0,educ_inst,education_institution_llama33
0,,"University of California, Berkeley"
1,Bowndoin College,Bowdoin College
2,Cornell,"MIT, Cornell"
3,Columbia,Columbia University
4,MIT,MIT
5,Carnegie Mellon,Carnegie Mellon
6,Columbia,Columbia
7,Cornell,Cornell University
8,University of Central Florida,"University of Central Florida, Brownsville Sch..."
9,,none


### Define a cleaning function to compare



In [69]:
import string # to get punctuation
import Levenshtein # to measure the distance between strings
import pandas as pd

def clean(val:str):
  """
  Clean strings
  """
  if pd.isnull(val):
    return None
  # lower case without punctuation
  val = val.lower().translate(str.maketrans('', '', string.punctuation)).replace("university", "").replace("college", "")
  # empty answer variations
  if val in ["not mentioned", "none"]:
    val = None
  return val

def eval_equality(str1:str, str2:str, distance_max:int = 1):
  """
  Define equality between 2 strings
  """
  # clean the string using function defined above
  str1 = clean(str1)
  str2 = clean(str2)

  # case with the None value
  if (str1 is None and str2) or (str1 and str2 is None):
    return False
  if str1 is None and str2 is None:
    return True

  # test equality

  # strict
  if str1 == str2:
    return True

  # with distance_max letters difference
  distance = Levenshtein.distance(str1, str2)
  if distance <= distance_max:
    return True

  return False

Add to the dataframe new columns with the post-processed values.

In [70]:
df_compare["education_institution_llama33_valid"] = df_compare.apply(lambda x: eval_equality(x["educ_inst"], x["education_institution_llama33"]), axis=1)

It is already better

In [71]:
df_compare["education_institution_llama33_valid"].mean()

np.float64(0.6)

Export only cases featuring disagreement in a file.

In [75]:
# for llama3.3
table = df_compare[~df_compare["education_institution_llama33_valid"]][["educ_inst","education_institution_llama33"]].reset_index()
table["equal"] = None
table.to_excel("ie_education_institution_llama33_to_recode.xlsx")

The annotator needs then to rename the file from to_recode to recoded [to recode => recoded], and enter something (1, or X) in the column "equal" if he/she juges that the extracted value match the gold standard.

Reload the file after changes (and rename to `ie_education_institution_llama33_recoded.xlsx`) and match it with the data to compute performances.

In [78]:
from pathlib import Path

if Path("ie_education_institution_llama33_recoded.xlsx").exists():
    # read the human annotated file
    table_reco = pd.read_excel("ie_education_institution_llama33_recoded.xlsx")

    # get the id of the element reco
    idx_human_feedback = table_reco[table_reco["equal"].notnull()].index
    df_compare.loc[idx_human_feedback, "education_institution_llama33_valid"] = True
else:
    print("No human feedback available for llama3.3")

In [79]:
df_compare["education_institution_llama33_valid"].mean()

np.float64(0.6)

For the categorical data, it is possible to use classical metrics (f1, ...). Since the generation + the human loop can judge the generated information as equal to the gold standard - even if it is not exactly the same string, and we must "smooth", post-process the data before proceding.