# Extract information with LLM

Colab notebook written by Emma Bonutti D'Agostini and Emilien Schultz, June 2025.

## Install and import packages

In [None]:
# Install
!pip install -q tqdm pandas==2.2.2 scikit-learn==1.6.0 openapi openai Levenshtein

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Import
import pandas as pd
import json
from openai import OpenAI
from tqdm import tqdm
import warnings
warnings.simplefilter(action='ignore')

In [None]:
# If you are working with Colab, connect this notebook to your personal Google Drive account
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Define request functions

We use open router to prompt different models. We will need to enter our **key**.

We will use meta-llama/llama-3.3-70b-instruct which is both efficient and cheap.

In [None]:
# Define a function to make requests to the API

token_or = "sk-......" # INSERT YOUR KEY HERE

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=token_or,
)

def do_predictions(prompt_generator, texts, model = "meta-llama/llama-3.3-70b-instruct"):
    """
    Run a prompt generator on a list of text for a specific model
    """
    results = []
    total = len(texts)
    with tqdm(total=total, desc="Progress", unit='item',
              bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} [{percentage:3.0f}%]') as pbar:
        for i, j in texts.items():
            try:
                completion = client.chat.completions.create(
                    model=model,
                    messages=prompt_generator(j)
                )
                results.append(completion)
            except Exception as e:
                print(e)
                results.append(None)
            pbar.update(1)
    return results

Small test to see if everythings works

In [None]:
# With the second model
completion = client.chat.completions.create(
    model="meta-llama/llama-3.3-70b-instruct",
    messages=[
    {
      "role": "user",
      "content": "Could you make a joke on computational social science ?"
    }
  ]
)
completion

ChatCompletion(id='gen-1752648172-1es5QOijH1AphZxXSuKO', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Why did the agent-based model go to therapy?\n\nBecause it was struggling to simulate a sense of self, and its social networks were always in a latent state of crisis! But in the end, it just needed to re-run its algorithms and re-learn its parameters to find a more optimal solution to its emotional instability. Now it's predicting a happier future!", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning=None), native_finish_reason='stop')], created=1752648172, model='meta-llama/llama-3.3-70b-instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=73, prompt_tokens=45, total_tokens=118, completion_tokens_details=None, prompt_tokens_details=None), provider='Kluster')

# Information extraction
Partial replication of "From Codebooks to Promptbooks" (Stuhler, Ton and Ollion 2025).

- link to [paper](https://journals.sagepub.com/eprint/T3QXU8KV5BP5QYKZTDZI/full)
- link to [replication materials](https://osf.io/hwuvs/)

The obituaries analyzed in the paper are a synthetic sample - i.e. they are not real obituaties, but generated by AI on the model of NYT obituaries.

Again, we will work with a sample of 100 texts to avoid making too many requests.

We will replicate together the information extraction for gender (categorical) and educational instution (open field) attended.

## Dataset
Let's look at how typical obituaries look, at how the information of interest is presented.

In [None]:
url = "https://raw.githubusercontent.com/css-polytechnique/ic2s2-tutorial-llm-2025/refs/heads/main/data/obituaries.csv"
obits = pd.read_csv(url)
obits.head()

Unnamed: 0,Article_ID,Date,text_combined,age_in_years,gender,religion,education_institution,first_name,last_name
0,431928514,2001-11-03,Date: 2001-11-03\nObituary title: rabbi john s...,103,male,jewish,not mentioned,eliezer,schach
1,2038054333,2018-05-13,Date: 2018-05-13\nObituary title: john smith a...,81,male,not mentioned,not mentioned,ernest,medina
2,424015567,1980-11-15,Date: 1980-11-15\nObituary title: john smith a...,74,male,not mentioned,not mentioned,john h.,preston
3,940930401,2012-03-27,Date: 2012-03-27\nObituary title: jane smith 8...,82,female,not mentioned,high school of music and art in manhattan,anita,steckel
4,430642115,1996-08-06,Date: 1996-08-06\nObituary title: jane smith 5...,58,female,not mentioned,"vassar college, fordham university",jean,gerard


In [None]:
obits['text_combined'].iloc[10]

"Date: 1981-08-07\nObituary title: john smith ex-buffalo mayor; served in mid-70's\nObituary: john m. smith a one-time grain-mill worker who became mayor of buffalo, died yesterday in deaconess hospital here following exploratory surgery for a respiratory illness. he was 58 years old. mr. smith a democrat, served as mayor from 1973 to 1977. since his decision in 1977 not to run for re-election, he served on the state industrial board of appeals. he was succeeded as mayor by james d. griffin, with whom he once worked in a grain mill. in a city geared to traditional methods in politics, mr. smith rose through the ranks, first with the american federation of grain millers, a.f.l.-c.i.o., and then in government. he was first elected to the former erie county board of supervisors in 1955. in 1959, he was appointed to the buffalo common council as a councilman at large and later became majority leader. in 1969, he became an aide to the late mayor frank a. sedita. in 1972, mr. smith was named

We will extract different pieces of information:

- gender (categorical variable)
- educational degree (trickier)


Comment : obitaries are long texts: cost will be higher

## Replication for Gender

In [None]:
#Function to build the prompt: system, user, + text to process (passed as argument)

prompt_system = ("You are a highly efficient information detection and extraction engine, specialized in analyzing natural language data.\n"
    "You value accuracy: when the user asks you to extract certain information from given text data, you will try your best to adhere to what is directly mentioned in the text and the extraction criteria.\n"
    "You value efficiency: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt. ")

def prompt_user_gender(text):
    #Text of the system prompt
    return [
       {"role":"system","content":prompt_system},
       {"role":"user","content": "Below I will provide an obituary of a deceased person.\n" +
        "Based on the text, infer the gender of the deceased person. Provide a one-word response from only one of the following options: 'male', 'female', 'other'." +
        f"\n\nThe text : {text}"}]

# Check if the prompt is correct
prompt_user_gender(obits.loc[1,"text_combined"])

[{'role': 'system',
  'content': 'You are a highly efficient information detection and extraction engine, specialized in analyzing natural language data.\nYou value accuracy: when the user asks you to extract certain information from given text data, you will try your best to adhere to what is directly mentioned in the text and the extraction criteria.\nYou value efficiency: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt. '},
 {'role': 'user',
  'content': "Below I will provide an obituary of a deceased person.\nBased on the text, infer the gender of the deceased person. Provide a one-word response from only one of the following options: 'male', 'female', 'other'.\n\nThe text : Date: 2018-05-13\nObituary title: john smith army captain acquitted in my lai massacre, dies at 81\nObituary: john l. smith the army captain who was accused of overall res

Careful: inference can take several minutes (around 3). For this reason let's  use only a sample of the data

In [None]:
# Create a sample to test
N_max = 10
df = obits[0:N_max]

And run the prompt on the data

In [None]:
# Run the prompts
r_llama33 = do_predictions(prompt_user_gender,
                           df['text_combined'],
                           "meta-llama/llama-3.3-70b-instruct" #test with a different model
                           )
# Add the result to the dataframe
df.loc[:, "gender_llama33"] = [i.choices[0].message.content if i is not None else None for i in r_llama33] #same for the second model tested

# Save the data
df.to_csv("table_results.csv")

# Display
df[["gender", "gender_llama33"]]

Progress: 100%|██████████ | 10/10 [100%]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, "gender_llama33"] = [i.choices[0].message.content if i is not None else None for i in r_llama33] #same for the second model tested


Unnamed: 0,gender,gender_llama33
0,male,male
1,male,male
2,male,male
3,female,female
4,female,female
5,male,male
6,male,male
7,male,male
8,male,male
9,male,male


It works pretty well ! Let's get the accuracy.

In [None]:
(df["gender"] == df["gender_llama33"]).mean()

np.float64(1.0)

## Replicate for Educational Institution Attended

In [None]:
# We use the same system prompt but define a new user prompt
def prompt_user_educinstit(text):
  return [
       {"role":"system","content":prompt_system},
       {"role":"user","content": """Below I will provide an obituary of a deceased person.
Record all institutions of higher education that the person obtained a degree from (i.e., universities, colleges, or graduate & professional schools), exactly as written in the text. If the text indicates that this person attended some institution as a student, but did not complete their degree, record this institution as well. When giving your response, consider the following rules:
1) Do not include high schools or college preparatory schools.
2) Do not include institutions that the person’s friends, family, coworkers or partners attended, unless the deceased person also attended them.
3) Obituaries may describe decedents who were employed at academic institutions, such as instructors, scientists, university administrators and coaches. You must distinguish higher education institutions that this person studied at from those that this person worked at. Only institutions where the person studied should be considered in your response. Do not record higher education institutions only because the person worked, taught, or held a job there. For example, if the text says “after transferring from University 1 to study mathematics at University 2, he eventually got a master's degree from University 3. He became a head coach at University 4 and taught sports science at University 5”, your response should only include Universities 1, 2 and 3, but not University 4.
If the text does not mention any institutions of higher education that the person attended, simply respond with “none”.
If your response is a list of two or more institutions, please separate each institution with a comma (e.g.: 'university 1, university 2, university 3').""" +
f"\n\nThe text : {text}"}]

prompt_user_educinstit(obits.loc[1,"text_combined"])


[{'role': 'system',
  'content': 'You are a highly efficient information detection and extraction engine, specialized in analyzing natural language data.\nYou value accuracy: when the user asks you to extract certain information from given text data, you will try your best to adhere to what is directly mentioned in the text and the extraction criteria.\nYou value efficiency: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt. '},
 {'role': 'user',
  'content': "Below I will provide an obituary of a deceased person.\nRecord all institutions of higher education that the person obtained a degree from (i.e., universities, colleges, or graduate & professional schools), exactly as written in the text. If the text indicates that this person attended some institution as a student, but did not complete their degree, record this institution as well. When givin

In [None]:
# Run the prompts
r_llama33 = do_predictions(prompt_user_educinstit,
                           df['text_combined'],
                           "meta-llama/llama-3.3-70b-instruct" #test with a different model
                           )
# Add the result to the dataframe
df.loc[:, "education_institution_llama33"] = [i.choices[0].message.content if i is not None else None for i in r_llama33] #same for the second model tested

# Save the data
df.to_csv("table_results.csv")

# Display
df[["education_institution", "education_institution_llama33"]]

Progress: 100%|██████████ | 10/10 [100%]


Unnamed: 0,education_institution,education_institution_llama33
0,not mentioned,none
1,not mentioned,none
2,not mentioned,none
3,high school of music and art in manhattan,none
4,"vassar college, fordham university","Vassar College, Fordham University"
5,not mentioned,none
6,"san jose state university, penn state university","Penn State University, San Jose State University"
7,st john's university law school,St. John's University Law School
8,hope college,Hope College
9,not mentioned,none


A raw accuracy yields catastrophic validation metrics

In [None]:
(df["education_institution_llama33"] == df["education_institution"]).mean()

np.float64(0.0)

# Your turn to try!

Try with a different model

Try to better define the extraction format to limit the variation

Try with the variable religion. Do you notice something?


There are differences, but are they significant differences ? How to compute metrics ?

# Taking into account variations

Since the generative process generates free responses, we need to design a strategy to deal with them.

Different strategies can be used:

- systematic modification : lower case, remove punctuation
- human judgment on disagreement to validate
- LLM-as-a-judge with a new request to a LLM

For the article, we used human-in-the-loop:

1. A first comparison with the gold standard with simple automatic rules
2. A human to judge if the disagreement is real (with 3 possibilities : disagreement, agreement, partial agreement)
3. Computation of the metrics using the human loop results

So we have 2 rules to decide if a prediction is correct:

- it is the same string with small variations (1/2 different letters or punctuation)
- if a human decides it is the same

In [None]:
# Reload the elements if needed for the comparison
df = pd.read_csv("table_results.csv")
df_compare = df[["education_institution","education_institution_llama33"]]
df_compare

Unnamed: 0,education_institution,education_institution_llama33
0,not mentioned,none
1,not mentioned,none
2,not mentioned,none
3,high school of music and art in manhattan,none
4,"vassar college, fordham university","Vassar College, Fordham University"
5,not mentioned,none
6,"san jose state university, penn state university","Penn State University, San Jose State University"
7,st john's university law school,St. John's University Law School
8,hope college,Hope College
9,not mentioned,none


### Define a cleaning function to compare



In [None]:
import string # to get punctuation
import Levenshtein # to measure the distance between strings
import pandas as pd

def clean(val:str):
  """
  Clean strings
  """
  if pd.isnull(val):
    return None
  # lower case without punctuation
  val = val.lower().translate(str.maketrans('', '', string.punctuation))
  # empty answer variations
  if val in ["not mentioned", "none"]:
    val = None
  return val

def eval_equality(str1:str, str2:str, distance_max:int = 1):
  """
  Define equality between 2 strings
  """
  # clean the string using function defined above
  str1 = clean(str1)
  str2 = clean(str2)

  # case with the None value
  if (str1 is None and str2) or (str1 and str2 is None):
    return False
  if str1 is None and str2 is None:
    return True

  # test equality

  # strict
  if str1 == str2:
    return True

  # with distance_max letters difference
  distance = Levenshtein.distance(str1, str2)
  if distance <= distance_max:
    return True

  return False

Add to the dataframe new columns with the post-processed values.

In [None]:
df_compare["education_institution_llama33_valid"] = df_compare.apply(lambda x: eval_equality(x["education_institution"], x["education_institution_llama33"]), axis=1)

It is already better

In [None]:
df_compare["education_institution_llama33_valid"].mean()

np.float64(0.8)

Export only cases featuring disagreement in a file.

In [None]:
# for llama3.3
table = df_compare[~df_compare["education_institution_llama33_valid"]][["education_institution","education_institution_llama33"]].reset_index()
table["equal"] = None
table.to_excel("ie_education_institution_llama33_to_recode.xlsx")

The annotator needs then to rename the file from to_recode to recoded [to recode => recoded], and enter something (1, or X) in the column "equal" if he/she juges that the extracted value match the gold standard.

Reload the file after changes (and rename to `ie_education_institution_llama33_recoded.xlsx`) and match it with the data to compute performances.

In [None]:
from pathlib import Path

if Path("ie_education_institution_llama33_recoded.xlsx").exists():
    # read the human annotated file
    table_reco = pd.read_excel("ie_education_institution_llama33_recoded.xlsx")

    # get the id of the element reco
    idx_human_feedback = table_reco[table_reco["equal"].notnull()].index
    df_compare.loc[idx_human_feedback, "education_institution_llama33_valid"] = True
else:
    print("No human feedback available for llama3.3")

No human feedback for llama3.3


In [None]:
df_compare["education_institution_llama33_valid"].mean()

np.float64(0.8)

For the categorical data, it is possible to use classical metrics (f1, ...). Since the generation + the human loop can judge the generated information as equal to the gold standard - even if it is not exactly the same string, and we must "smooth", post-process the data before proceding.