Colab notebook written by Emma Bonutti D'Agostini and Emilien Schultz, June 2025.

# Set-up

If you want to run this code on Colab, ou need to download this notebook and paste it to a folder in your personal Google Drive account.

**We nonetheless recommend that you can use any other code editor on your personal computer, such as VSCode**

## Packages

In [None]:
# Install
!pip install -q tqdm pandas==2.2.2 scikit-learn==1.6.0 openapi openai Levenshtein

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
# Import
import pandas as pd
import json
import yaml
pd.options.mode.chained_assignment = None

## Open Router API settings

**What is Open Router?**
> OpenRouter is a unified API that allows developers to access and use a wide range of powerful language models—such as OpenAI's GPT, Claude, Mistral, and others—without having to host them locally.
>
> Instead of installing and running large models on your own hardware, OpenRouter provides a simple interface to purchase and perform inference via third-party providers.

**How does it work?**
> OpenRouter acts as an intermediary that routes your model requests (inference) to a variety of model providers, depending on your selection. You simply send API requests through a single endpoint, and OpenRouter handles:
>- Model selection (unless you specify)
>- Authentication
>- Billing
>- Request routing

**Why Use OpenRouter Instead of Running Models Locally?**
> Running generative large-scale models (Llama, Claude 3, etc.) locally is challenging due to: hardware limitations (GPUs), storage requirements (memory), setup complexity.
>
> Using OpenRouter offers several key advantages:
>
>- No need for expensive local GPUs or cloud infrastructure.
>- Immediate access to multiple top-tier models.
>- Easy integration with just a few lines of code.
>- Pay-as-you-go pricing based on usage.


To use it with Python, we can use the OpenAPI wrapper + a key with credit:
https://openrouter.ai/docs/quickstart

The key to use Open Router is shared with a `.txt` file. Prepare the config file that is shared on the drive, together with this notebook.

**But be careful:**
The privacy of your data is not ensured as your are transmitting their content to third-parties. Be certain that your data is not sensitive or copyrighted.

In [None]:
from openai import OpenAI
from tqdm import tqdm

In [None]:
# If you are working with Colab, connect this notebook to your personal Google Drive account
# This allows you to access, through this notebook, data files etc. stored on your drive
# As well as to create new files to save the output of the data processing pipeline

# ***ATTENTION*** Do not pass any proprietary/private information
# ***ATTENTION*** Prefer a local solution if you can (for instance: jupyther notebooks)

# A window will open, and you'll have to give your consent to make the connection
# You will also be asked to choose to which Google Drive account you want to connect, in case you have several

# If the process succeeds, this cell will print the message "Mounted at /content/drive"

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Locate in the same folder this notebook, the data sample you want to use and any other useful file
# If you use a google drive path, it should start with /content/drive/My Drive/
my_path = '/YOUR/FILE/PATH/'

In [None]:
# Get an OpenRouter key and paste it here
token_or = "INSERT_YOUR_KEY_HERE"

In [None]:
client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=token_or,
)

In [None]:
# Define a function to make requests to the API
# Takes as input a function with the prompt, the texts from which to extract information, the model you want to use

def do_predictions(prompt_generator, texts, model):
    """
    Inference with the API for a model, a list of text and a prompt format.
    Displays progress with tqdm in percentage.
    """
    results = [] #empty list to store results
    total = len(texts)
    with tqdm(total=total, desc="Progress", unit='item',
              bar_format='{l_bar}{bar} | {n_fmt}/{total_fmt} [{percentage:3.0f}%]') as pbar:
        for i, j in texts.items(): #iterate over texts in your data file
            try:
                completion = client.chat.completions.create( #api request
                    model=model, #model you will chose
                    messages=prompt_generator(j) #prompt you will use
                )
                results.append(completion) #store results
            except Exception as e:
                print(e)
                results.append(None)
            pbar.update(1)
    return results

## Choose the model

- Llama 3.1 (used by Etienne) : meta-llama/llama-3.1-70b-instruct
- LLama 3.3 (newer, see if improved performances) : meta-llama/llama-3.3-70b-instruct

Small test to see if everythings works

In [None]:
# Test the request
completion = client.chat.completions.create( #api request
    model="meta-llama/llama-3.1-70b-instruct", #model
    messages=[ #test prompt
    {
      "role": "user",
      "content": "Could you make a joke on computational social science ?"
    }
  ]
)
completion

ChatCompletion(id='gen-1750924283-4ZBBhFTMfWEOyi3pFdMY', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="A joke on computational social science! Here's one:\n\nWhy did the computational social scientist break up with his girlfriend?\n\nBecause he realized their relationship was a non-linear system with diminishing returns, and his emotional investments were not scaling. Plus, he found a strong correlation between her mood swings and his sleep deprivation. He decided to re-run the regression analysis and concluded it was time to reset the model... and the relationship.\n\nI hope that brought a smile to your face!", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning=None), native_finish_reason='stop')], created=1750924283, model='meta-llama/llama-3.1-70b-instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=93, p

In [None]:
# With the second model
completion = client.chat.completions.create(
    model="meta-llama/llama-3.3-70b-instruct",
    messages=[
    {
      "role": "user",
      "content": "Could you make a joke on computational social science ?"
    }
  ]
)
completion

ChatCompletion(id='gen-1750924304-yqylygew2ldx3biixHYu', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Here's one:\n\nWhy did the computational social scientist break up with his girlfriend?\n\nBecause he realized their relationship was not scalable, had high latency, and was plagued by confirmation bias... and also, his machine learning model predicted a low probability of long-term success!\n\nHope that one computed a laugh!", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning=None), native_finish_reason='stop')], created=1750924304, model='meta-llama/llama-3.3-70b-instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=59, prompt_tokens=20, total_tokens=79, completion_tokens_details=None, prompt_tokens_details=None), provider='DeepInfra')

# Information extraction
Partial replication of "From Codebooks to Promptbooks" (Stuhler, Ton and Ollion 2025).

- link to [paper](https://journals.sagepub.com/eprint/T3QXU8KV5BP5QYKZTDZI/full)
- link to [replication materials](https://osf.io/hwuvs/)

The obituaries analyzed in the paper are a synthetic sample - i.e. they are not real obituaties, but generated by AI on the model of NYT obituaries.

Again, we will work with a sample of 100 texts to avoid making too many requests.

We will replicate together the information extraction for gender (categorical) and educational instution (open field) attended.

## Dataset
Let's look at how typical obituaries look, at how the information of interest is presented.

In [2]:
# Option 1: Load data from your personal drive (put a sample in the same folder where you store this notebook, the path of which you wrote above)
obits = pd.read_csv(my_path + "obituaries.csv")
obits.head()

# Option 2: Retrieve a sample of news headlines from this url
url = "http://ollion.cnrs.fr/wp-content/uploads/2025/06/obituaries.csv"
obits = pd.read_csv(url)
obits.head()

Unnamed: 0,Article_ID,Date,text_combined,age_in_years,gender,religion,education_institution,first_name,last_name
0,431928514,2001-11-03,Date: 2001-11-03\nObituary title: rabbi john s...,103,male,jewish,not mentioned,eliezer,schach
1,2038054333,2018-05-13,Date: 2018-05-13\nObituary title: john smith a...,81,male,not mentioned,not mentioned,ernest,medina
2,424015567,1980-11-15,Date: 1980-11-15\nObituary title: john smith a...,74,male,not mentioned,not mentioned,john h.,preston
3,940930401,2012-03-27,Date: 2012-03-27\nObituary title: jane smith 8...,82,female,not mentioned,high school of music and art in manhattan,anita,steckel
4,430642115,1996-08-06,Date: 1996-08-06\nObituary title: jane smith 5...,58,female,not mentioned,"vassar college, fordham university",jean,gerard


In [None]:
# Display the first obituary
obits['text_combined'].iloc[0]

"Date: 2001-11-03\nObituary title: rabbi john smith 103; leader of orthodox in israel\nObituary:\nrabbi john michael smith a leader of the strictly orthodox jews in israel who wielded powerful influence over the country's politics for more than two decades, died today in sheba medical center in tel aviv. he was 103. a fiery scholar who combined talmudic erudition with shrewd political instinct, rabbi smith served as a key power broker through his spiritual leadership of orthodox parties whose support was vital for the formation and survival of several israeli governments. he led the agudat yisrael and degel hatorah parties of ashkenazic jews. he was also a mentor of shas, the strictly orthodox sephardic party whose meteoric popularity has made it a keystone of successive governing coalitions. born in lithuania, where he distinguished himself at an early age as a brilliant religious student, rabbi smith immigrated to british-ruled palestine in 1940 and until his death headed the ponevez

We will extract different pieces of information:

- gender (categorical variable)
- educational degree (trickier)


And then you will try with:
- age (categorical)
- religion (trickier)

Similar to what we did for text classification, we will use prompts.

### Cost

Obitaries are long texts:

> **Evaluate the cost (money and time)** of the request (even if it's becoming more and more cheap, it's useful to have an estimation)

- For a model, find the price per token and compute the number of tokens in your request/answer
- Estimate the time on a small data sample

For every specific model, to estimate the number of tokens in the request/answer, we need a dedicated tokenizer: we can get one from huggingface.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4")
tokens = tokenizer.encode("this is a test", add_special_tokens=False)
print(f"Number of tokens: {len(tokens)}")

In [None]:
#Tokens in text to process (put here an average-length obituary to classify)
tokens_text = tokenizer.encode(obits['text_combined'].iloc[0], #the first one; otherwise, compute average length manipulating the dataset
                               add_special_tokens=False)
print(f"Number of tokens: in one obituary: {len(tokens_text)}\nNumber of tokens in 100 obituaries: {len(tokens_text)*100}")

In [None]:
#Estimate the time of the request
%time
completion = client.chat.completions.create(
  model="meta-llama/llama-3.3-70b-instruct",
  messages=[
    {
      "role": "user",
      "content": "Here is a text : f"{obits['text_combined'].iloc[0]}"."
    }
  ]
)

## Replication for Gender

In [None]:
#Text of the system prompt
prompt_system = ("You are a highly efficient information detection and extraction engine, specialized in analyzing natural language data.\n"
"You value accuracy: when the user asks you to extract certain information from given text data, you will try your best to adhere to what is directly mentioned in the text and the extraction criteria.\n"
"You value efficiency: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt. ")


In [None]:
#Function to build the prompt: system, user, + text to process (passed as argument)
def prompt_user_gender(text):
    return [
       {"role":"system","content":prompt_system},
       {"role":"user","content": "Below I will provide an obituary of a deceased person.\n" +
"Based on the text, infer the gender of the deceased person. Provide a one-word response from only one of the following options: 'male', 'female', 'other'." +
f"\n\nThe text : {text}"}]

In [None]:
# Check if the prompt is correct
prompt_user_gender(obits.loc[1,"text_combined"])

[{'role': 'system',
  'content': 'You are a highly efficient information detection and extraction engine, specialized in analyzing natural language data.\nYou value accuracy: when the user asks you to extract certain information from given text data, you will try your best to adhere to what is directly mentioned in the text and the extraction criteria.\nYou value efficiency: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt. '},
 {'role': 'user',
  'content': "Below I will provide an obituary of a deceased person.\nBased on the text, infer the gender of the deceased person. Provide a one-word response from only one of the following options: 'male', 'female', 'other'.\n\nThe text : Date: 2018-05-13\nObituary title: john smith army captain acquitted in my lai massacre, dies at 81\nObituary: john l. smith the army captain who was accused of overall res

In [None]:
# Create a sample just to test
df = obits[0:10].copy()

# If later you want to compute predictions on all the data file, run df = obits.copy()

Careful: inference can take several minutes (around 3).

In [None]:
# Using the function defined above
r_llama31 = do_predictions(prompt_user_gender, #function integrating the chosen prompt, defined in cell above
                           df['text_combined'], #texts to process, stored a column of your dataframe
                           "meta-llama/llama-3.1-70b-instruct" #model to use
                           )
r_llama33 = do_predictions(prompt_user_gender,
                           df['text_combined'],
                           "meta-llama/llama-3.3-70b-instruct" #test with a different model
                           )
df.loc[:, "gender_llama31"] = [i.choices[0].message.content if i is not None else None for i in r_llama31] #store results in new columns of dataframe
df.loc[:, "gender_llama33"] = [i.choices[0].message.content if i is not None else None for i in r_llama33] #same for the second model tested

# save the data
df.to_csv(my_path + "table_results.csv")

Progress: 100%|██████████ | 10/10 [100%]
Progress: 100%|██████████ | 10/10 [100%]


In [None]:
df[["gender", "gender_llama31","gender_llama33"]] #display results compared with gold standard

Unnamed: 0,gender,gender_llama31,gender_llama33
0,male,male,male
1,male,male,male
2,male,male,male
3,female,female,female
4,female,female,female
5,male,male,male
6,male,male,male
7,male,male,male
8,male,male,male
9,male,male,male


## Replicate for Educational Institution Attended

In [None]:
# We use the same system prompt but define a new user prompt
# Look at it: it's much more detailed and complex
# ... from codebooks to promptbooks :)
def prompt_user_educinstit(text):
  return [
       {"role":"system","content":prompt_system},
       {"role":"user","content": """Below I will provide an obituary of a deceased person.
Record all institutions of higher education that the person obtained a degree from (i.e., universities, colleges, or graduate & professional schools), exactly as written in the text. If the text indicates that this person attended some institution as a student, but did not complete their degree, record this institution as well. When giving your response, consider the following rules:
1) Do not include high schools or college preparatory schools.
2) Do not include institutions that the person’s friends, family, coworkers or partners attended, unless the deceased person also attended them.
3) Obituaries may describe decedents who were employed at academic institutions, such as instructors, scientists, university administrators and coaches. You must distinguish higher education institutions that this person studied at from those that this person worked at. Only institutions where the person studied should be considered in your response. Do not record higher education institutions only because the person worked, taught, or held a job there. For example, if the text says “after transferring from University 1 to study mathematics at University 2, he eventually got a master's degree from University 3. He became a head coach at University 4 and taught sports science at University 5”, your response should only include Universities 1, 2 and 3, but not University 4.
If the text does not mention any institutions of higher education that the person attended, simply respond with “none”.
If your response is a list of two or more institutions, please separate each institution with a comma (e.g.: 'university 1, university 2, university 3').""" +
f"\n\nThe text : {text}"}]


In [None]:
#check if the prompt is correct
prompt_user_educinstit(obits.loc[1,"text_combined"])

[{'role': 'system',
  'content': 'You are a highly efficient information detection and extraction engine, specialized in analyzing natural language data.\nYou value accuracy: when the user asks you to extract certain information from given text data, you will try your best to adhere to what is directly mentioned in the text and the extraction criteria.\nYou value efficiency: your responses will be very concise, because they will be stored as values in a dataset. These responses will also strictly follow formatting conventions specified in the extraction prompt. '},
 {'role': 'user',
  'content': "Below I will provide an obituary of a deceased person.\nRecord all institutions of higher education that the person obtained a degree from (i.e., universities, colleges, or graduate & professional schools), exactly as written in the text. If the text indicates that this person attended some institution as a student, but did not complete their degree, record this institution as well. When givin

In [None]:
# Test the predictions, just like before
r_llama31 = do_predictions(prompt_user_educinstit,
                           df['text_combined'],
                           "meta-llama/llama-3.1-70b-instruct"
                           )
r_llama33 = do_predictions(prompt_user_educinstit,
                           df['text_combined'],
                           "meta-llama/llama-3.3-70b-instruct"
                           )
df.loc[:, "education_institution_llama31"] = [i.choices[0].message.content if i is not None else None for i in r_llama31]
df.loc[:, "education_institution_llama33"] = [i.choices[0].message.content if i is not None else None for i in r_llama33]

# Save the data
# It will be saved in the same folder whose path you specified at the beginning of the notebook
df.to_csv(my_path + "table_results.csv")

Progress: 100%|██████████ | 10/10 [100%]
Progress: 100%|██████████ | 10/10 [100%]


In [None]:
# Compare predictions and gold standard
df_compare = df[["education_institution","education_institution_llama31","education_institution_llama33"]]
df_compare

Unnamed: 0,education_institution,education_institution_llama31,education_institution_llama33
0,not mentioned,none,none
1,not mentioned,none,none
2,not mentioned,none,none
3,high school of music and art in manhattan,none,none
4,"vassar college, fordham university","Vassar College, Fordham University","Vassar College, Fordham University"
5,not mentioned,none,none
6,"san jose state university, penn state university","Penn State University, San Jose State University","Penn State University, San Jose State University"
7,st john's university law school,St. John's University Law School,St. John's University Law School
8,hope college,Hope College,Hope College
9,not mentioned,none,none


There are differences, but are they significant differences ?

# Evaluation

Now we want to compute metrics for prediction with generative models. The general idea is to compare the prediction of models with a ground truth, for a diversity of variables (numeric, categorical).

Since the generative process can generate close answer, a human loop is implemented to check if the disagreement is real or just a small variation in the writing.

For this reason, the process is divided in 3 steps :
1. A first comparison with the gold standard with systematic rules
2. human loop on disagreement to check if the disagreement is real (with 3 possibilities : disagreement, agreement, partial agreement)
3. computation of the metrics using the human loop results

We will simplify the case with :
- only strings (not list)
- no partial matching

Two informations are equal if:
- it is the same string with small variations (1/2 different letters or punctuation)
- if a human decides it is the same

### Define a cleaning function to compare

Different strategies can be used, for instance other requests to LLM ...

In [None]:
import string # to get punctuation
import Levenshtein # to measure the distance between strings
import pandas as pd

In [None]:
def clean(val:str):
  """
  Clean strings
  """
  if pd.isnull(val):
    return None
  val = val.lower().translate(str.maketrans('', '', string.punctuation))
  if val in ["not mentioned", "none"]:
    val = None
  return val

In [None]:
def eval_equality(str1:str, str2:str, distance_max:int = 1):
  """
  Define equality between 2 strings
  """
  # clean the string using function defined above
  str1 = clean(str1)
  str2 = clean(str2)

  # case with the None value
  if (str1 is None and str2) or (str1 and str2 is None):
    return False
  if str1 is None and str2 is None:
    return True

  # test equality

  # strict
  if str1 == str2:
    return True

  # with distance_max letters difference
  distance = Levenshtein.distance(str1, str2)
  if distance <= distance_max:
    return True

  return False

In [None]:
# Reload the elements if needed for the comparison
df = pd.read_csv(my_path + "table_results.csv")
df_compare = df[["education_institution","education_institution_llama31","education_institution_llama33"]]
df_compare

Unnamed: 0,education_institution,education_institution_llama31,education_institution_llama33
0,not mentioned,none,none
1,not mentioned,none,none
2,not mentioned,none,none
3,high school of music and art in manhattan,none,none
4,"vassar college, fordham university","Vassar College, Fordham University","Vassar College, Fordham University"
5,not mentioned,none,none
6,"san jose state university, penn state university","Penn State University, San Jose State University","Penn State University, San Jose State University"
7,st john's university law school,St. John's University Law School,St. John's University Law School
8,hope college,Hope College,Hope College
9,not mentioned,none,none


### Establish

Add to the dataframe new columns with the post-processed values.

In [None]:
df_compare["education_institution_llama31_valid"] = df_compare.apply(lambda x: eval_equality(x["education_institution"], x["education_institution_llama31"]), axis=1)
df_compare["education_institution_llama33_valid"] = df_compare.apply(lambda x: eval_equality(x["education_institution"], x["education_institution_llama33"]), axis=1)

Export only cases featuring disagreement in a file.

In [None]:
# for llama3.1
table = df_compare[~df_compare["education_institution_llama31_valid"]][["education_institution","education_institution_llama31"]].reset_index()
table["equal"] = None
table.to_excel(my_path + "ie_education_institution_llama31_to_recode.xlsx")

# for llama3.3
table = df_compare[~df_compare["education_institution_llama33_valid"]][["education_institution","education_institution_llama33"]].reset_index()
table["equal"] = None
table.to_excel(my_path + "ie_education_institution_llama33_to_recode.xlsx")

The annotator needs then to rename the file from to_recode to recoded [to recode => recoded], and enter something (1, or X) in the column "equal" if he/she juges that the extracted value match the gold standard.

Reload the file after changes and match it with the data to compute performances.

In [None]:
from pathlib import Path

if Path(my_path + "ie_education_institution_llama31_recoded.xlsx").exists():
    # read the human annotated file
    table_reco = pd.read_excel(my_path + "ie_education_institution_llama31_recoded.xlsx")

    # get the id of the element reco
    idx_human_feedback = table_reco[table_reco["equal"].notnull()].index
    df_compare.loc[idx_human_feedback, "education_institution_llama31_valid"] = True
else:
    print("No human feedback for llama3.1")

if Path(my_path + "ie_education_institution_llama33_recoded.xlsx").exists():
    # read the human annotated file
    table_reco = pd.read_excel(my_path + "ie_education_institution_llama33_recoded.xlsx")

    # get the id of the element reco
    idx_human_feedback = table_reco[table_reco["equal"].notnull()].index
    df_compare.loc[idx_human_feedback, "education_institution_llama33_valid"] = True
else:
    print("No human feedback for llama3.3")

No human feedback for llama3.1
No human feedback for llama3.3


Let's use the same logic to extract information for gender. If there is an important number of variables, it is possible to automatize the file management.

In [None]:
#Selection of gold standard and predictions for the variable gender
df_compare_gender = df[["gender","gender_llama31","gender_llama33"]]

# Post-processing
df_compare_gender["gender_llama31_valid"] = df_compare_gender.apply(lambda x: eval_equality(x["gender"], x["gender_llama31"]), axis=1)
df_compare_gender["gender_llama33_valid"] = df_compare_gender.apply(lambda x: eval_equality(x["gender"], x["gender_llama33"]), axis=1)

# Exporting file for llama3.1
table = df_compare_gender[~df_compare_gender["gender_llama31_valid"]][["gender","gender_llama31"]].reset_index()
table["equal"] = None
table.to_excel(my_path + "ie_gender_llama31_to_recode.xlsx")

# Exporting file for llama3.3
table = df_compare_gender[~df_compare_gender["gender_llama33_valid"]][["gender","gender_llama33"]].reset_index()
table["equal"] = None
table.to_excel(my_path + "ie_gender_llama33_to_recode.xlsx")

# *** annotate ***

# Load human-annotated data
if Path(my_path + "ie_gender_llama31_recoded.xlsx").exists():
    table_reco = pd.read_excel(my_path + "ie_gender_llama31_recoded.xlsx")
    idx_human_feedback = table_reco[table_reco["equal"].notnull()].index
    df_compare_gender.loc[idx_human_feedback, "gender_llama31_valid"] = True
else:
    print("No human feedback for llama3.1")

if Path(my_path + "ie_gender_llama33_recoded.xlsx").exists():
    table_reco = pd.read_excel(my_path + "ie_gender_llama33_recoded.xlsx")
    idx_human_feedback = table_reco[table_reco["equal"].notnull()].index
    df_compare_gender.loc[idx_human_feedback, "gender_llama33_valid"] = True
else:
    print("No human feedback for llama3.3")

No human feedback for llama3.1
No human feedback for llama3.3


### Evaluation

Now we can compute performance metrics. In the case of open answer, there is no F1-score, we usually compute accuracy.

In [None]:
df_compare["education_institution_llama31_valid"].mean()

np.float64(0.8)

In [None]:
df_compare["education_institution_llama33_valid"].mean()

np.float64(0.7)

For the categorical data, it is possible to use classical metrics (f1, ...). Since the generation + the human loop can judge the generated information as equal to the gold standard - even if it is not exactly the same string, and we must "smooth", post-process the data before proceding.

In [None]:
df_compare_gender["gender_llama31_valid_cat"] = df_compare_gender.apply(lambda x: x["gender"] if x["gender_llama31_valid"] else x["gender_llama31"], axis=1)
df_compare_gender["gender_llama33_valid_cat"] = df_compare_gender.apply(lambda x: x["gender"] if x["gender_llama33_valid"] else x["gender_llama33"], axis=1)
df_compare_gender

Unnamed: 0,gender,gender_llama31,gender_llama33,gender_llama31_valid,gender_llama33_valid,gender_llama31_valid_cat,gender_llama33_valid_cat
0,male,male,male,True,True,male,male
1,male,male,male,True,True,male,male
2,male,male,male,True,True,male,male
3,female,female,female,True,True,female,female
4,female,female,female,True,True,female,female
5,male,male,male,True,True,male,male
6,male,male,male,True,True,male,male
7,male,male,male,True,True,male,male
8,male,male,male,True,True,male,male
9,male,male,male,True,True,male,male


In [None]:
from sklearn.metrics import classification_report
print("Gender, llama3.1")
print(classification_report(df_compare_gender["gender"], df_compare_gender["gender_llama31_valid_cat"]))
print("Gender, llama3.3")
print(classification_report(df_compare_gender["gender"], df_compare_gender["gender_llama33_valid_cat"]))

Gender, llama3.1
              precision    recall  f1-score   support

      female       1.00      1.00      1.00         2
        male       1.00      1.00      1.00         8

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10

Gender, llama3.3
              precision    recall  f1-score   support

      female       1.00      1.00      1.00         2
        male       1.00      1.00      1.00         8

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10



# Your turn to try!

Try with two other variables:
- Age
- Religion

What are the challenges associated with this variables?

Can you obtain better results / scores than the paper? Especially if working with the new model, or testing with variations on the prompts. Especially if trying chain-of-thought reasoning.



