In [2]:
import json
from tqdm import tqdm


import tools_utils, eval_utils, prompt_utils
import pandas as pd
import main_functions

from openai import OpenAI


ModuleNotFoundError: No module named 'tools_utils'

We imported all the necessary modules for the evaluation. Some are internal modules which are already used by the extraction tool. 

In [18]:
import importlib
importlib.reload(main_functions)
importlib.reload(tools_utils)
importlib.reload(prompt_utils)
importlib.reload(eval_utils)


<module 'eval_utils' from 'c:\\Users\\paolo\\Desktop\\T3.4.1_KeywordsTranslation\\eval_utils.py'>

The following reads the Excel File with the dataset. 

In [19]:
records = eval_utils.parse_excel_file("Dset_Eval_KW_Alignment_Eval_24_03_2025.xlsx")

Records is a list of dictionaries. Each dictionary specifies the original language of the article, metadata such as title and abstract, and a list of keywords. Each keyword is a dictionary with specified: the label, the wikidata_url, and the match. This is present even if the keyword is not in the original language of the article. 

In [6]:
records[5]

{'language': 'pl',
 'id': 'oai:revues.org:td/3478',
 'title_or': 'Ojczyzna i obczyzna: Irlandia, Polska i Inni',
 'title_eng': 'Homeland and garbage: Ireland, Poland and other',
 'abstract_or': 'Mówiąc o migracjach do Irlandii z czasów “celtyckiego tygrysa”, można rozróżnić dwa etapy. Na początku w latach 1996–2002 migranci odczuwali prawdziwą dumę z\xa0rodzimej kultury, w której się wychowali, a jednocześnie wykazywali żywe zainteresowanie irlandzką tradycją. Później jednak, od roku 2002 aż do kryzysu w 2008, nastąpił okres często opisywany jako „bling”, który charakteryzował przepych: nastawienie na materializm oraz\xa0brak zainteresowania kulturą i kulturową wymianą. Co za tym idzie, wbrew oczekiwaniom zapoczątkowanym przez nowe fuzje, nie rozwinęła się na szerszą skalę irlandzka literatura inspirowana spotkaniami z „innym”. Niewątpliwie jest to zadziwiające, biorąc pod\xa0uwagę, że tak wiele pozycji z kanonu literatury irlandzkiej zrodziło się z otwartości na „obcego/innego” – pocz

In what follows, we initialize the LLM client. The evaluation has been ran with the OpenAI model, so code to activate the OpenAI client is shown here. However, the evaluation function can be used with any OpenAI-compatible LLM API. 

The API key must be specified in order to run the evaluation. 

In [7]:
MY_OPENAI_API_KEY = "YOUR_API_KEY"

	

client = OpenAI(api_key=MY_OPENAI_API_KEY)
model_name = "gpt-4o-mini"


The following cell runs the evaluation. It takes as input the xlsx file and iterates over the articles in the dataset. For each keyword in the article, it executes the function useLLM_back_and_forth (which is defined in main_functions.py)


The function useLLM_back_and_forth generates a prompt for potential entities generation (given a keyword, find the corresponding entity in Wikidata). NUM_NAMES (set by default to 10 in the function, in line with the paper) is the maximum number of entities that can be generated by the potential entity generator. After the entities are generated, they are added to the EntitySelectionPrompt. This prompt asks a model to select the most relevant entities. The number of entities selected are the best number_of_entities matching entities. In line with the evaluation showed in the paper, the number of entities selected is set by default to 1. 

The code includes progress tracking and a mechanism for backup of results. It generates a JSON file where the information for each article is enriched with the LLM annotation for each keyword. 

In [25]:
import json
from tqdm import tqdm

SAVE_EVERY = 3  # Number of records after which to save
results_file = "C:/Users/paolo/Desktop/T3.4.1_KeywordsTranslation/data_eval_03242025.json"

for record_idx, record in enumerate(tqdm(records, desc="Processing records")):
    for i, kw in enumerate(
        tqdm(record['kws'], 
             total=len(record['kws']), 
             desc=f"Processing keywords for record {record_idx + 1}", 
             leave=False
        )
    ):
        
        try:
            llm_uris = main_functions.useLLM_back_and_forth(
                record['language'], 
                record['title_or'], 
                record['abstract_or'], 
                kw['label'], 
                client, 
                model_name
            )
            record['kws'][i]['llm_uris'] = llm_uris
        except Exception as e:
            print("LLM URIs cannot be computed:", e)
            record['kws'][i]['llm_uris'] = []


    # Periodically save (for example, every 10 records)
    if (record_idx + 1) % SAVE_EVERY == 0:
        print(f"Saving partial results at record {record_idx + 1}")
        with open(results_file, 'w', encoding='utf-8') as f:
            json.dump(records, f, ensure_ascii=False, indent=2)

# Final save after all records are processed
print("All records processed, saving final results.")
with open(results_file, 'w', encoding='utf-8') as f:
    json.dump(records, f, ensure_ascii=False, indent=2)


Processing records:   1%|▏         | 3/202 [01:03<1:19:55, 24.10s/it]

Saving partial results at record 3


Processing records:   3%|▎         | 6/202 [01:53<1:02:35, 19.16s/it]

Saving partial results at record 6


Processing records:   4%|▍         | 9/202 [02:38<49:32, 15.40s/it]  

Saving partial results at record 9


Processing records:   6%|▌         | 12/202 [03:12<34:15, 10.82s/it]  

Saving partial results at record 12


Processing records:   7%|▋         | 15/202 [04:24<53:54, 17.30s/it]  

Saving partial results at record 15


Processing records:   9%|▉         | 18/202 [06:14<1:26:27, 28.19s/it]

Saving partial results at record 18


Processing records:  10%|█         | 21/202 [07:15<1:10:32, 23.38s/it]

Saving partial results at record 21


Processing records:  12%|█▏        | 24/202 [09:29<2:01:09, 40.84s/it]

Saving partial results at record 24


Processing records:  13%|█▎        | 27/202 [10:52<1:36:19, 33.02s/it]

Saving partial results at record 27


Processing records:  15%|█▍        | 30/202 [12:01<1:14:54, 26.13s/it]

Saving partial results at record 30


Processing records:  16%|█▋        | 33/202 [14:18<1:48:22, 38.47s/it]

Saving partial results at record 33


Processing records:  18%|█▊        | 36/202 [15:50<1:34:05, 34.01s/it]

Saving partial results at record 36


Processing records:  19%|█▉        | 39/202 [17:06<1:20:14, 29.54s/it]

Saving partial results at record 39


Processing records:  21%|██        | 42/202 [18:32<1:16:10, 28.57s/it]

Saving partial results at record 42


Processing records:  22%|██▏       | 45/202 [19:57<1:20:07, 30.62s/it]

Saving partial results at record 45


Processing records:  24%|██▍       | 48/202 [21:41<1:36:55, 37.76s/it]

Saving partial results at record 48


Processing records:  25%|██▌       | 51/202 [22:39<1:02:49, 24.96s/it]

Saving partial results at record 51


Processing records:  27%|██▋       | 54/202 [23:51<1:01:03, 24.75s/it]

Saving partial results at record 54


Processing records:  28%|██▊       | 57/202 [25:20<1:07:58, 28.13s/it]

Saving partial results at record 57


Processing records:  30%|██▉       | 60/202 [27:31<1:18:00, 32.96s/it]

Saving partial results at record 60


Processing records:  31%|███       | 63/202 [28:27<54:45, 23.63s/it]  

Saving partial results at record 63


Processing records:  33%|███▎      | 66/202 [29:47<56:56, 25.12s/it]  

Saving partial results at record 66


Processing records:  34%|███▍      | 69/202 [31:32<1:11:19, 32.17s/it]

Saving partial results at record 69


Processing records:  36%|███▌      | 72/202 [32:56<1:04:58, 29.99s/it]

Saving partial results at record 72


Processing records:  37%|███▋      | 75/202 [34:01<49:46, 23.52s/it]  

Saving partial results at record 75


Processing records:  39%|███▊      | 78/202 [35:53<1:12:56, 35.30s/it]

Saving partial results at record 78


Processing records:  40%|████      | 81/202 [37:48<1:12:46, 36.09s/it]

Saving partial results at record 81


Processing records:  42%|████▏     | 84/202 [39:29<1:05:44, 33.43s/it]

Saving partial results at record 84


Processing records:  43%|████▎     | 87/202 [41:04<1:07:51, 35.41s/it]

Saving partial results at record 87


Processing records:  45%|████▍     | 90/202 [42:31<1:03:23, 33.96s/it]

Saving partial results at record 90


Processing records:  46%|████▌     | 93/202 [44:31<1:04:08, 35.31s/it]

Saving partial results at record 93


Processing records:  48%|████▊     | 96/202 [48:18<1:44:48, 59.33s/it]

Saving partial results at record 96


Processing records:  49%|████▉     | 99/202 [50:01<1:10:38, 41.15s/it]

Saving partial results at record 99


Processing records:  50%|█████     | 102/202 [51:48<1:03:13, 37.94s/it]

Saving partial results at record 102


Processing records:  52%|█████▏    | 105/202 [53:23<1:00:20, 37.33s/it]

Saving partial results at record 105


Processing records:  53%|█████▎    | 108/202 [54:39<47:09, 30.11s/it]  

Saving partial results at record 108


Processing records:  55%|█████▍    | 111/202 [55:33<33:09, 21.86s/it]

Saving partial results at record 111


Processing records:  56%|█████▋    | 114/202 [56:34<29:34, 20.17s/it]

Saving partial results at record 114


Processing records:  58%|█████▊    | 117/202 [58:02<32:49, 23.17s/it]

Saving partial results at record 117


Processing records:  59%|█████▉    | 120/202 [1:00:09<45:37, 33.39s/it]

Saving partial results at record 120


Processing records:  61%|██████    | 123/202 [1:01:46<44:38, 33.90s/it]

Saving partial results at record 123


Processing records:  62%|██████▏   | 126/202 [1:03:14<37:22, 29.51s/it]

Saving partial results at record 126


Processing records:  64%|██████▍   | 129/202 [1:05:07<37:38, 30.93s/it]

Saving partial results at record 129


Processing records:  65%|██████▌   | 132/202 [1:06:31<33:21, 28.59s/it]

Saving partial results at record 132


Processing records:  67%|██████▋   | 135/202 [1:07:57<32:51, 29.42s/it]

Saving partial results at record 135


Processing records:  68%|██████▊   | 138/202 [1:09:16<28:27, 26.69s/it]

Saving partial results at record 138


Processing records:  70%|██████▉   | 141/202 [1:10:17<21:11, 20.85s/it]

Saving partial results at record 141


Processing records:  71%|███████▏  | 144/202 [1:12:11<30:53, 31.95s/it]

Saving partial results at record 144


Processing records:  73%|███████▎  | 147/202 [1:13:38<28:17, 30.87s/it]

Saving partial results at record 147


Processing records:  74%|███████▍  | 150/202 [1:15:55<40:23, 46.61s/it]

Saving partial results at record 150


Processing records:  76%|███████▌  | 153/202 [1:17:41<33:44, 41.31s/it]

Saving partial results at record 153


Processing records:  77%|███████▋  | 156/202 [1:19:47<30:33, 39.86s/it]

Saving partial results at record 156


Processing records:  79%|███████▊  | 159/202 [1:21:28<24:54, 34.75s/it]

Saving partial results at record 159


Processing records:  80%|████████  | 162/202 [1:23:57<29:05, 43.63s/it]

Saving partial results at record 162


Processing records:  82%|████████▏ | 165/202 [1:26:13<30:15, 49.06s/it]

Saving partial results at record 165


Processing records:  83%|████████▎ | 168/202 [1:28:16<24:57, 44.04s/it]

Saving partial results at record 168


Processing records:  85%|████████▍ | 171/202 [1:30:40<25:59, 50.31s/it]

Saving partial results at record 171


Processing records:  86%|████████▌ | 174/202 [1:32:02<15:51, 33.98s/it]

Saving partial results at record 174


Processing records:  88%|████████▊ | 177/202 [1:33:03<10:36, 25.45s/it]

Saving partial results at record 177


Processing records:  89%|████████▉ | 180/202 [1:36:15<19:01, 51.87s/it]

Saving partial results at record 180


Processing records:  91%|█████████ | 183/202 [1:38:32<14:52, 46.95s/it]

Saving partial results at record 183


Processing records:  92%|█████████▏| 186/202 [1:39:48<08:25, 31.61s/it]

Saving partial results at record 186


Processing records:  94%|█████████▎| 189/202 [1:40:33<04:31, 20.86s/it]

Saving partial results at record 189


Processing records:  95%|█████████▌| 192/202 [1:42:14<04:40, 28.01s/it]

Saving partial results at record 192


Processing records:  97%|█████████▋| 195/202 [1:44:12<04:23, 37.71s/it]

Saving partial results at record 195


Processing records:  98%|█████████▊| 198/202 [1:46:48<03:27, 51.86s/it]

Saving partial results at record 198


Processing records:  99%|█████████▉| 200/202 [1:47:46<01:19, 39.68s/it]

Saving partial results at record 201


Processing records: 100%|██████████| 202/202 [1:47:46<00:00, 32.01s/it]

All records processed, saving final results.





The following cell of code contains some required adjustments to the evaluation output for the results analysis. These adjustments mainly involve string parsing. Running this cell is required to obtain the statistics. 

In [26]:
import json

results_file = "C:/Users/paolo/Desktop/T3.4.1_KeywordsTranslation/data_eval_03242025.json"


new_records = []

with open(results_file, 'r', encoding='utf-8') as f:
    records = json.load(f)


for record in records:
    new_record = record
    for i, kw in enumerate(record['kws']):
        if len(kw['llm_uris']) == 3 and kw['llm_uris'][1] == "wikidata":
            new_record['kws'][i]['llm_uris'] = ['.'.join(kw['llm_uris'])]
    new_records.append(new_record)


new_results_file = "C:/Users/paolo/Desktop/T3.4.1_KeywordsTranslation/data_eval_03242025_nuovo.json"
with open(new_results_file, 'w', encoding='utf-8') as f:
    json.dump(new_records, f, ensure_ascii=False, indent=2)
    

The following two cells can be used to obtain the statistics. As can be seen, the recall, precision, and the F1 scores are printed. They are printed for all the examples, for language (that is, we distinguish the results based on the original language of the article of the keyword), and by match type (where for match type we consider how match the Wikidata URL matches the actual keyword, this could be 'e' - exact match -, or 'r' - related match).

As can be seen in the example below, the code prints a report with the results. 

In [27]:
import json

# eval_results generation

new_results_file = "C:/Users/paolo/Desktop/T3.4.1_KeywordsTranslation/data_eval_03242025_nuovo.json"

with open(new_results_file, 'r', encoding='utf-8') as f:
    records = json.load(f)

languages = set(record['language'] for record in records)

scores_llm = {
    "Total": {
        "recall": {"Sum": 0, "Size": 0},
        "precision": {"Sum": 0, "Size": 0},
        "f1": {"Sum": 0, "Size": 0}
    },
    "Per_match_type": {
        "e": {
            "recall": {"Sum": 0, "Size": 0},
            "precision": {"Sum": 0, "Size": 0},
            "f1": {"Sum": 0, "Size": 0}
        },
        "r": {
            "recall": {"Sum": 0, "Size": 0},
            "precision": {"Sum": 0, "Size": 0},
            "f1": {"Sum": 0, "Size": 0}
        }
    },
    "Per_language": {
        language: {
            "recall": {"Sum": 0, "Size": 0},
            "precision": {"Sum": 0, "Size": 0},
            "f1": {"Sum": 0, "Size": 0}
        } for language in languages
    }
}

for record in records:
    for i, kw in enumerate(record['kws']):
        if kw['match'] in ("e", "r"):
            correct_uris = [
                url.replace("https", "http").replace("/wiki/", "/entity/")
                for url in kw['wikidata_url']
            ]
            retrieved_uris_llm = kw['llm_uris']
            recall_llm = eval_utils.compute_recall(correct_uris, retrieved_uris_llm)
            precision_llm = eval_utils.compute_precision(correct_uris, retrieved_uris_llm)
            
            # Calcolo del punteggio F1 con controllo per divisione per zero
            if (precision_llm + recall_llm) > 0:
                f1_llm = 2 * precision_llm * recall_llm / (precision_llm + recall_llm)
            else:
                f1_llm = 0

            # Aggiornamento dei punteggi totali
            scores_llm["Total"]["recall"]["Sum"] += recall_llm
            scores_llm["Total"]["recall"]["Size"] += 1
            scores_llm["Total"]["precision"]["Sum"] += precision_llm
            scores_llm["Total"]["precision"]["Size"] += 1
            scores_llm["Total"]["f1"]["Sum"] += f1_llm
            scores_llm["Total"]["f1"]["Size"] += 1

            # Aggiornamento dei punteggi per tipo di match
            scores_llm["Per_match_type"][kw['match']]["recall"]["Sum"] += recall_llm
            scores_llm["Per_match_type"][kw['match']]["recall"]["Size"] += 1
            scores_llm["Per_match_type"][kw['match']]["precision"]["Sum"] += precision_llm
            scores_llm["Per_match_type"][kw['match']]["precision"]["Size"] += 1
            scores_llm["Per_match_type"][kw['match']]["f1"]["Sum"] += f1_llm
            scores_llm["Per_match_type"][kw['match']]["f1"]["Size"] += 1

            # Aggiornamento dei punteggi per lingua
            scores_llm["Per_language"][record['language']]["recall"]["Sum"] += recall_llm
            scores_llm["Per_language"][record['language']]["recall"]["Size"] += 1
            scores_llm["Per_language"][record['language']]["precision"]["Sum"] += precision_llm
            scores_llm["Per_language"][record['language']]["precision"]["Size"] += 1
            scores_llm["Per_language"][record['language']]["f1"]["Sum"] += f1_llm
            scores_llm["Per_language"][record['language']]["f1"]["Size"] += 1



In [29]:
import json

def compute_mean_metrics(stats_dict):
    """
    Dato un dizionario con la struttura:
    {
      'Total': {
        'recall': {'Sum': x, 'Size': y},
        'precision': {'Sum': x2, 'Size': y2},
        'f1': {'Sum': x3, 'Size': y3}
      },
      'Per_match_type': {
        'e': {
          'recall': {'Sum': x, 'Size': y},
          'precision': {'Sum': x2, 'Size': y2},
          'f1': {'Sum': x3, 'Size': y3}
        },
        ...
      },
      'Per_language': {
        'fr': {
          'recall': {'Sum': x, 'Size': y},
          'precision': {'Sum': x2, 'Size': y2},
          'f1': {'Sum': x3, 'Size': y3}
        },
        ...
      }
    }

    Restituisce un dizionario con i valori medi di recall, precision e f1.
    """
    mean_dict = {
        'Total': {},
        'Per_match_type': {},
        'Per_language': {}
    }
    
    # --- 1) TOTAL ---
    if 'Total' in stats_dict:
        total_recall_sum = stats_dict['Total']['recall']['Sum']
        total_recall_size = stats_dict['Total']['recall']['Size']
        total_precision_sum = stats_dict['Total']['precision']['Sum']
        total_precision_size = stats_dict['Total']['precision']['Size']
        
        mean_dict['Total']['recall'] = total_recall_sum / total_recall_size if total_recall_size != 0 else 0
        mean_dict['Total']['precision'] = total_precision_sum / total_precision_size if total_precision_size != 0 else 0
        
        if 'f1' in stats_dict['Total']:
            total_f1_sum = stats_dict['Total']['f1']['Sum']
            total_f1_size = stats_dict['Total']['f1']['Size']
            mean_dict['Total']['f1'] = total_f1_sum / total_f1_size if total_f1_size != 0 else 0

    # --- 2) PER MATCH TYPE ---
    if 'Per_match_type' in stats_dict:
        for match_type, metrics in stats_dict['Per_match_type'].items():
            recall_sum = metrics['recall']['Sum']
            recall_size = metrics['recall']['Size']
            precision_sum = metrics['precision']['Sum']
            precision_size = metrics['precision']['Size']
            
            mean_rec = recall_sum / recall_size if recall_size != 0 else 0
            mean_prec = precision_sum / precision_size if precision_size != 0 else 0
            
            mean_dict['Per_match_type'][match_type] = {
                'recall': mean_rec,
                'precision': mean_prec
            }
            if 'f1' in metrics:
                f1_sum = metrics['f1']['Sum']
                f1_size = metrics['f1']['Size']
                mean_dict['Per_match_type'][match_type]['f1'] = f1_sum / f1_size if f1_size != 0 else 0

    # --- 3) PER LANGUAGE ---
    if 'Per_language' in stats_dict:
        for lang, metrics in stats_dict['Per_language'].items():
            recall_sum = metrics['recall']['Sum']
            recall_size = metrics['recall']['Size']
            precision_sum = metrics['precision']['Sum']
            precision_size = metrics['precision']['Size']
            
            mean_rec = recall_sum / recall_size if recall_size != 0 else 0
            mean_prec = precision_sum / precision_size if precision_size != 0 else 0
            
            mean_dict['Per_language'][lang] = {
                'recall': mean_rec,
                'precision': mean_prec
            }
            if 'f1' in metrics:
                f1_sum = metrics['f1']['Sum']
                f1_size = metrics['f1']['Size']
                mean_dict['Per_language'][lang]['f1'] = f1_sum / f1_size if f1_size != 0 else 0
                
    return mean_dict


def print_system_report(system_dict):
    """
    Stampa un report leggibile dei risultati del sistema (LLM) per Total,
    Per_match_type e Per_language includendo F1.
    """
    system_means = compute_mean_metrics(system_dict)
    
    print("======== SYSTEM METRICS REPORT ========")
    
    # --- 1) TOTAL ---
    print("\n--- TOTAL ---")
    total = system_means.get('Total', {})
    if total:
        print(f"LLM Recall:    {total.get('recall', 0):.4f}")
        print(f"LLM Precision: {total.get('precision', 0):.4f}")
        print(f"LLM F1:        {total.get('f1', 0):.4f}")
    else:
        print("No total data found.")
    
    # --- 2) PER MATCH TYPE ---
    print("\n--- PER MATCH TYPE ---")
    per_match_data = system_means.get('Per_match_type', {})
    for mtype, vals in per_match_data.items():
        print(f"\nMatch Type: {mtype}")
        print(f"  LLM Recall:    {vals.get('recall', 0):.4f}")
        print(f"  LLM Precision: {vals.get('precision', 0):.4f}")
        print(f"  LLM F1:        {vals.get('f1', 0):.4f}")
    
    # --- 3) PER LANGUAGE ---
    print("\n--- PER LANGUAGE ---")
    per_lang_data = system_means.get('Per_language', {})
    # Ordinamento convertendo le chiavi in stringa per evitare errori se sono di tipo misto
    for lang, vals in sorted(per_lang_data.items(), key=lambda x: str(x[0])):
        print(f"\nLanguage: {lang}")
        print(f"  LLM Recall:    {vals.get('recall', 0):.4f}")
        print(f"  LLM Precision: {vals.get('precision', 0):.4f}")
        print(f"  LLM F1:        {vals.get('f1', 0):.4f}")


# Esempio di utilizzo:
# Supponiamo che 'scores_llm' contenga i risultati del sistema (LLM)
# Ad esempio, scores_llm è stato popolato precedentemente nel codice.
print_system_report(scores_llm)


--- TOTAL ---
LLM Recall:    0.5576
LLM Precision: 0.5758
LLM F1:        0.5634

--- PER MATCH TYPE ---

Match Type: e
  LLM Recall:    0.6428
  LLM Precision: 0.6577
  LLM F1:        0.6477

Match Type: r
  LLM Recall:    0.1999
  LLM Precision: 0.2321
  LLM F1:        0.2096

--- PER LANGUAGE ---

Language: ar
  LLM Recall:    0.6351
  LLM Precision: 0.6486
  LLM F1:        0.6396

Language: ca
  LLM Recall:    0.5714
  LLM Precision: 0.5714
  LLM F1:        0.5714

Language: da
  LLM Recall:    0.3750
  LLM Precision: 0.3750
  LLM F1:        0.3750

Language: de
  LLM Recall:    0.4886
  LLM Precision: 0.5000
  LLM F1:        0.4924

Language: el
  LLM Recall:    0.6410
  LLM Precision: 0.6496
  LLM F1:        0.6439

Language: en
  LLM Recall:    0.5969
  LLM Precision: 0.6259
  LLM F1:        0.6059

Language: es
  LLM Recall:    0.5068
  LLM Precision: 0.5479
  LLM F1:        0.5205

Language: fi
  LLM Recall:    0.4508
  LLM Precision: 0.4571
  LLM F1:        0.4524

Language: 