# Named Entity Recognition for the decicontas.br dataset

This script implements Named Entity Recognition (NER) extraction applied to the **decicontas.br** dataset, which consists of decisions from the Rio Grande do Norte State Court of Accounts (TCE/RN) involving fines, obligations, reimbursements, and recommendations. The solution uses large language models (LLMs) deployed through Azure OpenAI, with integration via `langchain` and `pydantic` for generating structured outputs.

The goal is to evaluate the ability of LLMs to transform unstructured legal text into standardized data with consistent labels (e.g., MULTA, OBRIGACAO, RECOMENDACAO, RESSARCIMENTO), supporting downstream analysis and monitoring of audit decisions. The project is inspired by LexCare.BR (focused on health judicialization) and applies *function calling* and *few-shot prompting* to decisions from the TCE/RN.


In [92]:
import pprint
import time

import pandas as pd

from itertools import combinations
from tqdm import tqdm
from langchain_openai import  AzureChatOpenAI, ChatOpenAI
from dotenv import load_dotenv

from tools.dataset import get_decicontas_df
from tools.prompt import generate_few_shot_ner_prompts
from tools.schema import (
    NERDecisao
)

load_dotenv()

True

# Loading and setting up

The dataset is loaded using the function `get_decicontas_df()`, which wraps the reading of a CSV file annotated in Label Studio containing the TCE/RN decisions. Next, three Azure OpenAI models are instantiated:

- `gpt-4o`
- `gpt-35-turbo`
- `gpt-4-turbo`

All are configured with `with_structured_output` support to guarantee that the model outputs follow the JSON schema defined in `NERDecisao`. This approach enables automatic validation and makes it easy to convert the model’s answers to formats suitable for downstream evaluation pipelines.


In [None]:
df_decicontas = get_decicontas_df()

In [21]:
gpt41_nano = AzureChatOpenAI(
    deployment_name="gpt-4-1-nano",  
    model_name="gpt-4-1-nano"
)

gpt41_mini = AzureChatOpenAI(
    deployment_name="gpt-4-1-mini",  
    model_name="gpt-4-1-mini"
)

gpt41= AzureChatOpenAI(
    deployment_name="gpt-4-1",  
    model_name="gpt-4-1"
)

# Inicializadores com extração estruturada
extractor_gpt41_nano = gpt41_nano.with_structured_output(NERDecisao, include_raw=False, method="function_calling")
extractor_gpt41_mini = gpt41_mini.with_structured_output(NERDecisao, include_raw=False, method="function_calling")
extractor_gpt41 = gpt41.with_structured_output(NERDecisao, include_raw=False, method="function_calling")


gpt4o = AzureChatOpenAI(
    deployment_name="gpt-4o",  
    model_name="gpt-4o",       
)

gpt35 = AzureChatOpenAI(
    deployment_name="gpt-35",  
    model_name="gpt-35-turbo",
)

gpt4turbo = AzureChatOpenAI(
    deployment_name="gpt-4-turbo",
    model_name="gpt-4",
)

extractor_gpt4o = gpt4o.with_structured_output(NERDecisao, include_raw=False, method="function_calling")
extractor_gpt35 = gpt35.with_structured_output(NERDecisao, include_raw=False, method="function_calling")
extractor_gpt4_turbo = gpt4turbo.with_structured_output(NERDecisao, include_raw=False, method="function_calling")

In [22]:
gpt41 = ChatOpenAI(
    model="gpt-4.1"
)
gpt41_mini = ChatOpenAI(
    model_name="gpt-4.1-mini"
)

gpt41_nano= ChatOpenAI(
    model_name="gpt-4.1-nano"
)

# Inicializadores com extração estruturada
extractor_gpt41_nano = gpt41_nano.with_structured_output(NERDecisao, include_raw=False, method="function_calling")
extractor_gpt41_mini = gpt41_mini.with_structured_output(NERDecisao, include_raw=False, method="function_calling")
extractor_gpt41 = gpt41.with_structured_output(NERDecisao, include_raw=False, method="function_calling")

In [26]:
df_decicontas["checked"] = False

# Exploratory Data Analysis (EDA)

Before running the NER pipeline, an exploratory data analysis was carried out to inspect the distribution of the labels previously annotated in the dataset. This investigation serves to:

- validate consistency of the manual annotations
- identify classes with low representation
- guide adjustments for class balance during evaluation

The frequency of each label was extracted by scanning the Label Studio annotation results and summarized in a DataFrame.


In [3]:
labels_seen = []
for i,r in df_decicontas.iterrows():
    for a in r['annotations']:
        if 'result' in a.keys():
            for v in a['result']:
                labels_seen.append(v['value']['labels'][0])    
pd.Series(labels_seen).value_counts().sort_index().to_frame('count').reset_index().rename(columns={'index': 'label'}).sort_values(by='count', ascending=False)

Unnamed: 0,label,count
0,MULTA,204
1,OBRIGACAO,120
3,RESSARCIMENTO,63
2,RECOMENDACAO,58


# Example use

To validate the pipeline, a test query was built with real excerpts from TCE/RN decisions, simulating fines, deadlines, and enforcement clauses. This sample text was passed to the few-shot prompt generation function to check the schema, ensure consistent attribute definitions, and manually verify correctness before applying the pipeline to the full dataset.

The generated prompt was displayed with `pprint` for manual inspection and debugging.


In [4]:
EXAMPLE_TEXT = '''
DECIDEM os Conselheiros do Tribunal de Contas do Estado, à unanimidade, em consonância com a informação do Corpo Técnico e com o parecer do Ministério Público que atua junto a esta Corte de Contas, acolhendo integralmente o voto do Conselheiro Relator, julgar: a) pela DENEGAÇÃO DE REGISTRO ao ato concessivo da aposentadoria e à despesa dele decorrente; b) pela determinação ao IPERN, à vista da Lei Complementar Estadual nº 547/2015, para que, no prazo de 60 (sessenta) dias, após o trânsito em julgado desta decisão, adote as correções necessárias para regularização do ato concessório, do cálculo dos proventos e de sua respectiva implantação; c) no caso de descumprimento da presente decisão, a responsabilização do titular da pasta responsável por seu atendimento, sem prejuízo da multa cominatória desde já fixada no valor de R$ 50,00 (cinquenta reais) por dia que superar o interregno fixado no item `b`, com base no art. 110 da Lei Complementar Estadual nº 464/2012, valor este passível de revisão e limitado ao teto previsto no art. 323, inciso II, alínea `f`, do Regimento Interno, a ser apurado por ocasião de eventual subsistência de mora.
'''
prompt_with_few_shot = generate_few_shot_ner_prompts(EXAMPLE_TEXT)
pprint.pprint(prompt_with_few_shot)


ChatPromptValue(messages=[SystemMessage(content='Você é um especialista em extração de entidades nomeadas com precisão excepcional. Sua tarefa é identificar e extrair informações específicas do texto fornecido, seguindo estas diretrizes:\n\n1. Extraia as informações exatamente como aparecem no texto, sem interpretações ou alterações.\n2. Se uma informação solicitada não estiver presente ou for ambígua, retorne null para esse campo.\n3. Mantenha-se estritamente dentro do escopo das entidades e atributos definidos no esquema fornecido.\n4. Preste atenção especial para manter a mesma ortografia, pontuação e formatação das informações extraídas.\n5. Não infira ou adicione informações que não estejam explicitamente presentes no texto.\n6. Se houver múltiplas menções da mesma entidade, extraia todas as ocorrências relevantes.\n7. Ignore informações irrelevantes ou fora do contexto das entidades solicitadas.\n\n**Orientação adicional para OBRIGACAO**: considere apenas o dispositivo da decisão

# Evaluating the models

In this stage, the script loops over the entire dataset to apply batch NER inference. Each model (gpt-4o, gpt-4-turbo, and gpt-35-turbo) receives the few-shot prompts and returns structured predictions. These results are stored along with the source text and reference (“golden”) annotations to enable metric evaluation later.

The intermediate results are saved in JSON so they can be reused without requiring the models to be rerun, which saves time and reduces costs.


In [28]:
MODELS = [
    ('gpt-41', extractor_gpt41),
    ('gpt-41-nano', extractor_gpt41_nano),
    ('gpt-41-mini', extractor_gpt41_mini),
    #('gpt-35', extractor_gpt35),
    #('gpt-4o', extractor_gpt4o),
    #('gpt-4-turbo', extractor_gpt4_turbo)
]
models_names = [x[0] for x in MODELS]


In [58]:
model_index = []
for m in models_names:
    model_index.extend(list(zip([m] * len(df_decicontas.index), df_decicontas.index)))
df_checked = pd.DataFrame(model_index, columns=['model', 'index'])
df_checked["checked"] = False

In [63]:
len(df_decicontas.index), len(df_checked.index), len(models_names)

(1425, 4275, 3)

In [50]:
models_results = []

df_results = pd.DataFrame(columns=['index', 'text', 'pred', 'golden', 'model'])

In [91]:
len(df_checked[df_checked['checked']])

138

In [90]:
df_results

Unnamed: 0,index,text,pred,golden,model
0,0,DECIDEM os Conselheiros do Tribunal de Contas ...,"{'multas': [], 'ressarcimentos': [], 'obrigaco...",[],gpt-41
1,1,DECIDEM os Conselheiros do Tribunal de Contas ...,"{'multas': [], 'ressarcimentos': [], 'obrigaco...",[],gpt-41
2,2,DECIDEM os Conselheiros do Tribunal de Contas ...,"{'multas': [], 'ressarcimentos': [], 'obrigaco...",[],gpt-41
3,3,DECIDEM os Conselheiros do Tribunal de Contas ...,"{'multas': [], 'ressarcimentos': [], 'obrigaco...",[],gpt-41
4,4,DECIDEM os Conselheiros do Tribunal de Contas ...,"{'multas': [], 'ressarcimentos': [], 'obrigaco...",[],gpt-41
...,...,...,...,...,...
133,133,DECIDEM os Conselheiros do Tribunal de Contas ...,"{'multas': [], 'ressarcimentos': [], 'obrigaco...",[],gpt-41
134,134,DECIDEM os Conselheiros do Tribunal de Contas ...,"{'multas': [], 'ressarcimentos': [], 'obrigaco...",[],gpt-41
135,135,DECIDEM os Conselheiros do Tribunal de Contas ...,"{'multas': [], 'ressarcimentos': [], 'obrigaco...",[],gpt-41
136,136,DECIDEM os Conselheiros do Tribunal de Contas ...,"{'multas': [], 'ressarcimentos': [], 'obrigaco...",[],gpt-41


In [None]:
for model_name, model_extractor in MODELS:
    print(f"Extracting {model_name} results")

    rows_to_process = [
        (index, row) for index, row in df_decicontas.iterrows()
        if not df_checked.loc[(df_checked['model'] == model_name) & (df_checked['index'] == index), 'checked'].values[0]
    ]

    for index, row in tqdm(rows_to_process, desc=f"{model_name}", unit="instance"):
        prompt_with_few_shot = generate_few_shot_ner_prompts(row['data']['text'])

        success = False
        while not success:
            try:
                result = model_extractor.invoke(prompt_with_few_shot)
                success = True
            except Exception as e:
                print(f"Error at index {index}: {e}")
                time.sleep(3)

        extracted_result = pd.DataFrame([{
            'index': index,
            'text': row['data']['text'],
            'pred': result.model_dump(),
            'golden': [r['value'] for r in row['annotations'][0]['result']],
            'model': model_name
        }])
        df_results = pd.concat([df_results, extracted_result], ignore_index=True)
        
        df_checked.loc[(df_checked['model'] == model_name) & (df_checked['index'] == index), 'checked'] = True


Extracting gpt-41 results


gpt-41:   0%|          | 2/1287 [00:05<51:06,  2.39s/instance]  

In [None]:
df_results.to_json("dataset/labeled_data/models_results_decicontas_mininano.json", orient="records", force_ascii=False, indent=2)

In [None]:
df_results.to_json("dataset/labeled_data/models_results_decicontas.json", orient="records", force_ascii=False, indent=2)

In [None]:
df_results.to_pickle("dataset/labeled_data/models_results_decicontas.pkl")

# Assessing the metrics

To measure the models’ performance, the script uses two main evaluation strategies:

- **Token-level (seqeval)**: based on token segmentation, measuring precision, recall, and F1 using the IOB scheme.
- **Span-level IoU (Intersection over Union)**: computing the overlap between predicted spans and gold-standard spans, taking entity type into account. A threshold of IoU ≥ 0.5 is considered a valid match.

Additionally, the script reports aggregate metrics and also details them by label, showing the number of matching spans between predictions and ground truth. This allows a clearer view of each model’s behavior per entity type (MULTA, OBRIGACAO, RECOMENDACAO, RESSARCIMENTO).


In [3]:
df_models = pd.read_json("dataset/labeled_data/models_results_decicontas.json")

In [4]:
len(df_models)

8556

In [21]:
from rapidfuzz import fuzz

DICT_LABELS = {
    "obrigacoes": "OBRIGACAO",
    "recomendacoes": "RECOMENDACAO",
    "ressarcimentos": "RESSARCIMENTO",
    "multas": "MULTA",
}

def convert_pred_to_golden_format(row, window_size=500, step_size=100, min_score=80):
    pred_spans = []
    text = row['text']
    pred = row['pred']
    
    for label_type, spans in pred.items():
        for span in spans:
            if not isinstance(span, dict):
                continue
            span_text = (
                span.get("descricao_multa")
                or span.get("descricao_obrigacao")
                or span.get("descricao_ressarcimento")
                or span.get("descricao_recomendacao")
            )
            if not span_text:
                continue
            
            best_score = 0
            best_pos = -1
            best_substring = ""
            
            # sliding window search
            for start in range(0, len(text), step_size):
                window = text[start:start+window_size]
                score = fuzz.partial_ratio(span_text, window)
                if score > best_score and score >= min_score:
                    best_score = score
                    best_pos = start + window.find(span_text.split()[0]) if span_text.split() else start
                    best_substring = span_text
            
            if best_score >= min_score and best_pos >= 0:
                pred_spans.append({
                    "start": best_pos,
                    "end": best_pos + len(best_substring),
                    "text": best_substring,
                    "labels": [DICT_LABELS[label_type]]
                })
                
    return pred_spans


In [None]:
df_models['pred_as_golden'] = df_models.apply(
    lambda row: convert_pred_to_golden_format(row, window_size=500, step_size=100, min_score=80),
    axis=1
)

In [None]:
df_models_4o = df_models[df_models['model'] == 'gpt-4o']
df_models_4turbo = df_models[df_models['model'] == 'gpt-4turbo']
df_models_35 = df_models[df_models['model'] == 'gpt-35']
df_models_41 = df_models[df_models['model'] == 'gpt-41']
df_models_41_mini = df_models[df_models['model'] == 'gpt-41-mini']
df_models_41_nano = df_models[df_models['model'] == 'gpt-41-nano']

In [None]:
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
from collections import defaultdict

def compute_iou_score(span_a, span_b, label_a, label_b, threshold=0.5):
    """
    Computes the IoU agreement score between two spans with labels,
    as described in the definition you gave.
    """
    s_a, e_a = span_a
    s_b, e_b = span_b

    # no overlap at all
    if e_a < s_b or e_b < s_a:
        return 0.0
    
    intersection = max(0, min(e_a, e_b) - max(s_a, s_b))
    union = max(e_a, e_b) - min(s_a, s_b)
    
    iou = intersection / union if union > 0 else 0.0

    if iou >= threshold:
        delta = 1 if label_a == label_b else 0
        return iou * delta
    else:
        return 0.0

def calculate_metrics(df, iou_threshold=0.5):
    from collections import defaultdict
    from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

    y_true = []
    y_pred = []
    label_metrics = defaultdict(lambda: {"total_gold": 0, "total_pred": 0, "matched": 0})

    for _, row in df.iterrows():
        text = row['text']
        true_labels = ['O'] * len(text)
        pred_labels = ['O'] * len(text)

        # gold
        for ann in row['golden']:
            start, end, label = ann['start'], ann['end'], ann['labels'][0]
            if start < len(true_labels):
                true_labels[start] = f"B-{label}"
                for i in range(start+1, min(end, len(true_labels))):
                    true_labels[i] = f"I-{label}"

        # pred
        for ann in row['pred_as_golden']:
            start, end, label = ann['start'], ann['end'], ann['labels'][0]
            if start < len(pred_labels):
                pred_labels[start] = f"B-{label}"
                for i in range(start+1, min(end, len(pred_labels))):
                    pred_labels[i] = f"I-{label}"

        y_true.append(true_labels)
        y_pred.append(pred_labels)

        gold_spans = [(ann['start'], ann['end'], ann['labels'][0]) for ann in row['golden']]
        pred_spans = [(ann['start'], ann['end'], ann['labels'][0]) for ann in row['pred_as_golden']]

        for g in gold_spans:
            label_metrics[g[2]]["total_gold"] += 1
        for p in pred_spans:
            label_metrics[p[2]]["total_pred"] += 1

        for p in pred_spans:
            for g in gold_spans:
                score = compute_iou_score(
                    (p[0], p[1]), (g[0], g[1]),
                    p[2], g[2],
                    threshold=iou_threshold
                )
                if score > 0:
                    label_metrics[p[2]]["matched"] += 1

    # seqeval
    token_prec = precision_score(y_true,y_pred)
    token_rec  = recall_score(y_true,y_pred)
    token_f1   = f1_score(y_true,y_pred)

    print("====== SEQEVAL TOKEN-LEVEL ======")
    print(f"Precision: {token_prec:.4f}")
    print(f"Recall:    {token_rec:.4f}")
    print(f"F1:        {token_f1:.4f}")
    print(classification_report(y_true,y_pred))

    total_gold = sum(v["total_gold"] for v in label_metrics.values())
    total_pred = sum(v["total_pred"] for v in label_metrics.values())
    total_matched = sum(v["matched"] for v in label_metrics.values())

    iou_prec = total_matched / total_pred if total_pred > 0 else 0
    iou_rec  = total_matched / total_gold if total_gold > 0 else 0
    iou_f1   = 2*iou_prec*iou_rec/(iou_prec+iou_rec) if (iou_prec+iou_rec) > 0 else 0

    print("====== SPAN-LEVEL IOU>=0.5 AGGREGATED ======")
    print(f"Precision: {iou_prec:.4f}")
    print(f"Recall:    {iou_rec:.4f}")
    print(f"F1:        {iou_f1:.4f}")

    print("====== SPAN-LEVEL IOU PER LABEL ======")
    for label, m in label_metrics.items():
        prec = m["matched"] / m["total_pred"] if m["total_pred"] > 0 else 0
        rec  = m["matched"] / m["total_gold"] if m["total_gold"] > 0 else 0
        f1   = 2*prec*rec/(prec+rec) if (prec+rec)>0 else 0
        print(f"{label}: P={prec:.4f} R={rec:.4f} F1={f1:.4f} ({m['matched']} matched)")

    return {
        "seqeval": {
            "precision": token_prec,
            "recall": token_rec,
            "f1": token_f1
        },
        "iou_agg": {
            "precision": iou_prec,
            "recall": iou_rec,
            "f1": iou_f1
        },
        "iou_per_label": dict(label_metrics)
    }


In [None]:
metrics_35 = calculate_metrics(df_models_35, 0.5)

Precision: 0.1164
Recall:    0.1326
F1:        0.1239
               precision    recall  f1-score   support

        MULTA       0.11      0.12      0.12       204
    OBRIGACAO       0.04      0.05      0.05       120
 RECOMENDACAO       0.31      0.47      0.37        58
RESSARCIMENTO       0.03      0.03      0.03        63

    micro avg       0.12      0.13      0.12       445
    macro avg       0.12      0.17      0.14       445
 weighted avg       0.11      0.13      0.12       445

Precision: 0.7083
Recall:    0.8404
F1:        0.7688
MULTA: P=0.8304 R=0.9118 F1=0.8692 (186 matched)
OBRIGACAO: P=0.5903 R=0.7083 F1=0.6439 (85 matched)
RECOMENDACAO: P=0.5889 R=0.9138 F1=0.7162 (53 matched)
RESSARCIMENTO: P=0.7143 R=0.7937 F1=0.7519 (50 matched)


In [None]:
metrics_4turbo = calculate_metrics(df_models_4turbo, 0.5)

Precision: 0.1233
Recall:    0.1416
F1:        0.1318
               precision    recall  f1-score   support

        MULTA       0.12      0.13      0.12       204
    OBRIGACAO       0.05      0.06      0.05       120
 RECOMENDACAO       0.31      0.47      0.37        58
RESSARCIMENTO       0.04      0.05      0.05        63

    micro avg       0.12      0.14      0.13       445
    macro avg       0.13      0.17      0.15       445
 weighted avg       0.12      0.14      0.13       445

Precision: 0.6966
Recall:    0.8360
F1:        0.7600
MULTA: P=0.8017 R=0.9118 F1=0.8532 (186 matched)
OBRIGACAO: P=0.6187 R=0.7167 F1=0.6641 (86 matched)
RECOMENDACAO: P=0.5843 R=0.8966 F1=0.7075 (52 matched)
RESSARCIMENTO: P=0.6486 R=0.7619 F1=0.7007 (48 matched)


In [None]:
metrics_4o = calculate_metrics(df_models_4o, 0.5)


Precision: 0.1111
Recall:    0.1281
F1:        0.1190
               precision    recall  f1-score   support

        MULTA       0.07      0.07      0.07       204
    OBRIGACAO       0.05      0.06      0.05       120
 RECOMENDACAO       0.35      0.53      0.42        58
RESSARCIMENTO       0.05      0.06      0.06        63

    micro avg       0.11      0.13      0.12       445
    macro avg       0.13      0.18      0.15       445
 weighted avg       0.10      0.13      0.11       445

Precision: 0.6710
Recall:    0.8157
F1:        0.7363
MULTA: P=0.7716 R=0.8775 F1=0.8211 (179 matched)
OBRIGACAO: P=0.5816 R=0.6833 F1=0.6284 (82 matched)
RECOMENDACAO: P=0.6000 R=0.9310 F1=0.7297 (54 matched)
RESSARCIMENTO: P=0.6154 R=0.7619 F1=0.6809 (48 matched)


In [137]:
pprint.pprint(metrics_35)

{'iou_agg': {'f1': 0.7687564234326825,
             'precision': 0.7083333333333334,
             'recall': 0.8404494382022472},
 'iou_per_label': {'MULTA': {'matched': 186,
                             'total_gold': 204,
                             'total_pred': 224},
                   'OBRIGACAO': {'matched': 85,
                                 'total_gold': 120,
                                 'total_pred': 144},
                   'RECOMENDACAO': {'matched': 53,
                                    'total_gold': 58,
                                    'total_pred': 90},
                   'RESSARCIMENTO': {'matched': 50,
                                     'total_gold': 63,
                                     'total_pred': 70}},
 'seqeval': {'f1': 0.12394957983193279,
             'precision': 0.11637080867850098,
             'recall': 0.13258426966292136}}


In [138]:
pprint.pprint(metrics_4turbo)

{'iou_agg': {'f1': 0.759959141981614,
             'precision': 0.6966292134831461,
             'recall': 0.8359550561797753},
 'iou_per_label': {'MULTA': {'matched': 186,
                             'total_gold': 204,
                             'total_pred': 232},
                   'OBRIGACAO': {'matched': 86,
                                 'total_gold': 120,
                                 'total_pred': 139},
                   'RECOMENDACAO': {'matched': 52,
                                    'total_gold': 58,
                                    'total_pred': 89},
                   'RESSARCIMENTO': {'matched': 48,
                                     'total_gold': 63,
                                     'total_pred': 74}},
 'seqeval': {'f1': 0.13179916317991633,
             'precision': 0.1232876712328767,
             'recall': 0.14157303370786517}}


In [139]:
pprint.pprint(metrics_4o)

{'iou_agg': {'f1': 0.7363083164300204,
             'precision': 0.6709796672828097,
             'recall': 0.8157303370786517},
 'iou_per_label': {'MULTA': {'matched': 179,
                             'total_gold': 204,
                             'total_pred': 232},
                   'OBRIGACAO': {'matched': 82,
                                 'total_gold': 120,
                                 'total_pred': 141},
                   'RECOMENDACAO': {'matched': 54,
                                    'total_gold': 58,
                                    'total_pred': 90},
                   'RESSARCIMENTO': {'matched': 48,
                                     'total_gold': 63,
                                     'total_pred': 78}},
 'seqeval': {'f1': 0.11899791231732776,
             'precision': 0.1111111111111111,
             'recall': 0.12808988764044943}}


# Final metrics

The experiment reported the following main results across the three evaluated models:

- **gpt-35-turbo**:
  - Token-level F1: 0.1239
  - Span-level IoU F1: 0.7688
- **gpt-4-turbo**:
  - Token-level F1: 0.1318
  - Span-level IoU F1: 0.7600
- **gpt-4o**:
  - Token-level F1: 0.1190
  - Span-level IoU F1: 0.7363

We see that although the token-level F1 scores were relatively low (as expected, given the strict IOB scheme), the span-level IoU scores were much higher (over 0.73 F1), showing that the models were quite effective in identifying the correct text regions, even if the token segmentation was not perfect.

These results reinforce the feasibility of using LLMs for pre-labeling audit decisions, in a similar way to how it is done with healthcare judicial data (LexCare.BR), reducing the need for fully manual annotation.
