# LELA60332 Computational Linguistics 2

## Assignment: Name Entity Recognition with Transformers



---



**TASK**

The following is an implementation of **Name Entity Recognition** (NER) span-labelling task using Transformer models.

The **[dataset](https://github.com/UniversalNER/UNER_English-EWT)** used in this task is split into training, validation, and test subsets. There is an additional out-of-domain test set. The data consists of pretokenised sentences with token-by-token tags. The tagset follows the BIO (beginning, inside, outside) system. There are 3 main types of named entities: locations (‘B-LOC’, ‘I-LOC’), persons (‘B-PER’, ‘I-PER’), and organisations (‘B-ORG’, ‘I-ORG’). The goal is to solve 2 versions of the task: 1) the original one, with 7 different tags, 2) a simplified one, where the tagset consists of 3 tags (‘O’, ‘B’, and ‘I’).


The following Transformer architectures are explored to solve the task:

*   **Encoder-only**. A pre-trained BERT model (trained on masked language modelling) is fine-tuned for the NER task using the provided dataset. Three different classification heads are implemented and tested (single-layer, two-layer and two-layer with dropout).

*   **Decoder-only**. A pre-trained LLM is implemented using few-shot in-context learning to produce label sequences.

## Imports

In [1]:
## Import libraries and modules

# For data download and prep
from collections import defaultdict
from urllib import request
import json
import pandas as pd
import random

# To shuffle data for training
from random import shuffle
# For batch calculation
from math import ceil
# PyTorch for model definition, deep learning
import torch
import torch.nn as nn
# To access BERT
from transformers import AutoModel, AutoTokenizer
# For progress bars
from tqdm.auto import tqdm

In [2]:
# Store data from the project in Google Drive (model checkpoints, results)
import csv
from google.colab import drive
# mount my Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Create a directory for saving models
!mkdir -p /content/drive/MyDrive/ner_models
# Create a directory for results
!mkdir -p /content/drive/MyDrive/ner_models/combined_results
# Copy model files to Google Drive
!cp *.pt /content/drive/MyDrive/ner_models/

In [4]:
# for loading checkpoints only
import os
model_dir = '/content/drive/MyDrive/ner_models/'
print("Files in model directory:")
files = os.listdir(model_dir)
for file in files:
    print(f"  {file}")

Files in model directory:
  combined_results
  simplified_ner_clf_head.pt
  simplified_ner_encoder.pt
  two_layer_dropout_ner_clf_head.pt
  two_layer_dropout_ner_encoder.pt
  single_layer_ner_clf_head.pt
  single_layer_ner_encoder.pt
  two_layer_ner_clf_head.pt
  bert_results.csv
  two_layer_ner_encoder.pt
  bert_results_may 15.gsheet
  Bert results.png
  all_results.csv
  all_results.gsheet


Setting the random seed for reproducibility (shuffling of training data, initialization of model weights for the classification head).

In [5]:
# Set constant seed value
SEED = 42
# Set the random seed for PyTorch's random number generation in model weights
torch.manual_seed(SEED)
# Sets the random seed for data shuffling
random.seed(SEED)

## Prepare the Data

Define functions to process the dataset for the NER task

In [6]:
## Define a function to parse the data
# Takes a text block as an argument
def parse_conllu_using_pandas(block):
    # initialize an empty list to store processed lines
    records = []
    # split the input block into individual lines and iterate over each line
    for line in block.splitlines():
        # omit lines starting with "#" (comments and metadata)
        # For valid lines
        if not line.startswith('#'):
            # remove whitespaces, split the line by tab and append
            records.append(line.strip().split('\t'))
    # Create a Pandas DataFrame from the records list
    # with columns for token id, token form, token NER tag
    # a row per token
    return pd.DataFrame.from_records(
        records,
        columns=['ID', 'FORM', 'TAG', 'Misc1', 'Misc2'])

In [7]:
## Function to extract tokens and labels
# Take the dataframe created by previous function
def tokens_to_labels(df):
    # Return a tuple containing two lists: tokens from the FORM column and corresponding tags from the TAG column
    return (
        df.FORM.tolist(),
        df.TAG.tolist()
    )

Define the data source

In [8]:
 # Define a URL prefix for GIthub files
PREFIX = "https://raw.githubusercontent.com/UniversalNER/"

# Create a nested dictionary mapping dataset names and splits to their file pths
DATA_URLS = {
    "en_ewt": {
        "train": "UNER_English-EWT/master/en_ewt-ud-train.iob2",
        "dev": "UNER_English-EWT/master/en_ewt-ud-dev.iob2",
        "test": "UNER_English-EWT/master/en_ewt-ud-test.iob2"
    },
    "en_pud": {
        "test": "UNER_English-PUD/master/en_pud-ud-test.iob2"
    }
}

There are two  datasets:
*   `en_ewt` is the main dataset with the train-dev-test split
*   `en_pud` is the  out-of-domain test set


The data will be used as follows:
*   Train on `data_dict['en_ewt']['train']`
*   Validate on `data_dict['en_ewt']['dev']`
*   Test on `data_dict['en_ewt']['test']` and `data_dict['en_pud']['test']`

Download and process the data

In [9]:
# Create a nested dict to store processed data
data_dict = defaultdict(dict)

# Iterate through each dataset (each corpus and split in the DATA_URLS dictionary)
# corpus will be the key ("en_ewt" or "en_pud") - a specific dataset
# split_dict is the dictionary of data splits for that corpus - train/dev/test
for corpus, split_dict in DATA_URLS.items():
    # Iterate through split_dict key-value pairs
    # For each file
    for split, url_suffix in split_dict.items():
        # combine URL prefixes and suffix
        url = PREFIX + url_suffix
        # Open the URL
        with request.urlopen(url) as response:
            # read the content as UTF-8 text save into txt
            txt = response.read().decode('utf-8')
            # Split the text by double newlines to separate sentences
            # apply the parsing function to create dataframes
            data_frames = map(parse_conllu_using_pandas, txt.split('\n\n'))
            # Apply the function extracting tokens and labels to create token-label tuples
            # Store them(a list of tokens and a list of tags)
            token_label_alignments = list(map(tokens_to_labels, data_frames))
            # Store token-label pairs in a nested data_dict
            data_dict[corpus][split] = token_label_alignments

Save the data into a json file

In [10]:
# Save the data as a json file (so that you don't have to redownload it each time)
with open('ner_data_dict.json', 'w', encoding='utf-8') as out:
    json.dump(data_dict, out, indent=2, ensure_ascii=False)

Take a quick look at the data

In [11]:
# Look up the nested dictionary with two datasets and corresponding splits
data_dict['en_ewt'].keys()

dict_keys(['train', 'dev', 'test'])

In [12]:
data_dict['en_pud'].keys()

dict_keys(['test'])

In [13]:
# Check length of datasets
# Training set has 12543 sentences
len(data_dict['en_ewt']['train'])

12544

In [14]:
len(data_dict['en_ewt']['dev'])

2002

In [15]:
len(data_dict['en_ewt']['test'])

2078

In [16]:
len(data_dict['en_pud']['test'])

1001

In [17]:
# Look up the training data
# first access the outer dict (corpus - main/OOD), then inner dict (split type - train/dev/test)
# Each subset of each corpus is a list of tuples where each tuple is a list of tokens with a corresponding list of labels
data_dict['en_ewt']['train'][0]

(['Where', 'in', 'the', 'world', 'is', 'Iguazu', '?'],
 ['O', 'O', 'O', 'O', 'O', 'B-LOC', 'O'])



---



# **Fine-tuning BERT**

## Full Tagset

### Prepare labels

Extract NER tags and prepare them for model training

In [18]:
## Check how many unique NER tags we have in the training data
# Create a set of unique labels
unique_labels = set()
# iterate through training data (token-label tuples)
for item in data_dict['en_ewt']['train']:
    # using update() method all labels to the set
    unique_labels.update(item[1])

In [19]:
# Count the number of classes using the set of unique labels
n_classes = len(unique_labels)
n_classes

7

The number of classes will be used in model training as the output dimension of the classification head that will be added on top of the BERT encoder.

Create index to label mapping for model training since the the model requires numerical input. The reverse mapping will be used to convert numerical predictions to the list of tags.

In [20]:
# Map labels (strings) to numerical indices
label_to_i = {
    label: i
    for i, label in enumerate(sorted(unique_labels))
}
# convert model outputs back to labels
i_to_label = {
    i: label
    for label, i in label_to_i.items()
}

In [21]:
# look up the mapping
label_to_i

{'B-LOC': 0,
 'B-ORG': 1,
 'B-PER': 2,
 'I-LOC': 3,
 'I-ORG': 4,
 'I-PER': 5,
 'O': 6}

In [22]:
i_to_label

{0: 'B-LOC',
 1: 'B-ORG',
 2: 'B-PER',
 3: 'I-LOC',
 4: 'I-ORG',
 5: 'I-PER',
 6: 'O'}

### Model Setup

#### Define the Classification Head

Single Linear Layer

In [23]:
# Create a single layer classification head class
# Inherits from PyTorch's nn.Module
# Takes BERT's 768-dimensional output and projects it down to 7 dimensions
class ClassificationHead(nn.Module):
    def __init__(self, model_dim=768, n_classes=7):
        # Initialize the parent neural network class
        super().__init__()
        # Create a single linear layer that maps from BERT's hidden dimension to the output
        self.linear = nn.Linear(model_dim, n_classes)

    # Define the forward pass
    # take input tensor x (BERT embeddings)
    def forward(self, x):
        #  Apply linear transformation and return result (a tensor containing logits - raw prediction scores)
        return self.linear(x)

Two-Layer Classification Head

In [24]:
# Create a two-layer classification head class (with 1 hidden layer)
class TwoLayerClassificationHead(nn.Module):
    def __init__(self, model_dim=768, hidden_dim=256, n_classes=7):
        super().__init__()
        # Create a linear layer that transform the BERT's 768 dimensional output to 256
        self.layer1 = nn.Linear(model_dim, hidden_dim)
        # Pass the values through ReLU to introduce non-linearity
        self.relu = nn.ReLU()
        # Create a second linear layer that transforms 265 to the output (number of classes)
        self.layer2 = nn.Linear(hidden_dim, n_classes)

    # Define the forward pass
    def forward(self, x):
        # # Take input x and pass through the linear layer
        x = self.layer1(x)
        # Apply ReLU
        x = self.relu(x)
        # Pass values through the second layer and produce the output
        return self.layer2(x)

Two-Layer Classification Head with Dropout

In [25]:
# Create a two-layer classification head class with  dropout
class TwoLayerDropoutClassificationHead(nn.Module):
    # add dropout  - drop out 30% of neurons during training
    def __init__(self, model_dim=768, hidden_dim=256, dropout_rate=0.3, n_classes=7):
        super().__init__()
        # Create the first linear layer
        self.layer1 = nn.Linear(model_dim, hidden_dim)
        # Apply ReLU activation
        self.relu = nn.ReLU()
        # Add dropout in the hidden layer
        self.dropout = nn.Dropout(dropout_rate)
        # Create the second linear layer that produces outputs
        self.layer2 = nn.Linear(hidden_dim, n_classes)

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.dropout(x)
        return self.layer2(x)

#### Select the encoder

In [26]:
# select the model tag
model_tag = 'google-bert/bert-base-uncased'

# Initialize the tokeniser
tokeniser = AutoTokenizer.from_pretrained(model_tag)

### Process the data

The original data is pre-tokenised into words. However, BERT uses the WordPiece subword tokenization algorithm. Since the original NER tags are associated with words (but we may have subword tokens after tokenisation with BERT), we need to merge embeddings for subword tokens.

Check tokenization alignment on a sample sentence from the training data


In [27]:
# Check original tokenisation of the data
sample_sentence = data_dict['en_ewt']['train'][0]
tokens, labels = sample_sentence

# Print original tokens and labels
print("Original tokens:", tokens)
print("Original labels:", labels)

Original tokens: ['Where', 'in', 'the', 'world', 'is', 'Iguazu', '?']
Original labels: ['O', 'O', 'O', 'O', 'O', 'B-LOC', 'O']


In [28]:
# Check how BERT tokenises the same sentence
tokenisation = tokeniser(tokens, is_split_into_words=True)
print("BERT tokens:", tokenisation.tokens())

# Show mappping of BERT tokens to original word_ids
word_ids = tokenisation.word_ids()
print("Word IDs:", word_ids)

BERT tokens: ['[CLS]', 'where', 'in', 'the', 'world', 'is', 'i', '##gua', '##zu', '?', '[SEP]']
Word IDs: [None, 0, 1, 2, 3, 4, 5, 5, 5, 6, None]


#### Sentence processing function

In [29]:
## Function to process one sentence
def process_sentence(sentence, label_to_i, tokenizer, encoder, clf_head, encoder_device, clf_head_device):

    # Prepare inputs
    # Unpack the sentence tuple into list of words and a list of NER tags
    tokens, labels = sentence

    # Prepare the gold label tensors using the numerical indices mapping onto NER tags and move to clf_head device
    gold_labels = torch.tensor([label_to_i[label] for label in labels]).to(clf_head_device)
    # Process the pretokenised words using the BERT tokeniser and return tensors
    tokenization = tokenizer(tokens, is_split_into_words=True,return_tensors='pt')
    # create a dict with tokeniser outputs to later feed into the BERT encoder
    inputs = {k: v.to(encoder_device) for k, v in tokenization.items()}

    # Create BERT token embedding tensors
    # pass the inputs to the BERT encoder and get the last hidden state (token embeddings), disregard CLS and SEP tokens
    outputs = encoder(**inputs).last_hidden_state[0, 1:-1, :]

    # Tokens
    # Get the mapping between original and BERT tokens (minus CLS and SEP tokens)
    word_ids = tokenization.word_ids()[1:-1]
    # create an empty set to track which original tokens have been processed - to ensure we don't add more than one per word
    processed_words = set()
    # create an empty list to store oen embedding per token
    first_subword_embeddings = []

    # Combine subword embeddings into word embeddings by extracting embeddings of first subwords
    # Only take the first subword for each original token
    # Loop through BERT token positions and corresponding original tokens
    for i, word_id in enumerate(word_ids):
        if word_id is not None and word_id not in processed_words:
            # Add the BERT first subword embedding to the list
            first_subword_embeddings.append(outputs[i])
            # Add to processed tokens list
            processed_words.add(word_id)

    # Check we have one token embedding per original token
    assert len(first_subword_embeddings) == len(gold_labels)

    # Stack each token tensor (each row corresponds to one token embedding, column is dimension - 768 in BERT)
    # This way the classification head will output one prediction (NER tag per token)
    clf_head_inputs = torch.vstack(first_subword_embeddings).to(clf_head_device)

    # Pass the embedding through the classification head to get predictions (logits)
    # Return the prediction logits and gold labels - to compute the loss during training
    return clf_head(clf_head_inputs), gold_labels

### Training Functions

#### Batch processing function

In [30]:
## Define batch processing function
def process_ner_batch(batch_sentences, label_to_i, tokeniser, encoder, clf_head, encoder_device, clf_head_device, loss_fn, optimizer):

    # Reset gradients not to accumulate over batches
    optimizer.zero_grad()

    # initialize loss for sentences in this batch
    batch_loss = 0

    # Iterate through each sentence in the batch (each sentence is a tuple of tokens and labels)
    for sentence in batch_sentences:

        ## Forward pass
        # take a sentence and pass through the BERT encoder and the classification head
        # return logits (prediction scores) and gold labels (tensor)
        logits, gold_labels = process_sentence(
            sentence, label_to_i, tokeniser,
            encoder, clf_head, encoder_device,
            clf_head_device)

        # Calculate loss for this sentence
        loss = loss_fn(logits, gold_labels)
        # Accumulate loss across the sentences in the batch
        batch_loss += loss

    # Calculate average loss per sentence
    avg_batch_loss = batch_loss / len(batch_sentences)

    ## Backpropagate
    # compute gradients)
    avg_batch_loss.backward()
    # update model parameters
    optimizer.step()

    # Return average loss over the batch (tensor with gradient information)
    return avg_batch_loss

#### Shuffle data function

In [31]:
## Define a function to shuffle the data to avoid ordering bias
# Takes the data from data_dict
def prepare_ner_data(data_dict):

    # Extract dataset splits from data_dict
    train_data = [sentence for sentence in data_dict['en_ewt']['train'] if len(sentence[0]) > 0 and len(sentence[1]) > 0]
    dev_data = [sentence for sentence in data_dict['en_ewt']['dev'] if len(sentence[0]) > 0 and len(sentence[1]) > 0]
    test_data = [sentence for sentence in data_dict['en_ewt']['test'] if len(sentence[0]) > 0 and len(sentence[1]) > 0]
    ood_test_data = [sentence for sentence in data_dict['en_pud']['test'] if len(sentence[0]) > 0 and len(sentence[1]) > 0]

    # Shuffle training data (not shuffling others)
    shuffle(train_data)

    # Return a dict with the splits where training data is shuffled
    return {
        'train': train_data,
        'dev': dev_data,
        'test': test_data,
        'ood_test': ood_test_data
    }

#### Train epoch function

In [32]:
## Define a function to train the model for one epoch using batches
def train_epoch(train_data, label_to_i, tokeniser, encoder, clf_head, optimizer, loss_fn,  device_encoder, device_clf_head, batch_size):

    # Set encoder to training mode
    encoder.train()

    # Calculate the number of batch steps needed (number of inputs divided by the batch size rounded upwards)
    n_steps = ceil(len(train_data) / batch_size)

    # create a tensor to store losses
    epoch_losses = torch.empty(n_steps)

    # Step over batches with the progress bar (tqdm shows live during training)
    # step_n is current batch number
    for step_n in tqdm(range(n_steps), leave=False, desc='Train'):
        # calculate the lower index for the batch
        # multiplying current batch number by batch_size gives starting index for current batch
        lo = step_n * batch_size
        # calculate the upper index for the batch (where batch should end)
        #  min makes sure that last batches, if smaller, can be handled
        hi = min(lo + batch_size, len(train_data))
        # extract the batch
        batch = train_data[lo:hi]

        # Reset gradients
        optimizer.zero_grad()

        # Process the batch and get loss
        batch_loss = process_ner_batch(
            batch, label_to_i, tokeniser, encoder, clf_head,
            device_encoder, device_clf_head, loss_fn, optimizer
        )


        # Store the loss
        epoch_losses[step_n] = batch_loss.item()

    # Return average loss for the epoch
    return epoch_losses.mean().item()

### Evaluation functions


Measure the performance on two test sets (in- and out-of-domain)

#### Span extraction function

In [33]:
## Define a function to extract entity spans from tags
def extract_spans(tokens, labels, debug=False):
    spans = []
    current_span = None

    ## DEBUGGING when getting AttributeError (some labels being a list not a string) - ensure all labels are strings
    labels = [str(label) if label is not None else 'O' for label in labels]
    if debug:
        print("\nExtract spans debug:")
        print("Tokens:", tokens[:20])  # Show first 20 tokens
        print("Labels:", labels[:20])  # Show first 20 labels

    for i, (token, label) in enumerate(zip(tokens, labels)):
        if label.startswith('B-'):
            if debug:
                print(f"Found B tag at position {i}: {token} - {label}")

            # End any current span
            if current_span is not None:
                spans.append(current_span)
                if debug:
                    print(f"Ending previous span: {current_span}")

            # Start a new span
            entity_type = label[2:]
            current_span = (entity_type, i, i)
            if debug:
                print(f"Starting new span: {current_span}")

        elif label.startswith('I-'):
            if debug:
                print(f"Found I tag at position {i}: {token} - {label}")

            # Continue current span if entity type matches
            if current_span is not None and current_span[0] == label[2:]:
                current_span = (current_span[0], current_span[1], i)
                if debug:
                    print(f"Extended span: {current_span}")
            # Otherwise ignore

        elif label == 'B':  #  simplified tagset
            if debug:
                print(f"Found B tag at position {i}: {token} - {label}")

            if current_span is not None:
                spans.append(current_span)
                if debug:
                    print(f"Ending previous span: {current_span}")

            current_span = ('B', i, i)
            if debug:
                print(f"Starting new span: {current_span}")

        elif label == 'I':  #  simplified tagset
            if debug:
                print(f"Found I tag at position {i}: {token} - {label}")

            if current_span is not None:
                current_span = (current_span[0], current_span[1], i)
                if debug:
                    print(f"Extended span: {current_span}")

        else:  # 'O' tag
            if debug and i < 20:  # Only debug first 20 tokens
                print(f"Found O tag at position {i}: {token} - {label}")

            # End any current span
            if current_span is not None:
                spans.append(current_span)
                if debug:
                    print(f"Ending span at O tag: {current_span}")
                current_span = None

    # Add the last span if there is one
    if current_span is not None:
        spans.append(current_span)
        if debug:
            print(f"Adding final span: {current_span}")

    if debug:
        print(f"Extracted {len(spans)} spans: {spans}")

    return spans

#### Metric calculation function

In [34]:
def calculate_span_metrics(pred_labels, true_labels, tokens):

    # Extract spans (entity_type, start and end indices)
    true_spans = extract_spans(tokens, true_labels)
    pred_spans = extract_spans(tokens, pred_labels)

    # For unlabeled matching, only consider boundaries
    true_spans_unlabeled = [(span[1], span[2]) for span in true_spans]
    pred_spans_unlabeled = [(span[1], span[2]) for span in pred_spans]

    # Count matches
    labeled_matches = sum(1 for span in pred_spans if span in true_spans)
    unlabeled_matches = sum(1 for span in pred_spans_unlabeled if span in true_spans_unlabeled)

    # Calculate span matching scores
    labeled_match_score = labeled_matches / len(true_spans) if true_spans else 0
    unlabeled_match_score = unlabeled_matches / len(true_spans_unlabeled) if true_spans_unlabeled else 0

    # Group spans by entity
    true_spans_by_type = {}
    pred_spans_by_type = {}

    for span in true_spans:
        entity_type = span[0]
        if entity_type not in true_spans_by_type:
            true_spans_by_type[entity_type] = []
        true_spans_by_type[entity_type].append(span)

    for span in pred_spans:
        entity_type = span[0]
        if entity_type not in pred_spans_by_type:
            pred_spans_by_type[entity_type] = []
        pred_spans_by_type[entity_type].append(span)

    # Calculate F1 scores for each entity type
    all_entity_types = set(true_spans_by_type.keys()) | set(pred_spans_by_type.keys())
    f1_scores = {}

    for entity_type in all_entity_types:
        true_type_spans = true_spans_by_type.get(entity_type, [])
        pred_type_spans = pred_spans_by_type.get(entity_type, [])

        # Count matches for this entity type
        matches = sum(1 for span in pred_type_spans if span in true_type_spans)

        # Calculate precision, recall, F1
        precision = matches / len(pred_type_spans) if pred_type_spans else 0
        recall = matches / len(true_type_spans) if true_type_spans else 0
        f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        f1_scores[entity_type] = {
            'precision': precision,
            'recall': recall,
            'f1': f1
        }

    # Calculate macro average F1
    if f1_scores:
        macro_avg = {
            'precision': sum(scores['precision'] for scores in f1_scores.values()) / len(f1_scores),
            'recall': sum(scores['recall'] for scores in f1_scores.values()) / len(f1_scores),
            'f1': sum(scores['f1'] for scores in f1_scores.values()) / len(f1_scores)
        }
        f1_scores['macro_avg'] = macro_avg

    # Span matching scores and span-based P/R/F1 scores
    return {
        'span_metrics': {
            'labeled_match_score': labeled_match_score,
            'unlabeled_match_score': unlabeled_match_score,
            'true_spans_count': len(true_spans),
            'pred_spans_count': len(pred_spans),
            'labeled_matches': labeled_matches,
            'unlabeled_matches': unlabeled_matches
        },
        'f1_scores': f1_scores
    }

#### Validation function

In [35]:
# Define a function to evaluate NER model performance
# takes dev_data  (List of sentence tuples - tokens, labels)
def validate_ner(dev_data, label_to_i, tokeniser, encoder, clf_head, device_encoder, device_clf_head, batch_size, i_to_label=None, unique_labels=None):

    # Set model to evaluation mode
    encoder.eval()

    # Calculate number of batches
    n_steps = ceil(len(dev_data) / batch_size)

    # For token-level accuracy
    all_token_correct = 0
    all_token_total = 0

    # For aggregating span metrics across all sentences
    all_labeled_matches = 0
    all_unlabeled_matches = 0
    all_true_spans = 0

    # For entity-type specific metrics
    entity_metrics = {}

    # Process each batch
    for step_n in tqdm(range(n_steps), leave=False, desc='Validate'):
        # Calculate batch indices
        lo = step_n * batch_size
        hi = min(lo + batch_size, len(dev_data))
        batch = dev_data[lo:hi]

        # Process each sentence in the batch
        with torch.no_grad():  # Disable gradient calculation during validation
            for sentence in batch:
                # Process sentence through model
                logits, gold_labels = process_sentence(
                    sentence, label_to_i, tokeniser,
                    encoder, clf_head, device_encoder,
                    device_clf_head)

                # Skip sentences that returned empty tensors
                if logits.shape[0] == 0 or gold_labels.shape[0] == 0:
                    print(f"Skipping empty result for sentence (batch {step_n})")
                    continue

                # Get predictions
                predicted_labels = logits.argmax(dim=-1)

                # Calculate token-level accuracy
                correct = (predicted_labels == gold_labels).sum().item()
                total = len(gold_labels)

                all_token_correct += correct
                all_token_total += total

                # For span-based metrics
                if i_to_label is not None:
                    # Convert indices to label strings
                    pred_labels = [i_to_label[idx.item()] for idx in predicted_labels]
                    true_labels = [i_to_label[idx.item()] for idx in gold_labels]

                    # Calculate span metrics
                    span_results = calculate_span_metrics(pred_labels, true_labels, sentence[0])

                    # Aggregate span matching metrics
                    metrics = span_results['span_metrics']
                    all_labeled_matches += metrics['labeled_matches']
                    all_unlabeled_matches += metrics['unlabeled_matches']
                    all_true_spans += metrics['true_spans_count']

                    # Aggregate entity-type metrics
                    for entity_type, scores in span_results['f1_scores'].items():
                        if entity_type != 'macro_avg':
                            if entity_type not in entity_metrics:
                                entity_metrics[entity_type] = {
                                    'true_positives': 0,
                                    'false_positives': 0,
                                    'false_negatives': 0
                                }

                            # We need to calculate TP, FP, FN from the spans directly
                            true_type_spans = set(span for span in extract_spans(sentence[0], true_labels) if span[0] == entity_type)
                            pred_type_spans = set(span for span in extract_spans(sentence[0], pred_labels) if span[0] == entity_type)

                            # Calculate metrics
                            tp = len(true_type_spans & pred_type_spans)  # Intersection
                            fp = len(pred_type_spans - true_type_spans)  # In pred but not in true
                            fn = len(true_type_spans - pred_type_spans)  # In true but not in pred

                            # Accumulate
                            entity_metrics[entity_type]['true_positives'] += tp
                            entity_metrics[entity_type]['false_positives'] += fp
                            entity_metrics[entity_type]['false_negatives'] += fn

    # Calculate overall token accuracy
    token_accuracy = all_token_correct / all_token_total if all_token_total > 0 else 0

    # Calculate final span matching scores
    span_match_scores = {
        'labeled_span_match': all_labeled_matches / all_true_spans if all_true_spans > 0 else 0,
        'unlabeled_span_match': all_unlabeled_matches / all_true_spans if all_true_spans > 0 else 0
    }

    # Calculate final F1 scores by entity type
    f1_scores = {}
    for entity_type, metrics in entity_metrics.items():
        tp = metrics['true_positives']
        fp = metrics['false_positives']
        fn = metrics['false_negatives']

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        f1_scores[entity_type] = {
            'precision': precision,
            'recall': recall,
            'f1': f1
        }

    # Calculate macro averages
    if f1_scores:
        f1_scores['macro_avg'] = {
            'precision': sum(scores['precision'] for scores in f1_scores.values()) / len(f1_scores),
            'recall': sum(scores['recall'] for scores in f1_scores.values()) / len(f1_scores),
            'f1': sum(scores['f1'] for scores in f1_scores.values()) / len(f1_scores)
        }

    # Combine all metrics
    results = {
        'token_accuracy': token_accuracy,
        'span_match_scores': span_match_scores,
        'f1_scores': f1_scores
    }

    return results

### Model Initialization and Training

Initialize the model and specify parameters



Select the pre-trained BERT model (standard base version, doesn't distinguish between upper and lower case letters https://huggingface.co/google-bert/bert-base-uncased). Select the tokeniser (a class, also from HuggingFace).

#### Define Global Parameters

In [36]:
# Define training parameters (uppercase for constant values)
N_EPOCHS = 8
BATCH_SIZE = 32
LEARNING_RATE = 1e-5  # Learning rate of 0.00001, same as 10**(-5)

# Check CUDA availability and set devices
if torch.cuda.is_available():
    DEVICE_ENCODER = "cuda"
    DEVICE_CLF_HEAD = "cuda"
    print("Using GPU")
else:
    DEVICE_ENCODER = "cpu"
    DEVICE_CLF_HEAD = "cpu"
    print("Using CPU")

Using GPU


In [37]:
# Dictionary to store results
all_results = {
    "single_layer": {"test": None, "ood_test": None},
    "two_layer": {"test": None, "ood_test": None},
    "two_layer_dropout": {"test": None, "ood_test": None},
    "simplified": {"test": None, "ood_test": None},
    "llm": {"test": None, "ood_test": None}  # Add LLM results too
}

# Prepare data splits
data_splits = prepare_ner_data(data_dict)

Training loops for 3 classification heads containing early stopping to prevent overfitting. Run for indicated number of epochs but trigger early stopping  if no improvement for indicated number of epochs without improvement.

#### Training with the Single Layer Classification Head

In [None]:
# Start training with the single layer classification head
print("\n=== Training with Single Layer Classification Head ===\n")

# Initialize the BERT encoder and move to the selected device
encoder = AutoModel.from_pretrained(model_tag).to(DEVICE_ENCODER)

# Initialize the classification head for NER and move to selected device (with n_classes set to the number of NER tags)
clf_head = ClassificationHead(n_classes=len(unique_labels)).to(DEVICE_CLF_HEAD)

# Create optimizer - combining parameters from encoder and classification head
optimizer = torch.optim.AdamW(
    list(encoder.parameters()) + list(clf_head.parameters()),
    lr=LEARNING_RATE
)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Early stopping parameters
best_f1 = 0.0  # Track the highest F1 score seen so far
# keep track of which epoch had the last improvement in f1 score
last_epoch_with_dev_improvement = 0
# count how many consecutive epochs have passed without F1 score improving (will be reset when there is  improvement)
n_epochs_without_improvement = 0
# Set how many epochs without improvement to tolerate before stopping training
early_stopping_threshold = 3

## Training loop
for epoch_n in tqdm(range(N_EPOCHS)):
    # Train for one epoch
    train_loss = train_epoch(
        data_splits['train'], label_to_i, tokeniser,
        encoder, clf_head, optimizer, loss_fn,
        DEVICE_ENCODER, DEVICE_CLF_HEAD, BATCH_SIZE
    )
    print(f'Epoch {epoch_n+1}/{N_EPOCHS} train loss: {train_loss:.4f}')

    # Validate
    results = validate_ner(
        data_splits['dev'], label_to_i, tokeniser,
        encoder, clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
        BATCH_SIZE, i_to_label, unique_labels
    )

    # Print results
    if 'f1_scores' in results and 'macro_avg' in results['f1_scores']:
        macro = results['f1_scores']['macro_avg']
        span_scores = results['span_match_scores']

        print(f"\nModel performance on dev set - Single Layer Classification Head")
        print(f"Epoch: {epoch_n+1}/{N_EPOCHS}")
        print(f"Token Accuracy: {results['token_accuracy']:.4f}")
        print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
        print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
        print(f"Precision: {macro['precision']:.4f}")
        print(f"Recall: {macro['recall']:.4f}")
        print(f"F-score: {macro['f1']:.4f}")

        # Save model if F1 improves
        if macro['f1'] > best_f1:
          # # if the current f1 is best so far, update f1 to current value
          best_f1 = macro['f1']
          # Record the current epoch no as most recent epoch with improvement
          last_epoch_with_dev_improvement = epoch_n
          #Reset the counter for epochs without improvement
          n_epochs_without_improvement = 0
          # Save the model because new best performance-  save every time  F1  improves on  dev set
          print('Saving the model with the new best F1 score')
          torch.save(encoder.state_dict(), 'single_layer_ner_encoder.pt')
          torch.save(clf_head.state_dict(), 'single_layer_ner_clf_head.pt')
        # if current f1 score not better than previous best
        else:
          # calculate how many epochs have passed without improvement
          n_epochs_without_improvement = epoch_n - last_epoch_with_dev_improvement
          print(f"No improvement for {n_epochs_without_improvement} epochs. Best F1: {best_f1:.4f}")

          # Check if threshold reached
          # if 5 epochs without improvement, training stops - early stopping
          if n_epochs_without_improvement >= early_stopping_threshold:
              print(f"Early stopping after {n_epochs_without_improvement} epochs without improvement")
              break


=== Training with Single Layer Classification Head ===



  0%|          | 0/8 [00:00<?, ?it/s]

Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 1/8 train loss: 0.2057


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Single Layer Classification Head
Epoch: 1/8
Token Accuracy: 0.9770
Labeled Span Match: 0.6998
Unlabeled Span Match: 0.7360
Precision: 0.6693
Recall: 0.6679
F-score: 0.6645
Saving the model with the new best F1 score


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 2/8 train loss: 0.0527


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Single Layer Classification Head
Epoch: 2/8
Token Accuracy: 0.9822
Labeled Span Match: 0.8085
Unlabeled Span Match: 0.8468
Precision: 0.7390
Recall: 0.7912
F-score: 0.7641
Saving the model with the new best F1 score


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 3/8 train loss: 0.0323


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Single Layer Classification Head
Epoch: 3/8
Token Accuracy: 0.9833
Labeled Span Match: 0.8002
Unlabeled Span Match: 0.8333
Precision: 0.7787
Recall: 0.7754
F-score: 0.7766
Saving the model with the new best F1 score


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 4/8 train loss: 0.0225


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Single Layer Classification Head
Epoch: 4/8
Token Accuracy: 0.9829
Labeled Span Match: 0.8002
Unlabeled Span Match: 0.8364
Precision: 0.7734
Recall: 0.7702
F-score: 0.7701
No improvement for 1 epochs. Best F1: 0.7766


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 5/8 train loss: 0.0162


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Single Layer Classification Head
Epoch: 5/8
Token Accuracy: 0.9835
Labeled Span Match: 0.8116
Unlabeled Span Match: 0.8489
Precision: 0.7684
Recall: 0.7862
F-score: 0.7765
No improvement for 2 epochs. Best F1: 0.7766


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 6/8 train loss: 0.0115


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Single Layer Classification Head
Epoch: 6/8
Token Accuracy: 0.9840
Labeled Span Match: 0.8230
Unlabeled Span Match: 0.8634
Precision: 0.7731
Recall: 0.8044
F-score: 0.7880
Saving the model with the new best F1 score


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 7/8 train loss: 0.0083


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Single Layer Classification Head
Epoch: 7/8
Token Accuracy: 0.9843
Labeled Span Match: 0.8375
Unlabeled Span Match: 0.8737
Precision: 0.7724
Recall: 0.8181
F-score: 0.7945
Saving the model with the new best F1 score


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 8/8 train loss: 0.0066


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Single Layer Classification Head
Epoch: 8/8
Token Accuracy: 0.9842
Labeled Span Match: 0.8313
Unlabeled Span Match: 0.8706
Precision: 0.7870
Recall: 0.8132
F-score: 0.7995
Saving the model with the new best F1 score


Check if saving of model checkpoints in colab

In [None]:
!pwd

/content


In [None]:
!ls -la

total 437740
drwxr-xr-x 1 root root      4096 May 15 19:40 .
drwxr-xr-x 1 root root      4096 May 15 16:50 ..
drwxr-xr-x 4 root root      4096 May 14 13:38 .config
drwx------ 7 root root      4096 May 15 19:40 drive
-rw-r--r-- 1 root root  10184254 May 15 18:49 ner_data_dict.json
drwxr-xr-x 1 root root      4096 May 14 13:38 sample_data
-rw-r--r-- 1 root root     23236 May 15 19:39 single_layer_ner_clf_head.pt
-rw-r--r-- 1 root root 438011295 May 15 19:39 single_layer_ner_encoder.pt


#### Training with the Two-Layer Classification Head

In [None]:
# Initialize the BERT encoder and move to the selected device
encoder = AutoModel.from_pretrained(model_tag).to(DEVICE_ENCODER)

# Initialize the classification head for NER and move to selected device (with n_classes set to the number of NER tags)
clf_head = TwoLayerClassificationHead(n_classes=len(unique_labels)).to(DEVICE_CLF_HEAD)

# Create optimizer - combining parameters from encoder and classification head
optimizer = torch.optim.AdamW(
    list(encoder.parameters()) + list(clf_head.parameters()),
    lr=LEARNING_RATE
)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Early stopping parameters
best_f1 = 0.0
last_epoch_with_dev_improvement = 0
n_epochs_without_improvement = 0
early_stopping_threshold = 3

## Training loop
print("Training with Two-Layer Classification Head")

for epoch_n in tqdm(range(N_EPOCHS)):
    # Train for one epoch
    train_loss = train_epoch(
        data_splits['train'], label_to_i, tokeniser,
        encoder, clf_head, optimizer, loss_fn,
        DEVICE_ENCODER, DEVICE_CLF_HEAD, BATCH_SIZE
    )
    print(f'Epoch {epoch_n+1}/{N_EPOCHS} train loss: {train_loss:.4f}')

    # Validate
    results = validate_ner(
        data_splits['dev'], label_to_i, tokeniser,
        encoder, clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
        BATCH_SIZE, i_to_label, unique_labels
    )

    # Print results
    if 'f1_scores' in results and 'macro_avg' in results['f1_scores']:
        macro = results['f1_scores']['macro_avg']
        span_scores = results['span_match_scores']

        print(f"\nModel performance on dev set - Two-Layer Classification Head")
        print(f"Epoch: {epoch_n+1}/{N_EPOCHS}")
        print(f"Token Accuracy: {results['token_accuracy']:.4f}")
        print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
        print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
        print(f"Precision: {macro['precision']:.4f}")
        print(f"Recall: {macro['recall']:.4f}")
        print(f"F-score: {macro['f1']:.4f}")

        # Save model if F1 improves
        if macro['f1'] > best_f1:
            best_f1 = macro['f1']
            last_epoch_with_dev_improvement = epoch_n
            n_epochs_without_improvement = 0
            print('Saving the model, new best F1 score')
            torch.save(encoder.state_dict(), 'two_layer_ner_encoder.pt')
            torch.save(clf_head.state_dict(), 'two_layer_ner_clf_head.pt')
        else:
            n_epochs_without_improvement = epoch_n - last_epoch_with_dev_improvement
            print(f"No improvement for {n_epochs_without_improvement} epochs. Best F1: {best_f1:.4f}")

            # Check if threshold reached
            if n_epochs_without_improvement >= early_stopping_threshold:
                print(f"Early stopping after {n_epochs_without_improvement} epochs without improvement")
                break


=== Training with Two-Layer Classification Head ===



  0%|          | 0/8 [00:00<?, ?it/s]

Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 1/8 train loss: 0.28


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Classification Head
Epoch: 1/8
Token Accuracy: 0.97
Labeled Span Match: 0.57
Unlabeled Span Match: 0.62
Precision: 0.51
Recall: 0.51
F-score: 0.47
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 2/8 train loss: 0.08


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Classification Head
Epoch: 2/8
Token Accuracy: 0.98
Labeled Span Match: 0.77
Unlabeled Span Match: 0.82
Precision: 0.70
Recall: 0.74
F-score: 0.72
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 3/8 train loss: 0.04


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Classification Head
Epoch: 3/8
Token Accuracy: 0.98
Labeled Span Match: 0.83
Unlabeled Span Match: 0.87
Precision: 0.73
Recall: 0.81
F-score: 0.77
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 4/8 train loss: 0.03


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Classification Head
Epoch: 4/8
Token Accuracy: 0.98
Labeled Span Match: 0.83
Unlabeled Span Match: 0.87
Precision: 0.76
Recall: 0.81
F-score: 0.79
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 5/8 train loss: 0.02


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Classification Head
Epoch: 5/8
Token Accuracy: 0.98
Labeled Span Match: 0.84
Unlabeled Span Match: 0.88
Precision: 0.77
Recall: 0.83
F-score: 0.80
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 6/8 train loss: 0.02


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Classification Head
Epoch: 6/8
Token Accuracy: 0.98
Labeled Span Match: 0.84
Unlabeled Span Match: 0.88
Precision: 0.77
Recall: 0.83
F-score: 0.80
No improvement for 1 epochs. Best F1: 0.80


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 7/8 train loss: 0.01


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Classification Head
Epoch: 7/8
Token Accuracy: 0.98
Labeled Span Match: 0.85
Unlabeled Span Match: 0.90
Precision: 0.75
Recall: 0.84
F-score: 0.79
No improvement for 2 epochs. Best F1: 0.80


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 8/8 train loss: 0.01


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Classification Head
Epoch: 8/8
Token Accuracy: 0.98
Labeled Span Match: 0.83
Unlabeled Span Match: 0.87
Precision: 0.80
Recall: 0.81
F-score: 0.80
Saving the model - new best F1 score!


#### Training with the Two-Layer Classification Head with Dropout

In [39]:
# Initialize the BERT encoder and move to the selected device
encoder = AutoModel.from_pretrained(model_tag).to(DEVICE_ENCODER)

# Initialize the classification head for NER and move to selected device (with n_classes set to the number of NER tags)
clf_head = TwoLayerDropoutClassificationHead(n_classes=len(unique_labels)).to(DEVICE_CLF_HEAD)

# Create optimizer - combining parameters from encoder and classification head
optimizer = torch.optim.AdamW(
    list(encoder.parameters()) + list(clf_head.parameters()),
    lr=LEARNING_RATE
)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Early stopping parameters
best_f1 = 0.0
last_epoch_with_dev_improvement = 0
n_epochs_without_improvement = 0
early_stopping_threshold = 3

## Training loop
print("Training with Two-Layer Dropout Classification Head")

for epoch_n in tqdm(range(N_EPOCHS)):
    # Train for one epoch
    train_loss = train_epoch(
        data_splits['train'], label_to_i, tokeniser,
        encoder, clf_head, optimizer, loss_fn,
        DEVICE_ENCODER, DEVICE_CLF_HEAD, BATCH_SIZE
    )
    print(f'Epoch {epoch_n+1}/{N_EPOCHS} train loss: {train_loss:.4f}')

    # Validate
    results = validate_ner(
        data_splits['dev'], label_to_i, tokeniser,
        encoder, clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
        BATCH_SIZE, i_to_label, unique_labels
    )

    # Print results
    if 'f1_scores' in results and 'macro_avg' in results['f1_scores']:
        macro = results['f1_scores']['macro_avg']
        span_scores = results['span_match_scores']

        print(f"\nModel performance on dev set - Two-Layer Dropout Classification Head")
        print(f"Epoch: {epoch_n+1}/{N_EPOCHS}")
        print(f"Token Accuracy: {results['token_accuracy']:.4f}")
        print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
        print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
        print(f"Precision: {macro['precision']:.4f}")
        print(f"Recall: {macro['recall']:.4f}")
        print(f"F-score: {macro['f1']:.4f}")

        # Save model if F1 improves
        if macro['f1'] > best_f1:
            best_f1 = macro['f1']
            last_epoch_with_dev_improvement = epoch_n
            n_epochs_without_improvement = 0
            print('Saving the model - new best F1 score!')
            torch.save(encoder.state_dict(), 'two_layer_dropout_ner_encoder.pt')
            torch.save(clf_head.state_dict(), 'two_layer_dropout_ner_clf_head.pt')
        else:
            n_epochs_without_improvement = epoch_n - last_epoch_with_dev_improvement
            print(f"No improvement for {n_epochs_without_improvement} epochs. Best F1: {best_f1:.4f}")

            # Check if threshold reached
            if n_epochs_without_improvement >= early_stopping_threshold:
                print(f"Early stopping after {n_epochs_without_improvement} epochs without improvement")
                break


=== Training with Two-Layer Dropout Classification Head ===



  0%|          | 0/8 [00:00<?, ?it/s]

Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 1/8 train loss: 0.36


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Dropout Classification Head
Epoch: 1/8
Token Accuracy: 0.97
Labeled Span Match: 0.55
Unlabeled Span Match: 0.66
Precision: 0.54
Recall: 0.48
F-score: 0.37
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 2/8 train loss: 0.10


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Dropout Classification Head
Epoch: 2/8
Token Accuracy: 0.97
Labeled Span Match: 0.70
Unlabeled Span Match: 0.75
Precision: 0.51
Recall: 0.67
F-score: 0.58
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 3/8 train loss: 0.06


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Dropout Classification Head
Epoch: 3/8
Token Accuracy: 0.98
Labeled Span Match: 0.80
Unlabeled Span Match: 0.84
Precision: 0.75
Recall: 0.78
F-score: 0.77
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 4/8 train loss: 0.04


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Dropout Classification Head
Epoch: 4/8
Token Accuracy: 0.98
Labeled Span Match: 0.83
Unlabeled Span Match: 0.87
Precision: 0.75
Recall: 0.82
F-score: 0.79
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 5/8 train loss: 0.03


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Dropout Classification Head
Epoch: 5/8
Token Accuracy: 0.98
Labeled Span Match: 0.83
Unlabeled Span Match: 0.87
Precision: 0.75
Recall: 0.81
F-score: 0.78
No improvement for 1 epochs. Best F1: 0.79


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 6/8 train loss: 0.02


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Dropout Classification Head
Epoch: 6/8
Token Accuracy: 0.98
Labeled Span Match: 0.84
Unlabeled Span Match: 0.87
Precision: 0.77
Recall: 0.83
F-score: 0.80
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 7/8 train loss: 0.02


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Dropout Classification Head
Epoch: 7/8
Token Accuracy: 0.98
Labeled Span Match: 0.84
Unlabeled Span Match: 0.87
Precision: 0.77
Recall: 0.83
F-score: 0.79
No improvement for 1 epochs. Best F1: 0.80


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 8/8 train loss: 0.01


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Model performance on dev set - Two-Layer Dropout Classification Head
Epoch: 8/8
Token Accuracy: 0.98
Labeled Span Match: 0.85
Unlabeled Span Match: 0.88
Precision: 0.78
Recall: 0.83
F-score: 0.80
Saving the model - new best F1 score!


### Testing

Testing all models on test sets


In [57]:
# Test Single Layer model
print("Single Layer Classification Head")
# Create new model instances
single_layer_encoder = AutoModel.from_pretrained(model_tag).to(DEVICE_ENCODER)
single_layer_clf_head = ClassificationHead(n_classes=len(unique_labels)).to(DEVICE_CLF_HEAD)

# Load best checkpoints
single_layer_encoder.load_state_dict(torch.load('/content/drive/MyDrive/ner_models/single_layer_ner_encoder.pt'))
single_layer_clf_head.load_state_dict(torch.load('/content/drive/MyDrive/ner_models/single_layer_ner_clf_head.pt'))

# Look up examples
single_layer_encoder.eval()
examples_shown = 0

# Process each sentence in test data
with torch.no_grad():  # Disable gradient calculation
    for sentence in data_splits['test']:
        if examples_shown >= 10:
            break

        # Process sentence through model
        logits, gold_labels = process_sentence(
            sentence, label_to_i, tokeniser,
            single_layer_encoder, single_layer_clf_head,
            DEVICE_ENCODER, DEVICE_CLF_HEAD
        )

        # Get predictions
        predicted_labels = logits.argmax(dim=-1)

        # Convert indices to label strings
        pred_labels = [i_to_label[idx.item()] for idx in predicted_labels]
        true_labels = [i_to_label[idx.item()] for idx in gold_labels]
        tokens = sentence[0]

        # Extract spans
        true_spans = extract_spans(tokens, true_labels)

        # Only show examples with entities
        if len(true_spans) > 0:
            examples_shown += 1

            print(' '.join(tokens))

            for token, true_label, pred_label in zip(tokens, true_labels, pred_labels):
                print(f"{token} {true_label} {pred_label}")

            print()


# Test on in-domain test set
print("\nEvaluating Single Layer model on in-domain test set:")
single_layer_test_results = validate_ner(
    data_splits['test'], label_to_i, tokeniser,
    single_layer_encoder, single_layer_clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
    BATCH_SIZE, i_to_label, unique_labels
)

# Print test results
if 'f1_scores' in single_layer_test_results and 'macro_avg' in single_layer_test_results['f1_scores']:
    macro = single_layer_test_results['f1_scores']['macro_avg']
    span_scores = single_layer_test_results['span_match_scores']

    print(f"\nSingle Layer model performance on in-domain test set")
    print(f"Token Accuracy: {single_layer_test_results['token_accuracy']:.4f}")
    print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
    print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
    print(f"Precision: {macro['precision']:.4f}")
    print(f"Recall: {macro['recall']:.4f}")
    print(f"F-score: {macro['f1']:.4f}")

    # Store results in  global results dict
    all_results["single_layer"]["test"] = single_layer_test_results

# Test on out-of-domain test set
print("\nEvaluating Single Layer model on out-of-domain test set:")
single_layer_ood_results = validate_ner(
    data_splits['ood_test'], label_to_i, tokeniser,
    single_layer_encoder, single_layer_clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
    BATCH_SIZE, i_to_label, unique_labels
)

# Print OOD test results
if 'f1_scores' in single_layer_ood_results and 'macro_avg' in single_layer_ood_results['f1_scores']:
    macro = single_layer_ood_results['f1_scores']['macro_avg']
    span_scores = single_layer_ood_results['span_match_scores']

    print(f"\nSingle Layer model performance on out-of-domain test set")
    print(f"Token Accuracy: {single_layer_ood_results['token_accuracy']:.4f}")
    print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
    print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
    print(f"Precision: {macro['precision']:.4f}")
    print(f"Recall: {macro['recall']:.4f}")
    print(f"F-score: {macro['f1']:.4f}")

    # Store results in the global results dictionary
    all_results["single_layer"]["ood_test"] = single_layer_ood_results


----- Single Layer Classification Head -----
What is this Miramar ?
What O O
is O O
this O O
Miramar B-LOC B-LOC
? O O

It is a place in Argentina lol
It O O
is O O
a O O
place O O
in O O
Argentina B-LOC B-LOC
lol O O

" In Argentina , beef is revered , respected , and praised .
" O O
In O O
Argentina B-LOC B-LOC
, O O
beef O O
is O O
revered O O
, O O
respected O O
, O O
and O O
praised O O
. O O

A taste of Argentina .
A O O
taste O O
of O O
Argentina B-LOC B-LOC
. O O

What language is talked in Iguazu ?
What O O
language O O
is O O
talked O O
in O O
Iguazu B-LOC B-LOC
? O O

Do you think there are any koreans in Miramar ?
Do O O
you O O
think O O
there O O
are O O
any O O
koreans O O
in O O
Miramar B-LOC B-LOC
? O O

Does anyone know any good restaurants in cordoba ?
Does O O
anyone O O
know O O
any O O
good O O
restaurants O O
in O O
cordoba B-LOC B-LOC
? O O

Does anyone know of any good food in iguazu ?
Does O O
anyone O O
know O O
of O O
any O O
good O O
food O O
in O O
iguazu

Validate:   0%|          | 0/65 [00:00<?, ?it/s]


Single Layer model performance on in-domain test set
Token Accuracy: 0.9825
Labeled Span Match: 0.8438
Unlabeled Span Match: 0.8768
Precision: 0.8160
Recall: 0.8301
F-score: 0.8226

Evaluating Single Layer model on out-of-domain test set:


Validate:   0%|          | 0/32 [00:00<?, ?it/s]


Single Layer model performance on out-of-domain test set
Token Accuracy: 0.9789
Labeled Span Match: 0.7619
Unlabeled Span Match: 0.8205
Precision: 0.7816
Recall: 0.7142
F-score: 0.7350


Look up outputs

In [38]:
# Test Two Layer model
print("Two Layer Classification Head")
# Create new model instances
two_layer_encoder = AutoModel.from_pretrained(model_tag).to(DEVICE_ENCODER)
two_layer_clf_head = TwoLayerClassificationHead(n_classes=len(unique_labels)).to(DEVICE_CLF_HEAD)

# Load best checkpoints
two_layer_encoder.load_state_dict(torch.load('/content/drive/MyDrive/ner_models/two_layer_ner_encoder.pt'))
two_layer_clf_head.load_state_dict(torch.load('/content/drive/MyDrive/ner_models/two_layer_ner_clf_head.pt'))

# Test on in-domain test set
print("\nEvaluating Two Layer model on in-domain test set:")
two_layer_test_results = validate_ner(
    data_splits['test'], label_to_i, tokeniser,
    two_layer_encoder, two_layer_clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
    BATCH_SIZE, i_to_label, unique_labels
)

# Print test results
if 'f1_scores' in two_layer_test_results and 'macro_avg' in two_layer_test_results['f1_scores']:
    macro = two_layer_test_results['f1_scores']['macro_avg']
    span_scores = two_layer_test_results['span_match_scores']

    print(f"\nTwo Layer model performance on in-domain test set")
    print(f"Token Accuracy: {two_layer_test_results['token_accuracy']:.4f}")
    print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
    print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
    print(f"Precision: {macro['precision']:.4f}")
    print(f"Recall: {macro['recall']:.4f}")
    print(f"F-score: {macro['f1']:.4f}")

    # Store results in the global results dict
    all_results["two_layer"]["test"] = two_layer_test_results

# Test on out-of-domain test set
print("\nEvaluating Two Layer model on out-of-domain test set:")
two_layer_ood_results = validate_ner(
    data_splits['ood_test'], label_to_i, tokeniser,
    two_layer_encoder, two_layer_clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
    BATCH_SIZE, i_to_label, unique_labels
)

# Print OOD test results
if 'f1_scores' in two_layer_ood_results and 'macro_avg' in two_layer_ood_results['f1_scores']:
    macro = two_layer_ood_results['f1_scores']['macro_avg']
    span_scores = two_layer_ood_results['span_match_scores']

    print(f"\nTwo Layer model performance on out-of-domain test set")
    print(f"Token Accuracy: {two_layer_ood_results['token_accuracy']:.4f}")
    print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
    print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
    print(f"Precision: {macro['precision']:.4f}")
    print(f"Recall: {macro['recall']:.4f}")
    print(f"F-score: {macro['f1']:.4f}")

    # Store results in the global results dictionary
    all_results["two_layer"]["ood_test"] = two_layer_ood_results




----- Two Layer Classification Head -----

Evaluating Two Layer model on in-domain test set:


Validate:   0%|          | 0/65 [00:00<?, ?it/s]


Two Layer model performance on in-domain test set
Token Accuracy: 0.9818
Labeled Span Match: 0.8346
Unlabeled Span Match: 0.8621
Precision: 0.8240
Recall: 0.8207
F-score: 0.8203

Evaluating Two Layer model on out-of-domain test set:


Validate:   0%|          | 0/32 [00:00<?, ?it/s]


Two Layer model performance on out-of-domain test set
Token Accuracy: 0.9787
Labeled Span Match: 0.7637
Unlabeled Span Match: 0.8279
Precision: 0.7886
Recall: 0.7080
F-score: 0.7246


In [39]:
# Test Two Layer Dropout model
print("Two Layer Dropout Classification Head")
# Create new model instances
dropout_encoder = AutoModel.from_pretrained(model_tag).to(DEVICE_ENCODER)
dropout_clf_head = TwoLayerDropoutClassificationHead(n_classes=len(unique_labels)).to(DEVICE_CLF_HEAD)

# Load best checkpoints
dropout_encoder.load_state_dict(torch.load('/content/drive/MyDrive/ner_models/two_layer_dropout_ner_encoder.pt'))
dropout_clf_head.load_state_dict(torch.load('/content/drive/MyDrive/ner_models/two_layer_dropout_ner_clf_head.pt'))

# Test on in-domain test set
print("\nEvaluating Two Layer Dropout model on in-domain test set:")
dropout_test_results = validate_ner(
    data_splits['test'], label_to_i, tokeniser,
    dropout_encoder, dropout_clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
    BATCH_SIZE, i_to_label, unique_labels
)

# Print test results
if 'f1_scores' in dropout_test_results and 'macro_avg' in dropout_test_results['f1_scores']:
    macro = dropout_test_results['f1_scores']['macro_avg']
    span_scores = dropout_test_results['span_match_scores']

    print(f"\nTwo Layer Dropout model performance on in-domain test set")
    print(f"Token Accuracy: {dropout_test_results['token_accuracy']:.4f}")
    print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
    print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
    print(f"Precision: {macro['precision']:.4f}")
    print(f"Recall: {macro['recall']:.4f}")
    print(f"F-score: {macro['f1']:.4f}")

    # Store results in the global results dictionary
    all_results["two_layer_dropout"]["test"] = dropout_test_results

# Test on out-of-domain test set
print("\nEvaluating Two Layer Dropout model on out-of-domain test set:")
dropout_ood_results = validate_ner(
    data_splits['ood_test'], label_to_i, tokeniser,
    dropout_encoder, dropout_clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
    BATCH_SIZE, i_to_label, unique_labels
)

# Print OOD test results
if 'f1_scores' in dropout_ood_results and 'macro_avg' in dropout_ood_results['f1_scores']:
    macro = dropout_ood_results['f1_scores']['macro_avg']
    span_scores = dropout_ood_results['span_match_scores']

    print(f"\nTwo Layer Dropout model performance on out-of-domain test set")
    print(f"Token Accuracy: {dropout_ood_results['token_accuracy']:.4f}")
    print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
    print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
    print(f"Precision: {macro['precision']:.4f}")
    print(f"Recall: {macro['recall']:.4f}")
    print(f"F-score: {macro['f1']:.4f}")

    # Store results in the global results dict
    all_results["two_layer_dropout"]["ood_test"] = dropout_ood_results


----- Two Layer Dropout Classification Head -----

Evaluating Two Layer Dropout model on in-domain test set:


Validate:   0%|          | 0/65 [00:00<?, ?it/s]


Two Layer Dropout model performance on in-domain test set
Token Accuracy: 0.9809
Labeled Span Match: 0.8401
Unlabeled Span Match: 0.8759
Precision: 0.8018
Recall: 0.8284
F-score: 0.8142

Evaluating Two Layer Dropout model on out-of-domain test set:


Validate:   0%|          | 0/32 [00:00<?, ?it/s]


Two Layer Dropout model performance on out-of-domain test set
Token Accuracy: 0.9780
Labeled Span Match: 0.7758
Unlabeled Span Match: 0.8437
Precision: 0.7632
Recall: 0.7283
F-score: 0.7346




## Simplified Tagset

Training and Testing the Simplified Tagset (3 tags) using a Single Layer Classification Head.

In [67]:
# Convert to simplified tags
def convert_to_simplified_tags(data_dict):
    simplified_dict = defaultdict(dict)

    for corpus in data_dict:
        for split in data_dict[corpus]:
            simplified_dict[corpus][split] = []

            for tokens, tags in data_dict[corpus][split]:
                simplified_tags = []
                for tag in tags:
                    if tag.startswith('B-'):
                        simplified_tags.append('B')
                    elif tag.startswith('I-'):
                        simplified_tags.append('I')
                    else:  # 'O' tag
                        simplified_tags.append('O')

                simplified_dict[corpus][split].append((tokens, simplified_tags))

    return simplified_dict

In [68]:
# Convert full tagset to simplified tagset
simplified_data_dict = convert_to_simplified_tags(data_dict)

# Create simplified labels and prepare data
simplified_unique_labels = set(['O', 'B', 'I'])
print(f"Simplified tagset: {simplified_unique_labels}")

# Create label mappings for simplified tags
simplified_label_to_i = {
    label: i
    for i, label in enumerate(sorted(simplified_unique_labels))
}

simplified_i_to_label = {
    i: label
    for label, i in simplified_label_to_i.items()
}

print("Simplified label mappings:")
print(simplified_label_to_i)
print(simplified_i_to_label)

# Prepare data splits with simplified tags
simplified_data_splits = prepare_ner_data(simplified_data_dict)

Simplified tagset: {'B', 'O', 'I'}
Simplified label mappings:
{'B': 0, 'I': 1, 'O': 2}
{0: 'B', 1: 'I', 2: 'O'}


Training the Simplified Tagset with Single_Layer Classification Head


In [70]:
#  Initialize model for simplified task
simplified_encoder = AutoModel.from_pretrained(model_tag).to(DEVICE_ENCODER)

# Initialize a new classification head for the simplified task (with 3 classes)
simplified_clf_head = ClassificationHead(n_classes=len(simplified_unique_labels)).to(DEVICE_CLF_HEAD)

# Create optimizer
simplified_optimizer = torch.optim.AdamW(
    list(simplified_encoder.parameters()) + list(simplified_clf_head.parameters()),
    lr=1e-5
)

# Define loss function
loss_fn = nn.CrossEntropyLoss()

# Early stopping parameters
simplified_best_f1 = 0.0
last_epoch_with_dev_improvement = 0
n_epochs_without_improvement = 0
early_stopping_threshold = 3

# Training loop for simplified model
print("Training Simplified Model (3 tags)")

for epoch_n in tqdm(range(N_EPOCHS)):

    # Train for one epoch on simplified data
    train_loss = train_epoch(
        simplified_data_splits['train'], simplified_label_to_i, tokeniser,
        simplified_encoder, simplified_clf_head, simplified_optimizer, loss_fn,
        DEVICE_ENCODER, DEVICE_CLF_HEAD, BATCH_SIZE
    )
    print(f'Epoch {epoch_n+1}/{N_EPOCHS} train loss: {train_loss:.4f}')

    # Validate on simplified data
    results = validate_ner(
        simplified_data_splits['dev'], simplified_label_to_i, tokeniser,
        simplified_encoder, simplified_clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
        BATCH_SIZE, simplified_i_to_label, simplified_unique_labels
    )

    # Print results
    if 'f1_scores' in results and 'macro_avg' in results['f1_scores']:
        macro = results['f1_scores']['macro_avg']
        span_scores = results['span_match_scores']

        print(f"\nSimplified Model performance on dev set")
        print(f"Epoch: {epoch_n+1}/{N_EPOCHS}")
        print(f"Token Accuracy: {results['token_accuracy']:.4f}")
        print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
        print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
        print(f"Precision: {macro['precision']:.4f}")
        print(f"Recall: {macro['recall']:.4f}")
        print(f"F-score: {macro['f1']:.4f}")

        # Save model if F1 improves
        if macro['f1'] > simplified_best_f1:
            simplified_best_f1 = macro['f1']
            last_epoch_with_dev_improvement = epoch_n
            n_epochs_without_improvement = 0
            print('Saving the model - new best F1 score!')
            torch.save(simplified_encoder.state_dict(), 'simplified_ner_encoder.pt')
            torch.save(simplified_clf_head.state_dict(), 'simplified_ner_clf_head.pt')
        else:
            n_epochs_without_improvement = epoch_n - last_epoch_with_dev_improvement
            print(f"No improvement for {n_epochs_without_improvement} epochs. Best F1: {simplified_best_f1:.4f}")

            # Check for early stopping
            if n_epochs_without_improvement >= early_stopping_threshold:
                print(f"Early stopping after {n_epochs_without_improvement} epochs without improvement")
                break


=== Training Simplified Model (3 tags) ===



  0%|          | 0/8 [00:00<?, ?it/s]

Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 1/8 train loss: 0.1258


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Simplified Model performance on dev set
Epoch: 1/8
Token Accuracy: 0.9832
Labeled Span Match: 0.8157
Unlabeled Span Match: 0.8157
Precision: 0.7680
Recall: 0.8157
F-score: 0.7912
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 2/8 train loss: 0.0364


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Simplified Model performance on dev set
Epoch: 2/8
Token Accuracy: 0.9847
Labeled Span Match: 0.8571
Unlabeled Span Match: 0.8571
Precision: 0.8039
Recall: 0.8571
F-score: 0.8297
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 3/8 train loss: 0.0223


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Simplified Model performance on dev set
Epoch: 3/8
Token Accuracy: 0.9868
Labeled Span Match: 0.8478
Unlabeled Span Match: 0.8478
Precision: 0.8400
Recall: 0.8478
F-score: 0.8439
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 4/8 train loss: 0.0152


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Simplified Model performance on dev set
Epoch: 4/8
Token Accuracy: 0.9872
Labeled Span Match: 0.8665
Unlabeled Span Match: 0.8665
Precision: 0.8353
Recall: 0.8665
F-score: 0.8506
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 5/8 train loss: 0.0105


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Simplified Model performance on dev set
Epoch: 5/8
Token Accuracy: 0.9875
Labeled Span Match: 0.8665
Unlabeled Span Match: 0.8665
Precision: 0.8532
Recall: 0.8665
F-score: 0.8598
Saving the model - new best F1 score!


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 6/8 train loss: 0.0080


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Simplified Model performance on dev set
Epoch: 6/8
Token Accuracy: 0.9867
Labeled Span Match: 0.8634
Unlabeled Span Match: 0.8634
Precision: 0.8441
Recall: 0.8634
F-score: 0.8536
No improvement for 1 epochs. Best F1: 0.8598


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 7/8 train loss: 0.0062


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Simplified Model performance on dev set
Epoch: 7/8
Token Accuracy: 0.9864
Labeled Span Match: 0.8675
Unlabeled Span Match: 0.8675
Precision: 0.8431
Recall: 0.8675
F-score: 0.8551
No improvement for 2 epochs. Best F1: 0.8598


Train:   0%|          | 0/392 [00:00<?, ?it/s]

Epoch 8/8 train loss: 0.0049


Validate:   0%|          | 0/63 [00:00<?, ?it/s]


Simplified Model performance on dev set
Epoch: 8/8
Token Accuracy: 0.9874
Labeled Span Match: 0.8758
Unlabeled Span Match: 0.8758
Precision: 0.8545
Recall: 0.8758
F-score: 0.8650
Saving the model - new best F1 score!


Testing on the Simplified Tagset


In [71]:
# Initialize models for simplified tagset
simplified_encoder = AutoModel.from_pretrained(model_tag).to(DEVICE_ENCODER)
simplified_clf_head = ClassificationHead(n_classes=len(simplified_unique_labels)).to(DEVICE_CLF_HEAD)

# Load the simplified model checkpoints from Google Drive
simplified_encoder.load_state_dict(torch.load('/content/drive/MyDrive/ner_models/simplified_ner_encoder.pt'))
simplified_clf_head.load_state_dict(torch.load('/content/drive/MyDrive/ner_models/simplified_ner_clf_head.pt'))


# Test the simplified model
print("Evaluating simplified model on in-domain test set")
test_results = validate_ner(
    simplified_data_splits['test'], simplified_label_to_i, tokeniser,
    simplified_encoder, simplified_clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
    BATCH_SIZE, simplified_i_to_label, simplified_unique_labels
)

# Print test results
if 'f1_scores' in test_results and 'macro_avg' in test_results['f1_scores']:
    macro = test_results['f1_scores']['macro_avg']
    span_scores = test_results['span_match_scores']

    print(f"\nSimplified Model performance on in-domain test set")
    print(f"Token Accuracy: {test_results['token_accuracy']:.4f}")
    print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
    print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
    print(f"Precision: {macro['precision']:.4f}")
    print(f"Recall: {macro['recall']:.4f}")
    print(f"F-score: {macro['f1']:.4f}")

print("\nEvaluating simplified model on out-of-domain test set:")
ood_test_results = validate_ner(
    simplified_data_splits['ood_test'], simplified_label_to_i, tokeniser,
    simplified_encoder, simplified_clf_head, DEVICE_ENCODER, DEVICE_CLF_HEAD,
    BATCH_SIZE, simplified_i_to_label, simplified_unique_labels
)

# Print OOD test results
if 'f1_scores' in ood_test_results and 'macro_avg' in ood_test_results['f1_scores']:
    macro = ood_test_results['f1_scores']['macro_avg']
    span_scores = ood_test_results['span_match_scores']

    print(f"\nSimplified Model performance on out-of-domain test set")
    print(f"Token Accuracy: {ood_test_results['token_accuracy']:.4f}")
    print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
    print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")
    print(f"Precision: {macro['precision']:.4f}")
    print(f"Recall: {macro['recall']:.4f}")
    print(f"F-score: {macro['f1']:.4f}")


Evaluating simplified model on in-domain test set:


Validate:   0%|          | 0/65 [00:00<?, ?it/s]


Simplified Model performance on in-domain test set
Token Accuracy: 0.9839
Labeled Span Match: 0.8649
Unlabeled Span Match: 0.8649
Precision: 0.8524
Recall: 0.8649
F-score: 0.8586

Evaluating simplified model on out-of-domain test set:


Validate:   0%|          | 0/32 [00:00<?, ?it/s]


Simplified Model performance on out-of-domain test set
Token Accuracy: 0.9820
Labeled Span Match: 0.8270
Unlabeled Span Match: 0.8270
Precision: 0.8371
Recall: 0.8270
F-score: 0.8320


Testing comparison of BERT models

In [81]:
#PERFORMANCE COMPARISON - NOT fully working (not extracting scores for specific labels only macroaverage)

# Table 1: Overall model performance on full tagset
print("Table 1. Model performance on the full tagset computed on the two test sets.")
print("-" * 110)
print(f"{'Model':15} | {'Test set':15} | {'Token':10} | {'Labelled span':15} | {'Unlabelled':15} | {'Precision':10} | {'Recall':10} | {'F1 score':10}")
print(f"{'':15} | {'':15} | {'accuracy':10} | {'match':15} | {'span match':15} | {'':10} | {'':10} | {'':10}")
print("-" * 110)

# Display results for each model on both test sets
for model_name in ["single_layer", "two_layer", "two_layer_dropout"]:
    # Readable model names
    model_display_name = {
        "single_layer": "1-layer",
        "two_layer": "2-layer",
        "two_layer_dropout": "2-layer with dropout"
    }[model_name]

    # In-domain test results
    in_domain_results = all_results[model_name]["test"]
    if in_domain_results and 'f1_scores' in in_domain_results and 'macro_avg' in in_domain_results['f1_scores']:
        token_acc = in_domain_results["token_accuracy"]
        labeled_match = in_domain_results["span_match_scores"]["labeled_span_match"]
        unlabeled_match = in_domain_results["span_match_scores"]["unlabeled_span_match"]
        precision = in_domain_results["f1_scores"]["macro_avg"]["precision"]
        recall = in_domain_results["f1_scores"]["macro_avg"]["recall"]
        f1 = in_domain_results["f1_scores"]["macro_avg"]["f1"]

        print(f"{model_display_name:15} | {'in-domain':15} | {token_acc:.4f}     | {labeled_match:.4f}         | {unlabeled_match:.4f}         | {precision:.4f}     | {recall:.4f}     | {f1:.4f}")

    # Out-of-domain test results
    ood_results = all_results[model_name]["ood_test"]
    if ood_results and 'f1_scores' in ood_results and 'macro_avg' in ood_results['f1_scores']:
      token_acc = ood_results["token_accuracy"]
      labeled_match = ood_results["span_match_scores"]["labeled_span_match"]
      unlabeled_match = ood_results["span_match_scores"]["unlabeled_span_match"]
      precision = ood_results["f1_scores"]["macro_avg"]["precision"]
      recall = ood_results["f1_scores"]["macro_avg"]["recall"]
      f1 = ood_results["f1_scores"]["macro_avg"]["f1"]

      print(f"{model_display_name:15} | {'out-of-domain':15} | {token_acc:.4f}     | {labeled_match:.4f}         | {unlabeled_match:.4f}         | {precision:.4f}     | {recall:.4f}     | {f1:.4f}")


# Function to get F1 score or default value if not present
def get_tag_f1(results, tag):
    if not results or 'f1_scores' not in results:
        return "-"

    # Look for the exact tag in f1_scores
    if tag in results['f1_scores']:
        return f"{results['f1_scores'][tag]['f1']:.4f}"
    return "-"  # Tag not found

# Table 2: Entity-specific F1 scores on in-domain test set with all 7 tags
print("\n\nTable 2. Entity-specific F1 scores on the full tagset (7) on in-domain test set.")
print("-" * 100)
print(f"{'Model':15} | {'O':10} | {'B-PER':10} | {'I-PER':10} | {'B-ORG':10} | {'I-ORG':10} | {'B-LOC':10} | {'I-LOC':10} | {'Macro Avg':10}")
print("-" * 100)

# Display results for each model on in-domain test set with all 7 tags
for model_name in ["single_layer", "two_layer", "two_layer_dropout"]:
    # Readable model names
    model_display_name = {
        "single_layer": "1-layer",
        "two_layer": "2-layer",
        "two_layer_dropout": "2-layer with dropout"
    }[model_name]

    results = all_results[model_name]["test"]

    # Get F1 score for each tag (or "-" if not available)
    o_score = get_tag_f1(results, "O")
    b_per_score = get_tag_f1(results, "B-PER")
    i_per_score = get_tag_f1(results, "I-PER")
    b_org_score = get_tag_f1(results, "B-ORG")
    i_org_score = get_tag_f1(results, "I-ORG")
    b_loc_score = get_tag_f1(results, "B-LOC")
    i_loc_score = get_tag_f1(results, "I-LOC")

    # Get macro average
    macro_avg = "-"
    if results and 'f1_scores' in results and 'macro_avg' in results['f1_scores']:
        macro_avg = f"{results['f1_scores']['macro_avg']['f1']:.4f}"

    # Print row
    print(f"{model_display_name:15} | {o_score:10} | {b_per_score:10} | {i_per_score:10} | {b_org_score:10} | {i_org_score:10} | {b_loc_score:10} | {i_loc_score:10} | {macro_avg:10}")

# Table 3: Entity-specific F1 scores on out-of-domain test set with all 7 tags
print("\n\nTable 3. Entity-specific F1 scores on the full tagset (7) on out-of-domain test set.")
print("-" * 100)
print(f"{'Model':15} | {'O':10} | {'B-PER':10} | {'I-PER':10} | {'B-ORG':10} | {'I-ORG':10} | {'B-LOC':10} | {'I-LOC':10} | {'Macro Avg':10}")
print("-" * 100)

# Display results for each model on out-of-domain test set with all 7 tags
for model_name in ["single_layer", "two_layer", "two_layer_dropout"]:
    # Readable model names
    model_display_name = {
        "single_layer": "1-layer",
        "two_layer": "2-layer",
        "two_layer_dropout": "2-layer with dropout"
    }[model_name]

    results = all_results[model_name]["ood_test"]

    # Get F1 score for each tag (or "-" if not available)
    o_score = get_tag_f1(results, "O")
    b_per_score = get_tag_f1(results, "B-PER")
    i_per_score = get_tag_f1(results, "I-PER")
    b_org_score = get_tag_f1(results, "B-ORG")
    i_org_score = get_tag_f1(results, "I-ORG")
    b_loc_score = get_tag_f1(results, "B-LOC")
    i_loc_score = get_tag_f1(results, "I-LOC")

    # Get macro average
    macro_avg = "-"
    if results and 'f1_scores' in results and 'macro_avg' in results['f1_scores']:
        macro_avg = f"{results['f1_scores']['macro_avg']['f1']:.4f}"

    # Print row
    print(f"{model_display_name:15} | {o_score:10} | {b_per_score:10} | {i_per_score:10} | {b_org_score:10} | {i_org_score:10} | {b_loc_score:10} | {i_loc_score:10} | {macro_avg:10}")

# Table 4: Model performance on simplified tagset
print("\n\nTable 4. Model performance on the simplified tagset (3).")
print("-" * 110)
print(f"{'Model':15} | {'Test set':15} | {'Token':10} | {'Labelled span':15} | {'Unlabelled':15} | {'Precision':10} | {'Recall':10} | {'F1 score':10}")
print(f"{'':15} | {'':15} | {'accuracy':10} | {'match':15} | {'span match':15} | {'':10} | {'':10} | {'':10}")
print("-" * 110)

# Display simplified tagset results
simplified_in_domain = all_results["simplified"]["test"]
if simplified_in_domain:
    token_acc = simplified_in_domain["token_accuracy"]
    labeled_match = simplified_in_domain["span_match_scores"]["labeled_span_match"]
    unlabeled_match = simplified_in_domain["span_match_scores"]["unlabeled_span_match"]
    precision = simplified_in_domain["f1_scores"]["macro_avg"]["precision"]
    recall = simplified_in_domain["f1_scores"]["macro_avg"]["recall"]
    f1 = simplified_in_domain["f1_scores"]["macro_avg"]["f1"]

    print(f"{'1-layer':15} | {'in-domain':15} | {token_acc:.4f}     | {labeled_match:.4f}         | {unlabeled_match:.4f}         | {precision:.4f}     | {recall:.4f}     | {f1:.4f}")

simplified_ood = all_results["simplified"]["ood_test"]
if simplified_ood:
    token_acc = simplified_ood["token_accuracy"]
    labeled_match = simplified_ood["span_match_scores"]["labeled_span_match"]
    unlabeled_match = simplified_ood["span_match_scores"]["unlabeled_span_match"]
    precision = simplified_ood["f1_scores"]["macro_avg"]["precision"]
    recall = simplified_ood["f1_scores"]["macro_avg"]["recall"]
    f1 = simplified_ood["f1_scores"]["macro_avg"]["f1"]

    print(f"{'1-layer':15} | {'out-of-domain':15} | {token_acc:.4f}     | {labeled_match:.4f}         | {unlabeled_match:.4f}         | {precision:.4f}     | {recall:.4f}     | {f1:.4f}")

# Table 5: Entity-specific F1 scores on simplified tagset (in-domain)
print("\n\nTable 5. Entity-specific F1 scores on the simplified tagset (3) on in-domain test set.")
print("-" * 60)
print(f"{'Model':15} | {'O':10} | {'B':10} | {'I':10} | {'Macro Avg':10}")
print("-" * 60)

# Create row for simplified in-domain test
simplified_in_domain = all_results["simplified"]["test"]
if simplified_in_domain and 'f1_scores' in simplified_in_domain:
    # Extract F1 scores for each entity type if they exist
    o_score_str = f"{simplified_in_domain['f1_scores'].get('O', {}).get('f1', 0):.4f}" if 'O' in simplified_in_domain['f1_scores'] else "-"
    b_score_str = f"{simplified_in_domain['f1_scores'].get('B', {}).get('f1', 0):.4f}" if 'B' in simplified_in_domain['f1_scores'] else "-"
    i_score_str = f"{simplified_in_domain['f1_scores'].get('I', {}).get('f1', 0):.4f}" if 'I' in simplified_in_domain['f1_scores'] else "-"

    # Get macro average
    macro_avg = simplified_in_domain['f1_scores'].get('macro_avg', {}).get('f1', 0)
    macro_avg_str = f"{macro_avg:.4f}" if isinstance(macro_avg, float) else "-"

    # Print row for in-domain
    print(f"{'1-layer':15} | {o_score_str:10} | {b_score_str:10} | {i_score_str:10} | {macro_avg_str:10}")

    # Print an explanation if some entity types are missing
    missing_entities = []
    if 'O' not in simplified_in_domain['f1_scores']: missing_entities.append('O')
    if 'B' not in simplified_in_domain['f1_scores']: missing_entities.append('B')
    if 'I' not in simplified_in_domain['f1_scores']: missing_entities.append('I')

    if missing_entities:
        print(f"Note: F1 scores for {', '.join(missing_entities)} are not available. Available entity types are: {list(simplified_in_domain['f1_scores'].keys())}")
else:
    print(f"{'1-layer':15} | {'-':10} | {'-':10} | {'-':10} | {'-':10}")
    print("Note: Entity-specific scores are not available in the results.")

# Table 6: Entity-specific F1 scores on simplified tagset (out-of-domain)
print("\n\nTable 6. Entity-specific F1 scores on the simplified tagset (3) on out-of-domain test set.")
print("-" * 60)
print(f"{'Model':15} | {'O':10} | {'B':10} | {'I':10} | {'Macro Avg':10}")
print("-" * 60)

# Create row for simplified out-of-domain test
simplified_ood = all_results["simplified"]["ood_test"]
if simplified_ood and 'f1_scores' in simplified_ood:
    # Extract F1 scores for each entity type if they exist
    o_score_str = f"{simplified_ood['f1_scores'].get('O', {}).get('f1', 0):.4f}" if 'O' in simplified_ood['f1_scores'] else "-"
    b_score_str = f"{simplified_ood['f1_scores'].get('B', {}).get('f1', 0):.4f}" if 'B' in simplified_ood['f1_scores'] else "-"
    i_score_str = f"{simplified_ood['f1_scores'].get('I', {}).get('f1', 0):.4f}" if 'I' in simplified_ood['f1_scores'] else "-"

    # Get macro average
    macro_avg = simplified_ood['f1_scores'].get('macro_avg', {}).get('f1', 0)
    macro_avg_str = f"{macro_avg:.4f}" if isinstance(macro_avg, float) else "-"

    # Print row for out-of-domain
    print(f"{'1-layer':15} | {o_score_str:10} | {b_score_str:10} | {i_score_str:10} | {macro_avg_str:10}")

    # Print an explanation if some entity types are missing
    missing_entities = []
    if 'O' not in simplified_ood['f1_scores']: missing_entities.append('O')
    if 'B' not in simplified_ood['f1_scores']: missing_entities.append('B')
    if 'I' not in simplified_ood['f1_scores']: missing_entities.append('I')

    if missing_entities:
        print(f"Note: F1 scores for {', '.join(missing_entities)} are not available. Available entity types are: {list(simplified_ood['f1_scores'].keys())}")
else:
    print(f"{'1-layer':15} | {'-':10} | {'-':10} | {'-':10} | {'-':10}")
    print("Note: Entity-specific scores are not available in the results.")

# Export results to CSV files for reference
print("\nExporting results to CSV files...")

# Export data for all tables to a single CSV
with open('/content/drive/MyDrive/ner_models/combined_results/tables_for_report.csv', 'w', newline='') as csvfile:
    # Updated fieldnames to include all entity types
    fieldnames = ['Table', 'Model', 'Test_Set', 'Token_Accuracy', 'Labeled_Match', 'Unlabeled_Match',
              'Precision', 'Recall', 'F1_Score',
              'O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC',  # Full 7 tag set
              'B', 'I',  # Simplified entity types
              'Macro_Avg']

    # Use extrasaction='ignore' to ignore any fields not in fieldnames
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')
    writer.writeheader()

    # Table 1: Model performance on full tagset
    for model_name in ["single_layer", "two_layer", "two_layer_dropout"]:
        model_display_name = {
            "single_layer": "1-layer",
            "two_layer": "2-layer",
            "two_layer_dropout": "2-layer with dropout"
        }[model_name]

        for dataset_name, dataset_label in [("test", "in-domain"), ("ood_test", "out-of-domain")]:
            results = all_results[model_name][dataset_name]
            if results and 'f1_scores' in results and 'macro_avg' in results['f1_scores']:
                writer.writerow({
                    'Table': 'Table 1',
                    'Model': model_display_name,
                    'Test_Set': dataset_label,
                    'Token_Accuracy': f"{results['token_accuracy']:.4f}",
                    'Labeled_Match': f"{results['span_match_scores']['labeled_span_match']:.4f}",
                    'Unlabeled_Match': f"{results['span_match_scores']['unlabeled_span_match']:.4f}",
                    'Precision': f"{results['f1_scores']['macro_avg']['precision']:.4f}",
                    'Recall': f"{results['f1_scores']['macro_avg']['recall']:.4f}",
                    'F1_Score': f"{results['f1_scores']['macro_avg']['f1']:.4f}",
                    'Macro_Avg': f"{results['f1_scores']['macro_avg']['f1']:.4f}"
                })

    # Tables 2 & 3: Entity-specific F1 scores for full tagset
    for model_name in ["single_layer", "two_layer", "two_layer_dropout"]:
        model_display_name = {
            "single_layer": "1-layer",
            "two_layer": "2-layer",
            "two_layer_dropout": "2-layer with dropout"
        }[model_name]

        # Table 2: In-domain
        results = all_results[model_name]["test"]
        if results and 'f1_scores' in results:
            row_data = {
                'Table': 'Table 2',
                'Model': model_display_name,
                'Test_Set': 'in-domain',
                'O': get_tag_f1(results, "O"),
                'B-PER': get_tag_f1(results, "B-PER"),
                'I-PER': get_tag_f1(results, "I-PER"),
                'B-ORG': get_tag_f1(results, "B-ORG"),
                'I-ORG': get_tag_f1(results, "I-ORG"),
                'B-LOC': get_tag_f1(results, "B-LOC"),
                'I-LOC': get_tag_f1(results, "I-LOC"),
                'Macro_Avg': f"{results['f1_scores']['macro_avg']['f1']:.4f}" if 'macro_avg' in results['f1_scores'] else "-"
            }

            writer.writerow(row_data)

        # Table 3: Out-of-domain
        results = all_results[model_name]["ood_test"]
        if results and 'f1_scores' in results:
            row_data = {
                'Table': 'Table 3',
                'Model': model_display_name,
                'Test_Set': 'out-of-domain',
                'O': get_tag_f1(results, "O"),
                'B-PER': get_tag_f1(results, "B-PER"),
                'I-PER': get_tag_f1(results, "I-PER"),
                'B-ORG': get_tag_f1(results, "B-ORG"),
                'I-ORG': get_tag_f1(results, "I-ORG"),
                'B-LOC': get_tag_f1(results, "B-LOC"),
                'I-LOC': get_tag_f1(results, "I-LOC"),
                'Macro_Avg': f"{results['f1_scores']['macro_avg']['f1']:.4f}" if 'macro_avg' in results['f1_scores'] else "-"
            }

            writer.writerow(row_data)
    # Table 4: Model performance on simplified tagset
    for dataset_name, dataset_label in [("test", "in-domain"), ("ood_test", "out-of-domain")]:
        results = all_results["simplified"][dataset_name]
        if results and 'f1_scores' in results:
            writer.writerow({
                'Table': 'Table 4',
                'Model': '1-layer',
                'Test_Set': dataset_label,
                'Token_Accuracy': f"{results['token_accuracy']:.4f}",
                'Labeled_Match': f"{results['span_match_scores']['labeled_span_match']:.4f}",
                'Unlabeled_Match': f"{results['span_match_scores']['unlabeled_span_match']:.4f}",
                'Precision': f"{results['f1_scores']['macro_avg']['precision']:.4f}",
                'Recall': f"{results['f1_scores']['macro_avg']['recall']:.4f}",
                'F1_Score': f"{results['f1_scores']['macro_avg']['f1']:.4f}",
                'Macro_Avg': f"{results['f1_scores']['macro_avg']['f1']:.4f}"
            })

    # Tables 5 & 6: Entity-specific F1 scores on simplified tagset
    # Table 5: In-domain
    results = all_results["simplified"]["test"]
    if results and 'f1_scores' in results:
        row_data = {
            'Table': 'Table 5',
            'Model': '1-layer',
            'Test_Set': 'in-domain',
            'O': '-', 'B': '-', 'I': '-',  # Default values for simplified entity types
            'Macro_Avg': f"{results['f1_scores']['macro_avg']['f1']:.4f}"
        }

        # Fill in entity-specific scores
        for entity_type, scores in results["f1_scores"].items():
            if entity_type != 'macro_avg':
                row_data[entity_type] = f"{scores['f1']:.4f}"

        writer.writerow(row_data)

    # Table 6: Out-of-domain
    results = all_results["simplified"]["ood_test"]
    if results and 'f1_scores' in results:
        row_data = {
            'Table': 'Table 6',
            'Model': '1-layer',
            'Test_Set': 'out-of-domain',
            'O': '-', 'B': '-', 'I': '-',  # Default values for simplified entity types
            'Macro_Avg': f"{results['f1_scores']['macro_avg']['f1']:.4f}"
        }

        # Fill in entity-specific scores
        for entity_type, scores in results["f1_scores"].items():
            if entity_type != 'macro_avg':
                row_data[entity_type] = f"{scores['f1']:.4f}"

        writer.writerow(row_data)

print("Results exported to Google Drive at /content/drive/MyDrive/ner_models/combined_results/tables_for_report.csv")



=== BERT MODELS PERFORMANCE COMPARISON ===

Table 1. Model performance on the full tagset computed on the two test sets.
--------------------------------------------------------------------------------------------------------------
Model           | Test set        | Token      | Labelled span   | Unlabelled      | Precision  | Recall     | F1 score  
                |                 | accuracy   | match           | span match      |            |            |           
--------------------------------------------------------------------------------------------------------------
1-layer         | in-domain       | 0.9825     | 0.8438         | 0.8768         | 0.8160     | 0.8301     | 0.8226
1-layer         | out-of-domain   | 0.9789     | 0.7619         | 0.8205         | 0.7816     | 0.7142     | 0.7350
2-layer         | in-domain       | 0.9818     | 0.8346         | 0.8621         | 0.8240     | 0.8207     | 0.8203
2-layer         | out-of-domain   | 0.9787     | 0.7637        

In [46]:
# Copy all model files to Google Drive after training
!cp *.pt /content/drive/MyDrive/ner_models/



---



# **In-context learning with LLMs**



Implement the decoder-only architecture for NER using in-context learning with a pre-trained LLM. Prompt the LLM to produce label sequences using few-shot in-context learning. Write appropriate prompt-generation code containing in-context examples. Balance the number of examples that fit in the context window and label representation.

Llama-3.1-8B-Instruct - a pre-trained decoder-only model -

https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

## Imports

In [40]:
# Import "bitsandbytes", a library for efficient 8-bit quantization of model weights https://huggingface.co/docs/accelerate/v0.21.0/en/usage_guides/quantization
# Import accelerate, Hugging Face's library for distributed training across multiple GPUs/TPUs
!pip install -U bitsandbytes accelerate



In [41]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Authentication to access the HuggingFace token
from google.colab import userdata
access_token = userdata.get('HF_TOKEN')

## Set up the LLM

Load Llama-3.1 and initialize the tokeniser

In [42]:
# Select the model
model_id = "meta-llama/Llama-3.1-8B-Instruct"

# Configure for 8-bit quantization to reduce memory usage
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True
)

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)  #  process an access token for Llama 3

# Initialize the model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    token=access_token
)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

## ICL prompting

Define a prompt format that includes examples that demonstrate the task

In [43]:
## Define a function to create a prompt with few-shot examples for NER task
# Takes tokeniser,  examples for few shot learning, test sentence
def ner_prompt_formatter(tokenizer, examples, test_sentence):
    # System prompt explaining the task to the model
    system_prompt = (
        "You are a named entity recognition system. Given a tokenised sentence, "
        "identify which tokens are named entities using the BIO (Beginning, Inside, Outside) tagging scheme. "
        "Tags are: O (not an entity), B-PER (beginning of person), I-PER (inside of person), "
        "B-ORG (beginning of organization), I-ORG (inside of organization), "
        "B-LOC (beginning of location), I-LOC (inside of location)."
    )

    # Examples for few-shot learning
    # initialize an empty string to store examples
    examples_text = ""
    # Loop through each example (tuple of tokens a labels), join elements, number examples
    for i, (tokens, labels) in enumerate(examples):
        examples_text += f"Example {i+1}:\nTokens: {' '.join(tokens)}\nLabels: {' '.join(labels)}\n\n"

    # Format the test sentence
    test_text = f"Tokens: {' '.join(test_sentence)}\nLabels:"

    # Combine into a chat template to create a conversation with system and user roles
    return tokenizer.apply_chat_template(
        [
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": examples_text + test_text
            }
        ],
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
    )

Function to run inference to get predictions

In [44]:
## Define a function to make predictions for a test sentence
def predict_ner_with_llm(model, tokenizer, examples, test_sentence):

    # Create the prompt using the formatter
    prompt = ner_prompt_formatter(tokenizer, examples, test_sentence)

    # Move tokenised inputs to GPU if available
    # Create a dict and move each tensor in the prompt to the model device
    inputs = {k: v.to(model.device) for k, v in prompt.items()}

    # Generate predictions (no gradient tracking, uses deterministic generation - greedy decoding))
    with torch.inference_mode():
        output = model.generate(**inputs, max_new_tokens=200, do_sample=False)

    # Decode the output (token ids back to text, skips padding/EOS tokens)
    generated_text = tokenizer.decode(output[0][len(inputs['input_ids'][0]):], skip_special_tokens=True)

    # Return generated labels (calling parsing function that extracts labels from the output)
    return parse_generated_labels(generated_text, test_sentence)

Parse the predictions to get labels

In [45]:
## Define a function that extracts NER labels from the model's output
# Takes generated text and test sentence for comparison
def parse_generated_labels(generated_text, test_sentence, example_number=None):
    try:

        # Get raw text
        labels_text = generated_text.strip()

        # Extract labels section
        if "Labels:" in labels_text:
            labels_text = labels_text.split("Labels:")[1].strip()

        # Split into words and clean up
        raw_labels = labels_text.split()
        cleaned_labels = []

        for item in raw_labels:
            item = str(item)
            #Skip format tokens and keep only valid BIO tags
            if item in ["Tokens:", "Labels:", "Example", "1:", "2:", "3:", "4:", "5:", "6:", "7:"]:
                continue

            # Check if it's a valid BIO tag
            if item in ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B", "I"]:
                cleaned_labels.append(item)
            else:
                cleaned_labels.append("O")  # Default to O for invalid tags

        # Handle length mismatch
        if len(cleaned_labels) != len(test_sentence):
            if len(cleaned_labels) > len(test_sentence):
                cleaned_labels = cleaned_labels[:len(test_sentence)]
            else:
                cleaned_labels = cleaned_labels + ["O"] * (len(test_sentence) - len(cleaned_labels))
        return cleaned_labels

    except Exception as e:
        print(f"ERROR parsing labels: {e}")
        return ["O"] * len(test_sentence)

Function for calculating span metrics updated version for llm

In [46]:
def calculate_span_metrics(pred_labels, true_labels, tokens):
    """
    Calculate span-based metrics between predicted and true labels
    """
    # Ensure all labels are strings
    pred_labels = [str(label) for label in pred_labels]
    true_labels = [str(label) for label in true_labels]

    # Extract spans
    true_spans = extract_spans(tokens, true_labels)
    pred_spans = extract_spans(tokens, pred_labels)

    # For unlabeled matching, consider only boundaries
    true_spans_unlabeled = [(span[1], span[2]) for span in true_spans]
    pred_spans_unlabeled = [(span[1], span[2]) for span in pred_spans]

    # Count matches
    labeled_matches = sum(1 for span in pred_spans if span in true_spans)
    unlabeled_matches = sum(1 for span in pred_spans_unlabeled if span in true_spans_unlabeled)

    # Calculate match scores
    labeled_match_score = labeled_matches / len(true_spans) if true_spans else 0
    unlabeled_match_score = unlabeled_matches / len(true_spans_unlabeled) if true_spans_unlabeled else 0

    # Group spans by entity type
    true_spans_by_type = {}
    pred_spans_by_type = {}

    for span in true_spans:
        entity_type = span[0]
        if entity_type not in true_spans_by_type:
            true_spans_by_type[entity_type] = []
        true_spans_by_type[entity_type].append(span)

    for span in pred_spans:
        entity_type = span[0]
        if entity_type not in pred_spans_by_type:
            pred_spans_by_type[entity_type] = []
        pred_spans_by_type[entity_type].append(span)

    # Calculate F1 scores
    all_entity_types = set(true_spans_by_type.keys()) | set(pred_spans_by_type.keys())
    f1_scores = {}

    for entity_type in all_entity_types:
        true_type_spans = true_spans_by_type.get(entity_type, [])
        pred_type_spans = pred_spans_by_type.get(entity_type, [])

        # Count matches
        matches = sum(1 for span in pred_type_spans if span in true_type_spans)

        # Calculate precision, recall, F1
        precision = matches / len(pred_type_spans) if pred_type_spans else 0
        recall = matches / len(true_type_spans) if true_type_spans else 0
        f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        f1_scores[entity_type] = {
            'precision': precision,
            'recall': recall,
            'f1': f1
        }

    # Calculate macro average
    if f1_scores:
        macro_avg = {
            'precision': sum(scores['precision'] for scores in f1_scores.values()) / len(f1_scores),
            'recall': sum(scores['recall'] for scores in f1_scores.values()) / len(f1_scores),
            'f1': sum(scores['f1'] for scores in f1_scores.values()) / len(f1_scores)
        }
        f1_scores['macro_avg'] = macro_avg

    return {
        'span_metrics': {
            'labeled_match_score': labeled_match_score,
            'unlabeled_match_score': unlabeled_match_score,
            'true_spans_count': len(true_spans),
            'pred_spans_count': len(pred_spans),
            'labeled_matches': labeled_matches,
            'unlabeled_matches': unlabeled_matches
        },
        'f1_scores': f1_scores
    }

Function to evaluate the predictions

In [47]:
## Define a function to evaluate model predictions
def evaluate_llm_ner(model, tokenizer, few_shot_examples, test_data):

    # Track token-level metrics
    token_correct = 0
    token_total = 0

    # For span metrics
    all_labeled_matches = 0
    all_unlabeled_matches = 0
    all_true_spans = 0

    # For per-entity metrics
    entity_metrics = {}

    # Process each test sentence
    for tokens, true_labels in tqdm(test_data, desc="Evaluating LLM"):
        # Get predictions
        pred_labels = predict_ner_with_llm(model, tokenizer, few_shot_examples, tokens)

        # Ensure all labels are strings
        pred_labels = [str(label) for label in pred_labels]
        true_labels = [str(label) for label in true_labels]


        # Calculate token-level accuracy
        correct = sum(1 for p, t in zip(pred_labels, true_labels) if p == t)
        token_correct += correct
        token_total += len(true_labels)

        # Calculate span metrics for this sentence
        span_results = calculate_span_metrics(pred_labels, true_labels, tokens)


        ##DEBUG
        # Extract spans for debugging
        true_spans = extract_spans(tokens, true_labels)
        pred_spans = extract_spans(tokens, pred_labels)

        # Print detailed output for examples with entities
        if len(true_spans) > 0:  # If there are any gold entities
            print("\n==== COMPARE GOLD TO LLM OUTPUT ====")
            print(f"Sentence: {' '.join(tokens)}")

            # Print aligned tokens and labels
            print("\nToken-by-token comparison:")
            print(f"{'TOKEN':20} | {'TRUE LABEL':10} | {'PRED LABEL':10}")
            print("-" * 44)

            for i, (token, true_label, pred_label) in enumerate(zip(tokens, true_labels, pred_labels)):
                match = "✓" if true_label == pred_label else "✗"
                print(f"{token:20} | {true_label:10} | {pred_label:10} {match}")

            # Print spans
            print("\nEntity spans:")
            print(f"TRUE SPANS ({len(true_spans)}): {true_spans}")
            print(f"PRED SPANS ({len(pred_spans)}): {pred_spans}")

            # Calculate correct spans for this example
            correct_spans = [span for span in pred_spans if span in true_spans]
            print(f"CORRECT SPANS ({len(correct_spans)}): {correct_spans}")

        # Aggregate span matching metrics
        metrics = span_results['span_metrics']
        all_labeled_matches += metrics['labeled_matches']
        all_unlabeled_matches += metrics['unlabeled_matches']
        all_true_spans += metrics['true_spans_count']

        # Aggregate entity-type metrics
        for entity_type, scores in span_results['f1_scores'].items():
            if entity_type != 'macro_avg':
                if entity_type not in entity_metrics:
                    entity_metrics[entity_type] = {
                        'true_positives': 0,
                        'false_positives': 0,
                        'false_negatives': 0
                    }

                #  calculate TP, FP, FN from  spans
                true_type_spans = set(span for span in extract_spans(tokens, true_labels) if span[0] == entity_type)
                pred_type_spans = set(span for span in extract_spans(tokens, pred_labels) if span[0] == entity_type)

                # Calculate metrics
                tp = len(true_type_spans & pred_type_spans)  # Intersection
                fp = len(pred_type_spans - true_type_spans)  # In pred but not in true
                fn = len(true_type_spans - pred_type_spans)  # In true but not in pred

                # Accumulate
                entity_metrics[entity_type]['true_positives'] += tp
                entity_metrics[entity_type]['false_positives'] += fp
                entity_metrics[entity_type]['false_negatives'] += fn


    #DEBUG to see why metrics are 0 - is it because little ner examples
    # Print summary statistics
    print("\n EVALUATION")
    print(f"Total tokens processed: {token_total}")
    print(f"Correctly labeled tokens: {token_correct} ({token_correct/token_total:.2%})")
    print(f"Total true spans found: {all_true_spans}")
    print(f"Correctly predicted spans: {all_labeled_matches}")

    if all_true_spans > 0:
        print(f"Span recall: {all_labeled_matches/all_true_spans:.2%} of true spans")



    # Calculate overall token accuracy
    token_accuracy = token_correct / token_total if token_total > 0 else 0

    # Calculate final span matching scores
    span_match_scores = {
        'labeled_span_match': all_labeled_matches / all_true_spans if all_true_spans > 0 else 0,
        'unlabeled_span_match': all_unlabeled_matches / all_true_spans if all_true_spans > 0 else 0
    }

    # Calculate final F1 scores by entity type
    f1_scores = {}
    for entity_type, metrics_dict in entity_metrics.items():
        tp = metrics_dict['true_positives']
        fp = metrics_dict['false_positives']
        fn = metrics_dict['false_negatives']

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

        f1_scores[entity_type] = {
            'precision': precision,
            'recall': recall,
            'f1': f1
        }

    # Calculate macro average F1
    if f1_scores:
        f1_scores['macro_avg'] = {
            'precision': sum(scores['precision'] for scores in f1_scores.values()) / len(f1_scores),
            'recall': sum(scores['recall'] for scores in f1_scores.values()) / len(f1_scores),
            'f1': sum(scores['f1'] for scores in f1_scores.values()) / len(f1_scores)
        }

    # Combine all metrics
    results = {
        'token_accuracy': token_accuracy,
        'span_match_scores': span_match_scores,
        'f1_scores': f1_scores
    }

    return results

### Function to take examples form train data as few shot examples


then evaluate performance on 100 examples from dev and test sets. Could evaluate on full datasets for  reliable metrics and comparison with BERT

In [48]:
# Define constants for NER tags
NER_TAGS = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']

# Define a function to select balanced few shot examples
def select_balanced_examples(train_data, required_tags):

    selected_examples = []
    covered_tags = set()

    # First pass: try to find examples that contain tags not yet covered
    for tokens, labels in train_data:
        tags_in_example = set(labels)
        new_tags = tags_in_example - covered_tags

        if new_tags:
            selected_examples.append((tokens, labels))
            covered_tags.update(tags_in_example)

            # Check if we've covered all required tags
            if all(tag in covered_tags for tag in required_tags):
                break

    # If we still haven't covered all tags, add more examples
    if not all(tag in covered_tags for tag in required_tags):
        for tag in required_tags:
            if tag not in covered_tags:
                # Find an example with this tag
                for tokens, labels in train_data:
                    if tag in labels and (tokens, labels) not in selected_examples:
                        selected_examples.append((tokens, labels))
                        covered_tags.add(tag)
                        break

    return selected_examples

### Run evaluation - small subset for testing now

In [49]:
# Call function with the global NER_TAGS
few_shot_examples = select_balanced_examples(data_dict['en_ewt']['train'], NER_TAGS)

# See tag distribution in examples
tag_counts = {}
for _, labels in few_shot_examples:
    for label in labels:
        tag_counts[label] = tag_counts.get(label, 0) + 1

print("Tag distribution:")
for tag, count in sorted(tag_counts.items()):
    print(f"  {tag}: {count}")

Tag distribution:
  B-LOC: 2
  B-ORG: 1
  B-PER: 2
  I-LOC: 1
  I-ORG: 1
  I-PER: 1
  O: 38


In [58]:
# Evaluate on dev set
dev_results = evaluate_llm_ner(model, tokenizer, few_shot_examples, data_dict['en_ewt']['dev'][:100])  # slice

# Print dev results
print(f"\n{'-'*80}")
print(f"Results for LLM with ICL - Development Set")
print(f"{'-'*80}")
print(f"Token Accuracy: {dev_results['token_accuracy']:.4f}")

span_scores = dev_results['span_match_scores']
print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")

macro = dev_results['f1_scores']['macro_avg']
print(f"Precision: {macro['precision']:.4f}")
print(f"Recall: {macro['recall']:.4f}")
print(f"F-score: {macro['f1']:.4f}")


Evaluating LLM:   0%|          | 0/100 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: where can I get morcillas in tampa bay , I will like the argentinian type , but I will to try anothers please ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
where                | O          | O          ✓
can                  | O          | O          ✓
I                    | O          | O          ✓
get                  | O          | O          ✓
morcillas            | O          | O          ✓
in                   | O          | O          ✓
tampa                | B-LOC      | O          ✗
bay                  | I-LOC      | O          ✗
,                    | O          | O          ✓
I                    | O          | I          ✗
will                 | O          | O          ✓
like                 | O          | B-PER      ✗
the                  | O          | O          ✓
argentinian          | O          | O          ✓
type                 | O          

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: I searched all over the internet , but I could not find one place in Tampa Bay that sells morcillas , also known as blood pudding , black pudding and blood sausages .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
I                    | O          | O          ✓
searched             | O          | O          ✓
all                  | O          | O          ✓
over                 | O          | O          ✓
the                  | O          | O          ✓
internet             | O          | O          ✓
,                    | O          | B-LOC      ✗
but                  | O          | I-LOC      ✗
I                    | O          | O          ✓
could                | O          | O          ✓
not                  | O          | O          ✓
find                 | O          | O          ✓
one                  | O          | O          ✓
place                | O    

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: I learned that morcillas are basically impossible to find all across the North American region .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
I                    | O          | O          ✓
learned              | O          | O          ✓
that                 | O          | O          ✓
morcillas            | O          | O          ✓
are                  | O          | O          ✓
basically            | O          | B-LOC      ✗
impossible           | O          | I-LOC      ✗
to                   | O          | O          ✓
find                 | O          | O          ✓
all                  | O          | O          ✓
across               | O          | O          ✓
the                  | O          | O          ✓
North                | B-LOC      | O          ✗
American             | I-LOC      | O          ✗
region               | O          | O          ✓


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Miramar ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Miramar              | B-LOC      | B-LOC      ✓
?                    | O          | I-LOC      ✗

Entity spans:
TRUE SPANS (1): [('LOC', 0, 0)]
PRED SPANS (1): [('LOC', 0, 1)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Well you say Miramar I say Piramar

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Well                 | O          | O          ✓
you                  | O          | O          ✓
say                  | O          | B-LOC      ✗
Miramar              | B-LOC      | I-LOC      ✗
I                    | O          | O          ✓
say                  | O          | B-LOC      ✗
Piramar              | O          | I-LOC      ✗

Entity spans:
TRUE SPANS (1): [('LOC', 3, 3)]
PRED SPANS (2): [('LOC', 2, 3), ('LOC', 5, 6)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: MIRAMAR

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
MIRAMAR              | B-LOC      | O          ✗

Entity spans:
TRUE SPANS (1): [('LOC', 0, 0)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Miramir is for real , but there are a lot that make you wonder .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Miramir              | B-LOC      | O          ✗
is                   | O          | O          ✓
for                  | O          | O          ✓
real                 | O          | O          ✓
,                    | O          | O          ✓
but                  | O          | O          ✓
there                | O          | O          ✓
are                  | O          | O          ✓
a                    | O          | O          ✓
lot                  | O          | O          ✓
that                 | O          | O          ✓
make                 | O          | O          ✓
you                  | O          | O          ✓
wonder               | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: There are way more stranger names in the U.S for areas than Miramar .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
There                | O          | O          ✓
are                  | O          | O          ✓
way                  | O          | O          ✓
more                 | O          | O          ✓
stranger             | O          | O          ✓
names                | O          | B-LOC      ✗
in                   | O          | I-LOC      ✗
the                  | O          | O          ✓
U.S                  | B-LOC      | O          ✗
for                  | O          | O          ✓
areas                | O          | O          ✓
than                 | O          | O          ✓
Miramar              | B-LOC      | O          ✗
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (2): [('LOC', 8, 8), ('LOC', 12, 12)]
PRED SPANS (

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: i think Miramar was a famous goat trainer or something .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
i                    | O          | O          ✓
think                | O          | O          ✓
Miramar              | B-LOC      | O          ✗
was                  | O          | B-LOC      ✗
a                    | O          | I-LOC      ✗
famous               | O          | O          ✓
goat                 | O          | O          ✓
trainer              | O          | O          ✓
or                   | O          | O          ✓
something            | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 2, 2)]
PRED SPANS (1): [('LOC', 3, 4)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: What do you eat in Miramar ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
What                 | O          | O          ✓
do                   | O          | O          ✓
you                  | O          | O          ✓
eat                  | O          | O          ✓
in                   | O          | B-LOC      ✗
Miramar              | B-LOC      | I-LOC      ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 5, 5)]
PRED SPANS (1): [('LOC', 4, 5)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: How about empanadas arabes or other empanadas from that area of Argentina ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
How                  | O          | O          ✓
about                | O          | O          ✓
empanadas            | O          | O          ✓
arabes               | O          | O          ✓
or                   | O          | O          ✓
other                | O          | O          ✓
empanadas            | O          | O          ✓
from                 | O          | O          ✓
that                 | O          | O          ✓
area                 | O          | O          ✓
of                   | O          | O          ✓
Argentina            | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 11, 11)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: 1 cup of from that area of Argentina

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
1                    | O          | O          ✓
cup                  | O          | O          ✓
of                   | O          | O          ✓
from                 | O          | O          ✓
that                 | O          | O          ✓
area                 | O          | O          ✓
of                   | O          | O          ✓
Argentina            | B-LOC      | O          ✗

Entity spans:
TRUE SPANS (1): [('LOC', 7, 7)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for


==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Iguazu is a big or a small country ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Iguazu               | B-LOC      | B-LOC      ✓
is                   | O          | O          ✓
a                    | O          | O          ✓
big                  | O          | O          ✓
or                   | O          | O          ✓
a                    | O          | O          ✓
small                | O          | B-LOC      ✗
country              | O          | O          ✓
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 0, 0)]
PRED SPANS (2): [('LOC', 0, 0), ('LOC', 6, 6)]
CORRECT SPANS (1): [('LOC', 0, 0)]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Iguazu is NOT a country ....

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Iguazu               | B-LOC      | O          ✗
is                   | O          | B-LOC      ✗
NOT                  | O          | O          ✓
a                    | O          | O          ✓
country              | O          | O          ✓
....                 | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 0, 0)]
PRED SPANS (1): [('LOC', 1, 1)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Iguazu is in Argentina :)

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Iguazu               | B-LOC      | O          ✗
is                   | O          | B-LOC      ✗
in                   | O          | O          ✓
Argentina            | B-LOC      | O          ✗
:)                   | O          | O          ✓

Entity spans:
TRUE SPANS (2): [('LOC', 0, 0), ('LOC', 3, 3)]
PRED SPANS (1): [('LOC', 1, 1)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Which of these do you like : McDonald s , Burger King , Taco Bell , Wendy s ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Which                | O          | O          ✓
of                   | O          | O          ✓
these                | O          | O          ✓
do                   | O          | O          ✓
you                  | O          | O          ✓
like                 | O          | O          ✓
:                    | O          | O          ✓
McDonald             | B-LOC      | O          ✗
s                    | I-LOC      | O          ✗
,                    | O          | O          ✓
Burger               | B-LOC      | O          ✗
King                 | I-LOC      | O          ✗
,                    | O          | O          ✓
Taco                 | B-LOC      | O          ✗
Bell                 | I-LOC      | O          ✗
,                  

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Burger King

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Burger               | B-LOC      | B-ORG      ✗
King                 | I-LOC      | I-ORG      ✗

Entity spans:
TRUE SPANS (1): [('LOC', 0, 1)]
PRED SPANS (1): [('ORG', 0, 1)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: it seems like I 'm at a real restaurant like Applebee s , their food is usually that good

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
it                   | O          | O          ✓
seems                | O          | O          ✓
like                 | O          | O          ✓
I                    | O          | O          ✓
'm                   | O          | O          ✓
at                   | O          | B-LOC      ✗
a                    | O          | O          ✓
real                 | O          | O          ✓
restaurant           | O          | O          ✓
like                 | O          | O          ✓
Applebee             | B-LOC      | O          ✗
s                    | I-LOC      | O          ✗
,                    | O          | O          ✓
their                | O          | O          ✓
food                 | O          | O          ✓
is     

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: McDonal s the best for me

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
McDonal              | B-LOC      | B-ORG      ✗
s                    | I-LOC      | I-ORG      ✗
the                  | O          | O          ✓
best                 | O          | O          ✓
for                  | O          | O          ✓
me                   | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 0, 1)]
PRED SPANS (1): [('ORG', 0, 1)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: the one i like the most would be wendy 's .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
the                  | O          | O          ✓
one                  | O          | O          ✓
i                    | O          | O          ✓
like                 | O          | O          ✓
the                  | O          | O          ✓
most                 | O          | B-ORG      ✗
would                | O          | I-ORG      ✗
be                   | O          | O          ✓
wendy                | B-LOC      | O          ✗
's                   | I-LOC      | O          ✗
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 8, 9)]
PRED SPANS (1): [('ORG', 5, 6)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: When was Miramar founded ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
When                 | O          | O          ✓
was                  | O          | O          ✓
Miramar              | B-LOC      | B-LOC      ✓
founded              | O          | I-LOC      ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 2, 2)]
PRED SPANS (1): [('LOC', 2, 3)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Miramar was founded September 20 1888 .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Miramar              | B-LOC      | B-LOC      ✓
was                  | O          | O          ✓
founded              | O          | O          ✓
September            | O          | O          ✓
20                   | O          | O          ✓
1888                 | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 0, 0)]
PRED SPANS (1): [('LOC', 0, 0)]
CORRECT SPANS (1): [('LOC', 0, 0)]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: What foods do you eat in Miramar ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
What                 | O          | O          ✓
foods                | O          | O          ✓
do                   | O          | O          ✓
you                  | O          | O          ✓
eat                  | O          | B-LOC      ✗
in                   | O          | I-LOC      ✗
Miramar              | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 6, 6)]
PRED SPANS (1): [('LOC', 4, 5)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: you mean miramar florida theyy have good seafood there

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
you                  | O          | O          ✓
mean                 | O          | O          ✓
miramar              | B-LOC      | B-LOC      ✓
florida              | B-LOC      | I-LOC      ✗
theyy                | O          | O          ✓
have                 | O          | O          ✓
good                 | O          | O          ✓
seafood              | O          | O          ✓
there                | O          | O          ✓

Entity spans:
TRUE SPANS (2): [('LOC', 2, 2), ('LOC', 3, 3)]
PRED SPANS (1): [('LOC', 2, 3)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Why is the city called Miramar ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Why                  | O          | O          ✓
is                   | O          | O          ✓
the                  | O          | O          ✓
city                 | O          | B-LOC      ✗
called               | O          | I-LOC      ✗
Miramar              | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 5, 5)]
PRED SPANS (1): [('LOC', 3, 4)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: I 'm not sure about the origin of the name but they are a lot of different cities with different and unique names like Miramar so it 's just a name .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
I                    | O          | O          ✓
'm                   | O          | O          ✓
not                  | O          | O          ✓
sure                 | O          | O          ✓
about                | O          | O          ✓
the                  | O          | O          ✓
origin               | O          | O          ✓
of                   | O          | O          ✓
the                  | O          | O          ✓
name                 | O          | O          ✓
but                  | O          | O          ✓
they                 | O          | B-LOC      ✗
are                  | O          | I-LOC      ✗
a                    | O          | O        

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: There 's lots of towns called Miramar , it 'd help a lot if you listed a state or some sort of context you 're looking for it in .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
There                | O          | O          ✓
's                   | O          | O          ✓
lots                 | O          | O          ✓
of                   | O          | O          ✓
towns                | O          | O          ✓
called               | O          | O          ✓
Miramar              | B-LOC      | O          ✗
,                    | O          | O          ✓
it                   | O          | O          ✓
'd                   | O          | I          ✗
help                 | O          | O          ✓
a                    | O          | O          ✓
lot                  | O          | O          ✓
if                   | O          | O          ✓
you            

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: There 's a Miramar in Florida , just north of Miami .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
There                | O          | O          ✓
's                   | O          | O          ✓
a                    | O          | B-LOC      ✗
Miramar              | B-LOC      | I-LOC      ✗
in                   | O          | O          ✓
Florida              | B-LOC      | O          ✗
,                    | O          | B-LOC      ✗
just                 | O          | I-LOC      ✗
north                | O          | O          ✓
of                   | O          | O          ✓
Miami                | B-LOC      | O          ✗
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (3): [('LOC', 3, 3), ('LOC', 5, 5), ('LOC', 10, 10)]
PRED SPANS (2): [('LOC', 2, 3), ('LOC', 6, 7)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: There 's also a Miramar in California , the site of a rather large Air Force Base ...

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
There                | O          | O          ✓
's                   | O          | O          ✓
also                 | O          | B-LOC      ✗
a                    | O          | I-LOC      ✗
Miramar              | B-LOC      | O          ✗
in                   | O          | O          ✓
California           | B-LOC      | B-ORG      ✗
,                    | O          | I-ORG      ✗
the                  | O          | O          ✓
site                 | O          | O          ✓
of                   | O          | O          ✓
a                    | O          | O          ✓
rather               | O          | O          ✓
large                | O          | O          ✓
Air                  | B-LOC      | O          ✗
Force      

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Miramar California is a bit north of San Diego .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Miramar              | B-LOC      | B-LOC      ✓
California           | B-LOC      | I-LOC      ✗
is                   | O          | O          ✓
a                    | O          | O          ✓
bit                  | O          | B-LOC      ✗
north                | O          | I-LOC      ✗
of                   | O          | O          ✓
San                  | B-LOC      | O          ✗
Diego                | I-LOC      | O          ✗
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (3): [('LOC', 0, 0), ('LOC', 1, 1), ('LOC', 7, 8)]
PRED SPANS (2): [('LOC', 0, 1), ('LOC', 4, 5)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Can you recommend any restaurants in Buenos Aires ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Can                  | O          | O          ✓
you                  | O          | O          ✓
recommend            | O          | O          ✓
any                  | O          | O          ✓
restaurants          | O          | B-LOC      ✗
in                   | O          | I-LOC      ✗
Buenos               | B-LOC      | O          ✗
Aires                | I-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 6, 7)]
PRED SPANS (1): [('LOC', 4, 5)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: air asia flight attendant ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
air                  | B-ORG      | O          ✗
asia                 | I-ORG      | O          ✗
flight               | O          | O          ✓
attendant            | O          | O          ✓
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('ORG', 0, 1)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Good day , I 'm a foreigner living in malaysia , german citizen , 21 years of age , I was wondering if air asia recruits foreigners ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Good                 | O          | O          ✓
day                  | O          | O          ✓
,                    | O          | O          ✓
I                    | O          | O          ✓
'm                   | O          | O          ✓
a                    | O          | O          ✓
foreigner            | O          | O          ✓
living               | O          | O          ✓
in                   | O          | O          ✓
malaysia             | B-LOC      | O          ✗
,                    | O          | O          ✓
german               | O          | O          ✓
citizen              | O          | O          ✓
,                    | O          | O          ✓
21          

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: What kind of Meal do peopel in Argentina have ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
What                 | O          | O          ✓
kind                 | O          | O          ✓
of                   | O          | O          ✓
Meal                 | O          | O          ✓
do                   | O          | O          ✓
peopel               | O          | O          ✓
in                   | O          | O          ✓
Argentina            | B-LOC      | O          ✗
have                 | O          | O          ✓
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 7, 7)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: I am doing a project and need to know what kind food Argentina people eat for breakfast , lunch , and dinner .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
I                    | O          | O          ✓
am                   | O          | O          ✓
doing                | O          | O          ✓
a                    | O          | B-LOC      ✗
project              | O          | O          ✓
and                  | O          | O          ✓
need                 | O          | O          ✓
to                   | O          | O          ✓
know                 | O          | O          ✓
what                 | O          | O          ✓
kind                 | O          | O          ✓
food                 | O          | O          ✓
Argentina            | B-LOC      | O          ✗
people               | O          | O          ✓
eat                  | O          |

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Like in America dinner is our main meal .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Like                 | O          | O          ✓
in                   | O          | B-LOC      ✗
America              | B-LOC      | O          ✗
dinner               | O          | O          ✓
is                   | O          | O          ✓
our                  | O          | O          ✓
main                 | O          | O          ✓
meal                 | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 2, 2)]
PRED SPANS (1): [('LOC', 1, 1)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: What is the dress code for females at Del Frisco 's Philadelphia ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
What                 | O          | O          ✓
is                   | O          | O          ✓
the                  | O          | O          ✓
dress                | O          | O          ✓
code                 | O          | O          ✓
for                  | O          | O          ✓
females              | O          | O          ✓
at                   | O          | B-ORG      ✗
Del                  | B-LOC      | I-ORG      ✗
Frisco               | I-LOC      | O          ✗
's                   | I-LOC      | I-ORG      ✗
Philadelphia         | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (2): [('LOC', 8, 10), ('LOC', 11, 11)]
PRED SPANS (1): [('ORG', 7, 8)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: I will be going to Del Frisco 's in late November for dinner and I was wondering what the dress code is for a female .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
I                    | O          | O          ✓
will                 | O          | B-ORG      ✗
be                   | O          | I-ORG      ✗
going                | O          | O          ✓
to                   | O          | O          ✓
Del                  | B-LOC      | O          ✗
Frisco               | I-LOC      | O          ✗
's                   | I-LOC      | O          ✗
in                   | O          | O          ✓
late                 | O          | O          ✓
November             | O          | O          ✓
for                  | O          | O          ✓
dinner               | O          | O          ✓
and                  | O          | B-PER      ✗
I                    | O   

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== EVALUATION SUMMARY ====
Total tokens processed: 1051
Correctly labeled tokens: 903 (85.92%)
Total true spans found: 52
Correctly predicted spans: 2
Span recall: 3.85% of true spans

--------------------------------------------------------------------------------
Results for LLM with ICL - Development Set
--------------------------------------------------------------------------------
Token Accuracy: 0.8592
Labeled Span Match: 0.0385
Unlabeled Span Match: 0.0769
Precision: 0.0167
Recall: 0.0133
F-score: 0.0148


In [59]:
# Evaluate on main test set
test_results = evaluate_llm_ner(model, tokenizer, few_shot_examples, data_dict['en_ewt']['test'][:100])

# Print test results
print(f"\n{'-'*80}")
print(f"Results for LLM with ICL - In-Domain Test Set")
print(f"{'-'*80}")
print(f"Token Accuracy: {test_results['token_accuracy']:.4f}")

span_scores = test_results['span_match_scores']
print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")

macro = test_results['f1_scores']['macro_avg']
print(f"Precision: {macro['precision']:.4f}")
print(f"Recall: {macro['recall']:.4f}")
print(f"F-score: {macro['f1']:.4f}")

Evaluating LLM:   0%|          | 0/100 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: What is this Miramar ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
What                 | O          | O          ✓
is                   | O          | O          ✓
this                 | O          | B-LOC      ✗
Miramar              | B-LOC      | I-LOC      ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 3, 3)]
PRED SPANS (1): [('LOC', 2, 3)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: It is a place in Argentina lol

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
It                   | O          | O          ✓
is                   | O          | O          ✓
a                    | O          | O          ✓
place                | O          | O          ✓
in                   | O          | O          ✓
Argentina            | B-LOC      | O          ✗
lol                  | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 5, 5)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: " In Argentina , beef is revered , respected , and praised .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
"                    | O          | O          ✓
In                   | O          | B-LOC      ✗
Argentina            | B-LOC      | O          ✗
,                    | O          | O          ✓
beef                 | O          | O          ✓
is                   | O          | O          ✓
revered              | O          | O          ✓
,                    | O          | O          ✓
respected            | O          | O          ✓
,                    | O          | O          ✓
and                  | O          | O          ✓
praised              | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 2, 2)]
PRED SPANS (1): [('LOC', 1, 1)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: A taste of Argentina .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
A                    | O          | O          ✓
taste                | O          | B-LOC      ✗
of                   | O          | I-LOC      ✗
Argentina            | B-LOC      | O          ✗
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 3, 3)]
PRED SPANS (1): [('LOC', 1, 2)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: What language is talked in Iguazu ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
What                 | O          | O          ✓
language             | O          | O          ✓
is                   | O          | O          ✓
talked               | O          | B-LOC      ✗
in                   | O          | I-LOC      ✗
Iguazu               | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 5, 5)]
PRED SPANS (1): [('LOC', 3, 4)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Do you think there are any koreans in Miramar ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Do                   | O          | O          ✓
you                  | O          | O          ✓
think                | O          | O          ✓
there                | O          | O          ✓
are                  | O          | O          ✓
any                  | O          | B-PER      ✗
koreans              | O          | I-PER      ✗
in                   | O          | O          ✓
Miramar              | B-LOC      | B-LOC      ✓
?                    | O          | I-LOC      ✗

Entity spans:
TRUE SPANS (1): [('LOC', 8, 8)]
PRED SPANS (2): [('PER', 5, 6), ('LOC', 8, 9)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Does anyone know any good restaurants in cordoba ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Does                 | O          | O          ✓
anyone               | O          | O          ✓
know                 | O          | O          ✓
any                  | O          | O          ✓
good                 | O          | O          ✓
restaurants          | O          | B-LOC      ✗
in                   | O          | I-LOC      ✗
cordoba              | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 7, 7)]
PRED SPANS (1): [('LOC', 5, 6)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Does anyone know of any good food in iguazu ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Does                 | O          | O          ✓
anyone               | O          | O          ✓
know                 | O          | O          ✓
of                   | O          | O          ✓
any                  | O          | O          ✓
good                 | O          | O          ✓
food                 | O          | B-LOC      ✗
in                   | O          | I-LOC      ✗
iguazu               | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 8, 8)]
PRED SPANS (1): [('LOC', 6, 7)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: well , I do n't ask questions here because I have no clue what " Iguazu " is ...

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
well                 | O          | O          ✓
,                    | O          | O          ✓
I                    | O          | O          ✓
do                   | O          | O          ✓
n't                  | O          | O          ✓
ask                  | O          | B-LOC      ✗
questions            | O          | O          ✓
here                 | O          | O          ✓
because              | O          | O          ✓
I                    | O          | O          ✓
have                 | O          | O          ✓
no                   | O          | O          ✓
clue                 | O          | O          ✓
what                 | O          | O          ✓
"                    | O          | O          ✓
Iguazu          

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Besides eating good foods , what else do people do in Miramar ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Besides              | O          | O          ✓
eating               | O          | O          ✓
good                 | O          | O          ✓
foods                | O          | O          ✓
,                    | O          | B-LOC      ✗
what                 | O          | I-LOC      ✗
else                 | O          | O          ✓
do                   | O          | O          ✓
people               | O          | O          ✓
do                   | O          | O          ✓
in                   | O          | O          ✓
Miramar              | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 11, 11)]
PRED SPANS (1): [('LOC', 4, 5)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Is Hank Green Awesome ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Is                   | O          | O          ✓
Hank                 | B-PER      | B-PER      ✓
Green                | I-PER      | O          ✗
Awesome              | O          | O          ✓
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('PER', 1, 2)]
PRED SPANS (1): [('PER', 1, 1)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The Hank Green I know is hardly awesome .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
Hank                 | B-PER      | O          ✗
Green                | I-PER      | O          ✗
I                    | O          | O          ✓
know                 | O          | O          ✓
is                   | O          | O          ✓
hardly               | O          | O          ✓
awesome              | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('PER', 1, 2)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: I have a Nacho Libre question .?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
I                    | O          | O          ✓
have                 | O          | O          ✓
a                    | O          | B-PER      ✗
Nacho                | B-PER      | I-PER      ✗
Libre                | I-PER      | O          ✗
question             | O          | O          ✓
.?                   | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('PER', 3, 4)]
PRED SPANS (1): [('PER', 2, 3)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: When nacho is driving to cure the influenza guy does he say to the man with the cow i like your blouse or I like your cow because my friend thinks he says I like your blouse , but that would n't make sense .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
When                 | O          | O          ✓
nacho                | B-PER      | O          ✗
is                   | O          | O          ✓
driving              | O          | O          ✓
to                   | O          | O          ✓
cure                 | O          | O          ✓
the                  | O          | O          ✓
influenza            | O          | B-PER      ✗
guy                  | O          | O          ✓
does                 | O          | O          ✓
he                   | O          | O          ✓
say                  | O          | O          ✓
to                   | O          | 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Okay , FIRST , you have posted a question about an American movie , set in Mexico , in the Dining Out in Argentina category .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Okay                 | O          | O          ✓
,                    | O          | O          ✓
FIRST                | O          | O          ✓
,                    | O          | B-PER      ✗
you                  | O          | O          ✓
have                 | O          | O          ✓
posted               | O          | B-LOC      ✗
a                    | O          | I-LOC      ✗
question             | O          | O          ✓
about                | O          | B-LOC      ✗
an                   | O          | I-LOC      ✗
American             | O          | O          ✓
movie                | O          | O          ✓
,                    | O          | O          ✓
set                 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Nacho Libre is suppose to be inspired in Mexicans , not in Argentineans .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Nacho                | B-PER      | O          ✗
Libre                | I-PER      | O          ✗
is                   | O          | O          ✓
suppose              | O          | O          ✓
to                   | O          | B-PER      ✗
be                   | O          | O          ✓
inspired             | O          | O          ✓
in                   | O          | O          ✓
Mexicans             | O          | O          ✓
,                    | O          | B-LOC      ✗
not                  | O          | I-LOC      ✗
in                   | O          | O          ✓
Argentineans         | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('PER', 0, 1)]
PRED SPANS (2): [('PER', 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Here in Indiana I think it 's like $ 3 or $ 4 for a meal .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Here                 | O          | O          ✓
in                   | O          | O          ✓
Indiana              | B-LOC      | O          ✗
I                    | O          | O          ✓
think                | O          | O          ✓
it                   | O          | O          ✓
's                   | O          | O          ✓
like                 | O          | O          ✓
$                    | O          | O          ✓
3                    | O          | O          ✓
or                   | O          | O          ✓
$                    | O          | O          ✓
4                    | O          | O          ✓
for                  | O          | O          ✓
a                    | O          | O          ✓
meal                 | O          | O 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Best POS system in Philadelphia ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Best                 | O          | O          ✓
POS                  | O          | O          ✓
system               | O          | O          ✓
in                   | O          | O          ✓
Philadelphia         | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 4, 4)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: What is the best pos system I can buy in philadelphia pa ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
What                 | O          | O          ✓
is                   | O          | O          ✓
the                  | O          | O          ✓
best                 | O          | O          ✓
pos                  | O          | O          ✓
system               | O          | O          ✓
I                    | O          | O          ✓
can                  | O          | O          ✓
buy                  | O          | O          ✓
in                   | O          | O          ✓
philadelphia         | B-LOC      | B-LOC      ✓
pa                   | B-LOC      | I-LOC      ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (2): [('LOC', 10, 10), ('LOC', 11, 11)]
PRED SPANS (1): [('LOC', 10, 11)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: mazzoni 's deli best italian food in phila pa ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
mazzoni              | B-LOC      | O          ✗
's                   | I-LOC      | O          ✗
deli                 | I-LOC      | O          ✗
best                 | O          | B-ORG      ✗
italian              | O          | O          ✓
food                 | O          | O          ✓
in                   | O          | O          ✓
phila                | B-LOC      | O          ✗
pa                   | O          | O          ✓
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (2): [('LOC', 0, 2), ('LOC', 7, 7)]
PRED SPANS (1): [('ORG', 3, 3)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: is mazzoni 's deli at 3901 conshocken ave in phila pa really the best italian food in the country ???

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
is                   | O          | O          ✓
mazzoni              | B-LOC      | O          ✗
's                   | I-LOC      | B-ORG      ✗
deli                 | I-LOC      | I-ORG      ✗
at                   | O          | O          ✓
3901                 | B-LOC      | O          ✗
conshocken           | I-LOC      | B-LOC      ✗
ave                  | I-LOC      | I-LOC      ✓
in                   | I-LOC      | O          ✗
phila                | I-LOC      | O          ✗
pa                   | I-LOC      | B-LOC      ✗
really               | O          | I-LOC      ✗
the                  | O          | O          ✓
best                 | O          | O          ✓
italian              | O          | B-LOC   

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: because all of the food blogs I ve read say so and I will travel from maryland if that s true

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
because              | O          | O          ✓
all                  | O          | O          ✓
of                   | O          | O          ✓
the                  | O          | O          ✓
food                 | O          | O          ✓
blogs                | O          | B-ORG      ✗
I                    | O          | I-ORG      ✗
ve                   | O          | O          ✓
read                 | O          | O          ✓
say                  | O          | O          ✓
so                   | O          | O          ✓
and                  | O          | B-LOC      ✗
I                    | O          | I-LOC      ✗
will                 | O          | O          ✓
travel               | O          | O          ✓
fro

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: What is the best place to get discounts for San Francisco restaurants ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
What                 | O          | O          ✓
is                   | O          | O          ✓
the                  | O          | O          ✓
best                 | O          | O          ✓
place                | O          | O          ✓
to                   | O          | O          ✓
get                  | O          | O          ✓
discounts            | O          | O          ✓
for                  | O          | O          ✓
San                  | B-LOC      | O          ✗
Francisco            | I-LOC      | O          ✗
restaurants          | O          | O          ✓
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 9, 10)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: surprise romantic cheap date night san francisco ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
surprise             | O          | O          ✓
romantic             | O          | O          ✓
cheap                | O          | O          ✓
date                 | O          | O          ✓
night                | O          | O          ✓
san                  | B-LOC      | B-LOC      ✓
francisco            | I-LOC      | I-LOC      ✓
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 5, 6)]
PRED SPANS (1): [('LOC', 5, 6)]
CORRECT SPANS (1): [('LOC', 5, 6)]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Why not put together a bottle of champagne , a picnic and have a date on Treasure Island .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Why                  | O          | O          ✓
not                  | O          | O          ✓
put                  | O          | O          ✓
together             | O          | O          ✓
a                    | O          | O          ✓
bottle               | O          | O          ✓
of                   | O          | B-LOC      ✗
champagne            | O          | I-LOC      ✗
,                    | O          | O          ✓
a                    | O          | O          ✓
picnic               | O          | O          ✓
and                  | O          | O          ✓
have                 | O          | O          ✓
a                    | O          | O          ✓
date                 | O          | O          ✓
on    

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Dinner and dancing in Chicago ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Dinner               | O          | O          ✓
and                  | O          | O          ✓
dancing              | O          | O          ✓
in                   | O          | O          ✓
Chicago              | B-LOC      | O          ✗
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('LOC', 4, 4)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Rumba Room

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Rumba                | B-LOC      | O          ✗
Room                 | I-LOC      | O          ✗

Entity spans:
TRUE SPANS (1): [('LOC', 0, 1)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: is there any good places to get an ice - cream sundae from in Invercargill New Zealand ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
is                   | O          | O          ✓
there                | O          | O          ✓
any                  | O          | O          ✓
good                 | O          | O          ✓
places               | O          | O          ✓
to                   | O          | B-LOC      ✗
get                  | O          | I-LOC      ✗
an                   | O          | O          ✓
ice                  | O          | B-LOC      ✗
-                    | O          | I-LOC      ✗
cream                | O          | O          ✓
sundae               | O          | O          ✓
from                 | O          | O          ✓
in                   | O          | O          ✓
Invercargill         | B-LOC      | O          ✗
New     

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: so i live in Invercargill New Zealand and i want to know if there are any good places to buy an ice - cream sundae from other than mc donald s lol

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
so                   | O          | O          ✓
i                    | O          | O          ✓
live                 | O          | O          ✓
in                   | O          | B-LOC      ✗
Invercargill         | B-LOC      | I-LOC      ✗
New                  | B-LOC      | O          ✗
Zealand              | I-LOC      | O          ✗
and                  | O          | O          ✓
i                    | O          | O          ✓
want                 | O          | O          ✓
to                   | O          | O          ✓
know                 | O          | O          ✓
if                   | O          | B-ORG      ✗
there                | O          | I-ORG      ✗

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: you live in NZ and you eat McDonald s ice cream ?

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
you                  | O          | O          ✓
live                 | O          | O          ✓
in                   | O          | O          ✓
NZ                   | B-LOC      | O          ✗
and                  | O          | O          ✓
you                  | O          | O          ✓
eat                  | O          | O          ✓
McDonald             | B-LOC      | O          ✗
s                    | I-LOC      | O          ✗
ice                  | O          | O          ✓
cream                | O          | O          ✓
?                    | O          | O          ✓

Entity spans:
TRUE SPANS (2): [('LOC', 3, 3), ('LOC', 7, 8)]
PRED SPANS (0): []
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: any Ice Cream In NZ no matter where as New Zealand has the best bl**dy ice cream in the world I was in NZ for a few weeks had some Ice Cream and really enjoyed it I will go back just to eat some Ice Cream

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
any                  | O          | O          ✓
Ice                  | O          | O          ✓
Cream                | O          | O          ✓
In                   | O          | B-LOC      ✗
NZ                   | B-LOC      | I-LOC      ✗
no                   | O          | O          ✓
matter               | O          | O          ✓
where                | O          | B-LOC      ✗
as                   | O          | I-LOC      ✗
New                  | B-LOC      | O          ✗
Zealand              | I-LOC      | O          ✗
has                  | O          | O          ✓
the                  | O          | O  

In [60]:
# Evaluate on the OOD test set
ood_test_results = evaluate_llm_ner(model, tokenizer, few_shot_examples, data_dict['en_pud']['test'][:100])

# Print OOD test results
print(f"\n{'-'*80}")
print(f"Results for LLM with ICL - Out-of-Domain Test Set")
print(f"{'-'*80}")
print(f"Token Accuracy: {ood_test_results['token_accuracy']:.4f}")

span_scores = ood_test_results['span_match_scores']
print(f"Labeled Span Match: {span_scores['labeled_span_match']:.4f}")
print(f"Unlabeled Span Match: {span_scores['unlabeled_span_match']:.4f}")

macro = ood_test_results['f1_scores']['macro_avg']
print(f"Precision: {macro['precision']:.4f}")
print(f"Recall: {macro['recall']:.4f}")
print(f"F-score: {macro['f1']:.4f}")

Evaluating LLM:   0%|          | 0/100 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: “ While much of the digital transition is unprecedented in the United States , the peaceful transition of power is not , ” Obama special assistant Kori Schulman wrote in a blog post Monday .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
“                    | O          | O          ✓
While                | O          | O          ✓
much                 | O          | O          ✓
of                   | O          | B-LOC      ✗
the                  | O          | O          ✓
digital              | O          | O          ✓
transition           | O          | O          ✓
is                   | O          | O          ✓
unprecedented        | O          | B-PER      ✗
in                   | O          | I-PER      ✗
the                  | O          | O          ✓
United               | B-LOC      | O          ✗
States               | I-LOC      | O          ✗
,   

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: For those who follow social media transitions on Capitol Hill , this will be a little different .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
For                  | O          | O          ✓
those                | O          | O          ✓
who                  | O          | O          ✓
follow               | O          | O          ✓
social               | O          | O          ✓
media                | O          | B-LOC      ✗
transitions          | O          | I-LOC      ✗
on                   | O          | O          ✓
Capitol              | B-LOC      | O          ✗
Hill                 | I-LOC      | O          ✗
,                    | O          | O          ✓
this                 | O          | O          ✓
will                 | O          | O          ✓
be                   | O          | O          ✓
a                    | O          | O          ✓

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: But in a break from his past rhetoric about curtailing immigration , the GOP nominee proclaimed that as president he would allow “ tremendous numbers ” of legal immigrants based on a “ merit system . ”

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
But                  | O          | O          ✓
in                   | O          | O          ✓
a                    | O          | O          ✓
break                | O          | O          ✓
from                 | O          | B-ORG      ✗
his                  | O          | I-ORG      ✗
past                 | O          | O          ✓
rhetoric             | O          | O          ✓
about                | O          | O          ✓
curtailing           | O          | O          ✓
immigration          | O          | O          ✓
,                    | O          | O          ✓
the                  | O          | O     

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: “ So I hate to put a little pressure on you , but the fate of the republic rests on your shoulders , ” he told the crowd gathered on a sports field at the University of North Carolina .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
“                    | O          | O          ✓
So                   | O          | O          ✓
I                    | O          | O          ✓
hate                 | O          | B-PER      ✗
to                   | O          | O          ✓
put                  | O          | O          ✓
a                    | O          | O          ✓
little               | O          | O          ✓
pressure             | O          | O          ✓
on                   | O          | B-LOC      ✗
you                  | O          | I-LOC      ✗
,                    | O          | O          ✓
but                  | O          | O          ✓
the      

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The new spending is fueled by Clinton ’s large bank account .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
new                  | O          | O          ✓
spending             | O          | B-PER      ✗
is                   | O          | I-PER      ✗
fueled               | O          | O          ✓
by                   | O          | O          ✓
Clinton              | B-PER      | O          ✗
’s                   | O          | O          ✓
large                | O          | O          ✓
bank                 | O          | O          ✓
account              | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('PER', 6, 6)]
PRED SPANS (1): [('PER', 2, 3)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: In early October , the transition team used the same venue to meet with technology lobbyists , inviting representatives from Uber , the Motion Picture Association of America , the Consumer Technology Association and other groups .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
In                   | O          | O          ✓
early                | O          | O          ✓
October              | O          | O          ✓
,                    | O          | O          ✓
the                  | O          | O          ✓
transition           | O          | O          ✓
team                 | O          | O          ✓
used                 | O          | B-ORG      ✗
the                  | O          | I-ORG      ✗
same                 | O          | O          ✓
venue                | O          | B-ORG      ✗
to                   | O          | I-ORG      ✗
meet         

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The gathering was originally slated for Washington ’s private Metropolitan Club on H Street a few blocks away .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
gathering            | O          | O          ✓
was                  | O          | B-LOC      ✗
originally           | O          | O          ✓
slated               | O          | I-LOC      ✗
for                  | O          | O          ✓
Washington           | B-LOC      | O          ✗
’s                   | O          | O          ✓
private              | O          | O          ✓
Metropolitan         | B-LOC      | O          ✗
Club                 | I-LOC      | O          ✗
on                   | O          | O          ✓
H                    | B-LOC      | O          ✗
Street               | I-LOC      | O          ✗
a                    | O          

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: By comparison , it cost $ 103.7 million to build the NoMa infill Metro station , which opened in 2004 .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
By                   | O          | O          ✓
comparison           | O          | O          ✓
,                    | O          | O          ✓
it                   | O          | O          ✓
cost                 | O          | O          ✓
$                    | O          | O          ✓
103.7                | O          | O          ✓
million              | O          | O          ✓
to                   | O          | B-ORG      ✗
build                | O          | I-ORG      ✗
the                  | O          | O          ✓
NoMa                 | B-LOC      | O          ✗
infill               | O          | B-LOC      ✗
Metro                | O          | I-LOC      ✗
station              | O          | O     

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: “ We face a lot of competition , and we think transit can help , ” said Joe Sternlieb , president of the Georgetown BID .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
“                    | O          | O          ✓
We                   | O          | O          ✓
face                 | O          | O          ✓
a                    | O          | O          ✓
lot                  | O          | O          ✓
of                   | O          | O          ✓
competition          | O          | O          ✓
,                    | O          | O          ✓
and                  | O          | O          ✓
we                   | O          | O          ✓
think                | O          | O          ✓
transit              | O          | B-PER      ✗
can                  | O          | O          ✓
help                 | O          | O          ✓
,                    | O

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The feasibility study estimates that it would take passengers about four minutes to cross the Potomac River on the gondola .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
feasibility          | O          | O          ✓
study                | O          | O          ✓
estimates            | O          | O          ✓
that                 | O          | O          ✓
it                   | O          | B-LOC      ✗
would                | O          | I-LOC      ✗
take                 | O          | O          ✓
passengers           | O          | O          ✓
about                | O          | O          ✓
four                 | O          | O          ✓
minutes              | O          | O          ✓
to                   | O          | O          ✓
cross                | O          | O          ✓
the                  

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: That share has been rising steadily over the years — only 11 percent of the total vote was cast before Election Day in 1996 , according to the Census Bureau -- and seems likely to jump again this year .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
That                 | O          | O          ✓
share                | O          | O          ✓
has                  | O          | O          ✓
been                 | O          | O          ✓
rising               | O          | O          ✓
steadily             | O          | O          ✓
over                 | O          | B-PER      ✗
the                  | O          | I-PER      ✗
years                | O          | O          ✓
—                    | O          | O          ✓
only                 | O          | O          ✓
11                   | O          | O          ✓
percent              | O          | B-ORG

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: “ We ’ve requested other nations to help us populate the zoo with different species of animals , including a pig , ” Saqib said .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
“                    | O          | O          ✓
We                   | O          | O          ✓
’ve                  | O          | O          ✓
requested            | O          | O          ✓
other                | O          | O          ✓
nations              | O          | O          ✓
to                   | O          | O          ✓
help                 | O          | O          ✓
us                   | O          | O          ✓
populate             | O          | O          ✓
the                  | O          | B-PER      ✗
zoo                  | O          | I-PER      ✗
with                 | O          | O          ✓
different            | O          | O          ✓
species         

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: There was a time , Mr Panvalkar said , when he felt that they should leave the building .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
There                | O          | O          ✓
was                  | O          | O          ✓
a                    | O          | B-PER      ✗
time                 | O          | I-PER      ✗
,                    | O          | O          ✓
Mr                   | O          | O          ✓
Panvalkar            | B-PER      | O          ✗
said                 | O          | O          ✓
,                    | O          | B-PER      ✗
when                 | O          | I-PER      ✗
he                   | O          | O          ✓
felt                 | O          | O          ✓
that                 | O          | O          ✓
they                 | O          | O          ✓
should               | O          | O          ✓
leave  

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: She killed Andre Price III by pressing his face into an air mattress in her sitting room before trying to do the same to her daughter , Angel , police said .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
She                  | O          | O          ✓
killed               | O          | O          ✓
Andre                | B-PER      | O          ✗
Price                | I-PER      | B-PER      ✗
III                  | I-PER      | O          ✗
by                   | O          | B-PER      ✗
pressing             | O          | O          ✓
his                  | O          | B-PER      ✗
face                 | O          | O          ✓
into                 | O          | B-PER      ✗
an                   | O          | O          ✓
air                  | O          | O          ✓
mattress             | O          | O          ✓
in                   | O          | O

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Mr Osborne signed up with a US speakers agency after being sacked in July .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Mr                   | O          | B-PER      ✗
Osborne              | B-PER      | O          ✗
signed               | O          | O          ✓
up                   | O          | O          ✓
with                 | O          | O          ✓
a                    | O          | B-LOC      ✗
US                   | B-LOC      | I-LOC      ✗
speakers             | O          | O          ✓
agency               | O          | O          ✓
after                | O          | O          ✓
being                | O          | O          ✓
sacked               | O          | O          ✓
in                   | O          | O          ✓
July                 | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE S

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Michael Fallon said the date for cutting the first steel would help secure new investment and safeguard hundreds of skilled jobs until 2035 .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Michael              | B-PER      | O          ✗
Fallon               | I-PER      | O          ✗
said                 | O          | B-PER      ✗
the                  | O          | O          ✓
date                 | O          | O          ✓
for                  | O          | O          ✓
cutting              | O          | O          ✓
the                  | O          | O          ✓
first                | O          | O          ✓
steel                | O          | O          ✓
would                | O          | O          ✓
help                 | O          | O          ✓
secure               | O          | O          ✓
new                  | O          | O          ✓
inve

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The promise of new Royal Navy orders to secure the Clyde shipbuilding industry was made before the Scottish independence referendum in 2014 .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
promise              | O          | O          ✓
of                   | O          | B-ORG      ✗
new                  | O          | I-ORG      ✗
Royal                | B-ORG      | O          ✗
Navy                 | I-ORG      | O          ✗
orders               | O          | B-ORG      ✗
to                   | O          | I-ORG      ✗
secure               | O          | O          ✓
the                  | O          | O          ✓
Clyde                | B-LOC      | B-LOC      ✓
shipbuilding         | O          | O          ✓
industry             | O          | O          ✓
was                  | O          | B-LOC      ✗
made

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: And by 2007 , at the height of its popularity ( and , perhaps , Knightley 's ) it was a top 50 name , with three times more babies named Keira than Kiera .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
And                  | O          | O          ✓
by                   | O          | O          ✓
2007                 | O          | B-LOC      ✗
,                    | O          | O          ✓
at                   | O          | O          ✓
the                  | O          | O          ✓
height               | O          | B-PER      ✗
of                   | O          | I-PER      ✗
its                  | O          | O          ✓
popularity           | O          | O          ✓
(                    | O          | O          ✓
and                  | O          | B-PER      ✗
,                    | O          | I-PER      ✗
perhaps              | O          | O  

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Several analysts have suggested Huawei is best placed to benefit from Samsung 's setback .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Several              | O          | O          ✓
analysts             | O          | O          ✓
have                 | O          | O          ✓
suggested            | O          | B-PER      ✗
Huawei               | B-ORG      | I-PER      ✗
is                   | O          | O          ✓
best                 | O          | O          ✓
placed               | O          | B-ORG      ✗
to                   | O          | I-ORG      ✗
benefit              | O          | O          ✓
from                 | O          | O          ✓
Samsung              | B-ORG      | B-ORG      ✓
's                   | O          | I-ORG      ✗
setback              | O          | O          ✓
.                    | O          | O          ✓

Entit

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The Mate 9 phones lack an artificial intelligence interface , like the Google Assistant or Apple 's Siri .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
Mate                 | O          | O          ✓
9                    | O          | O          ✓
phones               | O          | O          ✓
lack                 | O          | O          ✓
an                   | O          | O          ✓
artificial           | O          | B-ORG      ✗
intelligence         | O          | I-ORG      ✗
interface            | O          | O          ✓
,                    | O          | O          ✓
like                 | O          | B-ORG      ✗
the                  | O          | I-ORG      ✗
Google               | O          | O          ✓
Assistant            | O          | O          ✓
or                   | O          | O  

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: " But it may be that BA and IAG have cracked it and can offer something vaguely reliable . "

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
"                    | O          | O          ✓
But                  | O          | O          ✓
it                   | O          | B-ORG      ✗
may                  | O          | I-ORG      ✗
be                   | O          | O          ✓
that                 | O          | O          ✓
BA                   | B-ORG      | O          ✗
and                  | O          | O          ✓
IAG                  | B-ORG      | O          ✗
have                 | O          | O          ✓
cracked              | O          | O          ✓
it                   | O          | O          ✓
and                  | O          | O          ✓
can                  | O          | O          ✓
offer                | O          | O          ✓
some

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The company told the BBC it would be the responsibility of each airline brand to decide whether to charge passengers an access fee .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
company              | O          | O          ✓
told                 | O          | B-ORG      ✗
the                  | O          | I-ORG      ✗
BBC                  | B-ORG      | O          ✗
it                   | O          | O          ✓
would                | O          | O          ✓
be                   | O          | O          ✓
the                  | O          | O          ✓
responsibility       | O          | O          ✓
of                   | O          | O          ✓
each                 | O          | O          ✓
airline              | O          | O          ✓
brand                | O          | O          ✓
to           

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The 10 - week course has been " certified " by UK spy agency GCHQ .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
10                   | O          | O          ✓
-                    | O          | O          ✓
week                 | O          | O          ✓
course               | O          | B-ORG      ✗
has                  | O          | I-ORG      ✗
been                 | O          | O          ✓
"                    | O          | O          ✓
certified            | O          | O          ✓
"                    | O          | O          ✓
by                   | O          | O          ✓
UK                   | B-LOC      | O          ✗
spy                  | O          | O          ✓
agency               | O          | O          ✓
GCHQ                 | B-ORG      | O          ✗
.                    | O     

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Naturally China will be presenting plenty of other military hardware this week from attack helicopters to seaplanes .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Naturally            | O          | O          ✓
China                | B-ORG      | B-ORG      ✓
will                 | O          | I-ORG      ✗
be                   | O          | O          ✓
presenting           | O          | O          ✓
plenty               | O          | O          ✓
of                   | O          | O          ✓
other                | O          | O          ✓
military             | O          | O          ✓
hardware             | O          | O          ✓
this                 | O          | O          ✓
week                 | O          | O          ✓
from                 | O          | O          ✓
attack               | O          | O          ✓
helicopters          | O    

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: And with China set to become the world 's biggest aviation market in the next decade , the show is an opportunity for Beijing to demonstrate its ambitions in civil aviation as well as defence .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
And                  | O          | O          ✓
with                 | O          | O          ✓
China                | B-LOC      | O          ✗
set                  | O          | O          ✓
to                   | O          | O          ✓
become               | O          | B-ORG      ✗
the                  | O          | I-ORG      ✗
world                | O          | O          ✓
's                   | O          | B-LOC      ✗
biggest              | O          | I-LOC      ✗
aviation             | O          | O          ✓
market               | O          | O          ✓
in                   | O          | O          ✓
t

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: " Unfortunately once again the few ruin it for the many , " wrote Jesse LaBrocca , founder of Hack Forums , in a message explaining why the section was being closed .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
"                    | O          | O          ✓
Unfortunately        | O          | O          ✓
once                 | O          | O          ✓
again                | O          | O          ✓
the                  | O          | O          ✓
few                  | O          | B-PER      ✗
ruin                 | O          | I-PER      ✗
it                   | O          | O          ✓
for                  | O          | B-ORG      ✗
the                  | O          | I-ORG      ✗
many                 | O          | O          ✓
,                    | O          | O          ✓
"                    | O          | O          ✓
wrote                | O    

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The " recent events " are likely to be the attacks of 21 October that briefly took down popular websites such as Reddit , Twitter and Spotify as well as many others .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
"                    | O          | O          ✓
recent               | O          | O          ✓
events               | O          | O          ✓
"                    | O          | O          ✓
are                  | O          | O          ✓
likely               | O          | O          ✓
to                   | O          | O          ✓
be                   | O          | O          ✓
the                  | O          | O          ✓
attacks              | O          | O          ✓
of                   | O          | O          ✓
21                   | O          | O          ✓
October              | O    

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: A UN review of national plans to cut carbon says they are well short of the levels needed to keep the rise in global temperatures under 2C .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
A                    | O          | O          ✓
UN                   | B-ORG      | B-ORG      ✓
review               | O          | I-ORG      ✗
of                   | O          | O          ✓
national             | O          | O          ✓
plans                | O          | O          ✓
to                   | O          | O          ✓
cut                  | O          | O          ✓
carbon               | O          | O          ✓
says                 | O          | O          ✓
they                 | O          | O          ✓
are                  | O          | O          ✓
well                 | O          | O          ✓
short                | O          | O          ✓
of   

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: " We are moving in the right direction : the Paris Agreement will slow climate change , as will the recent Kigali Amendment to reduce HFCs , " said Erik Solheim , head of UN Environment .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
"                    | O          | O          ✓
We                   | O          | O          ✓
are                  | O          | O          ✓
moving               | O          | O          ✓
in                   | O          | B-LOC      ✗
the                  | O          | I-LOC      ✗
right                | O          | O          ✓
direction            | O          | O          ✓
:                    | O          | O          ✓
the                  | O          | B-PER      ✗
Paris                | O          | I-PER      ✗
Agreement            | O          | O          ✓
will                 | O          | B-ORG      ✗
slow   

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: This will put new limits on the nature of the environmental changes that overtook the Earth and sent so many species - not just the dinosaurs - into oblivion .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
This                 | O          | O          ✓
will                 | O          | O          ✓
put                  | O          | O          ✓
new                  | O          | O          ✓
limits               | O          | O          ✓
on                   | O          | O          ✓
the                  | O          | B-LOC      ✗
nature               | O          | I-LOC      ✗
of                   | O          | O          ✓
the                  | O          | O          ✓
environmental        | O          | O          ✓
changes              | O          | O          ✓
that                 | O          | O          ✓
overtook             | O          |

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Just over 70 % of the plants grown from Earth seeds were alive after 17 days - just slightly more than the plants grown from space seeds - just over 66 % .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Just                 | O          | O          ✓
over                 | O          | O          ✓
70                   | O          | B-LOC      ✗
%                    | O          | I-LOC      ✗
of                   | O          | O          ✓
the                  | O          | O          ✓
plants               | O          | O          ✓
grown                | O          | O          ✓
from                 | O          | O          ✓
Earth                | B-LOC      | O          ✗
seeds                | O          | O          ✓
were                 | O          | O          ✓
alive                | O          | O          ✓
after                | O          | O  

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The results from this experiment provides further support that rocket seeds can be flown and stored on the International Space Station for six months without having any significant impacts on their ability to germinate and grow on Earth .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
results              | O          | O          ✓
from                 | O          | O          ✓
this                 | O          | O          ✓
experiment           | O          | B-ORG      ✗
provides             | O          | I-ORG      ✗
further              | O          | O          ✓
support              | O          | O          ✓
that                 | O          | O          ✓
rocket               | O          | O          ✓
seeds                | O          | O          ✓
can                  | O          | O          ✓
be   

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The RHS collected comments sent in by schoolchildren and teachers involved in the experiment .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
RHS                  | B-ORG      | O          ✗
collected            | O          | B-ORG      ✗
comments             | O          | I-ORG      ✗
sent                 | O          | O          ✓
in                   | O          | O          ✓
by                   | O          | B-LOC      ✗
schoolchildren       | O          | I-LOC      ✗
and                  | O          | O          ✓
teachers             | O          | O          ✓
involved             | O          | B-ORG      ✗
in                   | O          | I-ORG      ✗
the                  | O          | O          ✓
experiment           | O          | O          ✓
.                    | O          | O          ✓

E

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: In China the hair is typically put in a chemical bath to remove the cuticle completely , Tarlo explains .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
In                   | O          | O          ✓
China                | B-LOC      | B-LOC      ✓
the                  | O          | O          ✓
hair                 | O          | O          ✓
is                   | O          | O          ✓
typically            | O          | O          ✓
put                  | O          | B-PER      ✗
in                   | O          | I-PER      ✗
a                    | O          | O          ✓
chemical             | O          | O          ✓
bath                 | O          | O          ✓
to                   | O          | O          ✓
remove               | O          | O          ✓
the                  | O          | O          ✓
cuticle              | O          | O   

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Throughout history , the international hair market has always had a political dimension , says Tarlo .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Throughout           | O          | O          ✓
history              | O          | O          ✓
,                    | O          | O          ✓
the                  | O          | O          ✓
international        | O          | B-ORG      ✗
hair                 | O          | I-ORG      ✗
market               | O          | O          ✓
has                  | O          | O          ✓
always               | O          | O          ✓
had                  | O          | O          ✓
a                    | O          | B-PER      ✗
political            | O          | I-PER      ✗
dimension            | O          | O          ✓
,                    | O          | O          ✓
says                 | O          | O      

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Shenzhen 's traffic police have opted for unconventional penalties before .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Shenzhen             | B-ORG      | B-LOC      ✗
's                   | O          | I-LOC      ✗
traffic              | O          | O          ✓
police               | O          | O          ✓
have                 | O          | O          ✓
opted                | O          | O          ✓
for                  | O          | O          ✓
unconventional       | O          | B-PER      ✗
penalties            | O          | I-PER      ✗
before               | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('ORG', 0, 0)]
PRED SPANS (2): [('LOC', 0, 1), ('PER', 7, 8)]
CORRECT SPANS (0): []


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: I asked her afterwards if she understood why people might vote for Trump .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
I                    | O          | O          ✓
asked                | O          | O          ✓
her                  | O          | O          ✓
afterwards           | O          | O          ✓
if                   | O          | B-PER      ✗
she                  | O          | I-PER      ✗
understood           | O          | O          ✓
why                  | O          | O          ✓
people               | O          | O          ✓
might                | O          | O          ✓
vote                 | O          | B-PER      ✗
for                  | O          | I-PER      ✗
Trump                | B-PER      | O          ✗
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (1): [('PER', 12, 12)]
PRED SPANS (2): [('PER

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: " Luckily , someone in Sony Australia was like , ' Hey , by the way , did you guys notice this ? ' " says Pall .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
"                    | O          | O          ✓
Luckily              | O          | O          ✓
,                    | O          | B-ORG      ✗
someone              | O          | I-ORG      ✗
in                   | O          | O          ✓
Sony                 | B-ORG      | O          ✗
Australia            | I-ORG      | O          ✗
was                  | O          | O          ✓
like                 | O          | O          ✓
,                    | O          | O          ✓
'                    | O          | B-PER      ✗
Hey                  | O          | I-PER      ✗
,                    | O          | O          ✓
by                   | O          | O          ✓
the                  | O         

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Seagal made headlines when he described Russia 's actions in Crimea , which it annexed in 2014 , as " very reasonable " .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Seagal               | B-PER      | B-PER      ✓
made                 | O          | O          ✓
headlines            | O          | O          ✓
when                 | O          | B-ORG      ✗
he                   | O          | I-ORG      ✗
described            | O          | O          ✓
Russia               | B-ORG      | B-LOC      ✗
's                   | O          | O          ✓
actions              | O          | O          ✓
in                   | O          | O          ✓
Crimea               | B-LOC      | O          ✗
,                    | O          | O          ✓
which                | O          | O          ✓
it                   | O          | O          ✓
annexed              | O

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Seagal , whose grandmother was from Vladivostok in Russia 's far east , has made frequent trips to Russia in recent years and visited Kamchatka and Sakhalin in September .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Seagal               | B-PER      | B-PER      ✓
,                    | O          | O          ✓
whose                | O          | O          ✓
grandmother          | O          | B-LOC      ✗
was                  | O          | I-LOC      ✗
from                 | O          | O          ✓
Vladivostok          | B-LOC      | B-ORG      ✗
in                   | O          | I-ORG      ✗
Russia               | B-LOC      | O          ✗
's                   | O          | O          ✓
far                  | O          | O          ✓
east                 | O          | O          ✓
,                    | O          | B-LOC      ✗
has                  | 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Earlier this year Seagal was given Serbian nationality after offering to set up a martial arts school in the capital Belgrade .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Earlier              | O          | O          ✓
this                 | O          | O          ✓
year                 | O          | B-PER      ✗
Seagal               | B-PER      | O          ✗
was                  | O          | O          ✓
given                | O          | O          ✓
Serbian              | O          | B-ORG      ✗
nationality          | O          | I-ORG      ✗
after                | O          | O          ✓
offering             | O          | O          ✓
to                   | O          | B-LOC      ✗
set                  | O          | I-LOC      ✗
up                   | O          | O          ✓
a                    | O          | O          ✓
martial           

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: A police spokesman told the Associated Press there was " an exchange of words " followed by an " altercation " but that no injuries had been reported .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
A                    | O          | O          ✓
police               | O          | O          ✓
spokesman            | O          | B-ORG      ✗
told                 | O          | I-ORG      ✗
the                  | O          | O          ✓
Associated           | B-ORG      | O          ✗
Press                | I-ORG      | O          ✗
there                | O          | O          ✓
was                  | O          | O          ✓
"                    | O          | O          ✓
an                   | O          | O          ✓
exchange             | O          | O          ✓
of                   | O          | O          ✓
words                | O          | O      

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Cuaron , whose last film was the Oscar - winning Gravity , was reportedly not on set at the time of the incident .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Cuaron               | B-PER      | B-PER      ✓
,                    | O          | O          ✓
whose                | O          | O          ✓
last                 | O          | O          ✓
film                 | O          | O          ✓
was                  | O          | B-ORG      ✗
the                  | O          | I-ORG      ✗
Oscar                | O          | O          ✓
-                    | O          | O          ✓
winning              | O          | O          ✓
Gravity              | O          | O          ✓
,                    | O          | B-LOC      ✗
was                  | O          | I-LOC      ✗
reportedly           | O          | O          ✓
not                  | O       

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Ms Pugh has received treatment at Papworth and Addenbrooke 's Hospitals in Cambridgeshire .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Ms                   | O          | B-PER      ✗
Pugh                 | B-PER      | O          ✗
has                  | O          | O          ✓
received             | O          | O          ✓
treatment            | O          | B-ORG      ✗
at                   | O          | I-ORG      ✗
Papworth             | B-ORG      | O          ✗
and                  | O          | B-ORG      ✗
Addenbrooke          | B-ORG      | I-ORG      ✗
's                   | O          | O          ✓
Hospitals            | O          | B-LOC      ✗
in                   | O          | I-LOC      ✗
Cambridgeshire       | B-LOC      | O          ✗
.                    | O          | O          ✓

Entity spans:
TRUE SPANS (4): [('PER', 1, 1), ('ORG',

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: But a scan has shown the tumour in Ms Pugh 's right lung is growing , and she has had to leave the trial .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
But                  | O          | O          ✓
a                    | O          | O          ✓
scan                 | O          | O          ✓
has                  | O          | B-PER      ✗
shown                | O          | I-PER      ✗
the                  | O          | O          ✓
tumour               | O          | O          ✓
in                   | O          | O          ✓
Ms                   | O          | O          ✓
Pugh                 | B-PER      | O          ✗
's                   | O          | O          ✓
right                | O          | O          ✓
lung                 | O          | O          ✓
is                   | O          | O          ✓
growing              | O          | O  

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: " If Donald Trump becomes president , the government here will still have to work with him to advance whatever shared agenda there is , to ensure that Canadian businesses and interests are represented in Washington . "

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
"                    | O          | O          ✓
If                   | O          | B-PER      ✗
Donald               | B-PER      | O          ✗
Trump                | I-PER      | O          ✗
becomes              | O          | O          ✓
president            | O          | O          ✓
,                    | O          | O          ✓
the                  | O          | O          ✓
government           | O          | O          ✓
here                 | O          | O          ✓
will                 | O          | O          ✓
still                | O          | O          ✓
have                 | O 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Trudeau will extend that invitation to the 45th president of the United States , whoever he or she may be .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Trudeau              | B-PER      | O          ✗
will                 | O          | O          ✓
extend               | O          | O          ✓
that                 | O          | O          ✓
invitation           | O          | B-PER      ✗
to                   | O          | I-PER      ✗
the                  | O          | O          ✓
45th                 | O          | B-ORG      ✗
president            | O          | I-ORG      ✗
of                   | O          | O          ✓
the                  | O          | O          ✓
United               | B-LOC      | O          ✗
States               | I-LOC      | O          ✗
,                    | O          | O          ✓
whoever              | O          | O 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Despite the release of a photo this morning , police in B.C. say they have more questions than answers about an apparently homeless man charged in the fatal stabbing of a teen girl at her Abbotsford high school .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Despite              | O          | O          ✓
the                  | O          | O          ✓
release              | O          | O          ✓
of                   | O          | B-ORG      ✗
a                    | O          | I-ORG      ✗
photo                | O          | O          ✓
this                 | O          | O          ✓
morning              | O          | O          ✓
,                    | O          | O          ✓
police               | O          | O          ✓
in                   | O          | B-PER      ✗
B.C.                 | B-LOC      | O          ✗
say                  | O       

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Police in B.C. said earlier Klein did not appear to have a criminal history and released vague details about his recent whereabouts .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Police               | O          | B-ORG      ✗
in                   | O          | I-ORG      ✗
B.C.                 | B-LOC      | B-LOC      ✓
said                 | O          | I-LOC      ✗
earlier              | O          | O          ✓
Klein                | B-PER      | O          ✗
did                  | O          | O          ✓
not                  | O          | O          ✓
appear               | O          | O          ✓
to                   | O          | O          ✓
have                 | O          | O          ✓
a                    | O          | O          ✓
criminal             | O          | O          ✓
history              | O          | O          ✓
and         

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: " We do not believe the suspect has ties to this school , or to the two girls , or specifically to the Abbotsford area , " she said .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
"                    | O          | O          ✓
We                   | O          | O          ✓
do                   | O          | O          ✓
not                  | O          | O          ✓
believe              | O          | O          ✓
the                  | O          | O          ✓
suspect              | O          | O          ✓
has                  | O          | O          ✓
ties                 | O          | O          ✓
to                   | O          | O          ✓
this                 | O          | O          ✓
school               | O          | O          ✓
,                    | O          | O          ✓
or                   | O          | O          ✓
to          

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: He also said Klein was uncommunicative , uncooperative and unwilling to walk up from cells under the courthouse to attend his hearing .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
He                   | O          | O          ✓
also                 | O          | O          ✓
said                 | O          | B-PER      ✗
Klein                | B-PER      | O          ✗
was                  | O          | O          ✓
uncommunicative      | O          | O          ✓
,                    | O          | O          ✓
uncooperative        | O          | O          ✓
and                  | O          | O          ✓
unwilling            | O          | O          ✓
to                   | O          | B-LOC      ✗
walk                 | O          | I-LOC      ✗
up                   | O          | O          ✓
from                 | O          | O          ✓
cells     

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: The association that represents real estate agents in Ontario says more needs to be done to protect consumers and punish agents found to have engaged in unethical behaviour .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
The                  | O          | O          ✓
association          | O          | O          ✓
that                 | O          | B-ORG      ✗
represents           | O          | I-ORG      ✗
real                 | O          | O          ✓
estate               | O          | B-LOC      ✗
agents               | O          | I-LOC      ✗
in                   | O          | O          ✓
Ontario              | B-LOC      | O          ✗
says                 | O          | O          ✓
more                 | O          | O          ✓
needs                | O          | O          ✓
to                   | O          | O          ✓
be                  

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: He 'd also like to see greater enforcement and investigative powers for the Real Estate Council of Ontario ( RECO ) , which regulates agents in the province .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
He                   | O          | O          ✓
'd                   | O          | O          ✓
also                 | O          | O          ✓
like                 | O          | O          ✓
to                   | O          | B-ORG      ✗
see                  | O          | I-ORG      ✗
greater              | O          | O          ✓
enforcement          | O          | B-LOC      ✗
and                  | O          | I-LOC      ✗
investigative        | O          | O          ✓
powers               | O          | O          ✓
for                  | O          | O          ✓
the                  | O          | O          ✓
Real                 | B-ORG      | 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Currently , the maximum fine RECO can levy against an agent is $ 25,000 .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Currently            | O          | O          ✓
,                    | O          | O          ✓
the                  | O          | B-ORG      ✗
maximum              | O          | I-ORG      ✗
fine                 | O          | O          ✓
RECO                 | B-ORG      | O          ✗
can                  | O          | O          ✓
levy                 | O          | O          ✓
against              | O          | O          ✓
an                   | O          | O          ✓
agent                | O          | O          ✓
is                   | O          | O          ✓
$                    | O          | O          ✓
25,000               | O          | O          ✓
.                    | O          | O          ✓

Entity spans:
TRUE SPA

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: But China 's growing middle class has been unusually vocal in its complaints about toxic air in cities like Beijing , which can see day after day of lung - choking smog .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
But                  | O          | O          ✓
China                | B-LOC      | O          ✗
's                   | O          | O          ✓
growing              | O          | O          ✓
middle               | O          | O          ✓
class                | O          | B-ORG      ✗
has                  | O          | I-ORG      ✗
been                 | O          | O          ✓
unusually            | O          | B-LOC      ✗
vocal                | O          | I-LOC      ✗
in                   | O          | O          ✓
its                  | O          | O          ✓
complaints           | O          | O          ✓
about                | O

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: " We are seeing many , many countries and especially large new emitters like Brazil , South Africa , India and China stepping up to the plate in terms of playing a role in reducing emissions , " said Guilbeault .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
"                    | O          | O          ✓
We                   | O          | O          ✓
are                  | O          | O          ✓
seeing               | O          | O          ✓
many                 | O          | B-ORG      ✗
,                    | O          | I-ORG      ✗
many                 | O          | O          ✓
countries            | O          | O          ✓
and                  | O          | O          ✓
especially           | O          | O          ✓
large                | O          | B-ORG      ✗
new                  | O          | I-ORG      ✗
emitters             | O       

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Clinton and allies , meanwhile , are seeking to keep the spotlight on Trump , charging that his disparaging comments about women and minorities , and his temperament make him unfit for office .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Clinton              | B-PER      | O          ✗
and                  | O          | O          ✓
allies               | O          | O          ✓
,                    | O          | B-PER      ✗
meanwhile            | O          | I-PER      ✗
,                    | O          | O          ✓
are                  | O          | O          ✓
seeking              | O          | O          ✓
to                   | O          | O          ✓
keep                 | O          | O          ✓
the                  | O          | B-PER      ✗
spotlight            | O          | I-PER      ✗
on                   | O          | O          ✓
T

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Trump 's wife , Melania Trump , made her first appearance on the trail since the Republican convention in July .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Trump                | B-PER      | O          ✗
's                   | O          | B-PER      ✗
wife                 | O          | I-PER      ✗
,                    | O          | O          ✓
Melania              | B-PER      | B-PER      ✓
Trump                | I-PER      | O          ✗
,                    | O          | O          ✓
made                 | O          | O          ✓
her                  | O          | B-ORG      ✗
first                | O          | I-ORG      ✗
appearance           | O          | O          ✓
on                   | O          | O          ✓
the                  | O          | O          ✓
trail                | O          | O          ✓
since                | O         

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



==== COMPARE GOLD TO LLM OUTPUT ====
Sentence: Students like Rai have been meeting with counsellors at the school to talk about what happened , but she said the biggest comfort has come from seeing her friends .

Token-by-token comparison:
TOKEN                | TRUE LABEL | PRED LABEL
--------------------------------------------
Students             | O          | O          ✓
like                 | O          | O          ✓
Rai                  | B-PER      | O          ✗
have                 | O          | O          ✓
been                 | O          | B-PER      ✗
meeting              | O          | I-PER      ✗
with                 | O          | O          ✓
counsellors          | O          | O          ✓
at                   | O          | O          ✓
the                  | O          | O          ✓
school               | O          | O          ✓
to                   | O          | O          ✓
talk                 | O          | O          ✓
about                | O      