# Named Entity Recognition with Zero-Shot LLMs: Project Introduction

**Students:** Tiboni Gabriele, Visentin Giacomo  

**Students ID:** 2102414, 2121345

**Master Program:** Computer Engineering (AI & Robotics)

---

## Introduction

Named-Entity Recognition (NER) is the task of identifying and classifying entities (persons, organizations, locations, etc.) in text. Modern instruction-tuned Large Language Models (LLMs) such as ChatGPT and Gemini can perform NER in zero-shot settings—without task-specific training.

This project evaluates zero-shot NER with LLMs, based on ["Empirical Study of Zero-Shot NER with ChatGPT" (Xie et al., 2023)](https://arxiv.org/abs/2310.10035). We reproduce the baseline and two additional methods, test them on a standard dataset, and discuss results.

---

## Domain and Dataset

We use the **CoNLL-2003** dataset: newswire text annotated with four entity types (Person, Location, Organization, Miscellaneous). The data is split into train, development, and test sets for fair evaluation.

We start by installing and importing the main libraries for this project: Google Generative AI (for Gemini API), pandas, and the HuggingFace datasets library.

In [None]:
# Install required libraries
!pip install -q google-generativeai datasets pandas
!pip install --upgrade datasets fsspec

# Import libraries
import google.generativeai as genai  # For Gemini API
import pandas as pd                  # For dataframes
import time                          # To add delay between API calls
import re                            # For regex parsing
from datasets import load_dataset    # To load standard NER datasets

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


Here we configure the Gemini API by setting the API key and initializing the model we will use for zero-shot NER.


In [None]:
# Set up Gemini API key and model, this is our API key, you can insert yours
MY_API_KEY = "INSERT HERE YOUR API KEY"

genai.configure(api_key=MY_API_KEY)
model = genai.GenerativeModel('gemini-2.5-flash-preview-05-20')

print("Gemini configuration completed")

Gemini configuration completed


Now we load the CoNLL-2003 dataset from HuggingFace and check the size of each split. We also print the possible NER and POS tags.

In [None]:
dataset = load_dataset("conll2003")
train_data = dataset["train"]
val_data = dataset["validation"]
test_data = dataset["test"]

print("CoNLL-2003 dataset loaded")
print(f"Train: {len(train_data)} sentences")
print(f"Validation: {len(val_data)} sentences")
print(f"Test: {len(test_data)} sentences")

# Get list of NER and POS tags
ner_label_names = train_data.features['ner_tags'].feature.names
pos_label_names = train_data.features['pos_tags'].feature.names
print(f"NER labels: {ner_label_names}")
print(f"POS labels: {pos_label_names}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


CoNLL-2003 dataset loaded
Train: 14041 sentences
Validation: 3250 sentences
Test: 3453 sentences
NER labels: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
POS labels: ['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']


We display the test set object to check its structure and contents.


In [None]:
# Show the test dataset structure
test_data

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 3453
})

Here we define a function to extract sentences, NER labels, and POS tags from the HuggingFace dataset, limiting the number of examples for faster experiments. Then we prepare development and test subsets and print an example.

In [None]:
def prepare_data(hf_dataset, max_sentences=50):
    sentences = []
    labels = []
    posTags = []

    for i, example in enumerate(hf_dataset):
        if i >= max_sentences:
            break

        tokens = example['tokens']
        ner_tags = [ner_label_names[tag_id] for tag_id in example['ner_tags']]
        pos_tags = [pos_label_names[tag_id] for tag_id in example['pos_tags']]

        sentences.append(tokens)
        labels.append(ner_tags)
        posTags.append(pos_tags)

    return sentences, labels, posTags

# Prepare development and test sets (we use limited size for quick testing and for the usage limitation of our API key)
dev_sentences, dev_labels, dev_pos = prepare_data(val_data, 30)
test_sentences, test_labels, test_pos = prepare_data(test_data, 50)

print(f"Development set: {len(dev_sentences)} sentences")
print(f"Test set: {len(test_sentences)} sentences")

# Show example sentence and labels
print(f"\nExample:")
print(f"Tokens: {' '.join(dev_sentences[1][:10])}")
print(f"Labels: {' '.join(dev_labels[1][:10])}")

Development set: 30 sentences
Test set: 50 sentences

Example:
Tokens: LONDON 1996-08-30
Labels: B-LOC O


The original CoNLL-2003 dataset uses detailed BIO tags (e.g., B-PER, I-LOC) to indicate the position of tokens within named entities. However, for our project, we simplify the tagging scheme to the IO format: each token is labeled with its entity type (e.g., PERSON, LOCATION) if it belongs to an entity, or 'O' if it does not.
This choice keeps the labels simple and is suitable for testing if our zero-shot approach can find the right entities.
The aim of this project is to explore how changing the prompt can improve NER results so using a simple IO tagging scheme is still enough to compare different methods and see which one works better.

In [None]:
# Function to convert CoNLL detailed labels to simple entity classes
def convert_labels(conll_labels):
    simple_labels = []

    for label in conll_labels:
        if label == 'O':
            simple_labels.append('O')
        elif 'PER' in label:
            simple_labels.append('PERSON')
        elif 'LOC' in label:
            simple_labels.append('LOCATION')
        elif 'ORG' in label:
            simple_labels.append('ORGANIZATION')
        elif 'MISC' in label:
            simple_labels.append('MISCELLANEOUS')
        else:
            simple_labels.append('O')

    return simple_labels

# Convert dev and test labels to the simple format
dev_labels_simple = [convert_labels(labels) for labels in dev_labels]
test_labels_simple = [convert_labels(labels) for labels in test_labels]

print("Labels converted to simple format")

Labels converted to simple format


We define the set of possible entity labels used for classification and print them.


In [None]:
# Define the set of possible NER labels
label_set = {'O', 'PERSON', 'LOCATION', 'ORGANIZATION', 'MISCELLANEOUS'}
print(f"Available labels: {label_set}")

Available labels: {'LOCATION', 'PERSON', 'MISCELLANEOUS', 'O', 'ORGANIZATION'}


We define four different zero-shot NER prompting methods for the LLM.  

The first methos is the baseline method from Xie et al. (2023), which asks the model to classify each token directly.


In [None]:
# METHOD 1: Baseline (from Xie et al., 2023)
def get_baseline_prompt(sentence):
    # Basic prompt, asks for token-level NER classification
    text = " ".join(sentence)
    labels_str = ", ".join([f"'{label}'" for label in sorted(label_set)])

    prompt = f"""Given entity label set: {{{labels_str}}}
Based on the given entity label set, please recognize the named entities in the given text.
Text: {text}

Please classify each word in the text with the appropriate entity label. Use 'O' for words that are not named entities.
Format your answer as: word1/LABEL1 word2/LABEL2 word3/O ...

Answer:"""

    return prompt

The second method adopts an approach similar to the decomposed QA strategy described in the paper (method b). The key difference is that, instead of making a separate API call for each label, we request all classifications within a single prompt. This choice was primarily due to API usage limitations. Nonetheless, it's particularly interesting to observe how the LLM performs the classification task after being explicitly instructed on the reasoning process to follow and how to combine the individual classifications into a final decision.

In [None]:
# METHOD 2: Decomposed QA-style prompt
def get_decomposed_qa_prompt(sentence):
    # Asks for entities by type first, then for the full classification (all in one API call)
    text = " ".join(sentence)
    labels_str = ", ".join([f"'{label}'" for label in sorted(label_set)])

    prompt = f"""Given entity label set: {{{labels_str}}}
Based on the given entity label set, please recognize the named entities in the given text.
Text: {text}

Question: What are the named entities labeled as 'PERSON' in the text?
Answer: [List all PERSON entities found]

Question: What are the named entities labeled as 'LOCATION' in the text?
Answer: [List all LOCATION entities found]

Question: What are the named entities labeled as 'ORGANIZATION' in the text?
Answer: [List all ORGANIZATION entities found]

Question: What are the named entities labeled as 'MISCELLANEOUS' in the text?
Answer: [List all MISCELLANEOUS entities found]

Now, based on the above analysis, classify each word:
Format: word1/LABEL1 word2/LABEL2 word3/O ...

Final Answer:"""

    return prompt

The third method is very similar to the previous one and is also inspired by one of the approaches in the paper, in this case, method (d). However, in this version, the prompt includes not only the sentence but also its POS tags

In [None]:
# METHOD 3: Tool Augmentation (POS tags included)
def get_tool_augmentation_prompt(sentence, pos_tags):
    # Adds POS tag info to the prompt, (but only one API call)
    text = " ".join(sentence)
    labels_str = ", ".join([f"'{label}'" for label in sorted(label_set)])
    pos_text = " ".join([f"{word}/{tag}" for word, tag in zip(sentence, pos_tags)])

    prompt = f"""Given entity label set: {{{labels_str}}}
Given the text and the corresponding Part-of-Speech tags, please recognize the named entities in the given text.
Text: {text}
Part-of-Speech tags: {pos_text}

Question: What are the named entities labeled as 'PERSON' in the text?
Answer: [Analyze using POS tags to identify PERSON entities]

Question: What are the named entities labeled as 'LOCATION' in the text?
Answer: [Analyze using POS tags to identify LOCATION entities]

Question: What are the named entities labeled as 'ORGANIZATION' in the text?
Answer: [Analyze using POS tags to identify ORGANIZATION entities]

Question: What are the named entities labeled as 'MISCELLANEOUS' in the text?
Answer: [Analyze using POS tags to identify MISCELLANEOUS entities]

Based on the above analysis and POS tags, classify each word:
Format: word1/LABEL1 word2/LABEL2 word3/O ...

Final Classification:"""

    return prompt

Method 4 is a simple method we thought of, and it gave good results in previous runs. It's the same as the baseline in that it asks the model to classify the whole sentence without explaining the reasoning. However, it adds a short description of the labels or what each label includes. The idea is that by explaining what each label represents, the model can better understand which one to choose.

In [None]:
# METHOD 4: Detailed definitions (extra method)
def get_detailed_prompt(sentence):
    # Extra method, uses explicit definitions for each class (not from the paper, just a reasonable variant)
    text = " ".join(sentence)
    prompt = f"""Identify and classify each word in the following text as one of the following categories:

PERSON: Names of people (first names, last names, nicknames).
LOCATION: Geographic locations (cities, countries, continents, landmarks).
ORGANIZATION: Companies, institutions, government agencies, teams, universities.
MISCELLANEOUS: Languages, nationalities, religions, events, awards, products, brands.
O: Not a named entity.

Text: {text}

Provide the classification for each word in the format: word/CATEGORY

Answer:"""
    return prompt

We define a helper function to send prompts to Gemini and get the model's response. If there is a temporary error, it tries again a few times.

In [None]:
# Function to query Gemini with a prompt, with simple retry logic
def query_gemini(prompt, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            time.sleep(1)  # Small pause between calls
            return response.text
        except Exception as e:
            print(f"Error attempt {attempt + 1}: {str(e)}")
            if attempt < max_retries - 1:
                time.sleep(5)  # Wait before retrying
            else:
                return "ERROR"

print("Gemini query function defined")

Gemini query function defined


We define two helper functions to parse the output from Gemini and convert it into a list of predicted NER labels for each word. The second function is a bit more robust and is used especially for method 4, where the model's output might be less regular.


In [None]:
# Simple parsing function for Gemini output (mainly for methods 1-3)
def parse_response(response_text, original_sentence):
    predicted_labels = ['O'] * len(original_sentence)

    if response_text == "ERROR":
        return predicted_labels

    lines = response_text.strip().split('\n')

    for line in lines:
        matches = re.findall(r'(\S+)/(PERSON|LOCATION|ORGANIZATION|MISCELLANEOUS|O)', line)
        for word, label in matches:
            word_clean = word.strip('.,!?";:')
            for i, orig_word in enumerate(original_sentence):
                if orig_word.lower() == word_clean.lower():
                    predicted_labels[i] = label
                    break

    return predicted_labels


# More robust parser, especially for method 4 (detailed definition)
def parse_gemini_response(response_text, original_sentence):
    predicted_labels = ['O'] * len(original_sentence)
    lines = response_text.strip().split('\n')
    used_indices = set()
    for line in lines:
        matches = re.findall(r'(\S+)/(PERSON|LOCATION|ORGANIZATION|MISCELLANEOUS|O)', line, re.IGNORECASE)
        for word, label in matches:
            word_clean = word.strip('.,!?";:').lower()
            for i, orig_word in enumerate(original_sentence):
                if i not in used_indices and orig_word.lower() == word_clean:
                    predicted_labels[i] = label.upper() if label != 'O' else 'O'
                    used_indices.add(i)
                    break
    return predicted_labels

print("Parsing function(s) defined")


Parsing function(s) defined


Since we adopted the IO tagging scheme for NER, we evaluate the model's performance by comparing the predicted labels with the gold labels on a token-by-token basis. For each token, we check whether the predicted entity type matches the true (gold) label. This approach allows us to compute standard metrics like precision, recall, and F1-score without requiring span-level matching

In [None]:
# Compute precision, recall, F1 score for NER predictions
def calculate_f1(true_labels_list, pred_labels_list):
    correct = 0
    total_true = 0
    total_pred = 0

    for true_labels, pred_labels in zip(true_labels_list, pred_labels_list):
        for true_label, pred_label in zip(true_labels, pred_labels):

            if true_label != 'O':
                total_true += 1

            if pred_label != 'O':
                total_pred += 1

            if true_label == pred_label and true_label != 'O':
                correct += 1

    precision = correct / total_pred if total_pred > 0 else 0
    recall = correct / total_true if total_true > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'correct': correct,
        'total_true': total_true,
        'total_pred': total_pred
    }

print("Evaluation functions defined")

Evaluation functions defined


Here we test all four NER prompting methods on the first few examples from the development set. We print the original sentence, the true labels, and the predictions from each method.


In [None]:
print("\n DEVELOPMENT PHASE - Testing on the first few examples")

test_indices = [0, 1, 2]

for i in test_indices:
    sentence = dev_sentences[i]
    true_labels = dev_labels_simple[i]
    pos_tags = dev_pos[i]

    print(f"\n--- Sentence {i+1} ---")
    print(f"Text: {' '.join(sentence)}")
    print(f"True:  {' '.join(true_labels)}")

    # Test Method 1
    prompt1 = get_baseline_prompt(sentence)
    response1 = query_gemini(prompt1)
    pred1 = parse_response(response1, sentence)
    print(f"Met 1: {' '.join(pred1)}")

    # Test Method 2
    prompt2 = get_decomposed_qa_prompt(sentence)
    response2 = query_gemini(prompt2)
    pred2 = parse_response(response2, sentence)
    print(f"Met 2: {' '.join(pred2)}")

    # Test Method 3
    prompt3 = get_tool_augmentation_prompt(sentence, pos_tags)
    response3 = query_gemini(prompt3)
    pred3 = parse_response(response3, sentence)
    print(f"Met 3: {' '.join(pred3)}")

    # Test Method 4 (with detailed definitions)
    prompt4 = get_detailed_prompt(sentence)
    response4 = query_gemini(prompt4)
    pred4 = parse_gemini_response(response4, sentence)
    print(f"Met 4: {' '.join(pred4)}")


 DEVELOPMENT PHASE - Testing on the first few examples

--- Sentence 1 ---
Text: CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY .
True:  O O ORGANIZATION O O O O O O O O
Met 1: O O ORGANIZATION O O O O O O O O
Error attempt 1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Met 2: O O ORGANIZATION O O O O O O O O
Met 3: MISCELLANEOUS O ORGANIZATION O O O O O O O O
Met 4: O O ORGANIZATION O O O O O O O O

--- Sentence 2 ---
Text: LONDON 1996-08-30
True:  LOCATION O
Met 1: LOCATION O
Error attempt 1: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Met 2: LOCATION O
Met 3: LOCATION O
Met 4: LOCATION O

--- Sentence 3 ---
Text: West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship .
True:  MISCELLANEOUS MISCELLANEOUS O PERSON PERSON O O O O O O O ORGANIZATION O ORGANI

Now we evaluate all four methods on a subset of 50 test sentences. For each method, we generate predictions, calculate metrics, and store the results for later comparison.


In [None]:
print("\n FINAL EVALUATION ON THE TEST SET")
print("=" * 50)

n_test = 50  # Number of test sentences to use
test_subset_sentences = test_sentences[:n_test]
test_subset_labels = test_labels_simple[:n_test]
test_subset_pos = test_pos[:n_test]

results = {}


# METHOD 1: Baseline
print(f"\n Testing Method 1 on {n_test} sentences...")
method1_predictions = []

for i, sentence in enumerate(test_subset_sentences):
    print(f"Sentence {i+1}/{n_test}...", end=" ")
    prompt = get_baseline_prompt(sentence)
    response = query_gemini(prompt)
    predictions = parse_response(response, sentence)
    method1_predictions.append(predictions)
    print("✓")

results['Method 1'] = calculate_f1(test_subset_labels, method1_predictions)



# METHOD 2:
print(f"\n Testing Method 2 on {n_test} sentences...")
method2_predictions = []

for i, sentence in enumerate(test_subset_sentences):
    print(f"Sentence {i+1}/{n_test}...", end=" ")
    prompt = get_decomposed_qa_prompt(sentence)
    response = query_gemini(prompt)
    predictions = parse_response(response, sentence)
    method2_predictions.append(predictions)
    print("✓")

results['Method 2'] = calculate_f1(test_subset_labels, method2_predictions)



# METHOD 3:
print(f"\n Testing Method 3 on {n_test} sentences...")
method3_predictions = []

for i, sentence in enumerate(test_subset_sentences):
    print(f"Sentence {i+1}/{n_test}...", end=" ")
    prompt = get_tool_augmentation_prompt(sentence, test_subset_pos)
    response = query_gemini(prompt)
    predictions = parse_response(response, sentence)
    method3_predictions.append(predictions)
    print("✓")

results['Method 3'] = calculate_f1(test_subset_labels, method3_predictions)



# METHOD 4:
print(f"\n Testing Method 4 on {n_test} sentences...")
method4_predictions = []

for i, sentence in enumerate(test_subset_sentences):
    print(f"Sentence {i+1}/{n_test}...", end=" ")
    prompt = get_detailed_prompt(sentence)
    response = query_gemini(prompt)
    predictions = parse_gemini_response(response, sentence)
    method4_predictions.append(predictions)
    print("✓")

results['Method 4'] = calculate_f1(test_subset_labels, method4_predictions)


 FINAL EVALUATION ON THE TEST SET

 Testing Method 1 on 50 sentences...
Sentence 1/50... ✓
Sentence 2/50... Error attempt 1: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
✓
Sentence 3/50... Error attempt 1: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✓
Sentence 4/50... ✓
Sentence 5/50... ✓
Sentence 6/50... ✓
Sentence 7/50... ✓
Sentence 8/50... ✓
Sentence 9/50... ✓
Sentence 10/50... ✓
Sentence 11/50... ✓
Sentence 12/50... ✓
Sentence 13/50... Error attempt 1: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✓
Sentence 14/50... Error attempt 1: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✓
Sentence 15/50... Error attempt 1: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✓
Sentence 16/50... Error attempt 1: ('Connection aborted.', RemoteDisconnected('Remote end closed con



Error attempt 1: 429 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-05-20:generateContent?%24alt=json%3Benum-encoding%3Dint: You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.




Error attempt 2: 429 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-preview-05-20:generateContent?%24alt=json%3Benum-encoding%3Dint: You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.
✓
Sentence 33/50... ✓
Sentence 34/50... ✓
Sentence 35/50... ✓
Sentence 36/50... Error attempt 1: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✓
Sentence 37/50... ✓
Sentence 38/50... ✓
Sentence 39/50... ✓
Sentence 40/50... ✓
Sentence 41/50... Error attempt 1: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✓
Sentence 42/50... ✓
Sentence 43/50... Error attempt 1: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
✓
Sentence 44/50... ✓
Sentence 45/50... ✓
Sentence 46/50... ✓
Sentence 47/50... Error attempt 1: ('Connection abo

We print a summary table of the final results (precision, recall, F1) for each method. We also show more detailed counts and highlight the best-performing method based on F1 score.


In [None]:
print("\n" + "=" * 60)
print("FINAL RESULTS")
print("=" * 60)

# Print results table
print(f"\n{'Method':<25} {'Precision':<10} {'Recall':<10} {'F1-Score':<10}")
print("-" * 55)

for method_name, metrics in results.items():
    print(f"{method_name:<25} {metrics['precision']:.3f}      {metrics['recall']:.3f}      {metrics['f1']:.3f}")

# Show details for each method
print(f"\n DETAILS:")
for method_name, metrics in results.items():
    print(f"\n {method_name}:")
    print(f"   Precision: {metrics['precision']:.3f}")
    print(f"   Recall: {metrics['recall']:.3f}")
    print(f"   F1-Score: {metrics['f1']:.3f}")
    print(f"   Correct entities: {metrics['correct']}")
    print(f"   True entities: {metrics['total_true']}")
    print(f"   Predicted entities: {metrics['total_pred']}")

# Print the best method (highest F1)
best_method = max(results.items(), key=lambda x: x[1]['f1'])
print(f"\n🏆 BEST METHOD: {best_method[0]} (F1: {best_method[1]['f1']:.3f})")


FINAL RESULTS

Method                    Precision  Recall     F1-Score  
-------------------------------------------------------
Method 1                  0.942      0.913      0.927
Method 2                  0.901      0.905      0.903
Method 3                  0.917      0.909      0.913
Method 4                  0.953      0.974      0.964

 DETAILS:

 Method 1:
   Precision: 0.942
   Recall: 0.913
   F1-Score: 0.927
   Correct entities: 211
   True entities: 231
   Predicted entities: 224

 Method 2:
   Precision: 0.901
   Recall: 0.905
   F1-Score: 0.903
   Correct entities: 209
   True entities: 231
   Predicted entities: 232

 Method 3:
   Precision: 0.917
   Recall: 0.909
   F1-Score: 0.913
   Correct entities: 210
   True entities: 231
   Predicted entities: 229

 Method 4:
   Precision: 0.953
   Recall: 0.974
   F1-Score: 0.964
   Correct entities: 225
   True entities: 231
   Predicted entities: 236

🏆 BEST METHOD: Method 4 (F1: 0.964)


Finally, we show a few examples from the test set for a qualitative analysis. For each, we print the sentence, true labels, and predictions from the first three methods to better see where the models perform well or make mistakes.


In [None]:
import numpy as np

print(f"\n" + "=" * 60)
print("QUALITATIVE ANALYSIS - Examples")
print("=" * 60)

for i in range(min(5, len(test_subset_sentences))):
    i = np.random.randint(0, len(test_subset_sentences))
    sentence = test_subset_sentences[i]
    true_labels = test_subset_labels[i]

    print(f"\n--- Example {i+1} ---")
    print(f"Sentence: {' '.join(sentence)}")
    print(f"True:     {' '.join(true_labels)}")
    print(f"Method 1: {' '.join(method1_predictions[i])}")
    print(f"Method 2: {' '.join(method2_predictions[i])}")
    print(f"Method 3: {' '.join(method3_predictions[i])}")
    print(f"Method 4: {' '.join(method4_predictions[i])}")


QUALITATIVE ANALYSIS - Examples

--- Example 34 ---
Sentence: Squad : Javier Pertile , Paolo Vaccari , Marcello Cuttitta , Ivan Francescato , Leandro Manteri , Diego Dominguez , Francesco Mazzariol , Alessandro Troncon , Orazio Arancio , Andrea Sgorlon , Massimo Giovanelli , Carlo Checchinato , Walter Cristofoletto , Franco Properzi Curti , Carlo Orlandi , Massimo Cuttitta , Giambatista Croci , Gianluca Guidi , Nicola Mazzucato , Alessandro Moscardi , Andrea Castellani .
True:     O O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O
Method 1: O O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON PERSON O PERSON P

## Final Discussion

The results show that all four zero-shot NER methods perform surprisingly well using Gemini.

Method 1 (Baseline) performed very well (F1: 0.927) considering its simplicity, demonstrating that Gemini 2.5 has strong inherent NER capabilities even with minimal prompting. However, it showed the lowest recall (0.913), suggesting it tends to be more conservative in entity identification.

Methods 2 and 3 (Decomposed QA and POS Augmentation) showed similar performance (F1: 0.903 and 0.913 respectively), falling between the baseline and detailed approaches. It's surprising how little difference there is between those two methods, considering that Method 3 also includes POS tags. This might be because modern LLMs already have a strong internal representation of syntax, making explicit POS information less impactful.
It's also surprising that they perform worse than the baseline, despite the additional information. Probably asking the LLM to generate all the labels in a single prompt, instead of making separate requests for each one, leads the model to approach the task globally rather than focusing on one label at a time. As a result, the advantage of using separate requests, where the model can dedicate more attention and context to each individual label, is lost.

Method 4 (Detailed Definitions) emerged as the clear winner with an F1 score of 0.964, achieving both the highest precision (0.953) and recall (0.974). Providing clear descriptions of each entity type reduces ambiguity and help capture edge cases that might be missed by simpler prompts.

Overall, zero-shot LLMs are competitive on standard NER with careful prompt engineering. However, the output format is not always perfectly consistent, and errors tend to happen in more ambiguous cases (like event names or less frequent classes). Using a more robust parsing strategy and experimenting with even more detailed prompts could further improve performance.

In the future, it could be useful to compare the zero-shot results from Gemini 2.5 with results from supervised models that are fine-tuned on the same dataset. This would help us understand the differences between using a powerful general model and a model trained specifically for the task.

Another idea is to test the same prompts on domain-specific NER datasets, such as biomedical or legal texts. These domains use different types of entities and vocabulary, so it would be interesting to see how well the model can adapt.

Finally, it would be interesting to repeat the experiment using a smaller and less powerful language model. In that case, the way the prompt is written might have a bigger impact on the results, because simpler models may need more help to understand the task.
