## Authors

| Student Name | Student ID | Degree                                |
|--------------|------------|----------------------------------------|
| Fabrizio Genilotti    | 2119281       | Master Degree in Computer Engineering |
| Francesco Boscolo Meneguolo    | 2119969        | Master Degree in Computer Engineering |

#Zero-shot NER with Large Language Models

In this project the **Falcon-7B-Instruct** [[1](#)] small-size Large Language Model (LLM) is used to perform the Named Entity Recognition (NER) task on the **Few-NERD** [[2](#)] dataset using one of four possible approaches: Vanilla, Decomposed-QA, Tool Augmentation and Salient Entity Span.
The model perdormance on each approach is assessed according two types of evaluations: exact matching and token-level evaluation.
In the last section are reported the results and the discussion on the results.

**Reference bibliography** is in the last cell of the notebook.

In [None]:
# Choose NER zero-shot strategy (0 = 'Vanilla', 1 = 'Decomposed-QA', 2 = 'Tool Augmentation' 3 = 'Salient Entity Span')
ner_strategy = 2

First we need to install the libraries

In [None]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install hanlp
!pip install spacy

## LLM Model

Falcon-7B-Instruct is a 7B parameters causal decoder-only model built by TII based on Falcon-7B [[1](#)] and finetuned on a mixture of chat/instruct datasets.

*For Falcon-7B-Instruct paper coming soon - June 2025*

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# Model name choosen from Hugging Face library
model_id = "tiiuae/Falcon3-7B-Instruct"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load model into memory (use GPU if available, else CPU)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

# Generate using pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

## Dataset
Few-NERD [[2]()] is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. The list of classification labels, with possible values including O (0), art (1), building (2), event (3), location (4), organization (5), other(6), person (7), product (8).

It uses **IO** convention schema instead of BIO for evaluation.

N.B: label 'O' means **"not an entity"**.

In [None]:
!pip install -U datasets fsspec

# Dataset loading
from datasets import load_dataset
dataset = load_dataset("DFKI-SLT/few-nerd", "inter")

In [None]:
# Dataset structure
print(dataset)
print("\n")

# Dataset instance
print("Dataset instance:")
print(dataset["train"][1], end="\n\n")

## Named Entity Recognition Strategies

To perform the Named Entity Recognition (NER) task using an instructed Large Language Model (LLM), four prompting strategies are proposed: Vanilla, Decomposed-QA, Tool Augmentation and Salient Entity Span.

- **Vanilla** :

  The first approach consists in providing the model with a single structured prompt giving in input the entity label set, the text and the format the model is asked to output.

- **Decomposed QA** :

  The second approach [[3](#)] decomposes the task into multiple iterations, each focusing on a single entity type. The initial prompt includes the entity label set, the text and the output format. At each iteration, the prompt is expanded by asking to detect the entities belonging to one of the labels from the label set. Once all questions pertaining to each label have been addressed, the conversation is concluded. This strategy aims to narrowing the model’s focus during each extraction step.

- **Tool Augmentation**

  The third approach [[3](#)] firstly obtains the syntactic information of the input text via **spaCy** POS tagging tool [[4](#)]. Secondly, it feeds the input text together with the syntactic information to the model. Then, the prompt is iteratively expanded as in Decomposed QA.

- **Salien Entity Span** :

  The fourth approach breaks the task in two stages. In the first step, the model is prompted with the input text and is asked to output a list of potential salient entity spans. In the second step the model is re-prompted to recognize the entity spans given the entity label set, the text, the potential entity spans and the asked output format.  This approach separates span detection from classification, potentially improving accuracy by allowing more focused and specialized reasoning in each stage.

The dataset is used to test each of these strategies. Since the test process have a heavy computational cost for this setting the **test is made over 50 dataset instances**.

**N.B.:** validation instances (used in debugging phase) are different from test instances

In [None]:
# O (0), art (1), building (2), event (3), location (4), organization (5), other(6), person (7), product (8).
# Entity labels of interest, O excluded
entity_labels = ["art", "building", "event", "location", "organization", "other", "person", "product"]

# Store results in list
results = []

# Number of test instances
test_size = 50

### Vanilla

In [None]:
import ast
import json
import re

# Convert into multiple dictionaries, one per each possible entity tag
def to_dialogue_stages(raw_output_str, entity_labels):
    # Remove <|assistant|> and whitespace
    if "<|assistant|>" in raw_output_str:
      cleaned = raw_output_str.split("<|assistant|>", 1)[-1].strip()
    else:
      cleaned = raw_output_str.strip()

    # Fix improperly formatted dictionary-like output (e.g., ['key': 'value'])
    if cleaned.startswith("[") and ":" in cleaned:
        cleaned = cleaned.replace("[", "{").replace("]", "}")

    entity_dict = {}

    # String to dictionary
    entity_dict = ast.literal_eval(cleaned)

    label_to_entities = {label: [] for label in entity_labels}

    # Insert entity in correct dicionary
    for entity, label in entity_dict.items():
        if label in label_to_entities:
            label_to_entities[label].append(entity)

    return [{'label': label, 'raw_answer': label_to_entities[label]} for label in entity_labels]

# Vanilla NER zero-shot strategy
if ner_strategy == 0:
  # Loop over train corpus
  for i, sample in enumerate(dataset["train"]):
      # Prepare input text
      text_input = " ".join(sample["tokens"]).replace(" .", ".").replace(" ,", ",").replace("'", " ")
      print(i)

      # Safely quote the text using JSON
      quoted_text_input = json.dumps(text_input)

      # Build a single prompt for all entity types
      prompt = f"""Given the entity label set: {{art, building, event, location, organization, other, person, product}}. Based on the given entity label set, please recognize the named entities in the given text. Text: '{quoted_text_input}'. Answer has to be dictionary like: {{'entity_name': 'label'}}. Answer:"""

      # Generate output from model
      output = pipe(prompt, max_new_tokens=1000, do_sample=False)[0]["generated_text"]

      # Extract answer after the "Answer:"
      entity_dict_text = output.split("Answer:")[-1].strip()

      # Store the result
      result = {
          "text_index": i,
          "text_input": text_input,
          "dialogue_stages": to_dialogue_stages(entity_dict_text, entity_labels)
      }

      results.append(result)

      # Iterate over test_size instances
      if i == (test_size - 1):
        print("NER System: end")
        break

### Decomposed QA

In [None]:
import re
import json
import ast

# Extract dictionaries of entities
def normalize_raw_answer(raw_answer, current_label):
    # Remove <|assistant|> and whitespace
    cleaned = re.sub(r"<\|assistant\|>\s*", "", raw_answer.strip())

    # Treat empty string as empty dict
    if cleaned in ("", "{}"):
        return {}

    parsed = ast.literal_eval(cleaned)

    # Handle cases in which the model output different formats
    if isinstance(parsed, dict):
        for v in parsed.values():
            if isinstance(v, list):
                return v
    elif isinstance(parsed, list):
        candidates = parsed

    return candidates

# Decomposed-QA NER zero-shot strategy
if ner_strategy == 1:
  # Loop over train corpus
  for i, sample in enumerate(dataset["train"]):
    # Input text
    text_input = " ".join(sample["tokens"]).replace(" .", ".").replace(" ,", ",")
    print(i)

    # Safely quote the text using JSON
    quoted_text_input = json.dumps(text_input)

    # Initialize dialogue conversation
    dialogue =  f"""Based on the given entity label set {{art, building, event, location, organization, other, person, product}}, please recognize the named entities in the given text. Text: '{quoted_text_input}'. Answer has to be like in JSON format {{'label': []}}."""

    # Store dialogue Q&A of current text input
    result = {
        "text_index": i,
        "text_input": text_input,
        "dialogue_stages": []
    }

    for label in entity_labels:
      # Build prompt
      dialogue = dialogue + f""" Question: what are the named entities as '{label}' in the text?.
      Answer:"""

      # Generate model output
      output = pipe(dialogue, max_new_tokens=200, do_sample=False)[0]["generated_text"]

      # Extract last answer
      last_answer = output.split("Answer:")[-1].strip()

      # Append the answer relative to last question
      dialogue = dialogue + f"{last_answer}"

      # Store result
      result["dialogue_stages"].append({
          "label": label,
          "raw_answer": normalize_raw_answer(last_answer, label)
      })

    results.append(result)

    # Iterate over val_size instances
    if i == (test_size - 1):
      print("NER System: end")
      break

### Tool Augmentation

In [None]:
import re
import ast
import json
import spacy

# Tool Augmentation
if ner_strategy == 2:
    # spacy POS tagging model
    spacy_pipeline = spacy.load("en_core_web_sm")

    # Loop over train corpus
    for i, sample in enumerate(dataset["train"]):
      # Input text
      text_input = " ".join(sample["tokens"]).replace(" .", ".").replace(" ,", ",")
      print(f"Instance {i}")

      # Safely quote the text using JSON
      quoted_text_input = json.dumps(text_input)

      # Process the text with spaCy
      doc = spacy_pipeline(quoted_text_input)

      # Extract tokens (including punctuation) and their POS tags
      tokens = [token.text for token in doc]
      pos_tags = [token.pos_ for token in doc]  # Coarse-grained POS

      pos_result = ' '.join(f'{a}/{b}' for a, b in zip(tokens, pos_tags))

      # Initialize dialogue conversation
      dialogue =  f"""Given entity label set: {{art, building, event, location, other, person, product}}. Given the text and the corresponding Part-of-Speech tags, please recognize the named entities in the given text. Text: '{quoted_text_input}'. Part-of-Speech tags: '{pos_result}'. Answer has to be like in JSON format {{'label': []}}."""

      # Store dialogue Q&A of current text input
      result = {
          "text_index": i,
          "text_input": text_input,
          "dialogue_stages": []
      }

      for label in entity_labels:
        # Build prompt
        dialogue = dialogue + f""" Question: what are the named entities as '{label}' in the text?.
        Answer:"""
        print(label)
        # Generate model output
        output = pipe(dialogue, max_new_tokens=200, do_sample=False)[0]["generated_text"]

        # Extract last answer
        last_answer = output.split("Answer:")[-1].strip()

        # Append the answer relative to last question
        dialogue = dialogue + f"{last_answer}"

        # Store result
        result["dialogue_stages"].append({
            "label": label,
            "raw_answer": normalize_raw_answer(last_answer, label)
        })

      results.append(result)

      # Iterate over val_size instances
      if i == (test_size - 1):
        print("NER System: end")
        break


### Salient Entity Span

In [None]:
import re
import ast
import json

# Extract entity spans from the list of salient spans
def extract_entities(text_entities):
    match = re.search(r"\[.*\]", text_entities, re.DOTALL)
    if match:
        try:
            return ast.literal_eval(match.group(0))
        except:
            return []
    return []

# Convert into multiple dictionaries, one per each possible entity tag
def to_dialogue_stages_spans(raw_output_str, entity_labels):
    # Remove <|assistant|> and whitespace
    if "<|assistant|>" in raw_output_str:
      cleaned = raw_output_str.split("<|assistant|>", 1)[-1].strip()
    else:
      cleaned = raw_output_str.strip()

    entity_dict = {}

    # String to dictionary
    entity_dict = ast.literal_eval(cleaned)

    # Normalize extracted entity dict
    normalized_dict = {}
    for entity, label in entity_dict.items():
      if isinstance(label, set) or isinstance(label, list):
          # Handle set or list of labels
          normalized_dict[entity] = next(iter(label), None)
      elif isinstance(label, dict) and 'label' in label:
          normalized_dict[entity] = label['label']
      else:
          normalized_dict[entity] = label

    label_to_entities = {label: [] for label in entity_labels}

    # Insert entity in correct dicionary
    for entity, label in normalized_dict.items():
        if label in label_to_entities:
            label_to_entities[label].append(entity)

    return [{'label': label, 'raw_answer': label_to_entities[label]} for label in entity_labels]


# Custom NER zero-shot strategy
if ner_strategy == 3:
  # Loop over train corpus
  for i, sample in enumerate(dataset["train"]):
    # Input text
    text_input = " ".join(sample["tokens"]).replace(" .", ".").replace(" ,", ",")
    print(i)

    # Safely quote the text using JSON
    quoted_text_input = json.dumps(text_input)

    # Salient span highlighting
    prompt = f"""Extract all named entities from the following text. Return them as a JSON list of strings (no labels yet). Only include unique, meaningful names or phrases that refer to people, places, organizations, or other named things. Text: '{quoted_text_input}'. Answer:"""

    # Generate model output
    output_entities = pipe(prompt, max_new_tokens=1000, do_sample=False)[0]["generated_text"]
    entity_list = json.dumps(extract_entities(output_entities))

    prompt = f"""Based on the given entity label set {{art, building, event, location, organization, other, person, product}} and extracted entities, please recognize the named entities in the given text. Text: '{quoted_text_input}'. Entities: {entity_list}. Answer has to be like in JSON format, containing key entity and value label. Answer:"""

    # Generate model output
    output = pipe(prompt, max_new_tokens=1000, do_sample=False)[0]["generated_text"]

    # Extract answer after the "Answer:"
    entity_dict_text = output.split("Answer:")[-1].strip()

    # Store the result
    result = {
        "text_index": i,
        "text_input": text_input,
        "dialogue_stages": to_dialogue_stages_spans(entity_dict_text, entity_labels)
    }

    results.append(result)

    # Iterate over val_size instances
    if i == (test_size - 1):
      print("NER System: end")
      break

## Entity Post-processing

Before performing evaluation, the model predictions are post-processed.

For each dataset instance, the model prediction is parsed to extract entity spans in a structured format. Each entity span is defined as a tuple of the form `(text, (start_idx, end_idx), label)`. Where:

- `text` : the entity text span, as it appears in the input
- `(start_idx, end_idx)` : the span boundaries indicating start and end token indices
- `label` : the gold/predicted entity type



In [None]:
entity_labels = ["O", "art", "building", "event", "location", "organization", "other", "person", "product"]

In [None]:
# Extract dataset training instance in IO convention (not BIO)
def extract_io_entities(tokens, tags, label_map):
    """
    Converts an IO-tagged sequence into entity spans.
    Each span is (text, (start_idx, end_idx), label)
    """

    # Store entity tuples
    entities = []

    # Current entity text data
    current_entity = []
    current_label = None
    start_idx = None

    # Process each token with its tag
    for idx, tag in enumerate(tags):
      if tag != 0:
        label = label_map[tag]

        # Check if token is in current entity (else start new entity)
        if current_label == label:
          current_entity.append(tokens[idx])
        else:
          if current_entity:
            span = (start_idx, idx)
            entities.append((" ".join(current_entity), span, current_label))
          current_entity = [tokens[idx]]
          current_label = label
          start_idx = idx
      else:
        # Save entity when next tag is 'O'
        if current_entity:
          span = (start_idx, idx)
          entities.append((" ".join(current_entity), span, current_label))
          current_entity = []
          current_label = None
          start_idx = None

    # Last entity handle
    if current_entity:
        span = (start_idx, len(tokens))
        entities.append((" ".join(current_entity), span, current_label))

    return entities

In [None]:
# Extract predicted LLM instances in IO convention
def extract_llm_entities(llm_output, tokens):
    """
    Converts an IO-tagged sequence into entity spans.
    Each span is (text, (start_idx, end_idx), label)
    """
    predictions = []

    # Process each entity recognized by LLM
    for entry in llm_output:
      label = entry["label"]
      raw_answer = entry["raw_answer"]

      # Check if dict is not empty
      if not raw_answer:
        continue

      for entity_text in raw_answer:
          # Get each token of current entity
          entity_tokens = entity_text.split()

          for i in range(len(tokens)):
            if tokens[i : i + len(entity_tokens)] == entity_tokens:
              span = (i, i + len(entity_tokens))
              predictions.append((" ".join(tokens[i:span[1]]), span, label))
              break

    return predictions

## Evaluation
For the evaluation process the model relies on two approaches:
- **Exact evaluation** (coarse-grained)

  This method assesses the model's ability to identfy and localize complete entity spans with the correct labels. A predicted entity is considered correct only if it matches the ground truth exactly in both span boundaries and entity type. Partial matches or mislabelings are treated as errors.

- **Token-level evaluation** (fine-grained)

  This method assesses the model's performance at the level of individual tokens by comparing predicted and ground truth labels. Entity spans are converted into token-wise IO labels, excluding the 'O' label representing non-entity tokens. This approach captures how well the model labels tokens even when full entity spans are not perfectly matched.

### **Micro, Macro and Weighted average**

For the **Token-level evaluation**, three types of averaging methods are considered for the chosen metrics:

- **Micro average**: it computes metrics by aggregating the contributions of all classes. It treats every instance equally, regardless of class, by summing up the true positives, false positives, and false negatives across all classes before calculating the metric.

  **This averaging method is the main focus of the evaluation.**

- **Macro average**: it computes the metrics independently for each class and then takes the unweighted mean. It treats all classes equally, regardless of their frequency.

- **Weighted average**: it computes metrics independently for each class, and then takes the mean weighted by the number of true instances in each class. This accounts for class imbalance by giving more influence to more frequent classes.

In [None]:
# Possible entity tags
entity_labels = ["O", "art", "building", "event", "location", "organization", "other", "person", "product"]

In [None]:
# EXACT EVALUATION

def exact_evaluation(results, dataset, entity_labels, start_index, end_index):
  # Initialize counters
  total_tp = 0
  total_fp = 0
  total_fn = 0

  # Loop over instances for computing exact match metrics
  for i in range(start_index, end_index + 1):

      tokens = dataset["train"][i]["tokens"]

      # Extract true and predicted entities
      true_entities = extract_io_entities(tokens, dataset["train"][i]["ner_tags"], entity_labels)
      pred_entities = extract_llm_entities(results[i]["dialogue_stages"], tokens)

      # Convert to sets for exact match comparison (IO format)
      true_set = set(true_entities)
      pred_set = set(pred_entities)

      # Compute counts
      tp = len(true_set & pred_set)
      fp = len(pred_set - true_set)
      fn = len(true_set - pred_set)

      # Accumulate totals
      total_tp += tp
      total_fp += fp
      total_fn += fn

  # Compute final precision, recall, F1
  precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0.0
  recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0.0
  f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

  # Final metric scores (exact match)
  metric_results = {
      "precision": precision,
      "recall": recall,
      "f1": f1,
      "true_positives": total_tp,
      "false_positives": total_fp,
      "false_negatives": total_fn,
  }

  return metric_results

# Exact match computation
metric_results = exact_evaluation(results, dataset, entity_labels, 0, 49)

# Print metrics
print("Exact evaluation - Evaluation Results:")
print(f"Precision: {metric_results['precision']:.3f}")
print(f"Recall:    {metric_results['recall']:.3f}")
print(f"F1 Score:  {metric_results['f1']:.3f}")
print(f"TP: {metric_results['true_positives']}, FP: {metric_results['false_positives']}, FN: {metric_results['false_negatives']}")

In [None]:
from sklearn.metrics import classification_report

In [None]:
# TOKEN-LEVEL EVALUATION

# Convert entity span to label sequence over entire sentence
def spans_to_io_labels(tokens, spans):
    labels = ['O'] * len(tokens)
    for _, (start_idx, end_idx), ent_type in spans:
        for i in range(start_idx, end_idx):
            labels[i] = f"{ent_type}"
    return labels

def token_level_evaluation(results, dataset, entity_labels, start_index, end_index):
  # Data structures for metric evaluation
  all_true_labels = []
  all_pred_labels = []

  for i in range(start_index, end_index + 1):
      tokens = dataset["train"][i]["tokens"]

      # Extract entities
      ground_truth = extract_io_entities(tokens, dataset["train"][i]["ner_tags"], entity_labels)
      predicted = extract_llm_entities(results[i]["dialogue_stages"], tokens)

      # Convert spans to token-level labels
      true_labels = spans_to_io_labels(tokens, ground_truth)
      pred_labels = spans_to_io_labels(tokens, predicted)

      # Aggregate
      all_true_labels.extend(true_labels)
      all_pred_labels.extend(pred_labels)

  return all_pred_labels, all_true_labels

# Token-level evaluation
all_pred_labels, all_true_labels = token_level_evaluation(results, dataset, entity_labels, 0, 49)

# Extract lables present in data
labels = list(set(all_true_labels + all_pred_labels) - {'O'})

# Classification report
report = classification_report(all_true_labels, all_pred_labels, labels=labels, zero_division=0)
print("Token-level evaluation - Evaluation Results:")
print(report)

## Evaluation results

### **Vanilla**
The method shows limited capability in detecting entity spans, as suggested by its low performance metrics. It achieves a low precision, reflecting a high number of false positives, and a moderately higher recall, indicating a better but still limited ability to identify true entities and reduce false negatives. Overall, the F1 score underscores the need for improvement in both precision and recall.

#### Exact Match
```
Precision: 0.279
Recall:    0.510
F1 Score:  0.361
TP: 53, FP: 137, FN: 51
```

### Token-level
```
              precision    recall  f1-score   support

    building       0.09      0.17      0.12        23
      person       0.28      0.73      0.40        11
organization       0.68      0.51      0.58        63
     product       0.40      0.86      0.55         7
       other       0.00      0.00      0.00         0
    location       0.52      0.60      0.56        73
       event       0.00      0.00      0.00         4

   micro avg       0.29      0.52      0.37       181 <--
   macro avg       0.28      0.41      0.31       181
weighted avg       0.49      0.52      0.49       181
```

### **Decomposed-QA**
The performance of this method is comparable to that of the Vanilla approach. It exhibits a sligltly higher precision and a slightly higher recall. This suggests that the model tends to identify more entities within the sentences, including correct ones. However, it identifies more incorrect or spurious entities that are not actually present (false positives).

#### Exact Match
```
Precision: 0.283
Recall:    0.577
F1 Score:  0.380
TP: 60, FP: 152, FN: 44
```

#### Token-level
```
              precision    recall  f1-score   support

     product       0.00      0.00      0.00         7
    building       0.00      0.00      0.00        23
       event       0.00      0.00      0.00         4
       other       0.00      0.00      0.00         0
    location       0.39      0.74      0.51        73
         art       0.00      0.00      0.00         0
      person       0.26      0.91      0.41        11
organization       0.46      0.62      0.53        63

   micro avg       0.28      0.57      0.38       181 <--
   macro avg       0.14      0.28      0.18       181
weighted avg       0.33      0.57      0.41       181
```

### **Tool augmentation**
This the worst approach in terms of results. The POS taggings do not help the model to perform better, it actually worsens in all metrics (both in exact matching and token-level).
#### Exact Match
```
Precision: 0.202
Recall:    0.481
F1 Score:  0.284
TP: 50, FP: 198, FN: 54
```

### Token-level
```
              precision    recall  f1-score   support

    location       0.36      0.70      0.48        73
    building       0.00      0.00      0.00        23
      person       0.25      0.91      0.39        11
       other       0.00      0.00      0.00         0
organization       0.43      0.38      0.40        63
       event       0.03      0.25      0.05         4
     product       0.21      0.43      0.29         7
         art       0.00      0.00      0.00         0

   micro avg       0.25      0.49      0.33       181 <--
   macro avg       0.16      0.33      0.20       181
weighted avg       0.32      0.49      0.37       181
```

### **Salient entity span**
This method demonstrates almost a moderate ability to correctly identify entities, with a better balance between precision and recall with respect to the other methods. While recall is slightly lower than **Decomposed QA** approach, precision is significantly higher, indicating that the model makes fewer false positive predictions, thus producing more reliable outputs.

#### Exact Match
```
Precision: 0.415
Recall:    0.519
F1 Score:  0.462
TP: 54, FP: 76, FN: 50
```

#### Token-level
```
              precision    recall  f1-score   support

    location       0.45      0.63      0.52        73
    building       0.05      0.09      0.07        23
      person       0.31      0.73      0.43        11
       other       0.00      0.00      0.00         0
       event       0.00      0.00      0.00         4
     product       0.38      0.71      0.50         7
organization       0.72      0.54      0.62        63
         art       0.00      0.00      0.00         0

   micro avg       0.37      0.52      0.44       181 <--
   macro avg       0.24      0.34      0.27       181
weighted avg       0.47      0.52      0.48       181
```

## Conclusions
Overall, the model demonstrates limited capability in solving the Named Entity Recognition (NER) task across most methods. However, it achieves almost moderate performance with the Salient Entity Span approach, which stands out as the most effective.

Notably, even though the **Falcon-7B-Instruct** model was not fine-tuned for NER, it still shows promising results when paired with the right method. These results are particularly noteworthy given both the small size of the model and the use of a general-domain dataset, which increases the complexity of the task.

## References

1. **Almazroue et al.**  <br />
  *The Falcon Series of Open Language Models* <br />
  [https://arxiv.org/abs/2311.16867](https://arxiv.org/abs/2311.16867)<br />
  *For Falcon-7B-Instruct paper coming soon - June 2025* <br />
  **Model:** [https://huggingface.co/tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)

2. **Ning Ding et al.**  
   *Few-NERD: A Few-shot Named Entity Recognition Dataset*.  
   [https://arxiv.org/abs/2105.07464](https://arxiv.org/abs/2105.07464)

3. **Tingyu Xie et al.**  
  *Empirical Study of Zero-Shot NER with ChatGPT*. 2023.  
  [https://arxiv.org/abs/2310.10035](https://arxiv.org/abs/2310.10035)

4. **spaCy**  
  *Part-of-speech tagging*.  
  [https://spacy.io/usage/linguistic-features#pos-tagging](https://spacy.io/usage/linguistic-features#pos-tagging)