# 💫  Explore and analyze NER predictions

In this tutorial, we will learn to log [spaCy](https://spacy.io/) Name Entity Recognition (NER) predictions.

This is useful for:

- 🧐Evaluating pre-trained models.
- 🔎Spotting frequent errors both during development and production.
- 📈Annotating records to create an gold-standard evaluation dataset.


Reference: https://docs.argilla.io/en/latest/tutorials/notebooks/labelling-tokenclassification-spacy-pretrained.html


## Introduction

In this tutorial, we will learn how to explore and analyze spaCy NER pipelines in an easy way.

We will load the [*Gutenberg Time*](https://huggingface.co/datasets/gutenberg_time) dataset from the Hugging Face Hub and use a transformer-based spaCy model for detecting entities in this dataset and log the detected entities into an Argilla dataset. This dataset can be used for exploring the quality of predictions and for creating a new training set, by correcting, adding and validating entities via human annotation.

Let's import the Argilla module for reading and writing data:

In [None]:
import argilla as rg
import spacy
from gliner_spacy.pipeline import GlinerSpacy

<h2>Argilla with HuggingFace Spaces

In [1]:
api_url= ''
api_key= ''
client = rg.Argilla(
    api_url=api_url,
    api_key=api_key
)

NameError: name 'rg' is not defined

<h2>Argilla with Docker

In [4]:
# default argilla username is 'argilla'
# default argilla password is '12345678'
# default api_key for argilla on docker is 'argilla.apikey'
client = rg.Argilla(
    api_url="http://localhost:6900",
    api_key="argilla.apikey"
)
client # Test your login! :)

Argilla has been deployed at: http://localhost:6900

Finally, let's include the imports we need:

In [5]:
from pathlib import Path

from datasets import load_dataset, load_from_disk
import pandas as pd
import spacy
from tqdm.auto import tqdm

## Our dataset
For this tutorial, our default dataset is the [*Gutenberg Time*](https://huggingface.co/datasets/gutenberg_time) dataset from the Hugging Face Hub. It contains all explicit time references in a dataset of 52,183 novels whose full text is available via Project Gutenberg. From extracts of novels, we are surely going to find some NER entities.

If you are following the full lab, you can also load the dataset you generated in the previous notebook.

In [18]:
# when using this notebook standalone, choose a dataset from the hub
# dataset = load_dataset("gutenberg_time", split="train", streaming=True)

# when using this notebook as part of the full lab, load the dataset you created in the previous step
DATASET_PATH = 'data/sampled_dataset/'
dataset = load_from_disk(DATASET_PATH)

# Let's have a look at the first 5 examples of the train set.
try:
    print(pd.DataFrame(dataset.take(5)))
except AttributeError:
    print(pd.DataFrame(dataset[:5]))

                                               input  \
0      Dress, Shoes, and Scarf provided by ModCloth.   
1      Dress, Shoes, and Scarf provided by ModCloth.   
2                   Clothing for both Men and Women.   
3  Please dress appropriately for varied weather ...   
4  Jackets – warm down jacket, a lightweight jack...   

                                              output  \
0  ['Dress <> Clothing Item <> A garment worn by ...   
1  ['ModCloth <> Clothing Retailer <> A company t...   
2  ['Clothing <> Product <> Items worn on the bod...   
3  ['varied weather conditions <> Weather <> Envi...   
4  ['Jackets <> Clothing Item <> A piece of cloth...   

                                          embeddings  
0  [0.02724652737379074, 0.0037798627745360136, 0...  
1  [0.02724652737379074, 0.0037798627745360136, 0...  
2  [0.03369726613163948, 0.020493067800998688, -0...  
3  [-0.03423319756984711, 0.018985847011208534, -...  
4  [-0.04015563800930977, -0.061232659965753555, ..

-----

## Annotating with GLiNER and Logging NER entities into Argilla


Let's instantiate a spaCy transformer `nlp` pipeline and apply it to the first N examples in our dataset, collecting the *tokens* and *NER entities*.

We're going to use a [GLiNER](https://github.com/urchade/GLiNER) model to perform zero shot NER. This means we can provide any entity labels we like!

In [None]:
# observe how gliner works

# Gliner model options https://huggingface.co/urchade 
gliner_model = "urchade/gliner_largev2"

# Define your domain here: the list of entity types you expect to see
zero_shot_labels = ["person", "organization", "email", "sports team", "business"]

# Configuration for GLiNER integration
custom_spacy_config = {
    "gliner_model": gliner_model,
    "chunk_size": 250,
    "labels": zero_shot_labels,
    "style": "ent"
}

# Initialize a blank English spaCy pipeline and add GLiNER
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

# Example
text = "This is a text about Bill Gates and Microsoft."

# Process the text with the pipeline
doc = nlp(text)

# Output detected entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Bill Gates person
Microsoft organization


<h2> Create Argilla Dataset

In [None]:
# if you restart kernel and want to continue from here, get the variable set again
# because the created dataset will persist and you'll get an error trying to recreate the argilla_dataset variable from scratch
for dataset in client.datasets.list():
    if dataset.name == "argilla_dataset":
        argilla_dataset = dataset

In [20]:
# In case you'd like to delete argilla_dataset and start over, just run this cell

dataset_to_delete = client.datasets(name="argilla_dataset")

dataset_deleted = dataset_to_delete.delete()

In [21]:
# dataset_name = "gutenberg_spacy_ner"
dataset_name = "argilla_dataset"
# define the labels you will be using for this lab
labels = ['clothing', 'organization', 'address', 'event']

In [22]:
gliner_model = "urchade/gliner_largev2"

# Configuration for GLiNER integration with the labels defined above
import spacy
custom_spacy_config = {
    "gliner_model": gliner_model,
    "chunk_size": 250,
    "labels": labels,
    "style": "ent"
}

# Initialize a blank English spaCy pipeline and add GLiNER
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]



<gliner_spacy.pipeline.GlinerSpacy at 0x3fb43a3e0>

In [23]:
# Create settings for Argilla
settings = rg.Settings(
    guidelines="Classify individual tokens into given labels",
    fields = [
        rg.TextField(
            name='text', # give it a name
            title='Text', # this will be displayed on the UI above the text field
            use_markdown=False # not necessary for this application
        )
    ],
    # In Argilla a question is basically an annotation instance.
    # This is a token classification case and Argilla has a built-in question type for that called the SpanQuestion.
    questions=[
        rg.SpanQuestion( 
            name="span_label",
            field='text',
            labels=labels,
            title="Classify individual tokens into given labels",
            allow_overlapping=False
        )
    ]
)

In [24]:
# create argilla dataset
argilla_dataset = rg.Dataset(
    name=dataset_name,
    settings=settings
)
argilla_dataset.create()



Dataset(id=UUID('28b7237b-23b6-4ad1-9a29-83b2c0e0c2c2') inserted_at=datetime.datetime(2025, 3, 13, 20, 4, 45, 674340) updated_at=datetime.datetime(2025, 3, 13, 20, 4, 45, 710644) name='argilla_dataset' status='ready' guidelines='Classify individual tokens into given labels' allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('cf04c60d-319c-423e-b686-914e4f1a7ace') last_activity_at=datetime.datetime(2025, 3, 13, 20, 4, 45, 710644))

In [25]:
# Observe that your dataset is created
client.datasets

name,id,workspace_id,updated_at
argilla_dataset,28b7237b-23b6-4ad1-9a29-83b2c0e0c2c2,cf04c60d-319c-423e-b686-914e4f1a7ace,2025-03-13T20:04:45.710644


In [26]:
# In argilla, you can log what they call 'suggestions' for each data instance
# Suggestions are basically model predictions on the data instances

# Having these suggestions has two main benefits:
# 1. It can help you quickly label your data
# 2. You can eyeball the performance of your baseline model

# the below function will produce a suggestion output that is acceptable to Argilla from the GLiNER model
def gliner_predict(text):
    doc = nlp(text)
    return [
        {"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
        for ent in doc.ents
    ]

# collect all records here
record_instances = []
for row in dataset:
    text = row['input'] # this is the input sequence
    span_label= gliner_predict(text) # suggestion data
    
    # make the record instance acceptable by Argilla token classification task
    record_instance = {
        'text': text,
        'span_label': span_label
    }
    record_instances.append(record_instance)

argilla_dataset.records.log(record_instances)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Sending records...: 4batch [00:02,  1.39batch/s]                    


DatasetRecords(Dataset(id=UUID('28b7237b-23b6-4ad1-9a29-83b2c0e0c2c2') inserted_at=datetime.datetime(2025, 3, 13, 20, 4, 45, 674340) updated_at=datetime.datetime(2025, 3, 13, 20, 4, 45, 710644) name='argilla_dataset' status='ready' guidelines='Classify individual tokens into given labels' allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('cf04c60d-319c-423e-b686-914e4f1a7ace') last_activity_at=datetime.datetime(2025, 3, 13, 20, 4, 45, 710644)))

<p>You can now view and annotate your dataset in Argilla.
<p>For docker: http://localhost:6900/
<p>or visit your space in HF spaces if you opted to use that

-----

# Export Annotated Argilla Dataset

GLiNER models return the character indices of the detected entities. For fine-tuning, we need the token indices.

Format the dataset to GLiNER training format {'tokenized_text' [], 'ner': [ [start_token_i, end_token_i, label], ...], ...}

In [46]:
# use GLiNER annotations as training set
train_set = []
# use your gold standard as an evaluation set
eval_set = []
# GLiNER's predictions on the eval set | This will be useful when we start evaluating fine-tuned model against the baseline (GLiNER)
baseline_preds = []
# training data - strings alone
train_set_str = []
# evaluation data - strings alone
eval_set_str = []

for data_point in argilla_dataset.records.to_list(flatten=True):
    text = data_point['text'] # get the text out of the data point
    tokenized_text = [token.text for token in nlp(text)] # tokenize text with spacy nlp pipeline | this matches GLiNER expectations
    if data_point['status'] == 'completed': # this means you have annotated it
        annotations = data_point['span_label.responses'][0] # get the annotations
        ner = [[annotation['start'], annotation['end'], annotation['label']] for annotation in annotations] # convert annotations to GLiNER expectations
        eval_set.append({'tokenized_text': tokenized_text, 'ner': ner}) # append human annotation to the evaluation set
        eval_set_str.append(text)

        gliner_predictions = data_point['span_label.suggestion'] # get the GLiNER annotation as it's prediction
        gliner_ner = [[annotation['start'], annotation['end'], annotation['label']] for annotation in gliner_predictions]
        baseline_preds.append({'tokenized_text': tokenized_text, 'ner': gliner_ner})
    else: # GLiNER annotations
        annotations = data_point['span_label.suggestion'] # get the annotations
        ner = [[annotation['start'], annotation['end'], annotation['label']] for annotation in annotations] # convert annotations to GLiNER expectations
        train_set.append({'tokenized_text': tokenized_text, 'ner': ner}) # append GLiNER annotation to the training set
        train_set_str.append(text)

In [47]:
idx = 2
print(f'training example: {train_set_str[idx]} \ntraining example: {train_set[idx]} \nevaluation example: {eval_set_str[idx]} \nevaluation example: {eval_set[idx]} \ngliner evaluation example: {baseline_preds[idx]}')

training example: Clothing for both Men and Women. 
training example: {'tokenized_text': ['Clothing', 'for', 'both', 'Men', 'and', 'Women', '.'], 'ner': [[0, 8, 'clothing']]} 
evaluation example: In addition this short sleeve tee has raglan sleeves, cover stitched seams and a comfortable tag free neck. 
evaluation example: {'tokenized_text': ['In', 'addition', 'this', 'short', 'sleeve', 'tee', 'has', 'raglan', 'sleeves', ',', 'cover', 'stitched', 'seams', 'and', 'a', 'comfortable', 'tag', 'free', 'neck', '.'], 'ner': [[17, 33, 'clothing'], [38, 52, 'clothing']]} 
gliner evaluation example: {'tokenized_text': ['In', 'addition', 'this', 'short', 'sleeve', 'tee', 'has', 'raglan', 'sleeves', ',', 'cover', 'stitched', 'seams', 'and', 'a', 'comfortable', 'tag', 'free', 'neck', '.'], 'ner': [[17, 33, 'clothing'], [38, 52, 'clothing']]}


In [48]:
# save datasets
import json

OUTPUT_ROOT = Path('data/')
OUTPUT_ROOT.mkdir(exist_ok=True, parents=True)

file_path = OUTPUT_ROOT / f"{dataset_name}_train.jsonl"
with open(file_path, 'w') as file:
    for entry in train_set:
        json.dump(entry, file)
        file.write('\n')

file_path = OUTPUT_ROOT / f"{dataset_name}_eval.jsonl"
with open(file_path, 'w') as file:
    for entry in eval_set:
        json.dump(entry, file)
        file.write('\n')

file_path = OUTPUT_ROOT / f"{dataset_name}_baseline_preds.jsonl"
with open(file_path, 'w') as file:
    for entry in baseline_preds:
        json.dump(entry, file)
        file.write('\n')

file_path = OUTPUT_ROOT / f"{dataset_name}_train_set_str.jsonl"
with open(file_path, 'w') as file:
    for entry in train_set_str:
        json.dump(entry, file)
        file.write('\n')

file_path = OUTPUT_ROOT / f"{dataset_name}_eval_set_str.jsonl"
with open(file_path, 'w') as file:
    for entry in eval_set_str:
        json.dump(entry, file)
        file.write('\n')  

In [78]:
data = {"tokenized_text": ["A", "portable", "bridge", "had", "been", "prepared", "for", "crossing", "the", "canals", "which", "intersected", "the", "causeway", ";", "the", "intention", "being", "that", "it", "should", "be", "laid", "across", "a", "canal", ",", "that", "the", "army", "should", "pass", "over", "it", ",", "and", "that", "it", "should", "then", "be", "carried", "forward", "to", "the", "next", "gap", "in", "the", "causeway", ".", "This", "was", "a", "most", "faulty", "arrangement", ",", "necessitating", "frequent", "and", "long", "delays", ",", "and", "entailing", "almost", "certain", "disaster", ".", "Had", "three", "such", "portable", "bridges", "been", "constructed", ",", "the", "column", "could", "have", "crossed", "the", "causeway", "with", "comparatively", "little", "risk", ";", "and", "there", "was", "no", "reason", "why", "these", "bridges", "should", "not", "have", "been", "constructed", ",", "as", "they", "could", "have", "been", "carried", ",", "without", "difficulty", ",", "by", "the", "Tlascalans", ".", "At", "midnight", "the", "troops", "were", "in", "readiness", "for", "the", "march", ".", "Mass", "was", "performed", "by", "Father", "Olmedo", ";", "and", "at", "one", "o'clock", "on", "July", "1st", ",", "1520", ",", "the", "Spaniards", "sallied", "out", "from", "the", "fortress", "that", "they", "had", "so", "stoutly", "defended", ".", "Silence", "reigned", "in", "the", "city", ".", "As", "noiselessly", "as", "possible", ",", "the", "troops", "made", "their", "way", "down", "the", "broad", "street", ",", "expecting", "every", "moment", "to", "be", "attacked", ";", "but", "even", "the", "tramping", "of", "the", "horses", ",", "and", "the", "rumbling", "of", "the", "baggage", "wagons", "and", "artillery", "did", "not", "awake", "the", "sleeping", "Mexicans", ",", "and", "the", "head", "of", "the", "column", "arrived", "at", "the", "head", "of", "the", "causeway", "before", "they", "were", "discovered", "."], "ner": [[29, 30, "organization"], [116, 117, "organization"], [121, 122, "organization"], [133, 135, "person"], [147, 148, "organization"], [210, 211, "person"], [217, 218, "organization"]]}
#data = train_set[10]

import spacy
from spacy.tokens import Span, Doc

# Create a Doc from the tokenized text
doc = Doc(nlp.vocab, words=data["tokenized_text"])
ents = []
for start, end, label in data["ner"]:
    span = Span(doc, start, end, label=label)
    ents.append(span)
doc.ents = ents

# Visualize the NER entities
spacy.displacy.render(doc, style="ent", jupyter=True)

----

<h1>Evaluate Baseline

In [None]:
from nervaluate import Evaluator

# below function converts the data to the format that nervaluate expects
def convert_data_to_nervaluate_format(data):
    formatted_data = []
    for data_point in data:
        formatted_data_point = [{'label': ner_point[2], 'start': ner_point[0], 'end': ner_point[1]} for ner_point in data_point['ner']]
        formatted_data.append(formatted_data_point)
    return formatted_data

true = convert_data_to_nervaluate_format(eval_set) # human annotated data in the format that nervaluate expects
pred = convert_data_to_nervaluate_format(baseline_preds) # GLiNER predictions in the format that nervaluate expects

evaluator = Evaluator(true, pred, tags=labels)
results, results_per_tag, result_indices, result_indices_by_tag = evaluator.evaluate()

print(f"Precision: {results['ent_type']['precision']}\nRecall: {results['ent_type']['recall']}\nF1: {results['ent_type']['f1']}")

Precision: 0.8421052631578947
Recall: 0.7868852459016393
F1: 0.8135593220338982


## Appendix: Log datasets to the Hugging Face Hub

Here we will show you an example of how you can push an Argilla dataset (records) to the [Hugging Face Hub](https://huggingface.co/datasets).
In this way, you can effectively version any of your Argilla datasets.

See how to get your HuggingFace access token here: https://huggingface.co/docs/hub/en/security-tokens

In [None]:
# login to huggingface

from huggingface_hub import login

huggingface_access_token = ''

login(huggingface_access_token)

In [None]:
# push your dataset

hf_account_name = ''
dataset_name = ''
argilla_dataset.to_hub(repo_id=f"{hf_account_name}/{dataset_name}")