# 💫  Explore and analyze NER predictions

In this tutorial, we will learn to log [spaCy](https://spacy.io/) Name Entity Recognition (NER) predictions.

This is useful for:

- 🧐Evaluating pre-trained models.
- 🔎Spotting frequent errors both during development and production.
- 📈Annotating records to create an gold-standard evaluation dataset.


Reference: https://docs.argilla.io/en/latest/tutorials/notebooks/labelling-tokenclassification-spacy-pretrained.html


## Introduction

In this tutorial, we will learn how to explore and analyze spaCy NER pipelines in an easy way.

We will load the [*Gutenberg Time*](https://huggingface.co/datasets/gutenberg_time) dataset from the Hugging Face Hub and use a transformer-based spaCy model for detecting entities in this dataset and log the detected entities into an Argilla dataset. This dataset can be used for exploring the quality of predictions and for creating a new training set, by correcting, adding and validating entities via human annotation.

Firstly, run your Argilla server if you haven't already:

In [1]:
# Run the argilla docker container in your terminal (or use the %%bash magic)
# This may take a couple of minutes to spin up the container
# docker run -d --name quickstart -p 6900:6900 argilla/argilla-quickstart:latest

Let's import the Argilla module for reading and writing data:

In [37]:
import argilla as rg

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the `URL` and `API_KEY`:

In [39]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
# default argilla username is 'argilla'
# default argilla password is '12345678'
# default api_key for argilla on docker is 'argilla.apikey'
client = rg.Argilla(
    api_url="http://localhost:6900",                      # If you are using the docker container
    # api_url="https://jackboyla-zero-shot-lab.hf.space", # If you are using HF Spaces
    api_key="argilla.apikey"
)
client # Test your login! :)

Argilla has been deployed at: http://localhost:6900

Finally, let's include the imports we need:

In [41]:
from pathlib import Path

from datasets import load_dataset, load_from_disk
import pandas as pd
import spacy
from tqdm.auto import tqdm

## Our dataset
For this tutorial, our default dataset is the [*Gutenberg Time*](https://huggingface.co/datasets/gutenberg_time) dataset from the Hugging Face Hub. It contains all explicit time references in a dataset of 52,183 novels whose full text is available via Project Gutenberg. From extracts of novels, we are surely going to find some NER entities.

If you are following the full lab, you can also load the dataset you generated in the previous notebook.

In [160]:
# when using this notebook standalone, choose a dataset from the hub
# dataset = load_dataset("gutenberg_time", split="train", streaming=True)

# when using this notebook as part of the full lab, load the dataset you created in the previous step
DATASET_PATH = 'data/sampled_dataset/'
dataset = load_from_disk(DATASET_PATH)

# Let's have a look at the first 5 examples of the train set.
try:
    print(pd.DataFrame(dataset.take(5)))
except AttributeError:
    print(pd.DataFrame(dataset[:5]))

                                               input  \
0  When you would like the property photographed....   
1  I want to go see this property in person! MLS#...   
2  Va. 7 property for Preston Propane before the ...   
3  I was searching for a Property and found this ...   
4  I would like more information about 7362 Horiz...   

                                              output  \
0  ['property <> real estate <> physical entity r...   
1  ['MLS# 234382 <> Real Estate Property <> Uniqu...   
2  ['Va. 7 <> Street Address <> Physical location...   
3  ['Property <> Real estate <> Physical asset wi...   
4  ['7362 Horizon Drive West Palm Beach, FL 33412...   

                                          embeddings  
0  [0.0165400430560112, 0.04142063856124878, -0.0...  
1  [0.003052317537367344, 0.013402799144387245, -...  
2  [0.014098312705755234, 0.024646742269396782, 0...  
3  [0.009898650459945202, 0.0028502296190708876, ...  
4  [0.011095905676484108, 0.03647858276963234, 0...

## Annotating with GLiNER and Logging NER entities into Argilla


Let's instantiate a spaCy transformer `nlp` pipeline and apply it to the first N examples in our dataset, collecting the *tokens* and *NER entities*.

We're going to use a [GLiNER](https://github.com/urchade/GLiNER) model to perform zero shot NER. This means we can provide any entity labels we like!

In [184]:
# observe how gliner works :)

import spacy
from gliner_spacy.pipeline import GlinerSpacy

# Gliner model options https://huggingface.co/urchade 
gliner_model = "urchade/gliner_largev2"

# Define your domain here: the list of entity types you expect to see
zero_shot_labels = ["person", "organization", "email", "sports team", "business"]

# Configuration for GLiNER integration
custom_spacy_config = {
    "gliner_model": gliner_model,
    "chunk_size": 250,
    "labels": zero_shot_labels,
    "style": "ent"
}

# Initialize a blank English spaCy pipeline and add GLiNER
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

# Example
text = "This is a text about Bill Gates and Microsoft."

# Process the text with the pipeline
doc = nlp(text)

# Output detected entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Bill Gates person
Microsoft organization


<h2> Create Argilla Dataset

In [None]:
# if you restart kernel and want to continue from here, get the variable set again
# because the created dataset will persist and you'll get an error trying to recreate the argilla_dataset variable from scratch
for dataset in client.datasets.list():
    if dataset.name == "argilla_dataset":
        argilla_dataset = dataset

In [218]:
# In case you'd like to delete argilla_dataset and start over, just run this cell

dataset_to_delete = client.datasets(name="argilla_dataset")

dataset_deleted = dataset_to_delete.delete()

In [211]:
# dataset_name = "gutenberg_spacy_ner"
dataset_name = "argilla_dataset"
labels = ['address', 'organization', 'person'] # define a list of labels that you are interested in extracting with GLiNER

In [212]:
# Configuration for GLiNER integration with the labels defined above
custom_spacy_config = {
    "gliner_model": gliner_model,
    "chunk_size": 250,
    "labels": labels,
    "style": "ent"
}

# Initialize a blank English spaCy pipeline and add GLiNER
nlp = spacy.blank("en")
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]



<gliner_spacy.pipeline.GlinerSpacy at 0x5bd0c5420>

In [219]:
# Create settings for Argilla
settings = rg.Settings(
    guidelines="Classify individual tokens into given labels",
    fields = [
        rg.TextField(
            name='text', # give it a name
            title='Text', # this will be displayed on the UI above the text field
            use_markdown=False # not necessary for this application
        )
    ],
    # In Argilla a question is basically an annotation instance.
    # This is a token classification case and Argilla has a built-in question type for that called the SpanQuestion.
    questions=[
        rg.SpanQuestion( 
            name="span_label",
            field='text',
            labels=labels,
            title="Classify individual tokens into given labels",
            allow_overlapping=False
        )
    ]
)

In [220]:
# create the dataset
argilla_dataset = rg.Dataset(
    name=dataset_name,
    settings=settings
)
argilla_dataset.create()

Dataset(id=UUID('a338067b-ce33-4c2f-accb-1f7374199351') inserted_at=datetime.datetime(2025, 3, 11, 21, 55, 3, 876417) updated_at=datetime.datetime(2025, 3, 11, 21, 55, 3, 909699) name='argilla_dataset' status='ready' guidelines='Classify individual tokens into given labels' allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('cf04c60d-319c-423e-b686-914e4f1a7ace') last_activity_at=datetime.datetime(2025, 3, 11, 21, 55, 3, 909699))

In [221]:
# Observe that your dataset is created
client.datasets

name,id,workspace_id,updated_at
argilla_dataset,a338067b-ce33-4c2f-accb-1f7374199351,cf04c60d-319c-423e-b686-914e4f1a7ace,2025-03-11T21:55:03.909699


In [222]:
# In argilla, you can log what they call 'suggestions' for each data instance
# Suggestions are basically model predictions on the data instances

# Having these suggestions has two main benefits:
# 1. It can help you quickly label your data
# 2. You can eyeball the performance of your baseline model

# the below function will produce a suggestion output that is acceptable to Argilla from the GLiNER model
def gliner_predict(text):
    doc = nlp(text)
    return [
        {"start": ent.start_char, "end": ent.end_char, "label": ent.label_}
        for ent in doc.ents
    ]

# collect all records here
record_instances = []
for row in dataset:
    text = row['input'] # this is the input sequence
    span_label= gliner_predict(text) # suggestion data
    
    # make the record instance acceptable by Argilla token classification task
    record_instance = {
        'text': text,
        'span_label': span_label
    }
    record_instances.append(record_instance)

argilla_dataset.records.log(record_instances)

Sending records...: 4batch [00:02,  1.55batch/s]                    


DatasetRecords(Dataset(id=UUID('a338067b-ce33-4c2f-accb-1f7374199351') inserted_at=datetime.datetime(2025, 3, 11, 21, 55, 3, 876417) updated_at=datetime.datetime(2025, 3, 11, 21, 55, 3, 909699) name='argilla_dataset' status='ready' guidelines='Classify individual tokens into given labels' allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('cf04c60d-319c-423e-b686-914e4f1a7ace') last_activity_at=datetime.datetime(2025, 3, 11, 21, 55, 3, 909699)))

<p>You can now view and annotate your dataset in Argilla.
<p>For docker: http://localhost:6900/
<p>or visit your space in HF spaces if you opted to use that

In [234]:
adl = argilla_dataset.records.to_list()

for i in adl:
    if i['responses']:
        print(i)
        print(i['responses'])

{'id': 'aa104c44-438d-46e2-bec0-db25b8951c6e', 'fields': {'text': 'If there is a criminal investigation, they can get a search warrant from a magistrate.'}, 'metadata': {}, 'suggestions': {'span_label': {'value': [], 'score': None, 'agent': None}}, 'responses': {'span_label': [{'value': [], 'user_id': 'ff683590-024e-4424-acef-de69dcd4f186'}]}, 'vectors': {}, 'status': 'completed', '_server_id': 'c1541566-065f-42b8-abaf-402b31f70d62'}
{'span_label': [{'value': [], 'user_id': 'ff683590-024e-4424-acef-de69dcd4f186'}]}
{'id': '1ca9f194-0b04-49ea-b051-f26f03e9befb', 'fields': {'text': 'Sightseeing on the way. Once you reach Thekkady, proceed to the cottage in teh resort.'}, 'metadata': {}, 'suggestions': {'span_label': {'value': [], 'score': None, 'agent': None}}, 'responses': {'span_label': [{'value': [], 'user_id': 'ff683590-024e-4424-acef-de69dcd4f186'}]}, 'vectors': {}, 'status': 'completed', '_server_id': '3852d152-64f5-43f4-86e5-02dbe9279c4c'}
{'span_label': [{'value': [], 'user_id': 

-----

# Export Annotated Argilla Dataset

GLiNER models return the character indices of the detected entities. For fine-tuning, we need the token indices. So we have to do some data gymnastics below.

In [None]:
def char_to_token_indices(text, tokens, entities):
    """
    Convert character span indices in entities to token indices.
    
    Args:
    - text: The original text as a single string.
    - tokens: A list of tokens.
    - entities: A list of entities with character start and end indices.
    
    Returns:
    - A list of entities with token start and end indices.
    """
    # Calculate the character start index of each token
    token_char_spans = []
    current_char_index = 0
    for token in tokens:
        start_index = text.find(token, current_char_index)
        end_index = start_index + len(token)
        token_char_spans.append((start_index, end_index))
        current_char_index = end_index

    # Convert character indices to token indices for each entity
    converted_entities = []
    for entity in entities:
        entity_start, entity_end = entity['start'], entity['end']
        entity_start_token = None
        entity_end_token = None
        
        # Find the tokens that the entity start and end indices fall into
        for i, (start, end) in enumerate(token_char_spans):
            if start <= entity_start < end:
                entity_start_token = i
            if start < entity_end <= end:
                entity_end_token = i + 1
                break  # Stop looking once we've found the end token
        
        if entity_start_token is not None and entity_end_token is not None:
            converted_entities.append([entity_start_token, entity_end_token, entity['label']])
        else:
            print('Error on entity:', entity, 'Tokens:', tokens)
    
    return converted_entities

In [223]:
import json

OUTPUT_ROOT = Path('data/')
OUTPUT_ROOT.mkdir(exist_ok=True, parents=True)

# exported dataset after review
dataset_rg = rg.load(dataset_name)

# export your Argilla Dataset to a datasets Dataset
dataset_ds = dataset_rg.to_datasets()

'''
Dataset({
    features: ['text', 'tokens', 'prediction', 'prediction_agent', 'annotation', 'annotation_agent', 'vectors', 'id', 'metadata', 'status', 'event_timestamp', 'metrics'],
    num_rows: 200
})
'''

# format the dataset to GLiNER training format {'tokenized_text' [], 'ner': [ [start_token_i, end_token_i, label], ...], ...}

# if it's been annotated, it goes to the evaluation set
train_set = []
eval_set = []
for record in dataset_ds:
    
    converted_entities = char_to_token_indices(record['text'], record['tokens'], record['annotation'] or record['prediction'])

    if record['annotation'] is not None:

        eval_set.append(
            {
                'tokenized_text': record['tokens'], 
                'ner': converted_entities
            }
        )

    # otherwise, it goes to the weakly annotated train set
    else:
        train_set.append(
            {
                'tokenized_text': record['tokens'], 
                'ner': converted_entities
            }
        )


file_path = OUTPUT_ROOT / f"{dataset_name}_train.jsonl"
with open(file_path, 'w') as file:
    for entry in train_set:
        json.dump(entry, file)
        file.write('\n')

file_path = OUTPUT_ROOT / f"{dataset_name}_eval.jsonl"
with open(file_path, 'w') as file:
    for entry in eval_set:
        json.dump(entry, file)
        file.write('\n')

AttributeError: module 'argilla' has no attribute 'load'

In [None]:
# data = {"tokenized_text": ["A", "portable", "bridge", "had", "been", "prepared", "for", "crossing", "the", "canals", "which", "intersected", "the", "causeway", ";", "the", "intention", "being", "that", "it", "should", "be", "laid", "across", "a", "canal", ",", "that", "the", "army", "should", "pass", "over", "it", ",", "and", "that", "it", "should", "then", "be", "carried", "forward", "to", "the", "next", "gap", "in", "the", "causeway", ".", "This", "was", "a", "most", "faulty", "arrangement", ",", "necessitating", "frequent", "and", "long", "delays", ",", "and", "entailing", "almost", "certain", "disaster", ".", "Had", "three", "such", "portable", "bridges", "been", "constructed", ",", "the", "column", "could", "have", "crossed", "the", "causeway", "with", "comparatively", "little", "risk", ";", "and", "there", "was", "no", "reason", "why", "these", "bridges", "should", "not", "have", "been", "constructed", ",", "as", "they", "could", "have", "been", "carried", ",", "without", "difficulty", ",", "by", "the", "Tlascalans", ".", "At", "midnight", "the", "troops", "were", "in", "readiness", "for", "the", "march", ".", "Mass", "was", "performed", "by", "Father", "Olmedo", ";", "and", "at", "one", "o'clock", "on", "July", "1st", ",", "1520", ",", "the", "Spaniards", "sallied", "out", "from", "the", "fortress", "that", "they", "had", "so", "stoutly", "defended", ".", "Silence", "reigned", "in", "the", "city", ".", "As", "noiselessly", "as", "possible", ",", "the", "troops", "made", "their", "way", "down", "the", "broad", "street", ",", "expecting", "every", "moment", "to", "be", "attacked", ";", "but", "even", "the", "tramping", "of", "the", "horses", ",", "and", "the", "rumbling", "of", "the", "baggage", "wagons", "and", "artillery", "did", "not", "awake", "the", "sleeping", "Mexicans", ",", "and", "the", "head", "of", "the", "column", "arrived", "at", "the", "head", "of", "the", "causeway", "before", "they", "were", "discovered", "."], "ner": [[29, 30, "organization"], [116, 117, "organization"], [121, 122, "organization"], [133, 135, "person"], [147, 148, "organization"], [210, 211, "person"], [217, 218, "organization"]]}
data = train_set[10]

import spacy
from spacy.tokens import Span, Doc

# Create a Doc from the tokenized text
doc = Doc(nlp.vocab, words=data["tokenized_text"])
ents = []
for start, end, label in data["ner"]:
    span = Span(doc, start, end, label=label)
    ents.append(span)
doc.ents = ents

# Visualize the NER entities
spacy.displacy.render(doc, style="ent", jupyter=True)

## Appendix: Log datasets to the Hugging Face Hub

Here we will show you an example of how you can push an Argilla dataset (records) to the [Hugging Face Hub](https://huggingface.co/datasets).
In this way, you can effectively version any of your Argilla datasets.

In [None]:
# records = rg.load(dataset_name)
# records.to_datasets().push_to_hub("<name of the dataset on the HF Hub>")