# 💫  Explore and analyze `spaCy` NER pipelines

## TL;DR

In this tutorial, we will learn to log [spaCy](https://spacy.io/) Name Entity Recognition (NER) predictions. 

This is useful for: 

- 🧐Evaluating pre-trained models.
- 🔎Spotting frequent errors both during development and production. 
- 📈Improving your pipelines over time using Rubrix annotation mode.
- 🎮Monitoring your model predictions using Rubrix integration with Kibana

Let's get started!


<video width="100%" controls><source src="02-spacy/spacyner.mp4" type="video/mp4"></video>

## Introduction

In this tutorial we will learn how to explore and analyze spaCy NER pipelines in an easy way.

We will load the [*Gutenberg Time*](https://huggingface.co/datasets/gutenberg_time) dataset from the Hugging Face Hub and use a transformer-based spaCy model for detecting entities in this dataset and log the detected entities into a Rubrix dataset. This dataset can be used for exploring the quality of predictions and for creating a new training set, by correcting, adding and validating entities.

Then, we will use a smaller spaCy model for detecting entities and log the detected entities into the same Rubrix dataset for comparing its predictions with the previous model. And, as a bonus, we will use Rubrix and spaCy on a more challenging dataset: IMDB.

## Setup

Rubrix is a free and open-source tool to explore, annotate, and monitor data for NLP projects. 

If you are new to Rubrix, visit and ⭐ star Rubrix for more materials like and detailed docs: [Github repo](https://github.com/recognai/rubrix)

If you have not installed and launched Rubrix yet, check the [Setup and Installation guide](https://docs.rubrix.ml/en/latest/getting_started/setup%26installation.html).

For this tutorial we also need the third party libraries datasets and of course spaCy together with pytorch, which can be installed via git:

In [None]:
%pip install torch -qqq
%pip install datasets "spacy[transformers]~=3.0" protobuf -qqq 

## Our dataset
For this tutorial, we're going to use the [*Gutenberg Time*](https://huggingface.co/datasets/gutenberg_time) dataset from the Hugging Face Hub. It contains all explicit time references in a dataset of 52,183 novels whose full text is available via Project Gutenberg. From extracts of novels, we are surely going to find some NER entities.

In [None]:
from datasets import load_dataset

dataset = load_dataset("gutenberg_time", split="train")

Let's create a small test set and have a look at the data! 

In [None]:
train, test = dataset.train_test_split(test_size=0.002, seed=42).values()

In [7]:
test.to_pandas()

Unnamed: 0,guten_id,hour_reference,time_phrase,is_ambiguous,time_pos_start,time_pos_end,tok_context
0,6953,11,half past eleven,True,66,69,`` I was just going up to him to speak about m...
1,13123,5,ten minutes to six,True,65,69,Presently the great machinery which assisted h...
2,9826,0,midnight,True,93,94,"The mate of course obeyed , and the evening sh..."
3,12256,0,Midnight,True,107,108,"`` She is , I presume , by now , the Countess ..."
4,28357,11,eleven o’clock,True,89,91,Three days passed . Will still remained at the...
...,...,...,...,...,...,...,...
237,10066,10,ten\no'clock,True,52,54,He had drawn his chair closer : he had taken h...
238,10446,2,Two o'clock,True,50,52,"He contented himself , therefore , with the ba..."
239,2488,12,noon,True,87,88,It was on this oceanic river that the Nautilus...
240,9155,10,ten o'clock,True,58,60,"It was well the men had gone home , she though..."


## Logging spaCy NER entities into Rubrix

### Using a Transformer-based pipeline

Let's download our Roberta-based pretrained pipeline and apply it to one of our dataset records:

In [None]:
!python -m spacy download en_core_web_trf

In [None]:
import spacy

nlp = spacy.load("en_core_web_trf")
doc = nlp(dataset[0]["tok_context"])
doc

Now let's apply the nlp pipeline to our dataset records, collecting the **tokens** and **NER entities**. 

In [None]:
import rubrix as rb
from tqdm.auto import tqdm

records = []

for record in tqdm(test, total=len(test)):
    # We only need the text of each instance
    text = record["tok_context"]
    
    # spaCy Doc creation
    doc = nlp(text)    
    
    # Entity annotations
    entities = [
        (ent.label_, ent.start_char, ent.end_char)  
        for ent in doc.ents
    ] 

    # Pre-tokenized input text
    tokens = [token.text  for token in doc]
    
    # Rubrix TokenClassificationRecord list
    records.append(
        rb.TokenClassificationRecord(
            text=text,
            tokens=tokens,
            prediction=entities,
            prediction_agent="en_core_web_trf",
        )
    )

In [None]:
rb.log(records=records, name="gutenberg_spacy_ner")

If you go to the `gutenberg_spacy_ner` dataset in Rubrix you can explore the predictions of this model.

You can:

- Filter records containing specific entity types,
- See the most frequent "mentions" or surface forms for each entity. Mentions are the string values of specific entity types, such as for example "1 month" can be the mention of a duration entity. This is useful for error analysis, to quickly see potential issues and problematic entity types,
- Use the free-text search to find records containing specific words, 
- And validate, include or reject specific entity annotations to build a new training set.

### Using a smaller but more efficient pipeline

Now let's compare with a smaller, but more efficient pre-trained model. 

Let's first download it:

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(dataset[0]["tok_context"])

In [None]:
records = []    # Creating and empty record list to save all the records

for record in tqdm(test, total=len(test)):

    text = record["tok_context"]  # We only need the text of each instance
    doc = nlp(text)    # spaCy Doc creation
    
    # Entity annotations
    entities = [
        (ent.label_, ent.start_char, ent.end_char)  
        for ent in doc.ents
    ] 

    # Pre-tokenized input text
    tokens = [token.text  for token in doc]
    

    # Rubrix TokenClassificationRecord list
    records.append(
        rb.TokenClassificationRecord(
            text=text,
            tokens=tokens,
            prediction=entities,
            prediction_agent="en_core_web_sm",
        )
    )

In [None]:
rb.log(records=records, name="gutenberg_spacy_ner")

## Exploring and comparing `en_core_web_sm` and `en_core_web_trf` models

If you go to your `gutenberg_spacy_ner` dataset, you can explore and compare the results of both models. 

To only see predictions of a specific model, you can use the `predicted by` filter, which comes from the `prediction_agent` parameter of your `TextClassificationRecord`.


![spacy_models_meta](02-spacy/spacy_ner2.png "spaCy models predicted_by filter")


## Bonus: Explore the IMDB dataset

So far, both **spaCy pretrained models** seem to work pretty well. Let's try with a more challenging dataset, which is more dissimilar to the original training data these models have been trained on. 

In [None]:
imdb = load_dataset("imdb", split="test[0:5000]")

In [None]:
records = []
for record in tqdm(imdb, total=len(imdb)):
    # We only need the text of each instance
    text = record["text"]
    
    # spaCy Doc creation
    doc = nlp(text)    
    
    # Entity annotations
    entities = [
        (ent.label_, ent.start_char, ent.end_char)  
        for ent in doc.ents
    ] 

    # Pre-tokenized input text
    tokens = [token.text  for token in doc]
    
    # Rubrix TokenClassificationRecord list
    records.append(
        rb.TokenClassificationRecord(
            text=text,
            tokens=tokens,
            prediction=entities,
            prediction_agent="en_core_web_sm",
        )
    )

In [None]:
rb.log(records=records, name="imdb_spacy_ner")

Exploring this dataset highlights **the need of fine-tuning for specific domains**.

For example, if we check the most frequent mentions for Person, we find two highly frequent missclassified entities: **gore** (the film genre) and **Oscar** (the prize). 

You can easily check every example by using the filters and search-box.


<video width="100%" controls><source src="02-spacy/spacy2.mp4" type="video/mp4"></video>

## Summary
In this tutorial, you learned how to log and explore differnt `spaCy` NER models with Rubrix. Now you can:

- Build custom dashboards using Kibana to monitor and visualize spaCy models.
- Build training sets using pre-trained spaCy models.

## Next steps

### 📚 [Rubrix documentation](https://docs.rubrix.ml) for more guides and tutorials.

### 🙋‍♀️ Join the Rubrix community! A good place to start is the [discussion forum](https://github.com/recognai/rubrix/discussions).

### ⭐ Rubrix [Github repo](https://github.com/recognai/rubrix) to stay updated.