# ✨ Using Rubrix with `spaCy`

In this tutorial, we will walk through the process of using Rubrix with `spaCy`, one of the most-widely used NLP libraries.


## Introduction

**Our goal is to show you how to explore `spaCy` NER predictions with Rubrix**. 

## Install tutorial dependencies

In this tutorial we will be using `datasets` and `spaCy` libraries and the `en_core_web_trf` pretrained English model, a Roberta-based spaCy model . If you do not have them installed, run:

In [None]:
!pip install datasets -qqq
!pip install -U spacy -qqq

## Setup Rubrix

If you have not installed and launched Rubrix, check the [installation guide](https://github.com/recognai/rubrix#get-started). 

In [1]:
import rubrix as rb

## Our dataset

For this tutorial, we are going to use the [*Gutenberg Time*](https://huggingface.co/datasets/gutenberg_time) dataset from the Hugging Face Hub. It contains all explicit time references in a dataset of 52,183 novels whose full text is available via Project Gutenberg. From extracts of novels, we are surely going to find some NER entities. Well, technically, spaCy is going to find them.

In [None]:
from datasets import load_dataset

dataset = load_dataset("gutenberg_time", split="train[0:20]")

Let's take a look at our dataset! Starting by the length of it and an sneak peek to one instance.

In [28]:
dataset[1]

{'guten_id': '4447',
 'hour_reference': '12',
 'is_ambiguous': True,
 'time_phrase': 'the fall of the winter noon',
 'time_pos_end': 74,
 'time_pos_start': 68,
 'tok_context': "So profoundly penetrated with thoughtfulness was the tone of his voice that I could not take umbrage . The attempt to analyze his signification cost me an aching forehead , perhaps because I knew it too acutely . She was on horseback ; I on foot , Schwartz for sole witness , and a wide space of rolling silent white country around us . We had met in the fall of the winter noon by accident . ` You like my Professor ? ' said Ottilia . ` I do : I respect him for his learning . '"}

In [4]:
dataset

Dataset({
    features: ['guten_id', 'hour_reference', 'time_phrase', 'is_ambiguous', 'time_pos_start', 'time_pos_end', 'tok_context'],
    num_rows: 20
})

## Logging spaCy NER entities into Rubrix

### Using a Transformer-based pipeline

Let's install and load our roberta-based pretrained pipeline and apply it to one of our dataset records:

In [None]:
!python -m spacy download en_core_web_trf

In [32]:
import spacy

nlp = spacy.load("en_core_web_trf")
doc = nlp(dataset[0]["tok_context"])
doc

I crossed the ground she had traversed , noting every feature surrounding it , the curving wheel-track , the thin prickly sand-herbage , the wave - mounds , the sparse wet shells and pebbles , the gleaming flatness of the water , and the vast horizon-boundary of pale flat land level with shore , looking like a dead sister of the sea . By a careful examination of my watch and the sun 's altitude , I was able to calculate what would , in all likelihood , have been his height above yonder waves when her chair was turned toward the city , at a point I reached in the track . But of the matter then simultaneously occupying my mind , to recover which was the second supreme task I proposed to myself-of what . I also was thinking upon the stroke of five o'clock , I could recollect nothing . I could not even recollect whether I happened to be looking on sun and waves when she must have had them full and glorious in her face . With the heartiest consent I could give , and a blank cheque , my fath

Now let's apply the nlp pipeline to our dataset records, collecting the tokens and NER entities. 

In [33]:
records = []    # Creating and empty record list to save all the records

for record in dataset:

    text = record["tok_context"]  # We only need the text of each instance
    doc = nlp(text)    # spaCy Doc creation
    
    # Entity annotations
    entities = [
        (ent.label_, ent.start_char, ent.end_char)  
        for ent in doc.ents
    ] 

    # Pre-tokenized input text
    tokens = [token.text  for token in doc]
    

    # Rubrix TokenClassificationRecord list
    records.append(
        rb.TokenClassificationRecord(
            text=text,
            tokens=tokens,
            prediction=entities,
            prediction_agent="spacy.en_core_web_trf",
        )
    )

In [34]:
rb.log(records=records, name="gutenberg_spacy_ner")

BulkResponse(dataset='gutenberg_spacy_ner', processed=20, failed=0)

### Using a smaller but more efficient pipeline

Now let's compare with a smaller, but more efficient pre-trained model. Let's first download it

In [6]:
!python -m spacy download en_core_web_sm -qqq

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [29]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(dataset[0]["tok_context"])
doc

I crossed the ground she had traversed , noting every feature surrounding it , the curving wheel-track , the thin prickly sand-herbage , the wave - mounds , the sparse wet shells and pebbles , the gleaming flatness of the water , and the vast horizon-boundary of pale flat land level with shore , looking like a dead sister of the sea . By a careful examination of my watch and the sun 's altitude , I was able to calculate what would , in all likelihood , have been his height above yonder waves when her chair was turned toward the city , at a point I reached in the track . But of the matter then simultaneously occupying my mind , to recover which was the second supreme task I proposed to myself-of what . I also was thinking upon the stroke of five o'clock , I could recollect nothing . I could not even recollect whether I happened to be looking on sun and waves when she must have had them full and glorious in her face . With the heartiest consent I could give , and a blank cheque , my fath

In [30]:
records = []    # Creating and empty record list to save all the records

for record in dataset:

    text = record["tok_context"]  # We only need the text of each instance
    doc = nlp(text)    # spaCy Doc creation
    
    # Entity annotations
    entities = [
        (ent.label_, ent.start_char, ent.end_char)  
        for ent in doc.ents
    ] 

    # Pre-tokenized input text
    tokens = [token.text  for token in doc]
    

    # Rubrix TokenClassificationRecord list
    records.append(
        rb.TokenClassificationRecord(
            text=text,
            tokens=tokens,
            prediction=entities,
            prediction_agent="spacy.en_core_web_sm",
        )
    )

In [31]:
rb.log(records=records, name="gutenberg_spacy_ner")

BulkResponse(dataset='gutenberg_spacy_ner', processed=20, failed=0)

## Exploring and comparing `en_core_web_sm` and `en_core_web_trf` models

If you go to your `gutenberg_spacy_ner` you can explore and compare the results of both models. 

A handy feature is the `predicted by` filter, which comes from the `prediction_agent` parameter of your `TextClassificationRecord`.

![spacy_models_meta](img/spacy_filter_meta.png "spaCy models predicted_by filter")


Some quick qualitative findings about these two models applied to this sample:

- `en_core_web_trf` makes more conservative predictions, most of them accurate but misses a number of entities (higher precision, less recall for entities like `CARDINAL`).
- `en_core_web_sm` has less precision for most of the entities, confusing for example `PERSON` with `ORG` entities, even with the same surface form within the same paragraph, but has better recall for entities like `CARDINAL`.
- For `TIME` entities both model show almost the same distribution and are quite accurate. This could be further analysed by logging the time `annotations` in the dataset.

As an illustration of these findings, see an example of a records with `en_core_web_sm` (top) and `en_core_web_trf` (bottom) predicted entities.

![spacy_sm_vs_trf](img/spacy_sm_vs_trf.png "spaCy sm model vs. roberta-based model")

## Summary
In this tutorial, we have learnt to log and explore `spaCy` NER models with Rubrix. 

## Next steps

We invite you to check our other tutorials and join our community, a good place to start is our [discussion forum](https://github.com/recognai/rubrix/discussions).