# Using Rubrix with Spacy v3

In this tutorial, we will walk through the process of using Rubrix to log NER predictions from Spacy v3.


## Introduction

**Our goal is to show you how explore with Rubrix predictions made with Spacy**, facing a NER task. We will use the new capabilities of Spacy v3, integrating Transformers to the pipeline.

## Installations and imports

We are going to need Huggingface Dataset (as we are going to use a dataset from Huggingface Hub), Spacy (and we will make sure it is upgraded to v3), and an English transformer model.

In [None]:
!pip install torch torchvision torchaudio
!pip install transformers
!pip install transformers[sentencepiece]
!pip install datasets 
!pip install --upgrade spacy
!python -m spacy download en_core_web_trf

Next, we are importing rubrix ...

And don't forget to import spacy and load the model we just downloaded

In [1]:
import rubrix as rb

In [2]:
import spacy
nlp = spacy.load("en_core_web_trf")

For this tutorial, we are going to use the [*Gutenberg Time*](https://huggingface.co/datasets/gutenberg_time) dataset from Huggingface Hub. It contains all explicit time references in a dataset of 52,183 novels whose full text is available via Project Gutenberg. From extracts of novels, we are surely going to find some NER entities. Well, technically, spacy is going to find them.

In [3]:
from datasets import load_dataset

dataset = load_dataset("gutenberg_time", split="train")

Reusing dataset gutenberg_time (/Users/ignaciotalaveracepeda/.cache/huggingface/datasets/gutenberg_time/gutenberg/0.0.0/f39225c596bc6125cf8ffcfb61ea510ae8dfa497a3ba392cd80d6893ebf715ee)


## Dataset preview

Let's take a look at our dataset! Starting by the length of it and an sneak peek to one instance.

In [4]:
dataset[0]

{'guten_id': '4447',
 'hour_reference': '5',
 'is_ambiguous': True,
 'time_phrase': "five o'clock",
 'time_pos_end': 147,
 'time_pos_start': 145,
 'tok_context': "I crossed the ground she had traversed , noting every feature surrounding it , the curving wheel-track , the thin prickly sand-herbage , the wave - mounds , the sparse wet shells and pebbles , the gleaming flatness of the water , and the vast horizon-boundary of pale flat land level with shore , looking like a dead sister of the sea . By a careful examination of my watch and the sun 's altitude , I was able to calculate what would , in all likelihood , have been his height above yonder waves when her chair was turned toward the city , at a point I reached in the track . But of the matter then simultaneously occupying my mind , to recover which was the second supreme task I proposed to myself-of what . I also was thinking upon the stroke of five o'clock , I could recollect nothing . I could not even recollect whether I happene

In [5]:
dataset

Dataset({
    features: ['guten_id', 'hour_reference', 'time_phrase', 'is_ambiguous', 'time_pos_start', 'time_pos_end', 'tok_context'],
    num_rows: 120694
})

Okay, maybe those are some big numbers. For the shake of the tutorial, let's reduce it a little bit.

In [6]:
dataset = dataset.select(range(20))
dataset

Dataset({
    features: ['guten_id', 'hour_reference', 'time_phrase', 'is_ambiguous', 'time_pos_start', 'time_pos_end', 'tok_context'],
    num_rows: 20
})

Now, let's see how to create a spacy doc. An spacy doc is nothing more than an array of tokens, which spacy obtains from the Transformer model we just loaded. With the spacy doc, and using the power of the pretrained Transformer, we will find and mark our NER entities.

In [7]:
doc = nlp(dataset[0]["tok_context"])
doc

I crossed the ground she had traversed , noting every feature surrounding it , the curving wheel-track , the thin prickly sand-herbage , the wave - mounds , the sparse wet shells and pebbles , the gleaming flatness of the water , and the vast horizon-boundary of pale flat land level with shore , looking like a dead sister of the sea . By a careful examination of my watch and the sun 's altitude , I was able to calculate what would , in all likelihood , have been his height above yonder waves when her chair was turned toward the city , at a point I reached in the track . But of the matter then simultaneously occupying my mind , to recover which was the second supreme task I proposed to myself-of what . I also was thinking upon the stroke of five o'clock , I could recollect nothing . I could not even recollect whether I happened to be looking on sun and waves when she must have had them full and glorious in her face . With the heartiest consent I could give , and a blank cheque , my fath

Spacy's doc are obtained from a string, so we will need to make this process iteratively through all of our dataset. But the good thing is that, once the tokens are obtained, we have all the info we need to log our data into rubrix.

In [8]:
records = []    # Creating and empty record list to save all the records

for record in dataset:

    record = record["tok_context"]  # We only need the text of each instance
    entities = []   # Here we will store all the entities of each single record

    doc= nlp(record)    # Spacy Doc creation
    
    # Storing all entity tuples in our list
    for ent in doc.ents:
        entities.append((str(ent.label_), ent.start_char, ent.end_char))

    # Storing all tokens of the record
    tokens = []
    for token in doc:
        tokens.append(str(token))

    # Rubrix TokenClassificationRecord append
    records.append(rb.TokenClassificationRecord(
        text=record,
        tokens=tokens,
        prediction=entities,
        prediction_agent="spacy v3",
        metadata={
            "split": "train",
            },
        )
    )

Spacy has made the prediction for us, and we can just log them right away! Remember that predictions, and annotations, must be tuples of ```(label, start_character, end_character)``` stored in a list.

Let's log!

In [9]:
rb.log(records=records, name="gutenberg")

BulkResponse(dataset='gutenberg', processed=20, failed=0)