# From Spacy to NIF-compliant RDF

First of all, if you are running on your computer you have some setting up to do! Uncomment and run the following cells if you need to install the appropriate libraries and modules.

In [174]:
# %pip install spacy

In [None]:
# import spacy.cli
# spacy.cli.download("en_core_web_sm")

Anyway, we **all** need to install my [spacy2nif](https://github.com/francescomambrini/Spacy2NIF) module to convert the `spacy` annotation into NIF/RDF. Run the following cell:

In [None]:
!pip install git+https://github.com/francescomambrini/Spacy2NIF.git

## Spacy

See the [documentation](https://spacy.io/)

`spaCy` "is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. [...] Unlike `NLTK`, which is widely used for teaching and research, spaCy focuses on providing software for production usage. `spaCy` also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc. Using Thinc as its backend, `spaCy` features convolutional neural network models for part-of-speech tagging, dependency parsing, text categorization and named entity recognition (NER). Prebuilt statistical neural network models to perform these tasks are available for 23 languages, including English, Portuguese, Spanish, Russian and Chinese, and there is also a multi-language NER model. Additional support for tokenization for more than 65 languages allows users to train custom models on their own datasets as well" (from [Wikipedia](https://en.wikipedia.org/wiki/SpaCy))

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

Here is how we import the library and load the model. For most languages, models come in different versions, including in particular:
- lightweight models (faster, smaller but less accurate), generally identified with the `sm` tag.
- bigger models that are heavier for computation and bigger (might not be supported by all computers); identified with `md`, `lg` and `trf` indicator.

For this experiment we are just happy with the most basic model for English: `en_core_web_sm`. The name means:
 
- `core`: general-purpose pipeline
- `web`: genre of training data for the pipeline (web content)
- `sm`: the size (small model)

See [here](https://spacy.io/models) for instructions on how to download and load different models for other languages.

In [None]:
txt = '''Three times Randolph Carter dreamed of the marvellous city, and three times was he snatched away while still he paused on the high terrace above it. All golden and lovely it blazed in the sunset, with walls, temples, colonnades, and arched bridges of veined marble, silver-basined fountains of prismatic spray in broad squares and perfumed gardens, and wide streets marching between delicate trees and blossom-laden urns and ivory statues in gleaming rows; while on steep northward slopes climbed tiers of red roofs and old peaked gables harbouring little lanes of grassy cobbles. It was a fever of the gods; a fanfare of supernal trumpets and a clash of immortal cymbals. Mystery hung about it as clouds about a fabulous unvisited mountain; and as Carter stood breathless and expectant on that balustraded parapet there swept up to him the poignancy and suspense of almost-vanished memory, the pain of lost things, and the maddening need to place again what once had an awesome and momentous place.'''

In the following code cell we run the pipeline on the text that we defined before

In [None]:
doc = nlp(txt)

The pipeline processed our document. But what did it do, exactly? Here is a summary of the pipeline components:

In [None]:
nlp.pipe_names

## Tokens

If you loop over a `doc`, or if you index the doc with a number between 0 (first token) and $n$ (nr. of tokens in a text - 1), you access the tokens:

In [None]:
print(doc[0], type(doc[0]))

Tokens hold a series o properties stored in attribute, including their string form (their text) and other annotations produced by the pipeline, like POS tags, lemmatization, morphological analysis and other:

In [None]:
for t in doc[:5]:
    print(t.text, t.pos_, t.lemma_, t.morph)

But even more crucially for our purposes, the tokens store the offsets of the character where the token starts:

In [None]:
for t in doc[:5]:
    print(t.text, t.idx, t.idx + len(t))

### Sentence splitting

Since we have sentence splitting in our pipeline, let's inspect the sentences:

In [None]:
for e in doc.sents:
    print(f"{e.text}\t{e.label_}")

We can inspect one of those sentences even closer. Sentences are generated also to support dependency parsing, which is explicitely marked as a component of our pipeline.

In [None]:
sent = next(doc.sents)

The visualizer [`displacy`](https://spacy.io/usage/visualizers/) allows us to graphically inspect the parsing.

In [None]:
from spacy import displacy

displacy.render(sent, style="dep", jupyter=True)

But the dependency relation is hardcoded also in the tokens' attributes:

In [None]:
# see how I access the first token here?
t = doc[0]
print (f"{t.head} -[{t.dep_}]-> {t}")

I can get the root th sentence:

In [None]:
r = sent.root
print(r)

I can obtain the list of any token's dependents:

In [None]:
# `r` is the root of the sentence
for c in r.children:
    print(c)

## Named Entity

NER is also in our pipeline, so let's inspect how it went. The list of the annotated entities is accessible from the `doc`.

In [None]:
for e in doc.ents:
    print(e.text, e.label_)

You see that an entity can be a token, but it can also be something else...

In [None]:
for e in doc.ents:
    print(e.text, type(e))

In fact, entities are correctly indicated as spans (btw, sentences are span as well), which may or may not correspond to single tokens (in the case of "Randolph Cater" for instance it does not).

Spans also register the start and end offset, which comes very handy to generate NIF-comliant representations:

In [None]:
for e in doc.ents:
    print(e.text, e.start_char, e.end_char)

The same thing can be done with sentences:

In [None]:
for s in doc.sents:
    print(f"{s.text[:5]}...", s.start_char, s.end_char)

Once again, we can use `displaicy` to have a look at the entity annotation within the document:

In [None]:
displacy.render(doc, style="ent", jupyter=True)

## Convert to RDF

Let us load and check the converter.

In [None]:
import importlib
import spacy2nif.exporter

importlib.reload(spacy2nif.exporter)

Here we initialize the converter. You can pass a `base_uri` argument to it. If you don't do it, it defaults to http://example.org/doc#

In [None]:
nif = spacy2nif.exporter.NIFExporter(base_uri="http://example.org/doc#")

In [None]:
g = nif.export_doc(doc)

Let's bind the namespaces!

In [None]:
g.bind('nif', str(nif.NIF))
g.bind('conll', str(nif.CONLL))

In [None]:
g.serialize('doc2nif.ttl')

## The Shadow Over Insmouth

Let's do an exercise. We are going to take the [horror novella](https://en.wikipedia.org/wiki/The_Shadow_over_Innsmouth) *The Shadow Over Insmouth* by H.P. Lovecraft and we are going to:

1. annotate it
2. export the annotation as NIF

We will get the text of the novella from the [lovecraftcorpus](https://github.com/vilmibm/lovecraftcorpus) on GitHub. The URL of the txt is the following:

In [None]:
url = "https://raw.githubusercontent.com/vilmibm/lovecraftcorpus/refs/heads/master/innsmouth.txt"

Let's retrieve the text

In [None]:
import requests
r = requests.get(url)
txt = r.text
print(txt[:1000])

Now we annotate it using the same pipeline as before (it may take a while, depending on your computer).

In [None]:
doc = nlp(txt)

Now let's convert it to NIF. We use the web URL of the raw text in GitHub as the document base URI.

In [None]:
nif = spacy2nif.exporter.NIFExporter(base_uri=f"{url}#", export_full_text=False)

In [None]:
g = nif.export_doc(doc)

Let's bind the namespaces!

In [None]:
g.bind('nif', str(nif.NIF))
g.bind('conll', str(nif.CONLL))
g.bind('insmouth', f"{url}#")

In [None]:
g.serialize('insmouth.ttl')

In [None]:
displacy.render(next(doc.sents), style="ent", jupyter=True)

---

## Appendix: IOB

Most NER applications use a common format known as [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) as output. IOB annotation looks like this:

```
During	O
the	B-DATE
winter	I-DATE
of	I-DATE
1927	I-DATE
officials	O
of	O
the	O
Federal	B-ORG
government	I-ORG
in
Boston	B-LOC
...
```

As you may have guessed, the tags like `I-ORG` or `B-LOC` are composed of two parts: 1. the tag (`ORG`, `LOC` etc.) or an `O` if the token is *not* a Named Entity; 2. a prefix that is used for chunking.

Prefixes are:

- `B`: for the token that either start the span of the NE, or is the only token of the NE
- `I`: for the token that either is in the middle or is the last in the span 

(there are many flavours, however! Implementations may change, especially in how they use `I` and `B` for spans made of only one token or for final tokens. The explanation give above works for the output of [NameTag](https://lindat.mff.cuni.cz/services/nametag/) that is quoted above).

## Appendix 2: Entity Linking in `spacy`

`spacy` provides functionalities to perform entity linking, i.e. to take the named entities and match it with IDs from Knowledge Bases.

There are couple of solutions for doing this. One is included in `spacy` and it is pluggable as a component in the pipline. It requires to create your custom knowledge base.

The other one is the [spacy-entity-linker](https://pypi.org/project/spacy-entity-linker/) module by Martino Mensio. It must be installed and, the first time it is used, it will download the Wikidata as a spacy KB to be used (~1.3 giga!!!). The good new, if you're using Colab, is that it will be downloaded on your VM (but be aware that the **free tier of Colab has limited space**, of around 100gb)...

But actually, I wasn't very sastified with that.

We can try a thirt option! We can:

1. build a CSV file with all the URIs of the NE
2. load it into OpenRefine
3. use the reconciliation with Wikidata



**IMPORTANT**

Make sure you have whatever text file you want uploaded on your Colab machine (right pane, click on the folder and then upload)!

If you want to follow what I did in the video you can upload [this](https://github.com/francescomambrini/Spacy2NIF/blob/main/examples/louvre-ticket-price-hike-scli-intl.txt) txt file with a [news article](https://lite.cnn.com/2025/11/28/travel/louvre-ticket-price-hike-scli-intl#) from CNN.

Once you have loaded it, here is how you open it, read it and process it with `spacy`:

In [175]:
import spacy

# or use whatever model you want
nlp = spacy.load("en_core_web_sm")

with open('louvre-ticket-price-hike-scli-intl.txt') as f:
    txt = f.read()

doc = nlp(txt)

And here is how we generate the CSV file to be read by OpenRefine. If you are working with your own data, make sure to update the `base_uri` according to your settings.

In [None]:
base_uri = "https://lite.cnn.com/2025/11/28/travel/louvre-ticket-price-hike-scli-intl#"

with open('louvre.csv', 'w') as out:
  out.write('URI\tText\tLabel\n')
  for e in doc.ents:
    uri = f'{base_uri}char={e.start_char},{e.end_char}'
    out.write(f'{uri}\t{e.text.replace('\n', ' ')}\t{e.label_}\n')
  