# From Spacy to NIF-compliant RDF

First of all, we have some setting up to do! Uncomment and run the following cells if you need to install the appropriate libraries.

In [None]:
# %pip install spacy

In [129]:
import spacy.cli
spacy.cli.download("en_core_web_sm")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m21.5 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## Spacy

See the [documentation](https://spacy.io/)

`spaCy` "is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. [...] Unlike `NLTK`, which is widely used for teaching and research, spaCy focuses on providing software for production usage. `spaCy` also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow, PyTorch or MXNet through its own machine learning library Thinc. Using Thinc as its backend, `spaCy` features convolutional neural network models for part-of-speech tagging, dependency parsing, text categorization and named entity recognition (NER). Prebuilt statistical neural network models to perform these tasks are available for 23 languages, including English, Portuguese, Spanish, Russian and Chinese, and there is also a multi-language NER model. Additional support for tokenization for more than 65 languages allows users to train custom models on their own datasets as well" (from [Wikipedia](https://en.wikipedia.org/wiki/SpaCy))

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

Here is how we import the library and load the model. For most languages, models come in different versions, including in particular:
- lightweight models (faster, smaller but less accurate), generally identified with the `sm` tag.
- bigger models that are heavier for computation and bigger (might not be supported by all computers); identified with `md`, `lg` and `trf` indicator.

For this experiment we are just happy with the most basic model for English: `en_core_web_sm`. The name means:
 
- `core`: general-purpose pipeline
- `web`: genre of training data for the pipeline (web content)
- `sm`: the size (small model)

See [here](https://spacy.io/models) for instructions on how to download and load different models for other languages.

In [2]:
txt = '''Three times Randolph Carter dreamed of the marvellous city, and three times was he snatched away while still he paused on the high terrace above it. All golden and lovely it blazed in the sunset, with walls, temples, colonnades, and arched bridges of veined marble, silver-basined fountains of prismatic spray in broad squares and perfumed gardens, and wide streets marching between delicate trees and blossom-laden urns and ivory statues in gleaming rows; while on steep northward slopes climbed tiers of red roofs and old peaked gables harbouring little lanes of grassy cobbles. It was a fever of the gods; a fanfare of supernal trumpets and a clash of immortal cymbals. Mystery hung about it as clouds about a fabulous unvisited mountain; and as Carter stood breathless and expectant on that balustraded parapet there swept up to him the poignancy and suspense of almost-vanished memory, the pain of lost things, and the maddening need to place again what once had an awesome and momentous place.'''

In the following code cell we run the pipeline on the text that we defined before

In [3]:
doc = nlp(txt)

The pipeline processed our document. But what did it do, exactly? Here is a summary of the pipeline components:

In [None]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

## Tokens

If you loop over a `doc`, or if you index the doc with a number between 0 (first token) and $n$ (nr. of tokens in a text - 1), you access the tokens:

In [74]:
print(doc[0], type(doc[0]))

Three <class 'spacy.tokens.token.Token'>


Tokens hold a series o properties stored in attribute, including their string form (their text) and other annotations produced by the pipeline, like POS tags, lemmatization, morphological analysis and other:

In [76]:
for t in doc[:5]:
    print(t.text, t.pos_, t.lemma_, t.morph)

Three NUM three NumType=Card
times NOUN time Number=Plur
Randolph PROPN Randolph Number=Sing
Carter PROPN Carter Number=Sing
dreamed VERB dream Tense=Past|VerbForm=Fin


But even more crucially for our purposes, the tokens store the offsets of the character where the token starts:

In [79]:
for t in doc[:5]:
    print(t.text, t.idx, t.idx + len(t))

Three 0 5
times 6 11
Randolph 12 20
Carter 21 27
dreamed 28 35


### Sentence splitting

Since we have sentence splitting in our pipeline, let's inspect the sentences:

In [4]:
for e in doc.sents:
    print(f"{e.text}\t{e.label_}")

Three times Randolph Carter dreamed of the marvellous city, and three times was he snatched away while still he paused on the high terrace above it.	
All golden and lovely it blazed in the sunset, with walls, temples, colonnades, and arched bridges of veined marble, silver-basined fountains of prismatic spray in broad squares and perfumed gardens, and wide streets marching between delicate trees and blossom-laden urns and ivory statues in gleaming rows; while on steep northward slopes climbed tiers of red roofs and old peaked gables harbouring little lanes of grassy cobbles.	
It was a fever of the gods; a fanfare of supernal trumpets and a clash of immortal cymbals.	
Mystery hung about it as clouds about a fabulous unvisited mountain; and as Carter stood breathless and expectant on that balustraded parapet there swept up to him the poignancy and suspense of almost-vanished memory, the pain of lost things, and the maddening need to place again what once had an awesome and momentous plac

We can inspect one of those sentences even closer. Sentences are generated also to support dependency parsing, which is explicitely marked as a component of our pipeline.

In [5]:
sent = next(doc.sents)

The visualizer [`displacy`](https://spacy.io/usage/visualizers/) allows us to graphically inspect the parsing.

In [151]:
from spacy import displacy

displacy.render(sent, style="dep", jupyter=True)

But the dependency relation is hardcoded also in the tokens' attributes:

In [66]:
# see how I access the first token here?
t = doc[0]
print (f"{t.head} -[{t.dep_}]-> {t}")

times -[nummod]-> Three


I can get the root th sentence:

In [10]:
r = sent.root
print(r)

dreamed


I can obtain the list of any token's dependents:

In [67]:
# `r` is the root of the sentence
for c in r.children:
    print(c)

times
Carter
of
,
and
snatched


## Named Entity

NER is also in our pipeline, so let's inspect how it went. The list of the annotated entities is accessible from the `doc`.

In [68]:
for e in doc.ents:
    print(e.text, e.label_)

Three CARDINAL
Randolph Carter PERSON
three CARDINAL
Carter PERSON


You see that an entity can be a token, but it can also be something else...

In [69]:
for e in doc.ents:
    print(e.text, type(e))

Three <class 'spacy.tokens.span.Span'>
Randolph Carter <class 'spacy.tokens.span.Span'>
three <class 'spacy.tokens.span.Span'>
Carter <class 'spacy.tokens.span.Span'>


In fact, entities are correctly indicated as spans (btw, sentences are span as well), which may or may not correspond to single tokens (in the case of "Randolph Cater" for instance it does not).

Spans also register the start and end offset, which comes very handy to generate NIF-comliant representations:

In [70]:
for e in doc.ents:
    print(e.text, e.start_char, e.end_char)

Three 0 5
Randolph Carter 12 27
three 64 69
Carter 749 755


The same thing can be done with sentences:

In [71]:
for s in doc.sents:
    print(f"{s.text[:5]}...", s.start_char, s.end_char)

Three... 0 148
All g... 149 580
It wa... 581 672
Myste... 673 999


Once again, we can use `displaicy` to have a look at the entity annotation within the document:

In [83]:
displacy.render(doc, style="ent", jupyter=True)

## Convert to RDF

Let us load and check the converter.

In [152]:
import importlib
import spacy2nif.exporter

importlib.reload(spacy2nif.exporter)

<module 'spacy2nif.exporter' from '/Users/francesco/Documents/sync/progetti/Spacy2NIF/examples/../spacy2nif/exporter.py'>

Here we initialize the converter. You can pass a `base_uri` argument to it. If you don't do it, it defaults to http://example.org/doc#

In [87]:
nif = spacy2nif.exporter.NIFExporter(base_uri="http://example.org/doc#")

In [85]:
g = nif.export_doc(doc)

Let's bind the namespaces!

In [47]:
g.bind('nif', str(nif.NIF))
g.bind('conll', str(nif.CONLL))

In [48]:
g.serialize('doc2nif.ttl')

<Graph identifier=N23cd57cb76e34e10bd5d0799d79e3c04 (<class 'rdflib.graph.Graph'>)>

## The Shadow Over Insmouth

Let's do an exercise. We are going to take the [horror novella](https://en.wikipedia.org/wiki/The_Shadow_over_Innsmouth) *The Shadow Over Insmouth* by H.P. Lovecraft and we are going to:

1. annotate it
2. export the annotation as NIF

We will get the text of the novella from the [lovecraftcorpus](https://github.com/vilmibm/lovecraftcorpus) on GitHub. The URL of the txt is the following:

In [89]:
url = "https://raw.githubusercontent.com/vilmibm/lovecraftcorpus/refs/heads/master/innsmouth.txt"

Let's retrieve the text

In [95]:
import requests
r = requests.get(url)
txt = r.text
print(txt[:1000])

THE SHADOW OVER INNSMOUTH

I

During the winter of 1927-28 officials of the Federal government made a strange and secret investigation of certain conditions in the ancient Massachusetts seaport of Innsmouth. The public first learned of it in February, when a vast series of raids and arrests occurred, followed by the deliberate burning and dynamiting--under suitable precautions--of an enormous number of crumbling, worm-eaten, and supposedly empty houses along the abandoned waterfront. Uninquiring souls let this occurrence pass as one of the major clashes in a spasmodic war on liquor.

Keener news-followers, however, wondered at the prodigious number of arrests, the abnormally large force of men used in making them, and the secrecy surrounding the disposal of the prisoners. No trials, or even definite charges were reported; nor were any of the captives seen thereafter in the regular gaols of the nation. There were vague statements about disease and concentration camps, and later about di

Now we annotate it using the same pipeline as before (it may take a while, depending on your computer).

In [96]:
doc = nlp(txt)

Now let's convert it to NIF. We use the web URL of the raw text in GitHub as the document base URI.

In [153]:
nif = spacy2nif.exporter.NIFExporter(base_uri=f"{url}#", export_full_text=False)

In [154]:
g = nif.export_doc(doc)

Let's bind the namespaces!

In [155]:
g.bind('nif', str(nif.NIF))
g.bind('conll', str(nif.CONLL))
g.bind('insmouth', f"{url}#")

In [156]:
g.serialize('insmouth.ttl')

<Graph identifier=N3e0174eee8cc43fda82955340e453d72 (<class 'rdflib.graph.Graph'>)>

In [171]:
displacy.render(next(doc.sents), style="ent", jupyter=True)

## Load NER annotation

In the next exercise, we are going to take a text annotated with NER using a common format known as [IOB](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). IOB annotation looks like this:

```
During	O
the	B-DATE
winter	I-DATE
of	I-DATE
1927	I-DATE
officials	O
of	O
the	O
Federal	B-ORG
government	I-ORG
in
Boston	B-LOC
...
```

We can produce annotation in a series of languages using the web API of [NameTag](https://lindat.mff.cuni.cz/services/nametag/)

---