# Getting started

## Read the docs
* [spaCy](https://spacy.io/api/doc)

## Install spacy
1. Basic installation: `conda install -c conda-forge spacy`
2. Download the German model: `python -m spacy download de`

# Wish
I wish that in spaCy, an entity could be part of another entity. This is [intentionally]() not the case (see for instance [here](https://github.com/explosion/spaCy/issues/2550)).

## Why is this a problem?
For some NLP tasks, you will need to handle overlapping annotations. Not necessarily annotations of the same type, but for instance hierarchical annotations like paragraphs in a document that contain sentences, phrases that contain tokens or tokens that contain subword-units.

Let's see an example.

In [158]:
import spacy
nlp = spacy.load("de")

We will use the first few sentences from Rumpelstiltskin as example text. 

In [163]:
text = "Rumpelstilzchen\n\
Ein Märchen der Brüder Grimm\n\
Es war einmal ein Müller, der war arm, aber er hatte eine schöne Tochter."

print(text)
# process the document with spacy
doc = nlp(text, disable=['ner'])

Rumpelstilzchen
Ein Märchen der Brüder Grimm
Es war einmal ein Müller, der war arm, aber er hatte eine schöne Tochter.


Now, add some custom annotations.

In [164]:
from spacy.tokens.span import Span
# add the custom 'headline' annotation
headline = doc.char_span(0, 15, label="headline")
print(f"headline: {headline}")

# add the custom 'subheader' annotation
subheader = doc.char_span(16, 44, label="subheader")
print(f"subheader: {subheader}")


headline: Rumpelstilzchen
subheader: Ein Märchen der Brüder Grimm


Let's see if we can store these annotations as entities in the document...

In [165]:
doc.ents = list(doc.ents) + [headline, subheader]
for entity in doc.ents:
    print(f"{entity.label_}: {entity}")

headline: Rumpelstilzchen
subheader: Ein Märchen der Brüder Grimm


All good so far. Let's add another annotation. This time, we want to mark the author as an entity. 

In [166]:
# add the custom 'author' annotation
author = doc.char_span(32, 44, label="author")
print(f"author: {author}")

author: Brüder Grimm


In [167]:
doc.ents = list(doc.ents) + [author]

ValueError: [E098] Trying to set conflicting doc.ents: '(2, 7, 'subheader')' and '(5, 7, 'author')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

Bam. That does not work.

## Relation extraction as a use-case scenario

Assume you have a **relation extraction** task. If you are not familiar with this specific NLP task, the [repository by Sebastian Ruder](https://github.com/sebastianruder/NLP-progress/blob/master/english/relationship_extraction.md) is a really good starting point for an overview. For instance, you would like to extract some person's role in a political party. For a supervised extraction model, you will have to annotate some examples. One might look like this: 

In [168]:
text = "Frau Müller ist SPD-Vorstand."
doc = nlp(text)

# print the Named Entities that are recognized with the default model
for entity in doc.ents:
    print(entity, entity.label_)

SPD-Vorstand ORG


spaCy's NER model already marked a Named Entity. But for our scenario, we need to go a bit more into detail. We would like to mark the chars 'SPD' as party and the chars 'Vorstand' as role. Let's try to do this.

In [169]:
# add the annotations
custom_annotations = []
party = doc.char_span(16, 19, label="party")
custom_annotations.append(party)
role = doc.char_span(20, 28, label="role")
custom_annotations.append(role)

for annotation in custom_annotations:
    print(annotation)

None
None


Oops. No annotations were added. That's because we tried to add additional spans in an already existing annotation. spaCy should at least have thrown an error, right?
Let's try again but this time without a predefined NER model so that no Named Entities will be added.

In [170]:
doc = nlp(text, disable=['ner'])
assert len(doc.ents) == 0

# add the annotations
custom_annotations = []
party = doc.char_span(16, 19, label="party")
custom_annotations.append(party)
role = doc.char_span(20, 28, label="role")
custom_annotations.append(role)

for annotation in custom_annotations:
    print(annotation)

None
None


This **still** does not work. Guess why? Because the span 'SPD-Vorstand' is also a token, another type of span is spaCy. And again, we are not allowed to add an annotation that would cover a subsequence of that span.