# NER

Spacy vs flair

In [2]:
from tika import parser
import string
from flair.data import Sentence
from flair.models import SequenceTagger
import re

  from .autonotebook import tqdm as notebook_tqdm


In the next cell you have to comment in the part of the code that describes your situation. The first part is to load the PDF from the E-Periodica website without actually downloading it onto your machine, the second part is for the case where you already have the PDF on your machine.

### PDF
If you use PDFs from a Website or from your computer, run these cells

In [21]:
# If you don't want to save the PDF locally, comment in the next four lines and comment out the last one.
# url = "https://www.e-periodica.ch/cntmng?pid=grs-002%3A1984%3A76%3A%3A218"
# r = requests.get(url)
# f = io.BytesIO(r.content)
# parsed = parser.from_buffer(f)

# If you've downloaded the PDF onto your computer, comment in the following line:
parsed = parser.from_file('data/ner_data/grs-002_1984_76__219_d.pdf')

The E-Periodica PDFs already have the OCR embedded into them, so all you need to do is extract the text and clean it up.

In [4]:
#pdf
contents = [x.strip() for x in parsed["content"].split("\n") if x != ""]
#remove the first page
article = contents[contents.index('https://www.e-periodica.ch/digbib/about3?lang=en')+1:]
article = " ".join(article)

article = re.sub("¬\s+", "", article)  # "bindestriche" will be removed, if they are followed by one or several whitespaces, those will be removed as well.
article = article.strip()  # remove all starting and trailing whitespaces
article = re.sub("\n", " ", article)  # replace newlines with spaces
article = re.sub("\. ", "\.\n", article)  # replace periods with newlines (for nicer printing)
article = re.sub(r'\s+', " ", article)  # replace all repeating whitespaces with only one whitespace
article = re.sub(r'\\', "", article)  # replace all double backslashes

#### Text Output
If you would like to now save this PDF text as a simple text file, run this cell:

In [None]:
output_filename = "data/ner_data/grs-002_1984_76__219_d.txt"
with open(output_filename,"w") as f:
    f.writelines(article)

### Text
If you have the input as a text file, run this cell:

In [None]:
#text

### XML
If you have the input as an XML file, run this cell:

In [None]:
#xml

### Image
If you have the input as an image file only, please check out the OCR notebook, save the text files and run the cells for the Text input.

## Flair
First, we'll show you how named entity tagging works with FlairNLP

In [11]:
# load tagger, this might take a while
tagger = SequenceTagger.load("flair/ner-german-large")

# predict on the article
sentence = Sentence(article)
tagger.predict(sentence)

2025-03-19 15:49:32,425 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, B-PER, E-PER, S-LOC, B-MISC, I-MISC, E-MISC, S-PER, B-ORG, E-ORG, S-ORG, I-ORG, B-LOC, E-LOC, S-MISC, I-PER, I-LOC, <START>, <STOP>


In [12]:
people = []
places = []
organisations = []

for entity in sentence.get_spans('ner'):
    if entity.tag == "PER": #people
        name = entity.text.translate(str.maketrans('', '', string.punctuation)) #remove possible ocr mistakes
        if len(name) >= 3: # names are usually not shorter
            people.append(entity.text)
    elif entity.tag == "LOC": #places
        place = entity.text.translate(str.maketrans('', '', string.punctuation)) #remove possible ocr mistakes
        if len(place) >= 3: # place are usually not shorter
            places.append(entity.text)
    elif entity.tag == "ORG": #only organisations
        org = entity.text.translate(str.maketrans('', '', string.punctuation)) #remove possible ocr mistakes
        if len(org) >= 3:
            organisations.append(entity.text)

In [13]:
people

['Friedrich', 'Sulzer', 'Hansruedi Isler']

In [14]:
places

[]

In [15]:
organisations

['Verkauf Handel Transport Lebensmittel',
 'VHTL',
 'Fisco-Findus',
 'Hero Conserven',
 'VHTL',
 'VHTL']

But this is only People / Places / Organisations. What if you also want to extract dates and numbers (cardinal) and times? For that you need the "ontonotes" model. FlairNLP only has this model trained on english content, but as you'll see, it will work surprisingly well even on German text.

In [16]:
tagger_onto = SequenceTagger.load("flair/ner-english-ontonotes-large")

2025-03-19 15:50:47,026 SequenceTagger predicts: Dictionary with 76 tags: <unk>, O, B-CARDINAL, E-CARDINAL, S-PERSON, S-CARDINAL, S-PRODUCT, B-PRODUCT, I-PRODUCT, E-PRODUCT, B-WORK_OF_ART, I-WORK_OF_ART, E-WORK_OF_ART, B-PERSON, E-PERSON, S-GPE, B-DATE, I-DATE, E-DATE, S-ORDINAL, S-LANGUAGE, I-PERSON, S-EVENT, S-DATE, B-QUANTITY, E-QUANTITY, S-TIME, B-TIME, I-TIME, E-TIME, B-GPE, E-GPE, S-ORG, I-GPE, S-NORP, B-FAC, I-FAC, E-FAC, B-NORP, E-NORP, S-PERCENT, B-ORG, E-ORG, B-LANGUAGE, E-LANGUAGE, I-CARDINAL, I-ORG, S-WORK_OF_ART, I-QUANTITY, B-MONEY


In [17]:
sent_test = Sentence("Dem will eine Datenschutztagung an der ETH Zürich dienen, die von einer Gruppe Gewerkschaftern organisiert und vom SGB und seinen Verbänden unterstützt wird. Sie findet Samstag, den 31. März 1984 ab 9.15 Uhr, ganztägig statt.")
tagger_onto.predict(sent_test)

In [18]:
sent_test.get_spans('ner') #not bad, huh?

[Span[5:8]: "der ETH Zürich" → ORG (1.0000),
 Span[18:19]: "SGB" → ORG (1.0000),
 Span[27:33]: "Samstag, den 31. März 1984" → DATE (1.0000),
 Span[34:36]: "9.15 Uhr" → TIME (0.8940)]

## Spacy
Now we'll do the same thing, but with Spacy. The results for our toy examples are identical between spacy and flair, but keep in mind that spacy is significantly faster.

In [10]:
import spacy

### German NER
First, just as with FlairNLP, we show the results for regular German entity tagging.

In [6]:
nlp_de = spacy.load("de_core_news_lg")
doc_de = nlp_de(article)

In [20]:
nlp_de.get_pipe('ner').labels

('LOC', 'MISC', 'ORG', 'PER')

Unlike FlairNLP though, Spacy has a really nice visualization capability. 

In [8]:
spacy.displacy.serve(doc_de, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [11]:
# Find named entities, phrases and concepts
for entity in doc_de.ents:
    print(entity.text, entity.label_)

SGB-Arbeitsprogramm MISC
Gewerkschaften ORG
Zeit MISC
Datenschutzgesetz MISC
Eidgenössischen Justiz- und Polizeidepartement ORG
Friedrich PER
Betriebskommission ORG
Gewerkschaft ORG
Sulzer ORG
Datenschutz-Artikel LOC
Gewerkschaft ORG
Verkauf Handel Transport Lebensmittel ORG
VHTL MISC
Fisco-Findus ORG
Hero Conserven ORG
Geheimbereich und Datenschutz MISC
VHTL MISC
VHTL MISC
Hansruedi Isler PER
Handkartei LOC
3einer MISC
Betriebsinterne PER
Arbeitnehmerdaten LOC
Gewerkschaft ORG


But what if we have some text with lots of time and dates?

In [12]:
doc_2 = nlp_de("Dem will eine Datenschutztagung an der ETH Zürich dienen, die von einer Gruppe Gewerkschaftern organisiert und vom SGB und seinen Verbänden unterstützt wird. Sie findet Samstag, den 31. März 1984 ab 9.15 Uhr, ganztägig statt.")
# Find named entities, phrases and concepts
for entity in doc_2.ents:
    print(entity.text, entity.label_)

ETH Zürich ORG
Gewerkschaftern ORG
SGB MISC


### English Onto NER
In that case we need to step it up once again. Using a model trained on the ontonotes tags once again, but once again only the english ones, we get many more labels to use.

In [None]:
nlp = spacy.load("en_core_web_trf")
doc = nlp(article)

In [None]:
nlp.get_pipe('ner').labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

In [None]:
spacy.displacy.serve(doc, style="ent")

In [9]:
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

im Kapitel PERSON
Jahresanfang 1984 DATE
Friedrich PERSON
Jede Vereinbarung PERSON
Bewusstsein PERSON
dem 19. Juli 1983 DATE
Sulzer ORG
VHTL ORG
Fisco-Findus ORG
Hero Conserven ORG
Artikel PERSON
I.Januar 1984 DATE
1 CARDINAL
3 CARDINAL
4 CARDINAL
5 CARDINAL
VHTL ORG
zwei CARDINAL
Mängel PERSON
VHTL ORG
Hansruedi Isler PERSON
5 CARDINAL
Betriebsinterne Gründe ORG
15 CARDINAL


But does it work on our German text?

In [16]:
doc_2 = nlp("Dem will eine Datenschutztagung an der ETH Zürich dienen, die von einer Gruppe Gewerkschaftern organisiert und vom SGB und seinen Verbänden unterstützt wird. Sie findet Samstag, den 31. März 1984 ab 9.15 Uhr, ganztägig statt.")
# Find named entities, phrases and concepts
for entity in doc_2.ents:
    print(entity.text, entity.label_)

ETH Zürich ORG
SGB ORG
Samstag DATE
den 31. März 1984 DATE
9.15 Uhr TIME


It sure does! Looks great, despite the fact that this model was only trained on English data. That implies that there's some kind of generality to the rules it is learning for entity tagging.