# NER
Named entity recognition is the cornerstone for many applications. From automatically [building a database of herbs](https://www.nature.com/articles/s41598-023-50179-0#additional-information) to [end-to-end Named Entity Linking](https://zenodo.org/records/13907910), it all begins with simply recognizing what entities are mentioned in text.

For German text, the two frameworks we'll introduce are [SpaCy](https://spacy.io/models) and [FlairNLP](https://github.com/flairNLP/flair?tab=readme-ov-file). We show each step for both and encourage you to test out both on your datasets to see which one performs better in your particular use-case.

In [None]:
from tika import parser
import string
from flair.data import Sentence
from flair.models import SequenceTagger
import re
import requests
import io
from bs4 import BeautifulSoup
import spacy
from xml.etree.ElementTree import Element, SubElement, ElementTree
from bs4 import BeautifulSoup


In the next cell you have to comment in the part of the code that describes your situation.

The first part is to load the PDF from the E-Periodica website without actually downloading it onto your machine, the second part is for the case where you already have the PDF on your machine.

## Loading the text
Below we offer the code to load several input types (.pdf, .txt, .xml, and image files)

### PDF input
If you use PDFs from a Website or from your computer, run these cells

In [None]:
# If you don't want to save the PDF locally, comment in the next four lines and comment out the last one.
url = "https://www.e-periodica.ch/cntmng?pid=grs-001%3A1921%3A13%3A%3A298"
r = requests.get(url)
f = io.BytesIO(r.content)
parsed = parser.from_buffer(f)

# If you've downloaded the PDF onto your computer, comment in the following line and comment out the previous four:
#parsed = parser.from_file('data/ner_data/grs-001_1921_13__298_d.pdf')

The E-Periodica PDFs already have the OCR embedded into them, so all you need to do is extract the text and clean it up.

In [None]:
#pdf
contents = [x.strip() for x in parsed["content"].split("\n") if x != ""]
#remove the first page
article = contents[contents.index('https://www.e-periodica.ch/digbib/terms?lang=en')+1:]  # remove the first page of metadata
article = "\n".join(article)  # I deliberately add newlines so we can nicely put words back together that were split across the pages

article = re.sub("¬\n", "", article)  # "bindestriche" will be removed, if they are followed by one or several whitespaces, those will be removed as well.
article = article.strip()  # remove all starting and trailing whitespaces
article = re.sub("\n", " ", article)  # replace newlines with spaces
article = re.sub(r'\s+', " ", article)  # replace all repeating whitespaces with only one whitespace
article = re.sub(r'\\', "", article)  # replace all double backslashes

This is slightly different from how we read in the embedded text from the PDF in 01_text_recognition. There is really no difference, but doing it via requests with a parser makes it more general so you can choose between loading it from your computer or from the internet and then work with the resulting file the exact same way.

#### Text Output
If you would like to now save this PDF text as a simple text file, run this cell:

In [None]:
output_filename = "data/ner_data/output/grs-001_1921_13__298_d.txt"
with open(output_filename,"w") as f:
    f.writelines(article)

### Text Input
If you have the input as a text file, run this cell:

In [None]:
#text
input_filepath = "data/ner_data/output/grs-001_1921_13__298_d.txt"
with open(input_filepath, "r") as f:
    article = f.read()

### XML Input
If you have the input as an XML file, run this cell:

In [None]:
#xml
input_filepath = "./data/ocr_data/grs-001_1921_13__298_d_tei.xml"
with open(input_filepath, "r") as f:
    article = f.read()
soup = BeautifulSoup(article, features="xml")
pageText = soup.findAll(text=True)
article = " ".join(pageText)

In [None]:
article

### Image Input
If you have the input as an image file only, please check out the 01_text_recognition notebook, save the text files and run the cells for the Text input.

## Running Tagging

### Flair
First, we'll show you how named entity tagging works with FlairNLP

In [None]:
# load tagger, this might take a while but german-large has much better performance than the smaller models
tagger = SequenceTagger.load("flair/ner-german-large")

# predict on the article you loaded above.
sentence = Sentence(article)
tagger.predict(sentence)

Let's extract certain tags.

In [None]:
people = []
places = []
organisations = []

for entity in sentence.get_spans('ner'):
    if entity.tag == "PER": #people
        name = entity.text.translate(str.maketrans('', '', string.punctuation)) #remove possible ocr mistakes
        if len(name) >= 3: # names are usually not shorter
            people.append(entity.text)
    elif entity.tag == "LOC": #places
        place = entity.text.translate(str.maketrans('', '', string.punctuation)) #remove possible ocr mistakes
        if len(place) >= 3: # place are usually not shorter
            places.append(entity.text)
    elif entity.tag == "ORG": #only organisations
        org = entity.text.translate(str.maketrans('', '', string.punctuation)) #remove possible ocr mistakes
        if len(org) >= 3:
            organisations.append(entity.text)

In [None]:
people

It seems we chose an article with no people mentioned. This can happen frequently, especially if the article is about international relations or laws.

In [None]:
places

In [None]:
organisations

But this is only People / Places / Organizations. What if you also want to extract dates and numbers (cardinal) and times? For that you need the [ontonotes](https://catalog.ldc.upenn.edu/LDC2013T19) model. FlairNLP only has this model trained on English content, but as you'll see, it will work surprisingly well even on German text.

In [None]:
tagger_onto = SequenceTagger.load("flair/ner-english-ontonotes-large")

In [None]:
sent_test = Sentence("Dem will eine Datenschutztagung an der ETH Zürich dienen, die von einer Gruppe Gewerkschaftern organisiert und vom SGB und seinen Verbänden unterstützt wird. Sie findet Samstag, den 31. März 1984 ab 9.15 Uhr, ganztägig statt.")
tagger_onto.predict(sent_test)

In [None]:
sent_test.get_spans('ner')

Not bad at all, right?

### SpaCy
Now we'll do the same thing, but with SpaCy. The results for our toy examples are identical between SpaCy and FlairNLP, but keep in mind that SpaCy is significantly faster when you have much more data.

#### German NER
First, just as with FlairNLP, we show the results for regular German entity tagging.

In [None]:
nlp_de = spacy.load("de_core_news_lg")
doc_de = nlp_de(article)

What tags are even possible for this model?

In [None]:
nlp_de.get_pipe('ner').labels

Unlike FlairNLP though, SpaCy has a really nice visualization capability. (Don't forget to stop the execution of the cell.)

In [None]:
spacy.displacy.serve(doc_de, style="ent")

In [None]:
# Find named entities, phrases and concepts
for entity in doc_de.ents:
    print(entity.text, entity.label_)

But what if we have some text with lots of time and dates?

In [None]:
doc_2 = nlp_de("Dem will eine Datenschutztagung an der ETH Zürich dienen, die von einer Gruppe Gewerkschaftern organisiert und vom SGB und seinen Verbänden unterstützt wird. Sie findet Samstag, den 31. März 1984 ab 9.15 Uhr, ganztägig statt.")
# Find named entities, phrases and concepts
for entity in doc_2.ents:
    print(entity.text, entity.label_)

#### English Onto NER
In that case we need to step it up. Using a model trained on the ontonotes tags once again, but only the English ones, we get many more labels to use.

In [None]:
nlp = spacy.load("en_core_web_trf")
doc = nlp(article)

In [None]:
nlp.get_pipe('ner').labels

Now let's visualize it! SpaCy has a built-in visualizer, but careful! Since we're using a Jupyter notebook, the "serve" function will not stop on its own. You have to "interrupt" the display to run the next cells. There is a function specifically for Jupyter notebooks, namely "displacy.render", but that one does not display properly in our case.

In [None]:
spacy.displacy.serve(doc, style="ent")

In [None]:
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

But does it work on our German text?

In [None]:
doc_2 = nlp("Dem will eine Datenschutztagung an der ETH Zürich dienen, die von einer Gruppe Gewerkschaftern organisiert und vom SGB und seinen Verbänden unterstützt wird. Sie findet Samstag, den 31. März 1984 ab 9.15 Uhr, ganztägig statt.")
# Find named entities, phrases and concepts
for entity in doc_2.ents:
    print(entity.text, entity.label_)

It sure does! Looks great, despite the fact that this model was only trained on English data. That implies that there's some kind of generality to the rules it is learning for entity tagging.

## Text to TEI XML with SpaCy NER
Digital humanities mostly work with XML files and NER lends itself to be incorporated in a typical XML file structure. Here we take a text file, run NER on it and save it as a TEI XML file.

In [None]:
txt_file_path = "data/embedding_data/grs-001_1921_13__298_d.txt"
output_file_path = "data/ocr_data/output/grs-001_1921_13__298_d_tei.xml"
output_file_ner = "data/ocr_data/output/grs-001_1921_13__298_d_tei_ner.xml"

In [None]:
nlp = spacy.load("de_core_news_lg")

First we save the .txt file into a TEI XML file.

In [None]:
def create_tei_from_txt(txt_file_path, output_file_path, paragraph_delimiter="\n", page_delimiter="\n\n",):
    with open(txt_file_path, 'r') as f:
        text = f.read()
    
    pages = text.split(page_delimiter)
    paragraphs = [x.split(paragraph_delimiter) for x in pages]

    tei = Element('teiHeader') #root
    text_section = SubElement(tei, 'text')
    body = SubElement(text_section, 'body')
    
    for page in paragraphs:
        p_page = SubElement(body,"pb")
        for paragraph in page:
            p_para = SubElement(p_page, 'p')  # Paragraph element
            p_para.text = paragraph
    
    # Generate the output XML file
    tree = ElementTree(tei)
    tree.write(output_file_path, encoding='utf-8', xml_declaration=True)
    
    print(f"TEI file created: {output_file_path}")

In [None]:
create_tei_from_txt(txt_file_path, output_file_path)

Then we run NER on said file, and save it with the named entities tagged in it.

In [None]:
def create_ner_tei_from_tei(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        xml_doc = f.read()

    soup = BeautifulSoup(xml_doc, "xml")

    paragraphs = soup.find_all(string=True)
    for entry in paragraphs:

        doc = nlp(entry.text) #change here for flair
        newtext = entry
        last_tag = ""
        running_total = 0
        for i,ent in enumerate(doc.ents): #change the enumeration of the entities for flair
            start = ent.start_char + running_total
            end = ent.end_char + running_total
            entity_text = ent.text
            entity_label = ent.label_

            if entity_label == "PER": #change the tags for flair
                tag = "perName"
            elif entity_label == "ORG":
                tag = "orgName"
            elif entity_label == "GPE" or entity_label == "LOC":
                tag = "placeName"
            else:
                tag = entity_label
            
            newtext = newtext[:start] + "<"+tag+">"+entity_text+"</"+tag+">" + newtext[end:]
            last_tag = tag
            running_total += (5+2*len(last_tag))
        
        entry.replace_with(BeautifulSoup(newtext, features="html.parser"))
    
    with open(output_file, 'w') as f:
        f.write(soup.prettify())

In [None]:
create_ner_tei_from_tei(output_file_path, output_file_ner)

I know the results look somewhat disappointing here, for instance "Zwangsmassregeln" being tagged as a person is less than ideal...

Picking the proper model for your dataset (or finetuning / training it yourself) is the biggest part of this process, everything else can be done automatically. Here, I simply chose SpaCy's biggest German model, which was trained on news stories. A FlairNLP model might do better, or a model trained on books even.