# spaCy Named Entity Annotation using Regular Expressions

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)

**Version:** 0.1, November 2024

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

The following code shows how simple regular expression matches can be converted to spaCy Named Entity labels. The goal is to provide a first level annotation of entities, manually correct the annotations, and generate a corpus for training neural annotation models.

In [None]:
from __future__ import unicode_literals, print_function
import spacy
from spacy.util import minibatch, compounding
from spacy.training.example import Example
import re
import random
from pathlib import Path

In [None]:
nlp = spacy.load("en_core_web_trf")

In [None]:
banks = """
Bank of America
Bank of America Corporation
Citigroup
Citigroup Inc.
Citi
Goldman Sachs
The Goldman Sachs Group, Inc.
Goldman Sachs Group, Inc.
JP Morgan
JPMorgan Chase
JPMorgan Chase & Co.
JPMorganChase
Morgan Stanley
The PNC Financial Services Group, Inc.
PNC Financial Services Group, Inc.
PNC Bank
U.S. Bancorp
Wells Fargo
Wells Fargo & Company
"""

In [None]:
insurances = """
Farmers Insurance Group
Farmers
Acuity Insurance
Aflac
Aflac Incorporated
Allianz Life
Allied Insurance
Allstate
The Allstate Corporation
American Automobile Association
AAA
American Family Insurance
American Income Life Insurance Company
AIL
American International Group
AIG
Government Employees Insurance Company
GEICO
Liberty Mutual
Liberty Mutual Insurance Company
Zurich Insurance Group
Zurich Insurance Group Ltd
"""

We can filter out regular expression operators and convert them to symbols, as here for the period:

In [None]:
banks = banks.replace(r'.', r'\.')
insurances = insurances.replace(r'.', r'\.')

We can remove empty elements from the list by filtering in the list comprehension. Duplicates are removed by converting the list to a set.

In [None]:
banks_list = { x for x in banks.splitlines() if x }
insurances_list = { x for x in insurances.splitlines() if x }

The lists can now be converted to labeled groups in Python regular expressions. We will use the labels as Named Entity tags:

In [None]:

regular_expression = re.compile( r"|".join( (r"(?P<BANK>" + r"|".join( banks_list ) + r")", 
											 r"(?P<INSURANCE>" + r"|".join( insurances_list ) + r")") ) )

Here is some sample text that we want to annotate:

In [None]:
sample_text = """
Zurich Insurance Group Ltd is a Swiss insurance company, headquartered in Zürich, and the country's largest insurer.
Wells Fargo is an American multinational financial services company with a significant global presence.
JPMorgan Chase & Co. (stylized as JPMorganChase) is an American multinational financial services firm headquartered in New York City and incorporated in Delaware. It is the largest bank in the United States and the world's largest bank by market capitalization as of 2023.
"""

In [None]:
annotations = []
for match in regular_expression.finditer(sample_text):
	for label, value in match.groupdict().items():
		if value:
			break
	print(f"{label}: {match.start()} {match.end()} {match.group(0)}")
	if label:
		annotations.append( (match.start(), match.end(), label) )

We can also run the text through the spaCy NLP pipeline:

In [None]:
doc = nlp(sample_text.strip())

Now we can process sentence by sentence and generate the entity annotations for the training data:

In [None]:
training_data = []
for sentence in doc.sents:
	annotations = []
	for match in regular_expression.finditer(sentence.text):
		for label, value in match.groupdict().items():
			if value:
				break
		if label:
			annotations.append( (match.start(), match.end(), label) )
	if annotations:
		training_data.append( (sentence.text, { 'entities': annotations }) )

The resulting data structure is a list of tuples. The first element is the sentence text. The second is a dictionary with a key `entities` that has a list of antity annotation tuples as value:

In [None]:
for x in training_data:
	print(x[0])
	print(x[1])

In [None]:
nlp_new = spacy.blank("xx")  # create blank Language class
nlp_new.add_pipe('sentencizer')
ner = nlp_new.add_pipe("ner", last=True)

In [None]:
for _, annotations in training_data:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

In [None]:
nlp_new.begin_training()

In [None]:
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp_new.pipe_names if pipe not in pipe_exceptions]

In [None]:
with nlp_new.disable_pipes(*other_pipes):  # only train NER
    for itn in range(100):
        random.shuffle(training_data)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(training_data, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            for text, annotations in batch:
                print(text)
                print(annotations)
                doc = nlp_new.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp_new.update([example],
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
        print("Losses", losses)

In [None]:
for text, _ in training_data:
    doc = nlp_new(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

In [None]:
output_dir = Path("./models_ner/")

In [None]:
if not output_dir.exists():
    output_dir.mkdir()
nlp_new.to_disk(output_dir)

In [None]:
nlp_test = spacy.load(output_dir)

In [None]:
for text, _ in training_data:
    doc = nlp_test(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    #print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

**(C) 2024 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**