# spaCy Named Entity Annotation using Regular Expressions

(C) 2024 by [Damir Cavar](http://damir.cavar.me/)

**Version:** 0.1, November 2024

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

The following code shows how simple regular expression matches can be converted to spaCy Named Entity labels. The goal is to provide a first level annotation of entities, manually correct the annotations, and generate a corpus for training neural annotation models.

In [9]:
import spacy
import re

In [2]:
nlp = spacy.load("en_core_web_trf")

In [60]:
banks = """
Bank of America
Bank of America Corporation
Citigroup
Citigroup Inc.
Citi
Goldman Sachs
The Goldman Sachs Group, Inc.
Goldman Sachs Group, Inc.
JP Morgan
JPMorgan Chase
JPMorgan Chase & Co.
JPMorganChase
Morgan Stanley
The PNC Financial Services Group, Inc.
PNC Financial Services Group, Inc.
PNC Bank
U.S. Bancorp
Wells Fargo
Wells Fargo & Company
"""

In [None]:
insurances = """
Farmers Insurance Group
Farmers
Acuity Insurance
Aflac
Aflac Incorporated
Allianz Life
Allied Insurance
Allstate
The Allstate Corporation
American Automobile Association
AAA
American Family Insurance
American Income Life Insurance Company
AIL
American International Group
AIG
Government Employees Insurance Company
GEICO
Liberty Mutual
Liberty Mutual Insurance Company
Zurich Insurance Group
Zurich Insurance Group Ltd
"""

We can filter out regular expression operators and convert them to symbols, as here for the period:

In [62]:
banks = banks.replace(r'.', r'\.')
insurances = insurances.replace(r'.', r'\.')

We can remove empty elements from the list by filtering in the list comprehension. Duplicates are removed by converting the list to a set.

In [63]:
banks_list = { x for x in banks.splitlines() if x }
insurances_list = { x for x in insurances.splitlines() if x }

The lists can now be converted to labeled groups in Python regular expressions. We will use the labels as Named Entity tags:

In [64]:

regular_expression = re.compile( r"|".join( (r"(?P<BANK>" + r"|".join( banks_list ) + r")", 
											 r"(?P<INSURANCE>" + r"|".join( insurances_list ) + r")") ) )

Here is some sample text that we want to annotate:

In [65]:
sample_text = """
Zurich Insurance Group Ltd is a Swiss insurance company, headquartered in Zürich, and the country's largest insurer.
Wells Fargo is an American multinational financial services company with a significant global presence.
JPMorgan Chase & Co. (stylized as JPMorganChase) is an American multinational financial services firm headquartered in New York City and incorporated in Delaware. It is the largest bank in the United States and the world's largest bank by market capitalization as of 2023.
"""

In [70]:
annotations = []
for match in regular_expression.finditer(sample_text):
	for label, value in match.groupdict().items():
		if value:
			break
	print(f"{label}: {match.start()} {match.end()} {match.group(0)}")
	if label:
		annotations.append( (match.start(), match.end(), label) )

INSURANCE: 1 27 Zurich Insurance Group Ltd
BANK: 118 129 Wells Fargo
BANK: 222 242 JPMorgan Chase & Co.
BANK: 256 269 JPMorganChase


We can also run the text through the spaCy NLP pipeline:

In [78]:
doc = nlp(sample_text.strip())

Now we can process sentence by sentence and generate the entity annotations for the training data:

In [80]:
training_data = []
for sentence in doc.sents:
	annotations = []
	for match in regular_expression.finditer(sentence.text):
		for label, value in match.groupdict().items():
			if value:
				break
		if label:
			annotations.append( (match.start(), match.end(), label) )
	if annotations:
		training_data.append( (sentence.text, { 'entities': annotations }) )

The resulting data structure is a list of tuples. The first element is the sentence text. The second is a dictionary with a key `entities` that has a list of antity annotation tuples as value:

In [83]:
for x in training_data:
	print(x[0])
	print(x[1])

Zurich Insurance Group Ltd is a Swiss insurance company, headquartered in Zürich, and the country's largest insurer.
{'entities': [(0, 26, 'INSURANCE')]}
Wells Fargo is an American multinational financial services company with a significant global presence.
{'entities': [(0, 11, 'BANK')]}
JPMorgan Chase & Co. (stylized as JPMorganChase) is an American multinational financial services firm headquartered in New York City and incorporated in Delaware.
{'entities': [(0, 20, 'BANK'), (34, 47, 'BANK')]}


**(C) 2024 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**