# Named Entity Recognition (NER) - Overview + Implementation
* Notebook by Adam Lang
* Date: 3/27/2024
* This notebook will present an overview of Named Entity Recognition (NER) with an implementation in Python.

# NER Overview
* Process of finding and tagging named entities in a given text.
* 2 step process (usually):
    1. Named Entity Extraction
      a. Identify noun phrases - connected by direct subject or object relationships.
      b. Dependency parsing is usually done first. Then identify the named entity nouns.
    2. Named Entity Classification or "Linking"
      a. Second part is a "classification problem".
      b. Goal is to define a model that can assign the class/category of each named entity that was tagged.
      c. "General Classes" of Named Entities (varies based on business use case):
          - Person
          - Organization
          - Location
          - Drug
          - etc.....
      d. This is generally referred to as: "Named entity linking with a knowledge base."
      
## Named Entity Linking Approaches:
* Domain dictionary lookup (e.g. ontologies)
* Knowledge Base: Wikipedia, Google
* APIs - Google Maps, News API

## Challenges with NER
* Updated vs. outdated ontologies, databases or knowledge bases
* Name variation and semantic entity synonyms
* Multiple categories or classes
    * `London`: Name of a person, Name of a city
    * `Donald Trump Park`: Name of a park(location), Name of a person
    * etc...
* Non Capitalization
* Punctuation
* etc...



# Implemention of Named Entity Recognition in Python using spacy
* Spacy provides a defined list of entities it supports: https://spacy.io/api/data-formats#named-entities

In [1]:
# import spacy modules
import spacy
from spacy import displacy

In [2]:
# load the spacy model
nlp = spacy.load('en_core_web_sm')

In [4]:
# define string to parse
text = "Switzerland, officially the Swiss Confederation, is a country situated in the confluence of Western, Central, and Southern Europe. It is a federal republic composed of 26 cantons, with federal authorities based in Bern. Switzerland is a landlocked country bordered by Italy to the south, France to the west, Germany to the north, and Austria and Liechtenstein to the east. It is geographically divided among the Swiss Plateau, the Alps, and the Jura, spanning a total area of 41,285 km2 (15,940 sq mi), and land area of 39,997 km2 (15,443 sq mi). While the Alps occupy the greater part of the territory, the Swiss population of approximately 8.5 million is concentrated mostly on the plateau, where the largest cities and economic centres are located, among them Zürich, Geneva and Basel, where multiple international organisations are domiciled (such as FIFA, the UN's second-largest Office, and the Bank for International Settlements) and where the main international airports of Switzerland are."

In [5]:
# create spacy doc object
doc = nlp(text)

In [6]:
# Step 1 - extract named entities - will return a tuple
for entity in doc.ents:
  print(entity.text,'=>',entity.start_char,'=>',entity.end_char,'=>',entity.label_)

Switzerland => 0 => 11 => GPE
the Swiss Confederation => 24 => 47 => ORG
Western => 92 => 99 => NORP
Central => 101 => 108 => ORG
Southern Europe => 114 => 129 => LOC
26 => 168 => 170 => CARDINAL
Bern => 214 => 218 => GPE
Switzerland => 220 => 231 => GPE
Italy => 268 => 273 => GPE
France => 288 => 294 => GPE
Germany => 308 => 315 => GPE
Austria => 334 => 341 => GPE
Liechtenstein => 346 => 359 => PERSON
the Swiss Plateau => 408 => 425 => GPE
Alps => 431 => 435 => LOC
Jura => 445 => 449 => PERSON
41,285 => 476 => 482 => CARDINAL
15,940 sq mi => 488 => 500 => QUANTITY
39,997 => 520 => 526 => CARDINAL
15,443 sq mi => 532 => 544 => QUANTITY
Alps => 557 => 561 => LOC
Swiss => 608 => 613 => NORP
approximately 8.5 million => 628 => 653 => CARDINAL
Zürich => 763 => 769 => GPE
Geneva => 771 => 777 => GPE
Basel => 782 => 787 => GPE
FIFA => 855 => 859 => ORG
UN => 865 => 867 => ORG
second => 870 => 876 => ORDINAL
the Bank for International Settlements => 897 => 935 => ORG
Switzerland => 982 => 993

In [7]:
# Visualize named entities
displacy.render(doc,style='ent',jupyter=True)

Summary of output:
* Most of the output is correct. However, there are some errors in the model output worth noting.
* "Liechtenstein" is classified as a PERSON when it is a country.
* "Swiss Plateau" is a mountain range and should be a LOC.
* "Jura" is a mountain range and should be a LOC.
* There are some other errors but those are worth pointing out. Obviously a NER model is only as good as the data it is trained on and can not always generalize on new unseen text depending on the model architecture.

# NER Exercise
* Read in a file called "usa.txt".
* Extract and Classify named entities.

In [12]:
# open file
with open('/content/drive/MyDrive/Colab Notebooks/Classical NLP/usa.txt','r',encoding='utf-8') as file:
  text = file.read()

# close file
file.close()

In [15]:
# print text
print(text[:20])

The United States of


In [16]:
# spacy pipeline
import spacy
from spacy import displacy

# load spacy model
nlp = spacy.load('en_core_web_sm')

# set doc object
doc = nlp(text)


In [18]:
# NER step 1 - extract entities
for entity in doc.ents:
  print(entity.text,'=>', entity.start_char,'=>', entity.end_char, '=>', entity.label_)


The United States of America => 0 => 28 => GPE
USA => 30 => 33 => GPE
the United States => 54 => 71 => GPE
U.S. => 73 => 77 => GPE
US => 81 => 83 => GPE
America => 88 => 95 => GPE
North America => 136 => 149 => LOC
Canada => 159 => 165 => GPE
Mexico => 170 => 176 => GPE
50 => 193 => 195 => CARDINAL
five => 224 => 228 => CARDINAL
3.8 million square miles => 291 => 315 => QUANTITY
9.8 million square kilometres => 317 => 346 => QUANTITY
fourth => 377 => 383 => ORDINAL
328 million => 436 => 447 => CARDINAL
2019 => 454 => 458 => DATE
third => 470 => 475 => ORDINAL
Washington => 527 => 537 => GPE
D.C. => 539 => 543 => GPE
New York City => 575 => 588 => GPE
Paleo-Indians => 591 => 604 => NORP
Siberia => 619 => 626 => LOC
North American => 634 => 648 => NORP
at least 12,000 years ago => 658 => 683 => DATE
European => 689 => 697 => NORP
the 16th century => 720 => 736 => DATE
The United States => 738 => 755 => GPE
thirteen => 773 => 781 => CARDINAL
British => 782 => 789 => NORP
the East Coast =>

In [19]:
# visualize entities
displacy.render(doc,style='ent', jupyter=True)

summary: for the most part the NER model was able to extract and classify the entities correctly.