# Named Entity Recognition (NER)

The gool is to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. 

## Tools

* Apache OpenNLP [https://opennlp.apache.org/](https://opennlp.apache.org/)
* SpaCy [https://spacy.io/](https://spacy.io/)
* IBM Watson [https://www.ibm.com/cloud/watson-natural-language-understanding](https://www.ibm.com/cloud/watson-natural-language-understanding)
* AWS SageMaker: [https://docs.aws.amazon.com/sagemaker/latest/dg/sms-named-entity-recg.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-named-entity-recg.html)

## SpaCy

In [1]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.6/452.6 KB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.9-py2.py3-none-any.whl (20 kB)
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.0-py3-none-any.whl (27 kB)
Collecting wasabi<1.1.0,>=0.8.1
  Downloading wasabi-0.9.0-py3-none-any.whl (25 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.6-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_6

In [87]:
import spacy

### Spacy NLP Models

In [4]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [88]:
nlp = spacy.load("en_core_web_sm")

In [6]:
text = 'This is a text about Albert the king comming from Germany who was born in 1810 and had funny accent.'

In [7]:
doc = nlp(text)

In [10]:
print('Noun phrases:', [chunk.text for chunk in doc.noun_chunks])

Noun phrases: ['This', 'a text', 'Albert', 'the king', 'Germany', 'who', 'funny accent']


In [11]:
print('Verbs:', [token.lemma_ for token in doc if token.pos_ == 'VERB'])

Verbs: ['comme', 'bear', 'have']


In [12]:
for entity in doc.ents:
    print(entity.text, entity.label_)

Albert PERSON
Germany GPE
1810 DATE


### What is available as NERs:
* 'CARDINAL',
* 'DATE',
* 'EVENT' = named hurricanes, battles, wars, sports events, ...
* 'FAC' = buildings, airports, highways, bridges, ...
* 'GPE' = Geo-political Entity = countries, cities, states
* 'LANGUAGE',
* 'LAW',
* 'LOC' = non-GPE locations, mountain ranges, bodies of water
* 'MONEY',
* 'NORP' = nationalities or religious or political groups
* 'ORDINAL',
* 'ORG',
* 'PERCENT',
* 'PERSON',
* 'PRODUCT' = objects, vehicles, foods, not services
* 'QUANTITY',
* 'TIME',
* 'WORK_OF_ART' = titles of books, songs, ...

In [100]:
text = 'David Pejcoch, 55D St Margarets Road, TW1 2LL, Twickenham, UK, david@pejcoch.com, +420 111 222 333, £500'

In [101]:
doc = nlp(text)

In [102]:
for entity in doc.ents:
    print(entity.text, entity.label_)

David Pejcoch PERSON
Twickenham GPE
UK GPE
+420 CARDINAL
500 MONEY


### SpaCy Matching

In [79]:
from spacy.matcher import Matcher

In [80]:
matcher = Matcher(nlp.vocab)

In [81]:
text = nlp('My social insurance number is 78978958 and my NINO is C789582XY.')

In [82]:
pattern = [{"TEXT": {"REGEX": "[0-9]{8}"}}]

In [83]:
matcher.add("SIN",[pattern])

In [84]:
matches = matcher(text)

In [85]:
for match_id, start, end in matches:
    print(text[start:end])

78978958
