# What is Named Entity Recognition (NER)?
Named entity recognition models are pre-trained models on a specific corpus that has already trained labels that can act as categories for labeled text.

This can be as simple as identifying if a label of text represents a person. For instance, high profile figures such as __Abraham Lincoln, FDR, or George Washington__ can be labeled as a __person__.

There are many extensions of what NER can do for you but overall you can think of this as a tool to help extract categories of text without manually extracting information using customized regex functions or rules-based approaches.

SpaCy is an excellent NLP library for NER; documentation can be found [here](https://spacy.io/api/entityrecognizer)

In [2]:
# if you don't have the small english spacy model downloaded on local machine, uncomment this cell and execute.
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m62.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy
from spacy import displacy # great for visualizing entitites/tokens
model = spacy.load("en_core_web_sm") # This is a pretrained NLP pipeline

# Some miscellaneous information that is not too relevant to NERs...

In [None]:
model.pipe_names # This is what the default model architecture looks like.

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

In [None]:
model.pipeline # some additional information.

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x20f8af002c0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x20f8af3d5e0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x20f8ad75820>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x20f8ad754c0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x20f8afb9ac0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x20f8afbf9c0>)]

# Basic demo on what an NER can do.

In [None]:
text = "For instance, high profile figures such as Abraham Lincoln, FDR, or George Washington can be labeled as a person. I wonder what they would have thought about the USA today."

In [None]:
# pass text into model pipeline
processed_text = model(text)

In [None]:
processed_text # as we can see nothing is "out of the ordinary"

For instance, high profile figures such as Abraham Lincoln, FDR, or George Washington can be labeled as a person. I wonder what they would have thought about the USA today.

In [None]:
help(processed_text) # Data descriptors defined here:

Help on Doc object:

class Doc(builtins.object)
 |  Doc(Vocab vocab, words=None, spaces=None, user_data=None, *, tags=None, pos=None, morphs=None, lemmas=None, heads=None, deps=None, sent_starts=None, ents=None)
 |  A sequence of Token objects. Access sentences and named entities, export
 |      annotations to numpy arrays, losslessly serialize to compressed binary
 |      strings. The `Doc` object holds an array of `TokenC` structs. The
 |      Python-level `Token` and `Span` objects are views of this array, i.e.
 |      they don't own the data themselves.
 |  
 |      EXAMPLE:
 |          Construction 1
 |          >>> doc = nlp(u'Some text')
 |  
 |          Construction 2
 |          >>> from spacy.tokens import Doc
 |          >>> doc = Doc(nlp.vocab, words=["hello", "world", "!"], spaces=[True, False, False])
 |  
 |      DOCS: https://spacy.io/api/doc
 |  
 |  Methods defined here:
 |  
 |  __bytes__(...)
 |      Doc.__bytes__(self)
 |  
 |  __getitem__(...)
 |      Get a `Token

In [None]:
processed_text.text

'For instance, high profile figures such as Abraham Lincoln, FDR, or George Washington can be labeled as a person. I wonder what they would have thought about the USA today.'

In [None]:
for word in processed_text.ents:
    print(word.text, word.label_) # Geopolitical entity, i.e. countries, cities, states.

Abraham Lincoln PERSON
FDR PERSON
George Washington PERSON
USA GPE
today DATE


In [None]:
displacy.render(processed_text, style="ent", jupyter=True) # https://spacy.io/usage/visualizers
# style has 3 different attributes: "dep" for dependency parse, "ent" for entities, and "span" for specified lengths.

In [None]:
spacy.explain("PERSON") # What is this entity?

'People, including fictional'

In [None]:
spacy.explain("GPE") # What is this entity?

'Countries, cities, states'

In [None]:
spacy.explain("DATE") # What is this entity?

'Absolute or relative dates or periods'

Abraham Lincoln             |  Franklin Delano Roosevelt (FDR) | George Washington
:-------------------------:|:-------------------------:|:-------------------------:
<img src="https://github.com/SpencerPao/Natural-Language-Processing/blob/main/Named_Entity_Recognition/lincoln.jpg?raw=1" width="400"/>  |  <img src="https://github.com/SpencerPao/Natural-Language-Processing/blob/main/Named_Entity_Recognition/FDR.jpg?raw=1" width="500"> | <img src="https://github.com/SpencerPao/Natural-Language-Processing/blob/main/Named_Entity_Recognition/GW.jpg?raw=1" width="500">



# Where in the NLP pipelines is NER typically used?
In industry, I have typically used NER's as a categorical tool. Scale-wise, think in terms of millions of documents with tables, graphics, and most importantly thousands of words.

In essence, This can be used when tagging datasets, whether it be tweets, journal articles, or webpage content, this process reduce text sparsity, adding tremendous values to your dataset(s).

You can check out my [Optical Character Recognition (OCR)](https://www.youtube.com/watch?v=rCgy4d2pyyA) video I did on how to extract text from non-structured based data.

Imagine that we have thousands upon thousands of documents and we need to categorize our data so we can actually work!
- This is where we can use an NER to help our process.

In [None]:
import glob
import pandas as pd

In [None]:
fileList = glob.glob('data/*.csv')
fileList # in this case we only have 10 documents. But, as we can see, these datasets are unlabled.

['data\\0.csv',
 'data\\1.csv',
 'data\\2.csv',
 'data\\3.csv',
 'data\\4.csv',
 'data\\5.csv',
 'data\\6.csv',
 'data\\7.csv',
 'data\\8.csv',
 'data\\9.csv']

# Let's get the datasets that only refer to people.
- This helps narrow our search tremndously without lifting much of a finger.
    - We can rely on already built tools to help further our search for what we need and want.

In [None]:
# Just for our sake, we will only be looking at the first row.
# If true, then label dataset as store file names to be used later on and further preprocess.
for f in fileList:
    file_read = pd.read_csv(f, header = None)
    print(file_read)
    print("-----")

                                                    0
0   Assuming the Presidency at the depth of the Gr...
1   Born in 1882 at Hyde Park, New York�now a nati...
2   Following the example of his fifth cousin, Pre...
3   In the summer of 1921, when he was 39, disaste...
4   He was elected President in November 1932, to ...
5   By 1935 the Nation had achieved some measure o...
6   In 1936 he was re-elected by a top-heavy margi...
7   Roosevelt had pledged the United States to the...
8   When the Japanese attacked Pearl Harbor on Dec...
9   Feeling that the future peace of the world wou...
10  As the war drew to a close, Roosevelt�s health...
11  The Presidential biographies on WhiteHouse.gov...
-----
                                                    0
0   Abraham Lincoln became the United States� 16th...
1   Lincoln warned the South in his Inaugural Addr...
2   Lincoln thought secession illegal, and was wil...
3   The son of a Kentucky frontiersman, Lincoln ha...
4   �I was born Feb. 1

In [None]:
# Let's see how we can use NER in action.
white_list_files = []
for f in fileList:
    file_read = pd.read_csv(f, header = None)
    print(f"Text being read in..... for file {f}")
    print(file_read.values[0][0])
    processed_text = model(file_read.values[0][0])
    print("----- Checking if there are any PERSON entitities")
    is_people = False
    for word in processed_text.ents:
        print(word.text, word.label_)
        if word.label_ == "PERSON":
            is_people = True
    if is_people:
        white_list_files.append(f)
    print("-----")

print("Our current files that have the PERSON entity labeled in the first paragraph.: ")
print(white_list_files)

Text being read in..... for file data\0.csv
Assuming the Presidency at the depth of the Great Depression, Franklin D. Roosevelt helped the American people regain faith in themselves. He brought hope as he promised prompt, vigorous action, and asserted in his Inaugural Address, �the only thing we have to fear is fear itself.�
----- Checking if there are any PERSON entitities
the Great Depression EVENT
Franklin D. Roosevelt PERSON
American NORP
Inaugural Address ORG
-----
Text being read in..... for file data\1.csv
Abraham Lincoln became the United States� 16th President in 1861, issuing the Emancipation Proclamation that declared forever free those slaves within the Confederacy in 1863.
----- Checking if there are any PERSON entitities
Abraham Lincoln PERSON
the United States GPE
16th ORDINAL
1861 DATE
the Emancipation Proclamation FAC
1863 DATE
-----
Text being read in..... for file data\2.csv
On April 30, 1789, George Washington, standing on the balcony of Federal Hall on Wall Street 

# Customize YOUR NER!
- Of course, there is always manual work once you get into the details
- This is a quick demonstration on how to tune your very own NER to fit your use cases!

In [None]:
raw_text = "CP30 and R2D2 are the droids we are looking for! Let's assume they are people for the sake of argument."

In [None]:
processed_text = model(raw_text)

In [None]:
displacy.render(processed_text, style="ent", jupyter=True) # As we can see, Cp30 and R2D2 are not labled as "DROID"



### So, how do we label new entities to an already existing model?
- A basic example on how to this

In [None]:
from spacy.tokens import Span

In [None]:
PERSON = processed_text.vocab.strings[u'PERSON'] # hashvalue of the DROID entity (This is a new entity)
print(PERSON)

380


In [None]:
entity_CP30 = Span(processed_text, 0, 3, label = PERSON) # processed text, start position, end position, label
entity_R2D2 = Span(processed_text, 9, 12, label = PERSON)

In [None]:
processed_text.ents = list(processed_text.ents) + [entity_CP30] + [entity_R2D2]

In [None]:
displacy.render(processed_text, style="ent", jupyter=True) # As we can see, Cp30 and R2D2 are not labled as "DROID"

# Well, how but adding entirely new entities?
That's for a different video/notebook. So, let me know if you are interested in that!

# That's cool! What's next?
- You can check out all sorts of models that have an NER built into its model architecture. In fact, a great repository is [Huggingface](https://huggingface.co/) where all sorts of NLP models are stored publicy. They have models from all different types of industries! Might be worth to check it out.