## Named Entity Recognition (NER)
Named Entity Recognition (NER) is an essential task of the more general discipline of Information Extraction (IE). To obtain structured information from unstructured text we wish to identify named entities. Anything with a proper name is a named entity. This would include names of people, places, organizations, vehicles, facilities, and so on.
<p>
Named Entity (NE) Recognition (NER) is to
classify every word in a document into some
predefined categories and "none-of-the-above". In
the taxonomy of computational linguistics tasks, it
falls under the domain of "information extraction",
which extracts specific kinds of information from
documents as opposed to the more general task of
"document management" which seeks to extract all
of the information found in a document.
Since entity names form the main content of a
document, NER is a very important step toward
more intelligent information extraction and
management. The atomic elements of information
extraction -- indeed, of language as a whole -- could
be considered as the "who", "where" and "how
much" in a sentence. NER performs what is known
as surface parsing, delimiting sequences of tokens
that answer these important questions. NER can
also be used as the first step in a chain of processors:
a next level of processing could relate two or more
NEs, or perhaps even give semantics to that
relationship using a verb. In this way, further
processing could discover the "what" and "how" of
a sentence or body of text. 
    </p>
    
### Use Cases of Named Entity Recognition:

* Information Extraction Systems
* Question-Answer Systems
* Machine Translation Systems
* Automatic Summarizing Systems
* Semantic Annotation

Hidden Markov Models (HMM) is a simple concept which can explain most complicated real time processes such as speech recognition and speech generation, machine translation, gene recognition for bioinformatics, and human gesture recognition for computer vision, and more.The goal of NER is to find a stochastic optimal tag sequence T = t1, t2, t3,...,tn
for a given word sequence W = w1, w2, w3 ...,wn. Generally, the most probable
tag sequence is assigned to each sentence following the Viterbi algorithm . The
tagging problem becomes equivalent to searching for 
argmaxT P(T ) ∗ P(W|T ),
by the application of Bayes’ law (P(W) is constant).
The probability of the NE tag, i.e., P(T ) can be calculated by Markov assumption which states that the probability of a tag is dependent only on a small, fixed
number of previous NE tags. Here, in this work, a trigram model has been used.
So, the probability of a NE tag depends on two previous tags, and then we have,
P(T ) = P(t1) × P(t2|t1) × P(t3|t1, t2) × P(t4|t2, t3) × ... × P(tn|tn−2, tn−1)

## spaCy
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.It can be used to build information extraction or natural language understanding systems.

### Feature overview
* Support for 64+ languages
* 55 trained pipelines for 17 languages
* Multi-task learning with pretrained transformers like BERT
* Pretrained word vectors
* Linguistically-motivated tokenization
* Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
* Easily extensible with custom components and attributes
* Support for custom models in PyTorch, TensorFlow and other frameworks
* Built in visualizers for syntax and NER
* Easy model packaging, deployment and workflow management
* Robust, rigorously evaluated accuracy

### spaCy’s Statistical Models
Below mentioned models enable spaCy to perform several NLP related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.

I’ve listed below the different statistical models in spaCy along with their specifications:

* en_core_web_sm: English multi-task CNN trained on OntoNotes. 
* en_core_web_md: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl.
* en_core_web_lg: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. 

For this Project I am using <b>en_core_web_sm</b>

More details on <a href="https://spacy.io/usage/spacy-101#whats-spacy">spaCy v3.0</a>

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint

Now, we need to apply nlp on a sentance, the entire background pipeline will return the objects.

In [2]:
doc = nlp('A slight majority of Americans approve of the job performance of President Joe Biden, and at 52 per cent, his approval rating is 10 points higher than that of Donald Trump at the same point in his presidency.')
pprint([(X.text, X.label_) for X in doc.ents])

[('Americans', 'NORP'),
 ('Joe Biden', 'PERSON'),
 ('52 per cent', 'MONEY'),
 ('10', 'CARDINAL'),
 ('Donald Trump', 'PERSON')]


#### Token-Level Entity
Here I am demonstrating token-level entity annotation using the BILUO tagging scheme to describe the entity boundaries.
* "B" means the token begins an entity 
* "I" means it is inside an entity
* "L" means Final token of a multi-token entity
* "U" means single-token entity
* "O" means it is outside an entity
* "" means no entity tag is set.

In [3]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(A, 'O', ''),
 (slight, 'O', ''),
 (majority, 'O', ''),
 (of, 'O', ''),
 (Americans, 'B', 'NORP'),
 (approve, 'O', ''),
 (of, 'O', ''),
 (the, 'O', ''),
 (job, 'O', ''),
 (performance, 'O', ''),
 (of, 'O', ''),
 (President, 'O', ''),
 (Joe, 'B', 'PERSON'),
 (Biden, 'I', 'PERSON'),
 (,, 'O', ''),
 (and, 'O', ''),
 (at, 'O', ''),
 (52, 'B', 'MONEY'),
 (per, 'I', 'MONEY'),
 (cent, 'I', 'MONEY'),
 (,, 'O', ''),
 (his, 'O', ''),
 (approval, 'O', ''),
 (rating, 'O', ''),
 (is, 'O', ''),
 (10, 'B', 'CARDINAL'),
 (points, 'O', ''),
 (higher, 'O', ''),
 (than, 'O', ''),
 (that, 'O', ''),
 (of, 'O', ''),
 (Donald, 'B', 'PERSON'),
 (Trump, 'I', 'PERSON'),
 (at, 'O', ''),
 (the, 'O', ''),
 (same, 'O', ''),
 (point, 'O', ''),
 (in, 'O', ''),
 (his, 'O', ''),
 (presidency, 'O', ''),
 (., 'O', '')]


### BeautifulSoup/requests
Incredible amount of data on the Internet is a rich resource for NLP research. To effectively harvest that data, we need to do web scraping. The Python libraries requests and Beautiful Soup are powerful tools for the job. 

In [4]:
from bs4 import BeautifulSoup
import requests
import re

In [5]:
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

Lets extact the named entities from yahoo new article <a href="https://news.yahoo.com/biden-approval-rating-10-points-161424766.html">Biden’s approval rating is 10 points higher than his predecessor’s was after 100 days</a>

In [6]:
ny_bb = url_to_string('https://news.yahoo.com/biden-approval-rating-10-points-161424766.html')
article = nlp(ny_bb)
len(article.ents)

224

<p>
After tokenization, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes binary data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.
    </p>
<p>
Linguistic annotations are available as Token attributes. Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name
 </p>

In [16]:
for token in article:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

Biden Biden PROPN NNP npadvmod Xxxxx True False
’s ’s PART POS punct ’x False True
approval approval NOUN NN compound xxxx True False
rating rating NOUN NN nsubj xxxx True False
is be AUX VBZ ROOT xx True True
10 10 NUM CD nummod dd False False
points point NOUN NNS npadvmod xxxx True False
higher high ADJ JJR acomp xxxx True False
than than SCONJ IN prep xxxx True True
his -PRON- DET PRP$ poss xxx True True
predecessor predecessor NOUN NN pobj xxxx True False
’s ’s PUNCT , punct ’x False True
was be AUX VBD conj xxx True True
after after ADP IN prep xxxx True True
100 100 NUM CD nummod ddd False False
days day NOUN NNS pobj xxxx True False
                    SPACE _SP       False False
HOME HOME PROPN NNP nmod XXXX True False
              SPACE _SP       False False
MAIL MAIL PROPN NNP nmod XXXX True False
              SPACE _SP       False False
NEWS NEWS PROPN NNP nmod XXXX True False
              SPACE _SP       False False
FINANCE FINANCE PROPN NNP nmod XXXX True False
       

three three NUM CD nummod xxxx True True
months month NOUN NNS dobj xxxx True False
in in ADP IN prep xx True True
office office NOUN NN pobj xxxx True False
. . PUNCT . punct . False False
Mr Mr PROPN NNP compound Xx True False
Trump Trump PROPN NNP nsubj Xxxxx True False
had have AUX VBD ROOT xxx True True
an an DET DT det xx True True
approval approval NOUN NN compound xxxx True False
rating rating NOUN NN dobj xxxx True False
of of ADP IN prep xx True True
42 42 NUM CD nummod dd False False
per per NOUN NN nmod xxx True True
cent cent NOUN NN pobj xxxx True False
and and CCONJ CC cc xxx True True
his -PRON- DET PRP$ poss xxx True True
disapproval disapproval NOUN NN nsubj xxxx True False
stood stand VERB VBD conj xxxx True False
at at ADP IN prep xx True True
53 53 NUM CD pobj dd False False
per per NOUN NN prep xxx True True
cent cent NOUN NN pobj xxxx True False
at at ADP IN prep xx True True
this this DET DT det xxxx True True
time time NOUN NN pobj xxxx True False
in in ADP IN 

John John PROPN NNP compound Xxxx True False
McCain McCain PROPN NNP appos XxXxxx True False
, , PUNCT , punct , False False
tweeted tweet VERB VBD acl xxxx True False
ominously ominously ADV RB advmod xxxx True False
: : PUNCT : punct : False False
“ " PUNCT `` punct “ False False
Some some DET DT det Xxxx True True
tea tea NOUN NN nsubj xxx True False
leaves leave VERB VBZ ccomp xxxx True False
Democrats Democrats PROPN NNPS dobj Xxxxx True False
would would VERB MD aux xxxx True True
be be AUX VB ccomp xx True True
wise wise ADJ JJ acomp xxxx True False
to to PART TO aux xx True True
read read VERB VB xcomp xxxx True False
here here ADV RB advmod xxxx True True
. . PUNCT . punct . False False
Also also ADV RB advmod Xxxx True True
, , PUNCT , punct , False False
where where ADV WRB advmod xxxx True True
’s ’s PUNCT , punct ’x False True
all all DET PDT predet xxx True True
the the DET DT det xxx True True
bipartisanship bipartisanship NOUN NN nsubj xxxx True False
and and CCONJ CC c

were be AUX VBD ccomp xxxx True True
concerned concern VERB VBN acomp xxxx True False
that that SCONJ IN mark xxxx True True
Mr Mr PROPN NNP compound Xx True False
Biden Biden PROPN NNP nsubj Xxxxx True False
will will VERB MD aux xxxx True True
push push VERB VB ccomp xxxx True False
too too ADV RB advmod xxx True True
hard hard ADJ JJ advmod xxxx True False
to to PART TO aux xx True True
increase increase VERB VB advcl xxxx True False
the the DET DT det xxx True True
government government NOUN NN dobj xxxx True False
’s ’s PART POS punct ’x False True
size size NOUN NN dobj xxxx True False
and and CCONJ CC cc xxx True True
role role NOUN NN conj xxxx True False
in in ADP IN prep xx True True
society society NOUN NN pobj xxxx True False
. . PUNCT . punct . False False
Partisanship partisanship NOUN NN nsubj Xxxxx True False
after after ADP IN prep xxxx True True
the the DET DT det xxx True True
acrimonious acrimonious ADJ JJ amod xxxx True False
2020 2020 NUM CD nummod dddd False Fals

: : PUNCT : punct : False False
' ' PUNCT '' punct ' False False
He -PRON- PRON PRP nsubj Xx True True
believed believe VERB VBD ROOT xxxx True False
what what PRON WP nsubjpass xxxx True True
was be AUX VBD aux xxx True True
being be AUX VBG auxpass xxxx True True
fed feed VERB VBN ccomp xxx True False
to to PART TO prep xx True True
him'Anthony him'anthony NUM CD punct xxx'Xxxxx False False
Antonio Antonio PROPN NNP nsubj Xxxxx True False
had have AUX VBD aux xxx True True
come come VERB VBN ROOT xxxx True False
to to PART TO aux xx True True
believe believe VERB VB advcl xxxx True False
Trump Trump PROPN NNP dobj Xxxxx True False
’s ’s PART POS punct ’x False True
unfounded unfounded ADJ JJ amod xxxx True False
claims claim NOUN NNS appos xxxx True False
of of ADP IN prep xx True True
widespread widespread ADJ JJ amod xxxx True False
electoral electoral ADJ JJ amod xxxx True False
fraud fraud NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
his -PRON- DET PRP$ poss xxx T

Here here ADV RB ROOT Xxxx True True
’s ’s PUNCT , punct ’x False True
who who PRON WP nsubj xxx True True
’s ’ VERB VBZ ccomp ’x False True
in in ADV RB prt xx True True
, , PUNCT , punct , False False
who who PRON WP nsubj xxx True True
’s ’ VERB VBZ relcl ’x False True
out out ADP RP prt xxx True True
SaturdayThe SaturdayThe PROPN NNP compound XxxxxXxx True False
Florida Florida PROPN NNP compound Xxxxx True False
Panthers Panthers PROPN NNP nsubj Xxxxx True False
and and CCONJ CC cc xxx True True
Tampa Tampa PROPN NNP compound Xxxxx True False
Bay Bay PROPN NNP compound Xxx True False
Lightning Lightning PROPN NNP conj Xxxxx True False
might may VERB MD aux xxxx True True
play play VERB VB ROOT xxxx True False
nine nine NUM CD nummod xxxx True True
straight straight ADJ JJ amod xxxx True False
times time NOUN NNS dobj xxxx True False
throughout throughout ADP IN prep xxxx True True
the the DET DT det xxx True True
rest rest NOUN NN pobj xxxx True False
of of ADP IN prep xx True Tru

is be AUX VBZ ccomp xx True True
out out ADV RB advmod xxx True True
there there ADV RB advmod xxxx True True
that that DET WDT nsubj xxxx True True
can can VERB MD aux xxx True True
hear hear VERB VB ccomp xxxx True False
this this DET DT dobj xxxx True True
, , PUNCT , punct , False False
that that DET DT nsubj xxxx True True
has have AUX VBZ advcl xxx True True
you -PRON- PRON PRP dobj xxx True True
, , PUNCT , punct , False False
please please INTJ UH intj xxxx True True
, , PUNCT , punct , False False
we -PRON- PRON PRP nsubj xx True True
’ll will VERB MD aux ’xx False True
do do AUX VB ccomp xx True True
whatever whatever DET WDT dobj xxxx True True
it -PRON- PRON PRP nsubj xx True True
takes take VERB VBZ ccomp xxxx True False
to to PART TO aux xx True True
bring bring VERB VB advcl xxxx True False
you -PRON- PRON PRP dobj xxx True True
back back ADV RB advmod xxxx True True
. . PUNCT . punct . False False
We -PRON- PRON PRP nsubj Xx True True
love love VERB VBP ROOT xxxx True F

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a article

In [17]:
for ent in article.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

10 27 29 CARDINAL
100 days 77 85 DATE
Yahoo News 249 259 ORG
US 425 427 GPE
US 429 431 GPE
10 815 817 CARDINAL
100 865 868 CARDINAL
BidenDonald TrumpGustaf KilanderApril 25 1028 1068 PRODUCT
12:14 1076 1081 TIME
min readOops!Something 1087 1109 ORG
BidenDonald TrumpFirst 1233 1255 PRODUCT
Jill Biden 1261 1271 PERSON
U.S. 1276 1280 GPE
Joe Biden 1291 1300 PERSON
Marine One 1324 1334 ORG
April 24, 2021 1338 1352 DATE
Washington, DC 1356 1370 GPE
Getty 1373 1378 PERSON
Americans 1407 1416 NORP
Joe Biden 1461 1470 PERSON
52 per cent 1479 1490 MONEY
10 1515 1517 CARDINAL
Donald Trump 1545 1557 PERSON
the Washington Post/ 1607 1627 ORG
ABC News 1627 1635 ORG
Sunday 1650 1656 DATE
42 per cent 1658 1669 MONEY
Americans 1673 1682 NORP
Joe Biden 1701 1710 PERSON
his first three months 1723 1745 DATE
Trump 1760 1765 PERSON
42 per cent 1792 1803 MONEY
53 per cent 1833 1844 MONEY
Thirty-four 1872 1883 CARDINAL
Americans 1896 1905 NORP
Mr Biden 1935 1943 PERSON
35 1945 1947 DATE
Americans 2093 2102 

In [7]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'CARDINAL': 23,
         'DATE': 28,
         'ORG': 51,
         'GPE': 17,
         'PRODUCT': 3,
         'TIME': 3,
         'PERSON': 46,
         'NORP': 19,
         'MONEY': 13,
         'WORK_OF_ART': 4,
         'PERCENT': 5,
         'ORDINAL': 8,
         'LOC': 1,
         'EVENT': 1,
         'FAC': 2})

In [8]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('first', 5), ('10', 4), ('Americans', 4)]

In [9]:
sentences = [x for x in article.sents]
print(sentences[5])

COVID-19           US  US           Politics  Politics           World  World           Health  Health           Science  Science           Podcasts  Podcasts        Originals  


### Visualizing the entity recognizer
The entity visualizer, ent, highlights named entities and their labels in a text.If you specify a list of ents, only those entity types will be rendered – for example, you can choose to display PERSON entities. Internally, the visualizer knows nothing about available entity types and will render whichever spans and labels it receives. This makes it especially easy to work with custom entity types. 

In [10]:
displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

In [11]:

displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

In [12]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[20])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('Getty', 'PROPN', 'Getty'),
 ('Images)A', 'PROPN', 'Images)A'),
 ('slight', 'ADJ', 'slight'),
 ('majority', 'NOUN', 'majority'),
 ('Americans', 'PROPN', 'Americans'),
 ('approve', 'VERB', 'approve'),
 ('job', 'NOUN', 'job'),
 ('performance', 'NOUN', 'performance'),
 ('President', 'PROPN', 'President'),
 ('Joe', 'PROPN', 'Joe'),
 ('Biden', 'PROPN', 'Biden'),
 ('52', 'NUM', '52'),
 ('cent', 'NOUN', 'cent'),
 ('approval', 'NOUN', 'approval'),
 ('rating', 'NOUN', 'rating'),
 ('10', 'NUM', '10'),
 ('points', 'NOUN', 'point'),
 ('higher', 'ADJ', 'high'),
 ('Donald', 'PROPN', 'Donald'),
 ('Trump', 'PROPN', 'Trump'),
 ('point', 'NOUN', 'point'),
 ('presidency', 'NOUN', 'presidency')]

In [13]:
dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])

{'Getty': 'PERSON',
 'Americans': 'NORP',
 'Joe Biden': 'PERSON',
 '52 per cent': 'MONEY',
 '10': 'CARDINAL',
 'Donald Trump': 'PERSON'}

In [14]:
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[20]])

[((, 'O', ''), (Getty, 'B', 'PERSON'), (Images)A, 'O', ''), (slight, 'O', ''), (majority, 'O', ''), (of, 'O', ''), (Americans, 'B', 'NORP'), (approve, 'O', ''), (of, 'O', ''), (the, 'O', ''), (job, 'O', ''), (performance, 'O', ''), (of, 'O', ''), (President, 'O', ''), (Joe, 'B', 'PERSON'), (Biden, 'I', 'PERSON'), (,, 'O', ''), (and, 'O', ''), (at, 'O', ''), (52, 'B', 'MONEY'), (per, 'I', 'MONEY'), (cent, 'I', 'MONEY'), (,, 'O', ''), (his, 'O', ''), (approval, 'O', ''), (rating, 'O', ''), (is, 'O', ''), (10, 'B', 'CARDINAL'), (points, 'O', ''), (higher, 'O', ''), (than, 'O', ''), (that, 'O', ''), (of, 'O', ''), (Donald, 'B', 'PERSON'), (Trump, 'I', 'PERSON'), (at, 'O', ''), (the, 'O', ''), (same, 'O', ''), (point, 'O', ''), (in, 'O', ''), (his, 'O', ''), (presidency, 'O', ''), (., 'O', '')]


In [15]:
displacy.render(article, jupyter=True, style='ent')

#### Reference
* SpaCy API Model and Usage : https://spacy.io/
* SPACY'S ENTITY RECOGNITION MODEL: https://www.youtube.com/watch?v=sqDHBH9IjRU
