## Named Entity Recognition (NER)
Named Entity Recognition (NER) is an essential task of the more general discipline of Information Extraction (IE). To obtain structured information from unstructured text we wish to identify named entities. Anything with a proper name is a named entity. This would include names of people, places, organizations, vehicles, facilities, and so on.

## spaCy
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

### spaCy’s Statistical Models
Below mentioned models enable spaCy to perform several NLP related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.

I’ve listed below the different statistical models in spaCy along with their specifications:

* en_core_web_sm: English multi-task CNN trained on OntoNotes. 
* en_core_web_md: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl.
* en_core_web_lg: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. 

For this Project I am using <b>en_core_web_sm</b>

More details on <a href="https://spacy.io/usage/spacy-101#whats-spacy">spaCy v3.0</a>

In [6]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
from pprint import pprint

Now, we need to apply nlp on a sentance, the entire background pipeline will return the objects.

In [7]:
doc = nlp('A slight majority of Americans approve of the job performance of President Joe Biden, and at 52 per cent, his approval rating is 10 points higher than that of Donald Trump at the same point in his presidency.')
pprint([(X.text, X.label_) for X in doc.ents])

[('Americans', 'NORP'),
 ('Joe Biden', 'PERSON'),
 ('52 per cent', 'MONEY'),
 ('10', 'CARDINAL'),
 ('Donald Trump', 'PERSON')]


#### Token-Level Entity
Here I am demonstrating token-level entity annotation using the BILUO tagging scheme to describe the entity boundaries.
* "B" means the token begins an entity 
* "I" means it is inside an entity
* "L" means Final token of a multi-token entity
* "U" means single-token entity
* "O" means it is outside an entity
* "" means no entity tag is set.

In [10]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(A, 'O', ''),
 (slight, 'O', ''),
 (majority, 'O', ''),
 (of, 'O', ''),
 (Americans, 'B', 'NORP'),
 (approve, 'O', ''),
 (of, 'O', ''),
 (the, 'O', ''),
 (job, 'O', ''),
 (performance, 'O', ''),
 (of, 'O', ''),
 (President, 'O', ''),
 (Joe, 'B', 'PERSON'),
 (Biden, 'I', 'PERSON'),
 (,, 'O', ''),
 (and, 'O', ''),
 (at, 'O', ''),
 (52, 'B', 'MONEY'),
 (per, 'I', 'MONEY'),
 (cent, 'I', 'MONEY'),
 (,, 'O', ''),
 (his, 'O', ''),
 (approval, 'O', ''),
 (rating, 'O', ''),
 (is, 'O', ''),
 (10, 'B', 'CARDINAL'),
 (points, 'O', ''),
 (higher, 'O', ''),
 (than, 'O', ''),
 (that, 'O', ''),
 (of, 'O', ''),
 (Donald, 'B', 'PERSON'),
 (Trump, 'I', 'PERSON'),
 (at, 'O', ''),
 (the, 'O', ''),
 (same, 'O', ''),
 (point, 'O', ''),
 (in, 'O', ''),
 (his, 'O', ''),
 (presidency, 'O', ''),
 (., 'O', '')]


In [11]:
from bs4 import BeautifulSoup
import requests
import re

In [12]:
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

Lets extact the named entities from yahoo new article <a href="https://news.yahoo.com/biden-approval-rating-10-points-161424766.html">Biden’s approval rating is 10 points higher than his predecessor’s was after 100 days</a>

In [19]:
ny_bb = url_to_string('https://news.yahoo.com/biden-approval-rating-10-points-161424766.html')
article = nlp(ny_bb)
len(article.ents)

250

In [22]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'NORP': 43, 'ORG': 37, 'DATE': 34, 'PERSON': 30, 'GPE': 29, 'CARDINAL': 27, 'MONEY': 14, 'ORDINAL': 9, 'PRODUCT': 7, 'WORK_OF_ART': 6, 'PERCENT': 5, 'TIME': 3, 'LOC': 2, 'EVENT': 2, 'QUANTITY': 1, 'LANGUAGE': 1})

In [23]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('Americans', 7), ('10', 5), ('Trump', 5)]

In [24]:
sentences = [x for x in article.sents]
print(sentences[5])

COVID-19           US  US           Politics  Politics           World  World           Health  Health           Science  Science           Podcasts  Podcasts        Originals  


In [25]:
displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

In [26]:

displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

In [27]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[20])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('Getty', 'PROPN', 'Getty'), ('Images)A', 'PROPN', 'Images)A'), ('slight', 'ADJ', 'slight'), ('majority', 'NOUN', 'majority'), ('Americans', 'PROPN', 'Americans'), ('approve', 'VERB', 'approve'), ('job', 'NOUN', 'job'), ('performance', 'NOUN', 'performance'), ('President', 'PROPN', 'President'), ('Joe', 'PROPN', 'Joe'), ('Biden', 'PROPN', 'Biden'), ('52', 'NUM', '52'), ('cent', 'NOUN', 'cent'), ('approval', 'NOUN', 'approval'), ('rating', 'NOUN', 'rating'), ('10', 'NUM', '10'), ('points', 'NOUN', 'point'), ('higher', 'ADJ', 'high'), ('Donald', 'PROPN', 'Donald'), ('Trump', 'PROPN', 'Trump'), ('point', 'NOUN', 'point'), ('presidency', 'NOUN', 'presidency')]

In [28]:
dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])

{'Getty': 'PERSON', 'Americans': 'NORP', 'Joe Biden': 'PERSON', '52 per cent': 'MONEY', '10': 'CARDINAL', 'Donald Trump': 'PERSON'}

In [29]:
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[20]])

[((, 'O', ''), (Getty, 'B', 'PERSON'), (Images)A, 'O', ''), (slight, 'O', ''), (majority, 'O', ''), (of, 'O', ''), (Americans, 'B', 'NORP'), (approve, 'O', ''), (of, 'O', ''), (the, 'O', ''), (job, 'O', ''), (performance, 'O', ''), (of, 'O', ''), (President, 'O', ''), (Joe, 'B', 'PERSON'), (Biden, 'I', 'PERSON'), (,, 'O', ''), (and, 'O', ''), (at, 'O', ''), (52, 'B', 'MONEY'), (per, 'I', 'MONEY'), (cent, 'I', 'MONEY'), (,, 'O', ''), (his, 'O', ''), (approval, 'O', ''), (rating, 'O', ''), (is, 'O', ''), (10, 'B', 'CARDINAL'), (points, 'O', ''), (higher, 'O', ''), (than, 'O', ''), (that, 'O', ''), (of, 'O', ''), (Donald, 'B', 'PERSON'), (Trump, 'I', 'PERSON'), (at, 'O', ''), (the, 'O', ''), (same, 'O', ''), (point, 'O', ''), (in, 'O', ''), (his, 'O', ''), (presidency, 'O', ''), (., 'O', '')]


In [30]:
displacy.render(article, jupyter=True, style='ent')