# Building Named Entity Recogniser with NLTK and SpaCy

Named Entity Recognition helps extract information from textual data by locating and classifiying named entities in text into pre-defined categories. A BBC article will be used here to try 2 methods - NLTK and SpaCy.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from bs4 import BeautifulSoup
import requests
import re
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

## NLTK

In [2]:
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

In [3]:
ny_bb = url_to_string('https://www.bbc.co.uk/news/business-52068549')
article = nlp(ny_bb)
len(article.ents)

110

In [4]:
#apply word tokenisation and POS tagging to article
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [5]:
#print sample output
sent = preprocess(ny_bb)
sent[:10]

[('Drop', 'NN'),
 ('in', 'IN'),
 ('consumer', 'NN'),
 ('confidence', 'NN'),
 ('sends', 'VBZ'),
 ('US', 'JJ'),
 ('stocks', 'NNS'),
 ('lower', 'JJR'),
 ('again', 'RB'),
 ('-', ':')]

This generates a list of tuples containing individual words in the sentence and their associated POS. Noun phrase chunking will now be used to identify named entities using a regular expression consisting of rules that indicate how sentences should be chunked.

### Chunking

The chunking pattern rule here is a noun phrase should be formed whenever the chunker finds an optional determiner followed by any number of adjectives, JJ, and then a noun, NN.

In [6]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

In [7]:
#pattern rule now used to create a chunk parser and applied to text
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs[:10])

[Tree('NP', [('Drop', 'NN')]), ('in', 'IN'), Tree('NP', [('consumer', 'NN')]), Tree('NP', [('confidence', 'NN')]), ('sends', 'VBZ'), ('US', 'JJ'), ('stocks', 'NNS'), ('lower', 'JJR'), ('again', 'RB'), ('-', ':')]


In [8]:
#iob tags format now used to represent chunk structure
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged[:10])

[('Drop', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('consumer', 'NN', 'B-NP'),
 ('confidence', 'NN', 'B-NP'),
 ('sends', 'VBZ', 'O'),
 ('US', 'JJ', 'O'),
 ('stocks', 'NNS', 'O'),
 ('lower', 'JJR', 'O'),
 ('again', 'RB', 'O'),
 ('-', ':', 'O')]


The above shows one token per line, each with its POS tag and named entity tag. Based on this, a tagger can now be constructed that can be used to label new sentences ne_chunk used to recognise named entities using a classifier, which adds category labels like ORGANIZATION.

In [9]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
    
ne_tree = ne_chunk(pos_tag(word_tokenize(ny_bb)))
pprint(ne_tree[:15])

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/charlottefettes/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/charlottefettes/nltk_data...
[nltk_data]   Package words is already up-to-date!


[Tree('GPE', [('Drop', 'NN')]),
 ('in', 'IN'),
 ('consumer', 'NN'),
 ('confidence', 'NN'),
 ('sends', 'VBZ'),
 Tree('GSP', [('US', 'JJ')]),
 ('stocks', 'NNS'),
 ('lower', 'JJR'),
 ('again', 'RB'),
 ('-', ':'),
 Tree('ORGANIZATION', [('BBC', 'NNP'), ('News', 'NNP')]),
 ('HomepageAccessibility', 'NNP'),
 ('linksSkip', 'NN'),
 ('to', 'TO'),
 Tree('ORGANIZATION', [('contentAccessibility', 'NN')])]


## SpaCy

This alternative for named entity recognition supports quite a long list of entity types.

In [10]:
#reload data
ny_bb = url_to_string('https://www.bbc.co.uk/news/business-52068549')
article = nlp(ny_bb)
len(article.ents)

110

In [11]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'GPE': 16,
         'ORG': 25,
         'CARDINAL': 11,
         'DATE': 23,
         'PERSON': 13,
         'PRODUCT': 3,
         'PERCENT': 6,
         'NORP': 4,
         'MONEY': 3,
         'EVENT': 1,
         'LAW': 1,
         'LOC': 3,
         'WORK_OF_ART': 1})

In [12]:
#3 most frequent tokens in the text
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('US', 7), ('UK', 4), ('Friday', 3)]

In [13]:
#select one sentence to learn more
sentences = [x for x in article.sents]
print(sentences[50])

The University of Michigan survey found consumer sentiment fell 11.9 points in March - the biggest one month drop since October 2008, at the height of the global financial crisis.


In [14]:
#generate raw markup
displacy.render(nlp(str(sentences[50])), jupyter=True, style='ent')

In [15]:
#visualise the above sentence's dependencies
displacy.render(nlp(str(sentences[50])), style='dep', jupyter = True, options = {'distance': 120})

In [16]:
#extract POS and lemmatise this sentence
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[50])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('University', 'PROPN', 'University'),
 ('Michigan', 'PROPN', 'Michigan'),
 ('survey', 'NOUN', 'survey'),
 ('found', 'VERB', 'find'),
 ('consumer', 'NOUN', 'consumer'),
 ('sentiment', 'NOUN', 'sentiment'),
 ('fell', 'VERB', 'fall'),
 ('11.9', 'NUM', '11.9'),
 ('points', 'NOUN', 'point'),
 ('March', 'PROPN', 'March'),
 ('biggest', 'ADJ', 'big'),
 ('month', 'NOUN', 'month'),
 ('drop', 'NOUN', 'drop'),
 ('October', 'PROPN', 'October'),
 ('2008', 'NUM', '2008'),
 ('height', 'NOUN', 'height'),
 ('global', 'ADJ', 'global'),
 ('financial', 'ADJ', 'financial'),
 ('crisis', 'NOUN', 'crisis')]

In [17]:
dict([(str(x), x.label_) for x in nlp(str(sentences[50])).ents])

{'The University of Michigan': 'ORG',
 '11.9': 'CARDINAL',
 'March': 'DATE',
 'one month': 'DATE',
 'October 2008': 'DATE'}

In [18]:
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[50]])

[(The, 'B', 'ORG'), (University, 'I', 'ORG'), (of, 'I', 'ORG'), (Michigan, 'I', 'ORG'), (survey, 'O', ''), (found, 'O', ''), (consumer, 'O', ''), (sentiment, 'O', ''), (fell, 'O', ''), (11.9, 'B', 'CARDINAL'), (points, 'O', ''), (in, 'O', ''), (March, 'B', 'DATE'), (-, 'O', ''), (the, 'O', ''), (biggest, 'O', ''), (one, 'B', 'DATE'), (month, 'I', 'DATE'), (drop, 'O', ''), (since, 'O', ''), (October, 'B', 'DATE'), (2008, 'I', 'DATE'), (,, 'O', ''), (at, 'O', ''), (the, 'O', ''), (height, 'O', ''), (of, 'O', ''), (the, 'O', ''), (global, 'O', ''), (financial, 'O', ''), (crisis, 'O', ''), (., 'O', '')]


In [19]:
#visualise entire article
displacy.render(article, jupyter=True, style='ent')