<div class="alert alert-block alert-info">
Notebook Author:<br>Felix Gonzalez, P.E. <br> Adjunct Instructor, <br> Division of Professional Studies <br> Computer Science and Electrical Engineering <br> University of Maryland Baltimore County <br> fgonzale@umbc.edu
</div>

<div class="alert alert-block alert-info">
Acknowledgements:<br>Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
<br> https://www.nltk.org/book/
</div>

# NLTK Extracting Named Entities

https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python

https://stackoverflow.com/questions/31689621/how-to-traverse-an-nltk-tree-object

In [5]:
#nltk.download('maxent_ne_chunker') # Need to install/download once.
#nltk.download('words') # Need to install/download once.

In [6]:
import nltk
from nltk.corpus import words, names#, state_union
from nltk.tokenize import PunktSentenceTokenizer
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

# Sample Tokenization

In [7]:
text = "Bill okay! testing capitalisation. my nice friend is called bob he lives in america."

tokenized_words = nltk.word_tokenize(text)
pos_tagged_text = nltk.pos_tag(tokenized_words)
nltk_results = ne_chunk(pos_tag(word_tokenize(text))) 

In [8]:
print(tokenized_words)

['Bill', 'okay', '!', 'testing', 'capitalisation', '.', 'my', 'nice', 'friend', 'is', 'called', 'bob', 'he', 'lives', 'in', 'america', '.']


In [9]:
print(pos_tagged_text)

[('Bill', 'NNP'), ('okay', 'PRP'), ('!', '.'), ('testing', 'VBG'), ('capitalisation', 'NN'), ('.', '.'), ('my', 'PRP$'), ('nice', 'JJ'), ('friend', 'NN'), ('is', 'VBZ'), ('called', 'VBN'), ('bob', 'NN'), ('he', 'PRP'), ('lives', 'VBZ'), ('in', 'IN'), ('america', 'NN'), ('.', '.')]


In [10]:
print(nltk_results) # ne_chunk results in Named Entity wich results in linguistic tree.
# https://www.nltk.org/api/nltk.tree.tree.html

(S
  (PERSON Bill/NNP)
  okay/PRP
  !/.
  testing/VBG
  capitalisation/NN
  ./.
  my/PRP$
  nice/JJ
  friend/NN
  is/VBZ
  called/VBN
  bob/NN
  he/PRP
  lives/VBZ
  in/IN
  america/NN
  ./.)


# NLTK Finding Entities


In [14]:
#text = "Elon Musk 889-888-8888 elonpie@tessa.net Jeff Bezos (345)123-1234 bezzi@zonbi.com Reshma Saujani example.email@email.com 888-888-8888 Barkevious Mingo"
#text = "Elon okay! testing capitalisation. my nice friend is called bob he lives in america."
text='''That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase over the same period in 2016; further notes 156 of these job losses will be in York, a city that in the last six months has seen 2,000 job losses announced and has become the most inequitable city outside of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future relationship with the single market and customs union; and calls on the Government to intervene and work with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent further job losses across Nestlé.'''

nltk_results = ne_chunk(pos_tag(word_tokenize(text))) 
# Note ne_chunk uses svgling which used for doing single-pass rendering of linguistics-style constituent trees into SVG.
# Code will still work without svgling installed

In [17]:
nltk_results

ModuleNotFoundError: No module named 'svgling'

Tree('S', [('That', 'IN'), ('this', 'DT'), Tree('ORGANIZATION', [('House', 'NNP')]), ('notes', 'VBZ'), ('the', 'DT'), ('announcement', 'NN'), ('of', 'IN'), ('300', 'CD'), ('redundancies', 'NNS'), ('at', 'IN'), ('the', 'DT'), Tree('GPE', [('Nestlé', 'NNP')]), ('manufacturing', 'NN'), ('factories', 'NNS'), ('in', 'IN'), Tree('GPE', [('York', 'NNP')]), (',', ','), Tree('GPE', [('Fawdon', 'NNP')]), (',', ','), Tree('PERSON', [('Halifax', 'NNP')]), ('and', 'CC'), Tree('PERSON', [('Girvan', 'NNP')]), ('and', 'CC'), ('that', 'DT'), ('production', 'NN'), ('of', 'IN'), ('the', 'DT'), Tree('ORGANIZATION', [('Blue', 'NNP'), ('Riband', 'NNP')]), ('bar', 'NN'), ('will', 'MD'), ('be', 'VB'), ('transferred', 'VBN'), ('to', 'TO'), Tree('GPE', [('Poland', 'NNP')]), (';', ':'), ('acknowledges', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('first', 'JJ'), ('three', 'CD'), ('months', 'NNS'), ('of', 'IN'), ('2017', 'CD'), Tree('PERSON', [('Nestlé', 'NNP')]), ('achieved', 'VBD'), ('£21', 'JJ'), ('billion', 'CD'), 

### Finding number of Persons Mentions in a String



In [15]:
contains_person_name = 0

for nltk_result in nltk_results:
    if type(nltk_result) == Tree:
        entity_name = ''
        for nltk_result_leaf in nltk_result.leaves():
            entity_name += nltk_result_leaf[0] + ' '
            if nltk_result.label() == 'PERSON' or nltk_result.label() == 'GPE' : #Names sometimes detected as GPE.
                contains_person_name += 1

print ('Number of Person or GPE Name Mentions: ', contains_person_name)

Number of Person or GPE Name Mentions:  14


# Printing Labels of Entities

In [16]:
for nltk_result in nltk_results:
    if type(nltk_result) == Tree:
        entity_name = ''
        for nltk_result_leaf in nltk_result.leaves():
            entity_name += nltk_result_leaf[0] + ' '
        print ('Type: ', nltk_result.label(), 'Name: ', entity_name)

Type:  ORGANIZATION Name:  House 
Type:  GPE Name:  Nestlé 
Type:  GPE Name:  York 
Type:  GPE Name:  Fawdon 
Type:  PERSON Name:  Halifax 
Type:  PERSON Name:  Girvan 
Type:  ORGANIZATION Name:  Blue Riband 
Type:  GPE Name:  Poland 
Type:  PERSON Name:  Nestlé 
Type:  GPE Name:  York 
Type:  LOCATION Name:  South East 
Type:  GPE Name:  Fawdon 
Type:  GPE Name:  Newcastle 
Type:  GPE Name:  EU 
Type:  GPE Name:  EU 
Type:  ORGANIZATION Name:  UK 
Type:  GPE Name:  EU 
Type:  ORGANIZATION Name:  UK 
Type:  ORGANIZATION Name:  GMB 
Type:  ORGANIZATION Name:  Unite 
Type:  GPE Name:  Nestlé 


Note that NLTK named recognition seems to rely in part on capitalization to recognize names. For example, words in the middle of the sentence that are capitalized or have the first letter capitalized may be recognized as an entities.

# Entities Recognition within Movie Database Overview

# END OF NOTEBOOK