# Step7: Named Entity Recognition with NLTK

###  Named Entity Recognition with NLTK - 
####One of the most major forms of chunking in natural language processing is called "Named Entity Recognition." The idea is to have the machine immediately be able to pull out "entities" like people, places, things, locations, monetary figures, and more.

####This can be a bit of a challenge, but NLTK is this built in for us. There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

#### POS tag list:

##### CC -	coordinating conjunction
##### CD -	cardinal digit
##### DT -	determiner
##### EX -	existential there (like: "there is" ... think of it like "there exists")
##### FW -	foreign word
##### IN -	preposition/subordinating conjunction
##### JJ -	adjective	'big'
##### JJR -	adjective, comparative	'bigger'
##### JJS -	adjective, superlative	'biggest'
##### LS -	list marker	1)
##### MD -	modal	could, will
##### NN -	noun, singular 'desk'
##### NNS -	noun plural	'desks'
##### NNP -	proper noun, singular	'Harrison'
##### NNPS -	proper noun, plural	'Americans'
##### PDT -	predeterminer	'all the kids'
##### POS -	possessive ending	parent\'s
##### PRP -	personal pronoun	I, he, she
##### PRP -	possessive pronoun	my, his, hers
##### RB -	adverb	very, silently,
##### RBR -	adverb, comparative	better
##### RBS -	adverb, superlative	best
##### RP -	particle	give up
##### TO -	to	go 'to' the store.
##### UH -	interjection	errrrrrrrm
##### VB -	verb, base form	take
##### VBD -	verb, past tense	took
##### VBG -	verb, gerund/present participle	taking
##### VBN -	verb, past participle	taken
##### VBP -	verb, sing. present, non-3d	take
##### VBZ -	verb, 3rd person sing. present	takes
##### WDT -	wh-determiner	which
##### WP -	wh-pronoun	who, what
##### WP -	possessive wh-pronoun	whose
##### WRB -	wh-abverb	where, when

  ##  PunktSentenceTokenizer. - This tokenizer is capable of unsupervised machine learning, so you can actually train it on any body of text that you use

In [0]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

####  create our training and testing data:

In [2]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> 

---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
         

True

In [0]:
#  create our training and testing data:
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

#### train the Punkt tokenizer

In [0]:
# train the Punkt tokenizer
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [0]:
# now tokenize using trained PunktSentenceTokenizer
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [28]:
print(tokenized)

["PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all.", 'Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.', 'Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King.', '(Applause.)', 'President George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan.', '31, 2006.', "White House photo by Eric DraperEvery time I'm invited to this rostrum, I'm humbled by the privilege, and mindful of the history we've seen together.", 'We have gathered under this Capitol dome in moments of national mourning and national achievemen

#### function that will run through and tag all of the parts of speech per sentence

In [0]:
# function that will run through and tag all of the parts of speech per sentence
def process_content():
    try:
        for sentence in tokenized[:5]:
            tokenized_words= nltk.word_tokenize(sentence)
            tagged_words = nltk.pos_tag(tokenized_words)
            #print(tagged_words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            print(namedEnt) 
    except Exception as e:
        print(str(e))
        

In [27]:
process_content()

(S never/RB found/VBN right/RB one/CD ./.)
(S never/RB found/VBN right/RB one/CD ./.)
(S never/RB found/VBN right/RB one/CD ./.)
(S never/RB found/VBN right/RB one/CD ./.)
(S never/RB found/VBN right/RB one/CD ./.)


###Here, with the option of binary = True, this means either something is a named entity, or not. There will be no further detail. 

###If you set binary = False, it picked up the same things, but wound up splitting up terms like White House into "White" and "House" as if they were different, whereas we could see in the binary = True option, the named entity recognition was correct to say White House was part of the same named entity

##Depending on your goals, you may use the binary option how you see fit. Here are the types of Named Entities that you can get if you have binary as false:

###NE Type and Examples
####ORGANIZATION - Georgia-Pacific Corp., WHO
####PERSON - Eddy Bonte, President Obama
####LOCATION - Murray River, Mount Everest
####DATE - June, 2008-06-29
####TIME - two fifty a m, 1:30 p.m.
####MONEY - 175 million Canadian Dollars, GBP 10.40
####PERCENT - twenty pct, 18.75 %
####FACILITY - Washington Monument, Stonehenge
####GPE - South East Asia, Midlothian

### for more details https://pythonprogramming.net/named-entity-recognition-nltk-tutorial/