# Named entity recognition 

 Words represent the person,location and organization is called **Named Entity**.


**Named entitry recognition** is subtask of information extraction and is process of identify the words which are named entities in a given text.
With named entity recognition, you can find the named entities in your texts and also determine what kind of named entity they are.



Here’s the list of named entity types from the  <a href ="https://www.nltk.org/book/ch07.html#sec-ner"> NLTK BOOK</a> 

<table >
    <tr>
        <th>NE type</th>
        <th>Examples</th>
    </tr>
    <tr>
        <td>ORGANIZATION</td>
        <td>Georgia-Pacific Corp., WHO</td>
    </tr>
    <tr>
        <td>PERSON</td>
        <td>Eddy Bonte, President Obama</td>
    </tr>
        <tr>
        <td>LOCATION</td>
        <td>Murray River, Mount Everest</td>
    </tr>
        <tr>
        <td>DATE</td>
        <td>June, 2008-06-29</td>
    </tr>
        <tr>
        <td>TIME</td>
        <td>two fifty a m, 1:30 p.m.</td>
    </tr>
    <tr>
        <td>MONEY</td>
        <td>175 million Canadian dollars, GBP 10.40</td>
    </tr>
    <tr>
        <td>PERCENT</td>
        <td>twenty pct, 18.75 %</td>
    </tr>
        <tr>
        <td>FACILITY</td>
        <td>Washington Monument, Stonehenge</td>
    </tr>
    <tr>
        <td>GPE</td>
        <td>South East Asia, Midlothian</td>
    </tr>
    
</table>

Let’s use lotr_pos_tags again to test it out:

Fist step import the necessery libaries

In [93]:
import nltk
from nltk.tokenize import word_tokenize


In [94]:
sentence="It's a dangerous business, Frodo, going out your door."

In [95]:
tk_word=word_tokenize(sentence)

In [96]:
print(tk_word)

['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']


Now we take pos_tag of the sentence

In [97]:
lotr_pos_tags=nltk.pos_tag(tk_word)
print(lotr_pos_tags)

[('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('dangerous', 'JJ'), ('business', 'NN'), (',', ','), ('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'), ('.', '.')]


You can use **nltk.ne_chunk()** to recognize **named entities**.

In [98]:
nltk.download("maxent_ne_chunker")
nltk.download("words")

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\abhir\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\abhir\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [99]:
tree = nltk.ne_chunk(lotr_pos_tags)

In [100]:
for t in tree:
    if hasattr(t, "label") and t.label() == "NE":
        print(t[0])
    

In [101]:
tree.draw()

See how Frodo has been tagged as a PERSON? You also have the option to use the parameter binary=True <ins>if you just want to know what the named entities are but not what kind of named entity they are:</ins>

In [102]:
tree = nltk.ne_chunk(lotr_pos_tags,binary=True)

In [103]:
tree.draw()

That’s how you can identify named entities! But you can take this one step further and extract named entities directly from your text. Create a string from which to extract named entities. 

You can use this quotes from <a href= https://en.wikipedia.org/wiki/The_War_of_the_Worlds>The war of world</a>

In [104]:
string= """
... Men like Schiaparelli watched the red planet—it is odd, by-the-bye, that
... for countless centuries Mars has been the star of war—but failed to
... interpret the fluctuating appearances of the markings they mapped so well.
... All that time the Martians must have been getting ready.
...
... During the opposition of 1894 a great light was seen on the illuminated
... part of the disk, first at the Lick Observatory, then by Perrotin of Nice,
... and then by other observers. English readers heard of it first in the
... issue of Nature dated August 2."""

Now create a function to extract named entities:

In [105]:
def extract_ne(string):
    words=word_tokenize(string,language='english')
    tags=nltk.pos_tag(words)
    tree=nltk.ne_chunk(tags,binary=True)
    tree.draw()
    return set(
        " ".join(i[0] for i in t)
        for t in tree
        if hasattr(t, "label") and t.label() == "NE"
     )
    

With this function, you gather all named entities, with no repeats.
<ul>
    <li>In order to do that, you tokenize by word</li>
    <li>apply part of speech tags to those words</li>
<li>Apply extract named entities based on those tags. Because you included binary=True, the named entities you’ll get won’t be labeled more specifically.</li>
</ul>
You’ll just know that they’re named entities.

Extracted information 

In [106]:
 extract_ne(string)

{'Lick Observatory', 'Mars', 'Nature', 'Perrotin', 'Schiaparelli'}

You missed the city of Nice, possibly because NLTK interpreted it as a regular English adjective, but you still got the following:
<ul>
    <li>An institution: 'Lick Observatory'</li>
    <li>A planet: 'Mars'</li>
    <li>A publication: 'Nature'</li>
    <li>People: 'Perrotin', 'Schiaparelli'</li>
</ul>    