# named entity recognition

* Named Entity Recognition or NER for short is a natural language processing task used to identify important named entities in the text -- such as people, places and organizations -- they can even be dates, states, works of art and other categories depending on the libraries and notation you use

* NLTK allows you to interact with named entity recognition via it's own model, but also the Stanford CoreNLP library

* pos_tag assigns part of speech to each words. Some common part of speech are 
NN: Noun, singular or mass   
NNS: Noun, plural   
VB: Verb, base form
VBD: Verb, past tense   
VBG: Verb, gerund or present participle   
VBN: Verb, past participle   
JJ: Adjective   
RB: Adverb
PRP: Personal pronoun

* pos_tagged sentences are then passed to ne_chunk function, or named entity chunk, which will return the sentence as a tree where named entities mentioned in unstructured text are classfied into predefined categories . For example

PERSON: Persons (e.g., John Doe, Queen Elizabeth)
ORGANIZATION: Organizations (e.g., Apple Inc., Google)
GPE: Geopolitical Entities (e.g., USA, California, London)
LOCATION: Locations (e.g., Mountain View, Silicon Valley)
DATE: Dates (e.g., 2023-11-05)
TIME: Times (e.g., 3:15 PM)
MONEY: Monetary values (e.g., $10, €20)
PERCENT: Percentages (e.g., 25%)

In [None]:
import nltk
sentence = '''In New York, I like to ride the Metro to visit MOMA and some restaurants rated well by Ruth Reichl.'''
tokenized_sent = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokenized_sent)      #pos_tag is part of speech tagging which assigns grammatical role for each words. 
tagged_sent[:3]

#this will print [('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')] 

print(nltk.ne_chunk(tagged_sent))


In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import defaultdict
import matplotlib.pyplot as plt

In [None]:
article = '''The taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.


Uber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire a driver, be given false reports about the location of its cars. Uber management booked and then cancelled rides with a rival taxi-hailing company which took their vehicles out of circulation. Uber deny this was the intention. The punishment for this behaviour was negligible. Uber promised not to use this “greyball” software against law enforcement – one wonders what would happen to someone carrying a knife who promised never to stab a policeman with it. Travis Kalanick of Uber got a personal dressing down from Tim Cook, who runs Apple, but the company did not prohibit the use of the app. Too much money was at stake for that.


Millions of people around the world value the cheapness and convenience of Uber’s rides too much to care about the lack of drivers’ rights or pay. Many of the users themselves are not much richer than the drivers. The “sharing economy” encourages the insecure and exploited to exploit others equally insecure to the profit of a tiny clique of billionaires. Silicon Valley’s culture seems hostile to humane and democratic values. The outgoing CEO of Yahoo, Marissa Mayer, who is widely judged to have been a failure, is likely to get a $186m payout. This may not be a cause for panic, any more than the previous hero worship should have been a cause for euphoria. Yet there’s an urgent political task to tame these companies, to ensure they are punished when they break the law, that they pay their taxes fairly and that they behave responsibly.'''


# Tokenize the article into sentences: sentences
sentences = sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)


# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(v) for v in labels]

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()


# intro to spacy

* SpaCy is a NLP library similar to Gensim, but with different implementations, including a particular focus on creating NLP pipelines to generate models and corpora. 

* It has several linked objects, including entity which is an Entity Recognizer object from the pipeline module. This is what is used to find entities in the text.

* Then we load a new document by passing a string into the NLP variable. When the document is loaded, the named entities are stored as a document attribute called ents.

* We can also investigate the labels of each entity by using indexing to pick out the first entity and the label_ attribute to see the label for that particular entity.

* Spacy comes with informal language corpora, allowing you to more easily find entities in documents like Tweets and chat messages.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
nlp.entity

# this w<spacy.pipeline.EntityRecognizer at 0x7f76b75e68b8>(Berlin, Germany, Angela Merkel)
# 

doc = nlp("""Berlin is the capital of Germany;and the residence of Chancellor Angela Merkel.""")
doc.ents

#it will print all the entities present in the document. like Berlin, Germany, Angela Merkel

print(doc.ents[0], doc.ents[0].label_)
# it will print named entity's text and label like Berlin GPE (GPE means geopolitical entitiy)