## What is Named Entity Recognition?
- NLP task to identify important named entities in the text
    - People, places, organizations
    - Dates, states, works of art
    - ... and other categories!
- Can be used alongside topic identification
    -... or on its own!
- Who? What? When? Where?



## Example of NER

![](https://i.imgur.com/qEK89qR.png)

(Source: Europeana Newspapers (http://www.europeana-newspapers.eu))

- text  has been highlighted for diffrent types of entities
- dates, locations, people, and organizaitons
- extract information based on these entities

- use it for fact extracton and show wicht entities are related using  computational language models
- ex: 
    - `Einstein` has something to do with `United States`, `Adof Hitler` and `Germany`
    - we can see by token proxemity that `Russel` and `Einstein` created `Russel-Einsten Manifesto`
    


## nltk and the Stanford CoreNLP Library
- The Stanford CoreNLP library:
    - Integrated into Python via nltk
    - Java based
    - Support for NER as well as coreference and dependency trees

In [5]:
'''
LookupError: 
**********************************************************************
  Resource averaged_perceptron_tagger not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  
  Searched in:
    - '/home/frank/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/home/frank/miniconda3/envs/datacamp/nltk_data'
    - '/home/frank/miniconda3/envs/datacamp/share/nltk_data'
    - '/home/frank/miniconda3/envs/datacamp/lib/nltk_data'
**********************************************************************
'''

import nltk

# normal sentence
sentence = '''In New York, I like to ride the Metro to visit MOMA
and some restaurants rated well by Ruth Reichl.'''


# preprocess with tokenization
tokenized_sent = nltk.word_tokenize(sentence)


# tag the sentence for parts of speach
# this will add tags for nouns, proper nouns, adjetives, verbs. based on engl grammer
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent[:3]

[('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]

`New` and `York` are taged as 'NNP' thta is part of speach tag for proper noun singlular

## nktl's ``ne_chunk()

In [8]:
'''
LookupError: 
**********************************************************************
  Resource maxent_ne_chunker not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('maxent_ne_chunker')
  
  Searched in:
    - '/home/frank/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/home/frank/miniconda3/envs/datacamp/nltk_data'
    - '/home/frank/miniconda3/envs/datacamp/share/nltk_data'
    - '/home/frank/miniconda3/envs/datacamp/lib/nltk_data'
    - ''
**********************************************************************
'''


'''
LookupError: 
**********************************************************************
  Resource words not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('words')
  
  Searched in:
    - '/home/frank/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/home/frank/miniconda3/envs/datacamp/nltk_data'
    - '/home/frank/miniconda3/envs/datacamp/share/nltk_data'
    - '/home/frank/miniconda3/envs/datacamp/lib/nltk_data'
**********************************************************************
'''
# name entity chunk

# returns the sentense a tree
# this tree shows the name entity tags as their own chunks exp New as GPE, MOMA as ORGANIZATION, Ruth asn PERSON
print(nltk.ne_chunk(tagged_sent))

(S
  In/IN
  (GPE New/NNP York/NNP)
  ,/,
  I/PRP
  like/VBP
  to/TO
  ride/VB
  the/DT
  (ORGANIZATION Metro/NNP)
  to/TO
  visit/VB
  (ORGANIZATION MOMA/NNP)
  and/CC
  some/DT
  restaurants/NNS
  rated/VBN
  well/RB
  by/IN
  (PERSON Ruth/NNP Reichl/NNP)
  ./.)


## Let Practice

## NER with NLTK
You're now going to have some fun with named-entity recognition! A scraped news article has been pre-loaded into your workspace. Your task is to use nltk to find the named entities in this article.

What might the article be about, given the names you found?

Along with nltk, sent_tokenize and word_tokenize from nltk.tokenize have been pre-imported.

In [9]:
article = '\ufeffThe taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.\r\n\r\n\r\nUber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire a driver, be given false reports about the location of its cars. Uber management booked and then cancelled rides with a rival taxi-hailing company which took their vehicles out of circulation. Uber deny this was the intention. The punishment for this behaviour was negligible. Uber promised not to use this “greyball” software against law enforcement – one wonders what would happen to someone carrying a knife who promised never to stab a policeman with it. Travis Kalanick of Uber got a personal dressing down from Tim Cook, who runs Apple, but the company did not prohibit the use of the app. Too much money was at stake for that.\r\n\r\n\r\nMillions of people around the world value the cheapness and convenience of Uber’s rides too much to care about the lack of drivers’ rights or pay. Many of the users themselves are not much richer than the drivers. The “sharing economy” encourages the insecure and exploited to exploit others equally insecure to the profit of a tiny clique of billionaires. Silicon Valley’s culture seems hostile to humane and democratic values. The outgoing CEO of Yahoo, Marissa Mayer, who is widely judged to have been a failure, is likely to get a $186m payout. This may not be a cause for panic, any more than the previous hero worship should have been a cause for euphoria. Yet there’s an urgent political task to tame these companies, to ensure they are punished when they break the law, that they pay their taxes fairly and that they behave responsibly.'

In [22]:
# Tokenize the article into sentences: sentences
sentences = nltk.sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences,binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)

(NE Uber/NNP)
(NE Beyond/NN)
(NE Apple/NNP)
(NE Uber/NNP)
(NE Uber/NNP)
(NE Travis/NNP Kalanick/NNP)
(NE Tim/NNP Cook/NNP)
(NE Apple/NNP)
(NE Silicon/NNP Valley/NNP)
(NE CEO/NNP)
(NE Yahoo/NNP)
(NE Marissa/NNP Mayer/NNP)


## Charting practice
In this exercise, you'll use some extracted named entities and their groupings from a series of newspaper articles to chart the diversity of named entity types in the articles.

You'll use a defaultdict called ner_categories, with keys representing every named entity group type, and values to count the number of each different named entity type. You have a chunked sentence list called chunked_sentences similar to the last exercise, but this time with non-binary category names.

You can use hasattr() to determine if each chunk has a 'label' and then simply use the chunk's .label() method as the dictionary key.

In [23]:
from collections import defaultdict
import matplotlib.pyplot as plt

```python
# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(l) for l in labels]

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()
```

# Intro to SpaCy


## What is SpaCy?
NLP library similar to gensim, with different implementations
Focus on creating NLP pipelines to generate models and corpora
Open-source, with extra libraries and tools
- Displacy

## Displacy entity recognition visualizer


![](https://i.imgur.com/51zNH2v.png)

(source: https://demos.explosion.ai/displacy-ent/)

## SPaCy NER




In [25]:
! conda install spacy -y

Solving environment: done

## Package Plan ##

  environment location: /home/frank/miniconda3/envs/datacamp

  added / updated specs: 
    - spacy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    preshed-1.0.0              |   py36hf484d3e_0          83 KB
    murmurhash-0.28.0          |   py36hf484d3e_0          16 KB
    pathlib-1.0.1              |           py36_0          26 KB
    thinc-6.10.1               |   py36hd61447b_0         1.5 MB
    spacy-2.0.11               |   py36h04863e7_2        34.5 MB
    cymem-1.31.2               |   py36h6bb024c_0          26 KB
    dill-0.2.8.2               |           py36_0         112 KB
    msgpack-numpy-0.4.3        |           py36_0          14 KB
    msgpack-python-0.5.6       |   py36h6bb024c_0          96 KB
    ujson-1.35                 |   py36h14c3975_0          26 KB
    plac-0.9.6                 |           py36_0        

In [27]:
'''
download
python -m spacy download en
'''

import spacy


nlp = spacy.load('en')

# entity recognizer object
nlp.entity

<spacy.pipeline.EntityRecognizer at 0x7f08a1115200>

In [28]:
#lod document by passing a sring to the nlp variable
doc = nlp("""Berlin is the capital of Germany; 
                  and the residence of Chancellor Angela Merkel.""")


# name entities are stored as document atributes called "ents"

# spacy has properly taged and itedifies the 3 main entities in this sentense
doc.ents

(Berlin, Germany, Angela Merkel)

In [29]:
# invesitage labesl for each entity
# index first entity and label (GPE, Geo policical entiy)
print(doc.ents[0], doc.ents[0].label_)

Berlin GPE


## Why use SpaCy for NER?
- Easy pipeline creation
- Different entity types compared to nltk
- Informal language corpora
    - Easily find entities in Tweets and chat messages
- Quickly growing!

## let's practice

## Comparing NLTK with spaCy NER
Using the same text you used in the first exercise of this chapter, you'll now see the results using spaCy's NER annotator. How will they compare?

The article has been pre-loaded as article. To minimize execution times, you'll be asked to specify the keyword arguments tagger=False, parser=False, matcher=False when loading the spaCy model, because you only care about the entity in this exercise.

In [30]:
# Import spacy
import spacy

# Instantiate the English model: nlp
nlp = spacy.load('en',tagger=False,parser=False,matcher=False)

# Create a new document: doc
doc = nlp(article)

# Print all of the found entities and their labels
for ent in doc.ents:
    print(ent.label_, ent.text)

ORG Uber
ORG Uber
ORG Apple
ORG Uber
ORG Uber
PERSON Travis Kalanick
ORG Uber
PERSON Tim Cook
ORG Apple
CARDINAL Millions
ORG Uber
GPE drivers’
LOC Silicon Valley’s
ORG Yahoo
PERSON Marissa Mayer
MONEY $186m


# Multilingual NER with polyglot

## What is polyglot?
- NLP library which uses word vectors
- Why polyglot?
    - Vectors for many different languages
    - More than 130!
![](https://i.imgur.com/ytoaWhZ.png)


In [43]:
'''
pip install morfessor
pip install pyicu
pip install pycld2

# download spanish lang
polyglot download embeddings2.es
polyglot download ner2.es

'''

'\npip install morfessor\npip install pyicu\npip install pycld2\n\n# spanish\npolyglot download embeddings2.es\n\n'

In [39]:
from polyglot.text import Text

In [44]:
text = """El presidente de la Generalitat de Cataluña,
                  Carles Puigdemont, ha afirmado hoy a la alcaldesa 
                  de Madrid, Manuela Carmena, que en su etapa de 
                  alcalde de Girona (de julio de 2011 a enero de 2016) 
                  hizo una gran promoción de Madrid."""

ptext = Text(text)

In [46]:
ptext.entities

[I-ORG(['Generalitat', 'de']),
 I-LOC(['Generalitat', 'de', 'Cataluña']),
 I-PER(['Carles', 'Puigdemont']),
 I-LOC(['Madrid']),
 I-PER(['Manuela', 'Carmena']),
 I-LOC(['Girona']),
 I-LOC(['Madrid'])]


- list of entities chunks found while parsing the text
- each chunk has a label represented by the symbol starting with an I
    - `I-ORG`: represents an organization
    - `I-LOC`: represents a location
    - `I-PER`: represents a person
    
- some duplicates in the beguining b/c it  represents both location and an oranization

## let's practice

## French NER with polyglot I
In this exercise and the next, you'll use the polyglot library to identify French entities. The library functions slightly differently than spacy, so you'll use a few of the new things you learned in the last video to display the named entity text and category.

You have access to the full article string in article. Additionally, the Text class of polyglot has been imported from polyglot.text.

install french lang

```python

polyglot download embeddings2.fr
polyglot download ner2.fr
```

In [47]:
article = '''
\ufeffédition abonné\r\n\r\n\r\nDans une tribune au « Monde », l’universitaire Charles Cuvelliez estime que le fantasme d’un remplacement de l’homme par l’algorithme et le robot repose sur un malentendu.\r\n\r\n\r\nLe Monde | 10.05.2017 à 06h44 • Mis à jour le 10.05.2017 à 09h47 | Par Charles Cuvelliez (Professeur à l’Ecole polytechnique de l'université libre de Bruxelles)\r\n\r\n\r\nTRIBUNE. L’usage morbide, par certains, de Facebook Live a amené son fondateur à annoncer précipitamment le recrutement de 3 000 modérateurs supplémentaires. Il est vrai que l’intelligence artificielle (IA) est bien en peine de reconnaître des contenus violents, surtout diffusés en direct.\r\n\r\n\r\nLe quotidien affreux de ces modérateurs, contraints de visionner des horreurs à longueur de journée, mériterait pourtant qu’on les remplace vite par des machines !\r\n\r\n\r\nL’IA ne peut pas tout, mais là où elle peut beaucoup, on la maudit, accusée de détruire nos emplois, de remplacer la convivialité humaine. Ce débat repose sur un malentendu.\r\n\r\n\r\nIl vient d’une définition de l’IA qui n’a, dans la réalité, jamais pu être mise en pratique : en 1955, elle était vue comme la création de programmes informatiques qui, quoi qu’on leur confie, le feraient un jour mieux que les humains. On pensait que toute caractéristique de l’intelligence humaine pourrait un jour être si précisément décrite qu’il suffirait d’une machine pour la simuler. Ce n’est pas vrai.\r\n\r\n\r\nAngoisses infondées\r\n\r\n\r\nComme le dit un récent Livre blanc sur la question (Pourquoi il ne faut pas avoir peur de l’Intelligence arti\xadficielle, Julien Maldonato, Deloitte, mars 2017), rien ne pourra remplacer un humain dans sa globalité.\r\n\r\n\r\nL’IA, c’est de l’apprentissage automatique doté d’un processus d’ajustement de modèles statistiques à des masses de données, explique l’auteur. Il s’agit d’un apprentissage sur des paramètres pour lesquels une vision humaine n’explique pas pourquoi ils marchent si bien dans un contexte donné.\r\n\r\n\r\nC’est aussi ce que dit le rapport de l’Office parlementaire d’évaluation des choix scientifiques et technologiques (« Pour une intelligence artificielle maîtrisée, utile et démystifiée », 29 mars 2017), pour qui ce côté « boîte noire » explique des angoisses infondées. Ethiquement, se fonder sur l’IA pour des tâches critiques sans bien comprendre le comment...
'''

In [51]:
# Create a new text object using Polyglot's Text class: txt
txt = Text(article)

# Print each of the entities found
for ent in txt.entities:
    print(ent)
    
# Print the type of ent
print(type(ent))

['Charles', 'Cuvelliez']
['Charles', 'Cuvelliez']
['Bruxelles']
['l’IA']
['Julien', 'Maldonato']
['Deloitte']
['Ethiquement']
['l’IA']
['.']
<class 'polyglot.text.Chunk'>


## French NER with polyglot II
Here, you'll complete the work you began in the previous exercise. Your code from there has already been executed, as you can see from the output in the IPython Shell.

Your task is to use a list comprehension to create a list of tuples, in which the first element is the entity tag, and the second element is the full string of the entity text.

In [53]:
# Create the list of tuples: entities
entities = [(ent.tag,' '.join(ent)) for ent  in txt.entities]
# Print entities
print(entities)

[('I-PER', 'Charles Cuvelliez'), ('I-PER', 'Charles Cuvelliez'), ('I-ORG', 'Bruxelles'), ('I-PER', 'l’IA'), ('I-PER', 'Julien Maldonato'), ('I-ORG', 'Deloitte'), ('I-PER', 'Ethiquement'), ('I-LOC', 'l’IA'), ('I-PER', '.')]


## Spanish NER with polyglot
You'll continue your exploration of polyglot now with some Spanish annotation. This article is not written by a newspaper, so it is your first example of a more blog-like text. How do you think that might compare when finding entities?

The Text object has been created as txt, and each entity has been printed, as you can see in the IPython Shell.

Your specific task is to determine how many of the entities contain the words "Márquez" or "Gabo" - these refer to the same person in different ways!

In [55]:
t = '''Lina del Castillo es profesora en el Instituto de Estudios Latinoamericanos Teresa Lozano Long (LLILAS) y el Departamento de Historia de la Universidad de Texas en Austin. Ella será la moderadora del panel “Los Mundos Políticos de Gabriel García Márquez” este viernes, Oct. 30, en el simposio Gabriel García Márquez: Vida y Legado.


LIna del Castillo


Actualmente, sus investigaciones abarcan la intersección de cartografía, disputas a las demandas de tierra y recursos, y la formación del n...el tren de medianoche que lleva a miles y miles de cadáveres uno encima del otro como tantos racimos del banano que acabarán tirados al mar. Ningún recuento periodístico podría provocar nuestra imaginación y nuestra memoria como este relato de García Márquez.


Contenido Relacionado


Lea más artículos sobre el archivo de Gabriel García Márquez


Reciba mensualmente las últimas noticias e información del Harry Ransom Center con eNews, nuestro correo electrónico mensual. ¡Suscríbase hoy!
'''

In [57]:
txt = Text(t)

In [70]:
# Initialize the count variable: count
count = 0

# Iterate over all the entities
for ent in txt.entities:
    # Check whether the entity contains 'Márquez' or 'Gabo'
    if "Márquez" in ent or "Gabo" in ent:
        # Increment count
        count+=1

# Print count
print(count)

# Calculate the percentage of entities that refer to "Gabo": percentage
percentage = count / len(txt.entities)
print(percentage)

4
0.26666666666666666
