# Procesamiento del lenguaje natural
## Name Entity Recognition

¿Qué es NER?
* Una tarea del PLN para identificar nombres de entidades importantes en un texto:
            - Personas, lugares, organizaciones
            - Datos, estados,...

Es una librería de `nltk` y Stanford CoreNPL Library:
* Integrada en Python vía ntlk
* Basado en Java

In [1]:
import nltk
sentence = ''' In New York, I like to ride the Metro to visit MOMA and some restaurants rated well by Ruth Reichl, april.'''
print(sentence)

 In New York, I like to ride the Metro to visit MOMA and some restaurants rated well by Ruth Reichl, april.


In [2]:
# tokenizamos 
tokenized_sent = nltk.word_tokenize(sentence)
print(tokenized_sent)

['In', 'New', 'York', ',', 'I', 'like', 'to', 'ride', 'the', 'Metro', 'to', 'visit', 'MOMA', 'and', 'some', 'restaurants', 'rated', 'well', 'by', 'Ruth', 'Reichl', ',', 'april', '.']


In [3]:
# Etiquetamos las palabras con su 'atributo' verbo, preposicion, nombre, nombre propio, ...
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent

[('In', 'IN'),
 ('New', 'NNP'),
 ('York', 'NNP'),
 (',', ','),
 ('I', 'PRP'),
 ('like', 'VBP'),
 ('to', 'TO'),
 ('ride', 'VB'),
 ('the', 'DT'),
 ('Metro', 'NNP'),
 ('to', 'TO'),
 ('visit', 'VB'),
 ('MOMA', 'NNP'),
 ('and', 'CC'),
 ('some', 'DT'),
 ('restaurants', 'NNS'),
 ('rated', 'VBN'),
 ('well', 'RB'),
 ('by', 'IN'),
 ('Ruth', 'NNP'),
 ('Reichl', 'NNP'),
 (',', ','),
 ('april', 'NN'),
 ('.', '.')]

In [4]:
# Etiquetas más complejas:
print(nltk.ne_chunk(tagged_sent))

(S
  In/IN
  (GPE New/NNP York/NNP)
  ,/,
  I/PRP
  like/VBP
  to/TO
  ride/VB
  the/DT
  (ORGANIZATION Metro/NNP)
  to/TO
  visit/VB
  (ORGANIZATION MOMA/NNP)
  and/CC
  some/DT
  restaurants/NNS
  rated/VBN
  well/RB
  by/IN
  (PERSON Ruth/NNP Reichl/NNP)
  ,/,
  april/NN
  ./.)


In [5]:
# ejemplo

article = '\ufeffThe taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.\r\n\r\n\r\nUber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire a driver, be given false reports about the location of its cars. Uber management booked and then cancelled rides with a rival taxi-hailing company which took their vehicles out of circulation. Uber deny this was the intention. The punishment for this behaviour was negligible. Uber promised not to use this “greyball” software against law enforcement – one wonders what would happen to someone carrying a knife who promised never to stab a policeman with it. Travis Kalanick of Uber got a personal dressing down from Tim Cook, who runs Apple, but the company did not prohibit the use of the app. Too much money was at stake for that.\r\n\r\n\r\nMillions of people around the world value the cheapness and convenience of Uber’s rides too much to care about the lack of drivers’ rights or pay. Many of the users themselves are not much richer than the drivers. The “sharing economy” encourages the insecure and exploited to exploit others equally insecure to the profit of a tiny clique of billionaires. Silicon Valley’s culture seems hostile to humane and democratic values. The outgoing CEO of Yahoo, Marissa Mayer, who is widely judged to have been a failure, is likely to get a $186m payout. This may not be a cause for panic, any more than the previous hero worship should have been a cause for euphoria. Yet there’s an urgent political task to tame these companies, to ensure they are punished when they break the law, that they pay their taxes fairly and that they behave responsibly.'

# tokenizamos frases
sentences = nltk.sent_tokenize(article)

# tokenizamos las palabras dentro de cada frase
token_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# etiquetamos cada frase tokenizada
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences]
print('ETIQUETAS PALABRAS')
print(pos_sentences)

print('===================================================')
# creamos los labels de cada entidad
chunked_sentences = nltk.ne_chunk_sents(pos_sentences,binary=True)

# seleccionamos solo aquellas que nos interesan, ENTIDADES
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)

ETIQUETAS PALABRAS
[[('\ufeffThe', 'JJ'), ('taxi-hailing', 'JJ'), ('company', 'NN'), ('Uber', 'NNP'), ('brings', 'VBZ'), ('into', 'IN'), ('very', 'RB'), ('sharp', 'JJ'), ('focus', 'VB'), ('the', 'DT'), ('question', 'NN'), ('of', 'IN'), ('whether', 'IN'), ('corporations', 'NNS'), ('can', 'MD'), ('be', 'VB'), ('said', 'VBD'), ('to', 'TO'), ('have', 'VB'), ('a', 'DT'), ('moral', 'JJ'), ('character', 'NN'), ('.', '.')], [('If', 'IN'), ('any', 'DT'), ('human', 'JJ'), ('being', 'VBG'), ('were', 'VBD'), ('to', 'TO'), ('behave', 'VB'), ('with', 'IN'), ('the', 'DT'), ('single-minded', 'JJ'), ('and', 'CC'), ('ruthless', 'JJ'), ('greed', 'NN'), ('of', 'IN'), ('the', 'DT'), ('company', 'NN'), (',', ','), ('we', 'PRP'), ('would', 'MD'), ('consider', 'VB'), ('them', 'PRP'), ('sociopathic', 'JJ'), ('.', '.')], [('Uber', 'NNP'), ('wanted', 'VBD'), ('to', 'TO'), ('know', 'VB'), ('as', 'RB'), ('much', 'JJ'), ('as', 'IN'), ('possible', 'JJ'), ('about', 'IN'), ('the', 'DT'), ('people', 'NNS'), ('who', 'WP

### Spacy

¿Qué es SpaCy?
* Es una librería similar a gensim con diferentes implementaciones
* Open-source
* Para instalarlo hay que poner en Anaconda Prompt: `conda install spacy`
* Para instalar los idiomas hay que poner en Anaconda Prompt :https://spacy.io/usage/ o https://www.youtube.com/watch?v=hY_0YUKVNMU

In [6]:
import spacy
nlp = spacy.load('en')

doc = nlp("""Berlin is the capital of Germany; and the residence of Chancellor Angela Merkel,1912""")

# entidades 
doc.ents


(Berlin, Germany, Angela Merkel,1912)

In [7]:
# del ejemplo de articles 
# mostrar las entiedad del texto
doc = nlp(article)
for ent in doc.ents:
    print(ent.label_, ent.text)

ORG Uber
ORG Uber
ORG Apple
ORG Uber
ORG Uber
PERSON Travis Kalanick
ORG Uber
PERSON Tim Cook
ORG Apple
CARDINAL Millions
ORG Uber
GPE drivers’
LOC Silicon Valley’s
ORG Yahoo
PERSON Marissa Mayer
MONEY $186m
