# Pequeño tutorial de internet sobre NER

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')
  

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rodolfo.lobo/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
ex = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

In [3]:
print(ex)

European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices


In [4]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [5]:
sent = preprocess(ex)

In [6]:
print(sent)

[('European', 'JJ'), ('authorities', 'NNS'), ('fined', 'VBD'), ('Google', 'NNP'), ('a', 'DT'), ('record', 'NN'), ('$', '$'), ('5.1', 'CD'), ('billion', 'CD'), ('on', 'IN'), ('Wednesday', 'NNP'), ('for', 'IN'), ('abusing', 'VBG'), ('its', 'PRP$'), ('power', 'NN'), ('in', 'IN'), ('the', 'DT'), ('mobile', 'JJ'), ('phone', 'NN'), ('market', 'NN'), ('and', 'CC'), ('ordered', 'VBD'), ('the', 'DT'), ('company', 'NN'), ('to', 'TO'), ('alter', 'VB'), ('its', 'PRP$'), ('practices', 'NNS')]


Los tags pueden encontrarse en el siguiente [link](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). El objetivo es organizar la información de la siguiente forma:

Now we’ll implement noun phrase chunking to identify named entities using a regular expression consisting of rules that indicate how sentences should be chunked.
Our chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.

In [7]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

In [8]:
cp = nltk.RegexpParser(pattern)

In [9]:
print(cp)

chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<DT>?<JJ>*<NN>'>


In [10]:
cs = cp.parse(sent)
print(cs)

(S
  European/JJ
  authorities/NNS
  fined/VBD
  Google/NNP
  (NP a/DT record/NN)
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  (NP power/NN)
  in/IN
  (NP the/DT mobile/JJ phone/NN)
  (NP market/NN)
  and/CC
  ordered/VBD
  (NP the/DT company/NN)
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


In [11]:
len(cs)

24

In [12]:
cs[0]

('European', 'JJ')

The output can be read as a tree or a hierarchy with S as the first level, denoting sentence. we can also display it graphically.

In [13]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('European', 'JJ', 'O'),
 ('authorities', 'NNS', 'O'),
 ('fined', 'VBD', 'O'),
 ('Google', 'NNP', 'O'),
 ('a', 'DT', 'B-NP'),
 ('record', 'NN', 'I-NP'),
 ('$', '$', 'O'),
 ('5.1', 'CD', 'O'),
 ('billion', 'CD', 'O'),
 ('on', 'IN', 'O'),
 ('Wednesday', 'NNP', 'O'),
 ('for', 'IN', 'O'),
 ('abusing', 'VBG', 'O'),
 ('its', 'PRP$', 'O'),
 ('power', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'B-NP'),
 ('mobile', 'JJ', 'I-NP'),
 ('phone', 'NN', 'I-NP'),
 ('market', 'NN', 'B-NP'),
 ('and', 'CC', 'O'),
 ('ordered', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('company', 'NN', 'I-NP'),
 ('to', 'TO', 'O'),
 ('alter', 'VB', 'O'),
 ('its', 'PRP$', 'O'),
 ('practices', 'NNS', 'O')]


La forma de dividir las oraciones y agruparlas en árboles sintácticos (importancia gramatical y sub estructuras), aparentemente dependerá de la fecha en que el código ha sido implementado. Este pequeño tutorial es del 17 de agosto del 2018. La sigla B antes del guión indica inicio de oración. Especificamente:

1) B-NP : beginning of a noun phrase
2) I-NP : inside of a noun phrase
3) O: end of sentence
4) B-VP : beginning of a verb phrase
5) I-VP : inside of a verb phrase

[link](https://www.geeksforgeeks.org/nlp-iob-tags/) de las explicaciones en inglés 

Con la información etiquetada, podemos utilizar otra función de nltk para generar un árbol de la frase:


In [14]:
nltk.download('words')
nltk.download('maxent_ne_chunker')

ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


[nltk_data] Downloading package words to
[nltk_data]     /Users/rodolfo.lobo/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/rodolfo.lobo/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


Podemos observar que con esta herramienta $\mathtt{Google}$ es tratado como una persona. Esto podría ocasionar problemas cuando entrenemos el modelo! 

# Spacy: aparece como una mejor alternativa en estas tareas

In [15]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

Se aplica una sola vez y la función retorna lo necesario (en qué sentido?): utilizando el mismo ejemplo $\mathtt{ex}$

In [16]:
doc = nlp(ex)
pprint([(X.text, X.label_) for X in doc.ents])

[('European', 'NORP'),
 ('Google', 'ORG'),
 ('$5.1 billion', 'MONEY'),
 ('Wednesday', 'DATE')]


Lo anterior solo obtuvo objetos a un nivel de entidad (organización, cantidad de dinero, fecha, NORP). También es posible categorizar una mayor cantidad de objetos: 

In [17]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(European, 'B', 'NORP'),
 (authorities, 'O', ''),
 (fined, 'O', ''),
 (Google, 'B', 'ORG'),
 (a, 'O', ''),
 (record, 'O', ''),
 ($, 'B', 'MONEY'),
 (5.1, 'I', 'MONEY'),
 (billion, 'I', 'MONEY'),
 (on, 'O', ''),
 (Wednesday, 'B', 'DATE'),
 (for, 'O', ''),
 (abusing, 'O', ''),
 (its, 'O', ''),
 (power, 'O', ''),
 (in, 'O', ''),
 (the, 'O', ''),
 (mobile, 'O', ''),
 (phone, 'O', ''),
 (market, 'O', ''),
 (and, 'O', ''),
 (ordered, 'O', ''),
 (the, 'O', ''),
 (company, 'O', ''),
 (to, 'O', ''),
 (alter, 'O', ''),
 (its, 'O', ''),
 (practices, 'O', '')]


"B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.

# Ahora extraemos las entidades presentes en una página de internet!

In [18]:
from bs4 import BeautifulSoup
import requests
import re

def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

ny_bb = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news')
article = nlp(ny_bb)
len(article.ents)

158

In [19]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'PERSON': 81,
         'GPE': 13,
         'ORG': 32,
         'CARDINAL': 5,
         'DATE': 19,
         'PRODUCT': 2,
         'NORP': 3,
         'ORDINAL': 1,
         'EVENT': 1,
         'FAC': 1})

In [20]:
ny_bb2 = url_to_string('https://fitolobo.github.io/')
article2 = nlp(ny_bb2)
len(article2.ents)

73

In [21]:
labels2 = [x.label_ for x in article2.ents]
Counter(labels2)

Counter({'PERSON': 24,
         'ORG': 19,
         'GPE': 10,
         'NORP': 6,
         'DATE': 11,
         'ORDINAL': 1,
         'PRODUCT': 1,
         'CARDINAL': 1})

3 more Frequent Tokens

In [23]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('Strzok', 32), ('F.B.I.', 17), ('Trump', 13)]

Elegir una parte del texto

In [26]:
sentences = [x for x in article.sents]
print(sentences[10:20])

[Mr. Strzok had testified before the House in July about how he had not allowed his political views to interfere with the investigations he was overseeing., But Mr. Strzok’s lawyer said the deputy director of the F.B.I., David Bowdich, had overruled the Office of Professional Responsibility and fired Mr. Strzok., A spokeswoman for the F.B.I. did not respond to a message seeking comment about why Mr. Strzok was dismissed rather than demoted., Firing Mr. Strzok, however, removes a favorite target of Mr. Trump from the ranks of the F.B.I. and gives Mr. Bowdich and the F.B.I. director, Christopher A. Wray, a chance to move beyond the president’s ire., Aitan Goelman, Mr. Strzok’s lawyer, denounced his client’s dismissal., “The decision to fire Special Agent Strzok is not only a departure from typical bureau practice, but also contradicts Director Wray’s testimony to Congress and his assurances that the F.B.I. intended to follow its regular process in this and all personnel matters,” Mr. Goe

In [27]:
displacy.render(nlp(str(sentences[10:20])), jupyter=True, style='ent')

# En el caso de mi página

In [33]:
items2 = [x.text for x in article2.ents]
Counter(items2).most_common(3)

[('R. A.', 5), ('2020', 3), ('2019', 3)]

In [31]:
sentences2 = [x for x in article2.sents]
print(sentences2[1:2])

[I'm interested in neural network models, machine learning, biomathematics and numerical methods.]


In [32]:
displacy.render(nlp(str(sentences2[1:2])), jupyter=True, style='ent')



Respuesta que encontré ante este error:
    
You're not expecting the right thing then :) Spacy's models for NER are trained on different datasets depending on the language. In the case of the model you're using see here : [link](https://spacy.io/models/en#en_core_web_sm)

The dataset used to train the model you're using is called "Onto Notes 5" and that one doesn't consider cats and dogs as PERSON (as most people do). If you want to get "cats" and "dogs" as entities, you need to train your own NER model with your own data. For example you could label some data with the ANIMAL entity using regex rules with a list of pets of interest, and using that labelled dataset, you can fine tune the NER model to do what you want.

# Observemos otra página

In [34]:
ny_bb3 = url_to_string('https://www.latex-project.org/')
article3 = nlp(ny_bb3)
len(article3.ents)

49

In [35]:
labels3 = [x.label_ for x in article3.ents]
Counter(labels3)

Counter({'ORG': 17,
         'NORP': 8,
         'PRODUCT': 2,
         'PERSON': 3,
         'DATE': 11,
         'PERCENT': 1,
         'CARDINAL': 5,
         'LANGUAGE': 2})

In [36]:
items3 = [x.text for x in article3.ents]
Counter(items3).most_common(3)

[('LaTeX', 7), ('the TeX Users Group', 2), ('LaTeX Project', 2)]

In [38]:
sentences3 = [x for x in article3.sents]
print(sentences3[3:10])

[LaTeX is the de facto standard for the communication and publication of scientific documents., LaTeX is available as free software.,     You don't have to pay for using LaTeX, i.e., there are no license fees, etc., But you are, of course, invited to support the maintenance and development efforts through a donation to the TeX Users Group (choose LaTeX Project contribution) if you are satisfied with LaTeX.        , You can also sponsor the work of LaTeX team members through the GitHub sponsor program at the moment for   Frank, David and Joseph.,  , Your contribution will be   matched by GitHub in the first year and goes 100% to the developers.]


In [39]:
displacy.render(nlp(str(sentences3[3:10])), jupyter=True, style='ent')