# Exemplo de Treinos com Spacy para Reconhecimento de Entidades (NER)

#### O exemplo foi montado a partir do código Kaggle informado pelo professor e também por um exemplo do Medium, seguem os links:
* https://www.kaggle.com/finalepoch/medical-ner-using-spacy/notebook
* https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7

In [1]:
from __future__ import unicode_literals, print_function
#import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding
from spacy.training import Example
from tqdm import tqdm

### Modelo Puro Spacy com exemplos de entitades não reconhecidas ou identificadas incorretamente

In [2]:
nlp_pre_treinado_sm = spacy.load('en_core_web_sm')
doc = nlp_pre_treinado_sm(u"Who is Nishant? He is in Minas Gerais or Macaé, Brazil or Brasil. I am in my apartment at Barra da Tijuca. If Anderson, Gustavo and Astolfo met up us, we will make a party.")
for token in doc:
    print(token.text, token.pos_)
print('--- ENT ---')
for ent in doc.ents:
    print(ent.text,ent.start_char, ent.end_char,ent.label_)


Who PRON
is AUX
Nishant ADJ
? PUNCT
He PRON
is AUX
in ADP
Minas PROPN
Gerais PROPN
or CCONJ
Macaé PROPN
, PUNCT
Brazil PROPN
or CCONJ
Brasil PROPN
. PUNCT
I PRON
am AUX
in ADP
my PRON
apartment NOUN
at ADP
Barra PROPN
da PROPN
Tijuca PROPN
. PUNCT
If SCONJ
Anderson PROPN
, PUNCT
Gustavo PROPN
and CCONJ
Astolfo PROPN
met VERB
up ADP
us PRON
, PUNCT
we PRON
will AUX
make VERB
a DET
party NOUN
. PUNCT
--- ENT ---
Minas Gerais 25 37 ORG
Macaé 41 46 GPE
Brazil 48 54 GPE
Brasil 58 64 PERSON
Barra da Tijuca 90 105 ORG
Anderson 110 118 PERSON
Astolfo 132 139 GPE


In [3]:
doc = nlp_pre_treinado_sm(u"Who is Kamal Khumar? Who is Nishant? Who is Maria? ")
for token in doc.ents:
    print(token.text,token.start_char, token.end_char,token.label_)

Kamal Khumar 7 19 PERSON
Maria 44 49 PERSON


#### Pequena base de treino para reconhecimento de Entidades

In [4]:
TRAIN_DATA = [
    ('Where is Barra da Tijuca?', {
        'entities': [(9, 24, 'LOC')]
    }),
    ('Who is Astolfo?', {
        'entities': [(7, 14, 'PERSON')]
    }),
     ('Who is Gustavo?', {
        'entities': [(7, 14, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]


### Para treinar um modelo do Zero

In [5]:
nlp_blank = spacy.blank('en')  

##### No Spacy 3 não é mais necessário fazer um create_pipe (os exemplos utilzam create_pipe e add_pipe posteriormente)

In [6]:
if 'ner' not in nlp_blank.pipe_names:
    ner_blank = nlp_blank.add_pipe('ner')    
else:
    ner_blank = nlp_blank.get_pipe('ner')

##### Aqui também tivemos mudanças no Spacy 3 - Quando iterar pelo batch é necessário criar objetos *Example*.

In [9]:
n_iter=100
for _, annotations in TRAIN_DATA:
    for ent in annotations.get('entities'):
        ner_blank.add_label(ent[2])

#usando os mesmos pipes do exemplo Kaggle
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp_blank.pipe_names if pipe not in pipe_exceptions]
with nlp_blank.disable_pipes(*other_pipes):  
    optimizer = nlp_blank.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        batches = minibatch(TRAIN_DATA, size=2)
        for batch in batches:
            for text, annotations in batch:
                doc = nlp_blank.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp_blank.update( [example],
                    drop=0.5,  
                    sgd=optimizer,
                    losses=losses)
        print(losses)

{'ner': 17.48537713289261}
{'ner': 15.800669848918915}
{'ner': 12.777700930833817}
{'ner': 10.85413184762001}
{'ner': 8.27435595728457}
{'ner': 7.547029630979523}
{'ner': 6.824094576993957}
{'ner': 6.978209424414672}
{'ner': 5.9165332875272725}
{'ner': 5.373996594109485}
{'ner': 6.201074225849879}
{'ner': 4.175207329715789}
{'ner': 4.278953146247204}
{'ner': 4.564560299518924}
{'ner': 4.764484313191133}
{'ner': 3.4053405030980457}
{'ner': 3.2561352893029767}
{'ner': 3.802131842158701}
{'ner': 3.1804286987965034}
{'ner': 3.173197740411191}
{'ner': 3.024884076604}
{'ner': 2.039549062062684}
{'ner': 2.8410471055105115}
{'ner': 2.334606500805936}
{'ner': 6.607993241588902}
{'ner': 1.686455891403225}
{'ner': 0.9864755109491186}
{'ner': 1.6112177401109133}
{'ner': 1.8702583382498312}
{'ner': 3.3655961177128453}
{'ner': 3.234964168316506}
{'ner': 1.9997512892387137}
{'ner': 0.6362937198690592}
{'ner': 3.5437545845396983}
{'ner': 1.8201272295806414}
{'ner': 0.9437481277211759}
{'ner': 1.411639

#### Resultado do treino a partir do zero é muito ruim, embora esperado, nossa base de treino é inadequada

In [11]:
doc = nlp_blank(u"Who is Nishant? He is in Minas Gerais or Macaé, Brazil or Brasil. I am in my apartment at Barra da Tijuca. If Anderson, Gustavo and Astolfo met up us, we will make a party.")

for token in doc:
    print(token.text, token.pos_)
print('--- ENT ---')
for ent in doc.ents:
    print(ent.text,ent.start_char, ent.end_char,ent.label_)

Who 
is 
Nishant 
? 
He 
is 
in 
Minas 
Gerais 
or 
Macaé 
, 
Brazil 
or 
Brasil 
. 
I 
am 
in 
my 
apartment 
at 
Barra 
da 
Tijuca 
. 
If 
Anderson 
, 
Gustavo 
and 
Astolfo 
met 
up 
us 
, 
we 
will 
make 
a 
party 
. 
--- ENT ---
Nishant 7 14 PERSON
Minas 25 30 PERSON
Gerais 31 37 PERSON
Macaé 41 46 LOC
Brazil 48 54 LOC
Brasil 58 64 LOC
Barra da Tijuca 90 105 LOC
Anderson 110 118 PERSON
Gustavo 120 127 PERSON
Astolfo 132 139 LOC
, we will 149 158 LOC
party 166 171 LOC


## Reaproveitando o modelo Pré-Treinado Spacy

In [14]:
nlp_comp = spacy.load('en_core_web_sm')

##### No Spacy 3 não é mais necessário fazer um create_pipe (os exemplos utilzam create_pipe e add_pipe posteriormente)

In [15]:
if 'ner' not in nlp_comp.pipe_names:
    ner_comp = nlp_comp.add_pipe('ner')    
else:
    ner_comp = nlp_comp.get_pipe('ner')

##### Para utilizar um modelo pré-treinado do Spacy é necessário utilizar *create_optimizer* ao invés de begin_training

In [16]:
n_iter=100
for _, annotations in TRAIN_DATA:
    for ent in annotations.get('entities'):
        ner_comp.add_label(ent[2])

#usando os mesmos pipes do exemplo Kaggle
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp_comp.pipe_names if pipe not in pipe_exceptions]
with nlp_comp.disable_pipes(*other_pipes):  
    optimizer = nlp_comp.create_optimizer()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        batches = minibatch(TRAIN_DATA, size=2)
        for batch in batches:
            for text, annotations in batch:
                doc = nlp_comp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp_comp.update( [example],
                    drop=0.5,  
                    sgd=optimizer,
                    losses=losses)
        print(losses)

{'ner': 11.165581656949426}
{'ner': 5.177444001609533}
{'ner': 5.435967956115739}
{'ner': 6.862212636792849}
{'ner': 5.361327985616445}
{'ner': 3.6201770044390473}
{'ner': 7.487692069636954}
{'ner': 2.6575723538952607}
{'ner': 2.923849790352127}
{'ner': 0.31361643224690994}
{'ner': 1.9999626883582804}
{'ner': 0.09339892612066049}
{'ner': 0.006701308884077068}
{'ner': 0.0029198915784581963}
{'ner': 4.493339208968703e-05}
{'ner': 0.004537594570875606}
{'ner': 0.0005037831261631808}
{'ner': 0.00020129984464048928}
{'ner': 0.2154272784489902}
{'ner': 6.635966388707149e-05}
{'ner': 0.03085110000431699}
{'ner': 0.12313657447292987}
{'ner': 4.8241546098040704e-05}
{'ner': 1.9482305500996857}
{'ner': 7.567607187445672e-05}
{'ner': 1.3311514316339505e-05}
{'ner': 0.3379472472738589}
{'ner': 2.0597767749759278e-05}
{'ner': 0.0195511118569808}
{'ner': 2.0958307901139746e-06}
{'ner': 3.5163345704016714e-06}
{'ner': 0.0016660147394312913}
{'ner': 0.00038475593740161187}
{'ner': 2.887879041947466e-0

#### Resultado muito melhor do que o teste anterior, conforme esperado.

In [17]:
doc = nlp_comp(u"Who is Nishant? He is in Minas Gerais or Macaé, Brazil or Brasil. I am in my apartment at Barra da Tijuca. If Anderson, Gustavo and Astolfo met up us, we will make a party.")

for token in doc:
    print(token.text, token.pos_)
print('--- ENT ---')
for ent in doc.ents:
    print(ent.text,ent.start_char, ent.end_char,ent.label_)

Who PRON
is AUX
Nishant ADJ
? PUNCT
He PRON
is AUX
in ADP
Minas PROPN
Gerais PROPN
or CCONJ
Macaé PROPN
, PUNCT
Brazil PROPN
or CCONJ
Brasil PROPN
. PUNCT
I PRON
am AUX
in ADP
my PRON
apartment NOUN
at ADP
Barra PROPN
da PROPN
Tijuca PROPN
. PUNCT
If SCONJ
Anderson PROPN
, PUNCT
Gustavo PROPN
and CCONJ
Astolfo PROPN
met VERB
up ADP
us PRON
, PUNCT
we PRON
will AUX
make VERB
a DET
party NOUN
. PUNCT
--- ENT ---
Nishant 7 14 PERSON
Minas Gerais 25 37 LOC
Macaé 41 46 LOC
Brazil 48 54 LOC
Brasil 58 64 PERSON
Barra da Tijuca 90 105 LOC
Anderson 110 118 PERSON
Gustavo 120 127 PERSON
Astolfo 132 139 PERSON


#### Salvando o Modelo

In [18]:
output_dir=Path("C:\\temporario")
output_dir = Path(output_dir)
if not output_dir.exists():
    output_dir.mkdir()
nlp_comp.to_disk(output_dir)
print("Saved model to", output_dir)

Saved model to C:\temporario
