## SpaCy (version 2.0.18) is the biggest challenger to NLTK for natural language preprocessing tools 
### Pros: most preprocessing tools - tokenization, stopwords, stemming, lemmatizing, POS tagging, dependency - work well, and fast, with Spanish; easy-to-read documentation; cool visualization package (displacy); textacy is built on top of spaCy; provides two pretrained Spanish corpora 
### Cons: many third-party libraries are build on top of NLTK; named entity reckognition (NER) does not work automatically (but you can teach it)
### This notebook shows the preprocessing tools of spaCy, the ease and attractiveness of its visualization tool (displacy), and how we can edit the NER dictionary so that it can reckognize entities we deem important.   

In [22]:
import pandas as pd
import es_core_news_md
from spacy import displacy
from IPython.core.display import display, HTML
from spacy.tokens import Span

In [None]:
#DOWNLOAD - not necessary if already loaded
!python -m spacy download es_core_news_md

In [23]:
#Pass text through nlp
nlp = es_core_news_md.load()
doc = nlp(u'Sres de organización centro: por favor necesito de forma urgente que se informe al sr responsable de la grúa, que se me envíe el recibo de pago por mail o por la vía que le sea más conveniente. Necesito presentarla a mi lugar de trabajo. De haber sabido que no tenía recibo, no abonaba. Siempre actuó de buena fe. Hasta soporte ni sólo el riesgo de vida de mi mascota de 5 kilos, si no el hecho de que el sr en cuestión hablaba permanentemente por celular de cuestiones personales. Debo enviar pruebas de esto también? Gracias.')

In [27]:
#construct a pandas dataframe so that we can visualize the results
text = []
lemma = []
pos = []
pos_spec = []
dep = []
shape = []
alpha = []
stop = []
for token in doc[:40]:   
    text.append(token.text)
    lemma.append(token.lemma_)
    pos.append(token.pos_)
    pos_spec.append(token.tag_)
    dep.append(token.dep_)
    shape.append(token.shape_)
    alpha.append(token.is_alpha)
    stop.append(token.is_stop)
print(len(text))

40


In [28]:
df = pd.DataFrame({'TEXT' : text, 
                  'LEMMA' : lemma,
                 'POS' : pos,
                 'POS-Tag(more specific)' : pos_spec,
                 'DEP' : dep,
                 'SHAPE' : shape,
                 'ALPHA' : alpha,
                 'STOP?' : stop,})
df

Unnamed: 0,TEXT,LEMMA,POS,POS-Tag(more specific),DEP,SHAPE,ALPHA,STOP?
0,Sres,Sres,NOUN,NOUN__Gender=Masc|Number=Plur,ROOT,Xxxx,True,False
1,de,de,ADP,ADP__AdpType=Prep,case,xx,True,True
2,organización,organización,NOUN,NOUN__Gender=Fem|Number=Sing,nmod,xxxx,True,False
3,centro,centrar,NOUN,NOUN__Gender=Masc|Number=Sing,amod,xxxx,True,False
4,:,:,PUNCT,PUNCT__PunctType=Colo,punct,:,False,False
5,por,por,ADP,ADP__AdpType=Prep,advmod,xxx,True,True
6,favor,favor,NOUN,NOUN___,fixed,xxxx,True,False
7,necesito,necesitar,VERB,VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres...,acl,xxxx,True,False
8,de,de,ADP,ADP__AdpType=Prep,case,xx,True,True
9,forma,formar,NOUN,NOUN__Gender=Fem|Number=Sing,obl,xxxx,True,False


### Here we see spaCy performs better than NLTK in POS tagging, and dependency. Also, stopwords are built into the module, which saves us a step in filtering the data. 

In [5]:
#Dependency visualization looks good!
html = displacy.render(doc, style='dep')
display(HTML(html))

In [6]:
#However, NER is not so good :(
html = displacy.render(doc, style='ent')
display(HTML(html))

### Let's try to edit the NER so that we can reckognize the name of our client

In [3]:
#This grabs all of the "entities" that spaCy reckognizes, although they are all wrong.
print('BEFORE')
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)

BEFORE
[('Sres', 0, 4, 'LOC'), ('Necesito', 194, 202, 'LOC'), ('Siempre actuó de buena fe', 287, 312, 'MISC'), ('Hasta soporte ni sólo', 314, 335, 'MISC'), ('Debo', 482, 486, 'PER'), ('Gracias', 519, 526, 'PER')]


In [29]:
#We want "ORG", which refers to companies, agencies and institutions to be the NER of Organization centro
ent_org = [doc[2].text, doc[0].ent_iob_, doc[0].ent_type_]
ent_centro = [doc[3].text, doc[1].ent_iob_, doc[1].ent_type_]
print(ent_org)
print(ent_centro)

['organización', 'B', 'LOC']
['centro', 'O', '']


In [32]:
ORG = doc.vocab.strings[u'ORG']
oc_ent = Span(doc, 2, 4, label=ORG) 
doc.ents = list(doc.ents) + [oc_ent]

In [34]:
# It worked!
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('AFTER:', ents)

AFTER: [('Sres', 0, 4, 'LOC'), ('organización centro', 8, 27, 'ORG'), ('de forma urgente que se informe al sr responsable de la grúa, que se me envíe el recibo', 48, 135, 'ORG'), ('Necesito', 194, 202, 'LOC'), ('Siempre actuó de buena fe', 287, 312, 'MISC'), ('Hasta soporte ni sólo', 314, 335, 'MISC'), ('Debo', 482, 486, 'PER'), ('Gracias', 519, 526, 'PER')]


In [7]:
#Now let's see the visualization
html = displacy.render(doc, style='ent')
display(HTML(html))

### But it does not work with new text...

In [20]:
new_doc = nlp("Sres, desafortunadamente, el modulo no reconoce organización centro como una empresa.")

In [21]:
html = displacy.render(new_doc, style='ent')
display(HTML(html))

### We can train/ edit a spaCy statistical model to reckognize entities that we deem important, such as the client's name, and words relevant to auto insurance, save them in a spaCy model and load it whenever we want to analyze data from that client.   