# NLP II - Exercise 1 solution

Create a tagger that can detectect the object and the owner of it. Solve it using the EAGLES tags.

Create the tagger from scratch, you can't use the cess_esp training set.

Your tagger must be able to tag correctly this phrases (as minimum):

- Eustaquio tiene un coche amarillo
- Hermenegildo tiene un mechero
- Sole tiene un huevo de avestruz
- María quiere una tele curva
- El niño pide tortilla

The output have to be a dictionary like this: {"Person": "Eustaquio", "Object": "Coche amarillo"}

## Libraries

We will work with the NLTK library, with the tokeniser included in the library, the cess_esp corpus and the 4 taggers seen in the unit. We will also have to import the NLTK RegEx Parser.

Finally, as non-mandatory add-ons that we will use to format or make the work easier, sklearn's train_test_split.

In [19]:
import nltk

from nltk.tokenize import word_tokenize

#Importamos conjunto de entrenamiento
from nltk.corpus import cess_esp

#Importamos taggers
from nltk import UnigramTagger, BigramTagger, TrigramTagger #N-Gram
from nltk.tag.hmm import HiddenMarkovModelTagger

#Importamos REGEX
from nltk.chunk.regexp import *

In [9]:
def tokenizar(_frase):
    
    return word_tokenize(_frase, 'spanish')

We tokenise a first sentence to tag it by hand.

In [6]:
frase = 'El niño tiene unos zapatos'
tokens = word_tokenize(frase, 'spanish')
tokens

['El', 'niño', 'tiene', 'unos', 'zapatos']

In [7]:
frase_tags = [[
  ('El', 'tdms0'),
    ('niño','ncms000'),
    ('tiene','vmip3sm'),
    ('unos','mcmp00'),
    ('zapatos','ncmp000')
]]

Once we have assigned the tokens to the phrase, we move on to train the different taggers and run some tests.

In [37]:
uni1 = UnigramTagger(frase_tags)
tri1 = TrigramTagger(frase_tags, backoff=uni1)
hmm1 = HiddenMarkovModelTagger.train(frase_tags)


uni_es = UnigramTagger(cess_esp.tagged_sents())
hmm_es = HiddenMarkovModelTagger.train(cess_esp.tagged_sents())

In [16]:
frase2 = tokenizar('El abuelo tiene un bastón')
tri1.tag(frase2)

[('El', 'tdms0'),
 ('abuelo', None),
 ('tiene', 'vmip3sm'),
 ('un', None),
 ('bastón', None)]

In [14]:
uni1.tag(frase2)

[('El', 'tdms0'),
 ('abuelo', None),
 ('tiene', 'vmip3sm'),
 ('un', None),
 ('bastón', None)]

In [18]:
hmm1.tag(frase2)

[('El', 'tdms0'),
 ('abuelo', 'ncms000'),
 ('tiene', 'vmip3sm'),
 ('un', 'mcmp00'),
 ('bastón', 'ncmp000')]

We manually add some names to our tagger so that it can recognise them.

In [33]:
hmm_es = hmm_es.train([

    [('Eustaquio', 'NP00000')],
    [('Hermenegildo', 'NP00000')],
    [('Sole', 'NP00000')],
    [('Rodolfito', 'NP00000')],
    [('Marta', 'NP00000')],
    [('Pedro', 'NP00000')]

])

In [38]:
corpus = [
    'Eustaquio tiene un coche amarillo',
    'Hermenegildo tiene un mechero',
    'Sole tiene un huevo de avestruz',
    'María quiere una tele curva'

]

In [39]:
frases_tags = []

for frase in corpus:
    
    tokens = tokenizar(frase)
    
    frases_tags.append(hmm_es.tag(tokens))
    
frases_tags

[[('Eustaquio', 'sn.e-SUJ'),
  ('tiene', 'vmip3s0'),
  ('un', 'di0ms0'),
  ('coche', 'ncms000'),
  ('amarillo', 'aq0ms0')],
 [('Hermenegildo', 'sn.e-SUJ'),
  ('tiene', 'vmip3s0'),
  ('un', 'di0ms0'),
  ('mechero', 'ncms000')],
 [('Sole', 'sn.e-SUJ'),
  ('tiene', 'vmip3s0'),
  ('un', 'di0ms0'),
  ('huevo', 'ncms000'),
  ('de', 'sps00'),
  ('avestruz', 'da0fs0')],
 [('María', 'sn.e-SUJ'),
  ('quiere', 'vmip3s0'),
  ('una', 'di0fs0'),
  ('tele', 'ncfs000'),
  ('curva', 'aq0fs0')]]

Now we create the rules for out tagger

In [95]:
reglas = r'''

Quien: {<sn.e-SUJ>}
Qué: <di.*> { <nc.*> <di.*> <aq.*> | <nc.*> <sp.*> <da.*> | <nc.*> <sp.*> <np.*> | <nc.*> <aq.*> | <nc.*> }
Qué: {<nc.*> <sp.*> <da.*>}

'''

parser = nltk.RegexpParser(reglas)

In [52]:
def parsear(_tokens):
    return parser.parse(_tokens)

In [103]:
print(parsear(frases_tags[0]))

(S
  (Quien Eustaquio/sn.e-SUJ)
  tiene/vmip3s0
  un/di0ms0
  (Qué coche/ncms000 amarillo/aq0ms0))


We create a function to extract the info

In [118]:
def extraer(_tree):
    
    result = {'Quien': None, 'Qué': None}
    
    objeto = ''
    
    for nodo in _tree:
        
        if type(nodo) != tuple:
            
            if nodo.label() == 'Quien':
                word, tag = nodo[0]
                result['Quien'] = word
            
            if nodo.label() == 'Qué':
                for elemento in nodo:
                    word, tag = elemento
                    objeto = objeto + word + ' '
                
                result['Qué'] = objeto.strip()
                
    return result
    

In [123]:
extraer(parsear(frases_tags[2]))

{'Quien': 'Sole', 'Qué': 'huevo de avestruz'}