# NLP II - Exercise 2 solution

A major restaurant chain wants to implement a chat ordering system, so that customers simply write their order on a tablet or via whatsapp and the virtual assistant will take care of managing the order with the kitchen and waiters.

The restaurant needs to have a list of the dishes and drinks separately along with the number of units of each item ordered (and the ingredients as optional).

To do this exercise, train and compare the performance of each tagger, then choose the best one. You will have to retrain your tagger if needed.

## Libraries

We will work with the NLTK library, with the tokeniser included in the library, the cess_esp corpus and the 4 taggers seen in the unit. We will also have to import the NLTK RegEx Parser.

Finally, as non-mandatory add-ons that we will use to format or make the work easier, sklearn's train_test_split and pandas.

In [1]:
import nltk

#We import the NLTK component to tokenise
from nltk.tokenize import word_tokenize

#We import the CESS corpus in Spanish
from nltk.corpus import cess_esp

#Taggers ngrams y HMM
from nltk import UnigramTagger, BigramTagger, TrigramTagger
from nltk.tag.hmm import HiddenMarkovModelTagger

#RegEx Parser
from nltk.chunk.regexp import *


#This will allow us to create the test and train sets.
from sklearn.model_selection import train_test_split

#Finally we import pandas
import pandas as pd

## Tagger training

To train the taggers, what we have to do is to get the tagging corpus in English. First, let's see what the corpus contains. 

In [2]:
cess_esp.sents()

[['El', 'grupo', 'estatal', 'Electricité_de_France', '-Fpa-', 'EDF', '-Fpt-', 'anunció', 'hoy', ',', 'jueves', ',', 'la', 'compra', 'del', '51_por_ciento', 'de', 'la', 'empresa', 'mexicana', 'Electricidad_Águila_de_Altamira', '-Fpa-', 'EAA', '-Fpt-', ',', 'creada', 'por', 'el', 'japonés', 'Mitsubishi_Corporation', 'para', 'poner_en_marcha', 'una', 'central', 'de', 'gas', 'de', '495', 'megavatios', '.'], ['Una', 'portavoz', 'de', 'EDF', 'explicó', 'a', 'EFE', 'que', 'el', 'proyecto', 'para', 'la', 'construcción', 'de', 'Altamira_2', ',', 'al', 'norte', 'de', 'Tampico', ',', 'prevé', 'la', 'utilización', 'de', 'gas', 'natural', 'como', 'combustible', 'principal', 'en', 'una', 'central', 'de', 'ciclo', 'combinado', 'que', 'debe', 'empezar', 'a', 'funcionar', 'en', 'mayo_del_2002', '.'], ...]

In [3]:
cess_esp.tagged_sents()

[[('El', 'da0ms0'), ('grupo', 'ncms000'), ('estatal', 'aq0cs0'), ('Electricité_de_France', 'np00000'), ('-Fpa-', 'Fpa'), ('EDF', 'np00000'), ('-Fpt-', 'Fpt'), ('anunció', 'vmis3s0'), ('hoy', 'rg'), (',', 'Fc'), ('jueves', 'W'), (',', 'Fc'), ('la', 'da0fs0'), ('compra', 'ncfs000'), ('del', 'spcms'), ('51_por_ciento', 'Zp'), ('de', 'sps00'), ('la', 'da0fs0'), ('empresa', 'ncfs000'), ('mexicana', 'aq0fs0'), ('Electricidad_Águila_de_Altamira', 'np00000'), ('-Fpa-', 'Fpa'), ('EAA', 'np00000'), ('-Fpt-', 'Fpt'), (',', 'Fc'), ('creada', 'aq0fsp'), ('por', 'sps00'), ('el', 'da0ms0'), ('japonés', 'aq0ms0'), ('Mitsubishi_Corporation', 'np00000'), ('para', 'sps00'), ('poner_en_marcha', 'vmn0000'), ('una', 'di0fs0'), ('central', 'ncfs000'), ('de', 'sps00'), ('gas', 'ncms000'), ('de', 'sps00'), ('495', 'Z'), ('megavatios', 'ncmp000'), ('.', 'Fp')], [('Una', 'di0fs0'), ('portavoz', 'nccs000'), ('de', 'sps00'), ('EDF', 'np00000'), ('explicó', 'vmis3s0'), ('a', 'sps00'), ('EFE', 'np00000'), ('que', 'c

As we can see, with the command `cess_esp.sents()` we get a set of tokenised phrases of different themes.

And with the command `cess_esp.tagged_sents()` we get the same set of tagged sentences.

Now we will create 2 sets of tagged tokens. One containing 90% of the tokens and the other containing 10%, one for train and one for test respectively.

In [4]:
#We generate the Train and Test sets
data_train, data_test = train_test_split(cess_esp.tagged_sents(), test_size=0.10, random_state=1)

print('Train tokens:',len(data_train),
      '\nTokens test:    ',len(data_test))

Train tokens: 5427 
Tokens test:     603


Having the sets already created, we move on to train the taggers.

To train the ngrams we must execute the tagger with the corpus, for example UnigramTagger(data_train). We will see that the ngrams can have as backoff another ngram.

In the case of HiddenMarkovModelTagger we must execute the function .train().



In [5]:
unigram  = UnigramTagger(data_train)
bigram   = BigramTagger(data_train, backoff=unigram)
trigram  = TrigramTagger(data_train, backoff=bigram)
hmm      = HiddenMarkovModelTagger.train(data_train)

Once the taggers have been trained, we are going to evaluate how each of them tends to perform with the test set. To evaluate it we have to use the train() function, for all the taggers. Let's see how each of them performs.

When you run the training, pay attention to the time it takes for each of the taggers to display the score. While the ngrams are quite fast to extract the information, the HMM takes longer to get the data.

In [6]:
print ('Hit with unigramas: %.2f %%' % (unigram.evaluate(data_test)*100))
print ('Hit with bigramas:  %.2f %%' % (bigram.evaluate(data_test)*100))
print ('Hit with trigramas: %.2f %%' % (trigram.evaluate(data_test)*100))
print ('Hit with HMMs:      %.2f %%' % (hmm.evaluate(data_test)*100))

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print ('Hit with unigramas: %.2f %%' % (unigram.evaluate(data_test)*100))
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print ('Hit with bigramas:  %.2f %%' % (bigram.evaluate(data_test)*100))
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print ('Hit with trigramas: %.2f %%' % (trigram.evaluate(data_test)*100))
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print ('Hit with HMMs:      %.2f %%' % (hmm.evaluate(data_test)*100))


Hit with unigramas: 87.27 %
Hit with bigramas:  88.78 %
Hit with trigramas: 88.77 %
Hit with HMMs:      89.57 %


Now, we can retrain the taggers with the test data. Although we will not see a big improvement in general terms, as the volume of data we are using is small, it will help.

We will see that if we evaluate the taggers again on the test set, we will get almost 100% accuracy. This improvement is greater in the ngrams because they are more 'logical' rules, so to speak, so with small volumes of data we will see more success. On the other hand, in the HMM, we see better improvement in the case of evaluating on the data but in those tokens not trained we will have better performance than the rest.

In [7]:
unigram  = UnigramTagger(data_test)
bigram   = BigramTagger(data_test, backoff=unigram)
trigram  = TrigramTagger(data_test, backoff=bigram)
hmm      = HiddenMarkovModelTagger.train(data_test)

print ('Acierto con unigramas: %.2f %%' % (unigram.evaluate(data_test)*100))
print ('Acierto con bigramas:  %.2f %%' % (bigram.evaluate(data_test)*100))
print ('Acierto con trigramas: %.2f %%' % (trigram.evaluate(data_test)*100))
print ('Acierto con HMMs:      %.2f %%' % (hmm.evaluate(data_test)*100))

Acierto con unigramas: 96.67 %
Acierto con bigramas:  98.86 %
Acierto con trigramas: 99.56 %
Acierto con HMMs:      93.62 %


As we have seen that the HMM has better performance, a priori, we will use this one to elaborate our command bot.

## Start building the bot

Now we will create a sentence of our bot's theme to see how it performs with our dataset.

In [8]:
food_text = 'Quiero unos macarrones con queso y una cerveza'

The first step is to work out the sentence tokens. For the taggers it is not necessary to eliminate stopwords, lemmatisation or derivation, on the contrary. If we do all these steps we will be eliminating information that the taggers will use to find the tags.

In [9]:
tokens = nltk.word_tokenize(food_text)
tokens

['Quiero', 'unos', 'macarrones', 'con', 'queso', 'y', 'una', 'cerveza']

Once we have the tokens, we will use the `.tag()` function.

In [10]:
food_tagged = hmm.tag(tokens)
food_tagged

[('Quiero', 'da0mp0'),
 ('unos', 'di0mp0'),
 ('macarrones', 'ncmp000'),
 ('con', 'sps00'),
 ('queso', 'np0000l'),
 ('y', 'cc'),
 ('una', 'di0fs0'),
 ('cerveza', 'ncfs000')]

We can see that the tags obtained are not quite correct. To check them, we should check the EAGLES tags in the link provided in the syllabus: https://www.cs.upc.edu/~nlp/tools/parole-sp.html.

In summary, we will use these tags, although we can have variations and incorporate others;

- **ncms000** : nombre común masculino singular
- **ncfs000** : nombre común femenino singular
- **ncmp000** : nombre común masculino plural
- **ncfp000** : nombre común femenino plural

- **np0000p/np00001** : nombre propio (We include it because if we look at the way our examples are labelled we can see that some words are labelled in this way (e.g.: tuna sandwich). It could also be the case that some food follows this structure correctly (e.g. Toni's pizza).


- **di0ms0** : determinante indefinido masculino singular
- **di0fs0** : determinante indefinido femenino singular
- **di0mp0** : determinante indefinido masculino plural
- **di0fp0** : determinante indefinido femenino plural
- **dn0cp0** : determinante indefinido comun plural


- **sps00** : Preposición


- **da0ms0**: el
- **da0fs0**: la
- **da0mp0**: los
- **da0fp0**: las
- **da0ns0**: lo

Let's review the sentence:

 * ('Quiero', 'da0mp0') -> Taggeado como un determinante, debería ser: vmpip1s0 que corresponde a presente de indicativo
 * ('unos', 'di0mp0') -> Taggeado como determinante masculino, debería ser: mcmp00 que corresponde a un numeral ordinal masculino
 * ('macarrones', 'ncmp000'), -> **Correcto**: Taggeado como Sustantivo Común Masculino Plural
 * ('con', 'sps00'), -> **Correcto**: Taggeado como preposición
 * ('queso', 'np0000l'), -> Taggeado como Sustantivo Propio, debería ser: ncms000 sustantivo común masculino singular
 * ('y', 'cc'),-> **Correcto**: Taggeado como conjunción coordinada
 * ('una', 'di0fs0'),-> Taggeado como determinante femenino, debería ser: mcfp00 que corresponde a un numeral ordinal femenino
 * ('cerveza', 'ncfs000')-> -> **Correcto**: Taggeado como sustantivo femenino singular
 
 As we have seen, correctly tagged we would have 4 tokens out of 8, a 50% hit rate. If we evaluate these tags with the performance of, for example, the trigram, what % of hits will we have? Let's check

In [11]:
print ('Acierto con unigramas: %.2f %%' % (unigram.evaluate([food_tagged])*100))
print ('Acierto con bigramas:  %.2f %%' % (bigram.evaluate([food_tagged])*100))
print ('Acierto con trigramas: %.2f %%' % (trigram.evaluate([food_tagged])*100))
print ('Acierto con HMMs:      %.2f %%' % (hmm.evaluate([food_tagged])*100))

Acierto con unigramas: 50.00 %
Acierto con bigramas:  50.00 %
Acierto con trigramas: 50.00 %
Acierto con HMMs:      100.00 %


As we can see, HMM says that it is 100% correct, which is logical since it is this tagger that has made the tags. On the other hand, the rest of the tags coincide in the correction we have made, although this does not mean that these tags are correct 8 out of 8 tags, let's check it.

In [12]:
print ('Unigramas:', (unigram.tag(tokens)))
print ('Bigramas: ', (bigram.tag(tokens)))
print ('Trigramas: ', (trigram.tag(tokens)))

Unigramas: [('Quiero', None), ('unos', 'di0mp0'), ('macarrones', None), ('con', 'sps00'), ('queso', None), ('y', 'cc'), ('una', 'di0fs0'), ('cerveza', None)]
Bigramas:  [('Quiero', None), ('unos', 'di0mp0'), ('macarrones', None), ('con', 'sps00'), ('queso', None), ('y', 'cc'), ('una', 'di0fs0'), ('cerveza', None)]
Trigramas:  [('Quiero', None), ('unos', 'di0mp0'), ('macarrones', None), ('con', 'sps00'), ('queso', None), ('y', 'cc'), ('una', 'di0fs0'), ('cerveza', None)]


Now that we have seen these results, we confirm that the best performer is HMM, which has found labels, which although not entirely correct have been approximate, in some cases it has confused the gender, in others the number, although this would not affect our information extraction too much. But in general it has tagged 6 of the 8 tokens in word type.

The ngrams, however, failed to identify the tokens in 50% of the cases. So we would have had problems when using them as taggers.

## Correct the tagger

Now that we've corrected the tags, it's time to retrain the tagger with the correct phrase, so let's get on with it.
We're also going to work with the foodTagger (it's an HMM trained with our food phrases).

In [13]:
corrected_tokens = [('Quiero', 'vmpip1s0'), ('unos', 'mcmp00'), ('macarrones', 'ncmp000'), ('con', 'sps00'), ('queso', 'ncms000'), ('y', 'cc'), ('una', 'mcfp00'), ('cerveza', 'ncfs000')]

foodTagger = hmm.train([corrected_tokens])

Now, once the tagger is trained, if we pass the same phrase again, it will hit all of them because it is trained to detect the tags.

In [14]:
food_tagged = foodTagger.tag(tokens)
food_tagged

[('Quiero', 'vmpip1s0'),
 ('unos', 'mcmp00'),
 ('macarrones', 'ncmp000'),
 ('con', 'sps00'),
 ('queso', 'ncms000'),
 ('y', 'cc'),
 ('una', 'mcfp00'),
 ('cerveza', 'ncfs000')]

The correction of the tagger is a process that we will have to elaborate repeatedly in order to improve the results. It is a process that could be greatly shortened if we had a corpus with thousands of phrases and tokens like the initial one, but from our specific context.

## Develop a function to recognise orders

For this we will have to use RegEx Parser and the logical rules. In our case, we can use the ones we have seen in the theory of this unit.

- nombre común : *macarrones*

- nombre común + nombre (común/propio) : *pizza margarita*

- nombre común + preposición + nombre(común/propio) : *bocadillo de atún*

- nombre común + preposición + artículo + nombre(común/propio) : *lentejas a la riojana*

In [15]:
reglas = r'''
    cantidad: {<mccp00>}
    comida: {<ncms000|ncfs000|ncmp000|ncfp000>*<sps00>*<da0ms0|da0fs0|da0mp0|da0fp0|da0ns0>*<ncms000|ncfs000|ncmp000|ncfp000|np0000l|np0000p>}
    cantidad: {<di0ms0|di0fs0|di0mp0|di0fp0|dn0cp0|mcmp00|mcfp00> || <mcmp00>* || <mcfp00> }
      '''

Now we have the grammar created, let's create the regex function that extracts the information.

In [16]:
RegexP = nltk.RegexpParser(reglas)

def parsear(phrase):
    return RegexP.parse(phrase)

In [17]:
frase_regex = parsear(corrected_tokens)
print(frase_regex)

(S
  Quiero/vmpip1s0
  (cantidad unos/mcmp00)
  (comida macarrones/ncmp000 con/sps00 queso/ncms000)
  y/cc
  (cantidad una/mcfp00)
  (comida cerveza/ncfs000))


### Function to extract the classified nodes

Once we have managed to identify the food and the quantity of the order, we will generate a JSON with the order data.

In [18]:
def genera_comanda(tree):
    
    result = []
    
    item = {}
    item['item'] = None
    item['cantidad'] = 0
    
    elementos = 0
    
    #En primer lugar contaremos cuantos elementos hay en el pedido
    for nodo in tree:
        if type(nodo) == tuple:
            continue
        tipo = nodo.label()
        if tipo == 'comida':
            elementos += 1
            
    #Ahora generaremos cada línea de pedido con sus cantidades
    for nodo in tree:
        if type(nodo) == tuple:
            continue
        
        count = 0
        valor = ''
        
        for elemento in nodo:
            count += 1
            palabra, categoria = elemento
                
            if count == 1:
                valor = valor + palabra
            else:
                valor = valor + ' ' + palabra
            
            if nodo.label() == 'cantidad':
                item['cantidad'] = valor
            else:
                item['item'] = valor
        
        if nodo.label() == 'comida':
            result.append(item)
            item = {}
            #print(item)
        
    
    return result

In [19]:
genera_comanda(frase_regex)

[{'item': 'macarrones con queso', 'cantidad': 'unos'},
 {'cantidad': 'una', 'item': 'cerveza'}]

Now that the function generates the comma, we will generate a new function that does it from 0.

In [20]:
def procesa_frase(frase):
    
    tokens = nltk.word_tokenize(frase)
    print('Tokens:')
    print(tokens)
    print('\n', '-----------------------------------------', '\n')
    tags = foodTagger.tag(tokens)
    print('TAGS:')
    print(tags)
    print('\n', '-----------------------------------------', '\n')
    parsed = parsear(tags)
    print('Parsed:')
    print(parsed)
    print('\n', '-----------------------------------------', '\n')
    
    return genera_comanda(parsed)    
    

And finally, we test the function we have created to parse complete sentences.

In [21]:
fraseTest = 'pedir dos pizzas cuatro quesos y cinco fantas'

procesa_frase(fraseTest)

Tokens:
['pedir', 'dos', 'pizzas', 'cuatro', 'quesos', 'y', 'cinco', 'fantas']

 ----------------------------------------- 

TAGS:
[('pedir', 'vmpip1s0'), ('dos', 'mcmp00'), ('pizzas', 'ncmp000'), ('cuatro', 'sps00'), ('quesos', 'ncms000'), ('y', 'cc'), ('cinco', 'mcfp00'), ('fantas', 'ncfs000')]

 ----------------------------------------- 

Parsed:
(S
  pedir/vmpip1s0
  (cantidad dos/mcmp00)
  (comida pizzas/ncmp000 cuatro/sps00 quesos/ncms000)
  y/cc
  (cantidad cinco/mcfp00)
  (comida fantas/ncfs000))

 ----------------------------------------- 



[{'item': 'pizzas cuatro quesos', 'cantidad': 'dos'},
 {'cantidad': 'cinco', 'item': 'fantas'}]

Although this sentence works, others may not, for example, if we don't put a verb before the command, let's do a test.

In [22]:
fraseTest = 'dos pizzas cuatro quesos y cinco fantas'
procesa_frase(fraseTest)

Tokens:
['dos', 'pizzas', 'cuatro', 'quesos', 'y', 'cinco', 'fantas']

 ----------------------------------------- 

TAGS:
[('dos', 'vmpip1s0'), ('pizzas', 'mcmp00'), ('cuatro', 'ncmp000'), ('quesos', 'sps00'), ('y', 'ncms000'), ('cinco', 'cc'), ('fantas', 'mcfp00')]

 ----------------------------------------- 

Parsed:
(S
  dos/vmpip1s0
  (cantidad pizzas/mcmp00)
  (comida cuatro/ncmp000 quesos/sps00 y/ncms000)
  cinco/cc
  (cantidad fantas/mcfp00))

 ----------------------------------------- 



[{'item': 'cuatro quesos y', 'cantidad': 'pizzas'}]

In this sentence you have only identified, and wrongly, so let's correct it and train our tagger.

In [23]:
corrected_order = [('dos', 'mccp00'), ('pizzas', 'ncmp000'), ('cuatro', 'sps00'), ('quesos', 'ncms000'), ('y', 'cc'), ('cinco', 'mcfp00'), ('fantas', 'ncfs000')]

foodTagger = foodTagger.train([corrected_order])

We have now re-trained our tagger, so we can try the same phrase again and check its performance.

In [24]:
procesa_frase(fraseTest)

Tokens:
['dos', 'pizzas', 'cuatro', 'quesos', 'y', 'cinco', 'fantas']

 ----------------------------------------- 

TAGS:
[('dos', 'mccp00'), ('pizzas', 'ncmp000'), ('cuatro', 'sps00'), ('quesos', 'ncms000'), ('y', 'cc'), ('cinco', 'mcfp00'), ('fantas', 'ncfs000')]

 ----------------------------------------- 

Parsed:
(S
  (cantidad dos/mccp00)
  (comida pizzas/ncmp000 cuatro/sps00 quesos/ncms000)
  y/cc
  (cantidad cinco/mcfp00)
  (comida fantas/ncfs000))

 ----------------------------------------- 



[{'item': 'pizzas cuatro quesos', 'cantidad': 'dos'},
 {'cantidad': 'cinco', 'item': 'fantas'}]

In [25]:
procesa_frase('dos hamburguesas')

Tokens:
['dos', 'hamburguesas']

 ----------------------------------------- 

TAGS:
[('dos', 'mccp00'), ('hamburguesas', 'ncmp000')]

 ----------------------------------------- 

Parsed:
(S (cantidad dos/mccp00) (comida hamburguesas/ncmp000))

 ----------------------------------------- 



[{'item': 'hamburguesas', 'cantidad': 'dos'}]

And with this we would have our command processor bot ready. Of course it has room for improvement and needs further training, but it has the required functionality. 