# LELA32051 Computational Linguistics Week 8

This week we are going first to take a look at the challenge of machine translation.

We'll look at German-to-English MT. Here is a set of sentences - the s stands for source and the t for target. Hopefully the translations here will be somewhat transparent to you. The only thing that might not be obvious is the use of "ja". This means yes in some context but is also use to mean something like "certainly". So "das haus ist ja gros" could be translated as "the house is certainly big" but because there isn't a perfect match from ja to certainly it tends to just be omitted in English translation as it is here.

In [None]:
s1='klein ist das haus '
t1='the house is small '
s2='das haus ist ja groß '
t2='the house is big '
s3='ja das buch ist klein '
t3='yes the book is small '
s4='das haus '
t4='the house '
s5='ein buch '
t5='a book '

We are going to use the now very familiar re.sub function to perform translation first. 
The g2e function takes German as input and should output English.

Its translation is performaed using a series of re.sub functions.

First let's take a really naive approach.

In [None]:
import re

In [None]:
def g2e(out):
    re.UNICODE
    out=re.sub("klein ","small ",out)
    out=re.sub("ist ","is ",out)
    out=re.sub("das ","the ",out)
    out=re.sub("haus ","house ",out)
    out=re.sub("groß ","big ",out)
    out=re.sub("buch ","book ",out)
    out=re.sub("ein ","a ",out)
    out=re.sub("ja ","yes ",out)
    
    return out


In [None]:
print(g2e(s1) + "\n" + g2e(s2)  + "\n" + g2e(s3)  + "\n" + g2e(s4)  + "\n" + g2e(s5))


That didn't work well. Your job is to change the rules so that the function returns the correct translation.

To make your job easier I have marked the part of speech using the following tags, based on what an automatic part of speech tagger would do (we'll look at these and how they work next week).

ADJ : adjective
AUX : auxiliary verb
ART : article/determiner
N : noun
ADV : adverb

You can make use of the tags by matching them and their associated words like this:

[^ ]+_ART

so if you wrote

re.sub("([^ ]+)_ART",\\1,out)

then it would return an article without its tag.

In [None]:
s1='klein_ADJ ist_AUX das_ART haus_N'
t1='the house is small'
s2='das_ART haus_N ist_AUX ja_ADV groß_ADJ '
t2='the house is big '
s3='ja_ADV das_ART buch_N ist_AUX klein_ADJ'
t3='the book is small '
s4='das_ART haus_N'
t4='the house '
s5='ein_ART buch_N'
t5='a book '

In [None]:
def g2e(out):
    re.UNICODE
    out=re.sub('klein_','small_',out)
    out=re.sub('ist_','is_',out)
    out=re.sub('das_','the_',out)
    out=re.sub('haus_','house_',out)
    out=re.sub('ja_','yes_',out)
    out=re.sub('groß_','big_',out)
    out=re.sub('buch_','book_',out)
    out=re.sub('ein_','a_',out)
    
    out = re.sub("_[^ ]+","",out)
    return out


In [None]:
print(g2e(s1) + "\n" + g2e(s2)  + "\n" + g2e(s3)  + "\n" + g2e(s4)  + "\n" + g2e(s5))

### Another sentence set to explore 

Update the below function to translate these sentence pairs in as few a set of rules as possible

In [None]:
s1="der_ART mann_N hat_AUX fußball_N gespielt_V"
t1="the man played football"
s2="der_ART mann_N spielt_V fußball_N" 
t2="the man plays football"
s3="der_ART mann_N hat_AUX kartoffeln_N gekocht_V"
t3="the man cooked potatoes"
s4="der_ART mann_N kocht_V kartoffeln_N"
t4="the man cooks potatoes"

In [None]:
def g2e(out):
    re.UNICODE
    out=re.sub('der_','the_',out)
    out=re.sub('mann_','man_',out)
    out=re.sub('fußball_','football_',out)
    out=re.sub('spielt_','plays_',out)
    out=re.sub('kocht_','cooks_',out)
    out=re.sub('kartoffeln_','potatoes_',out)
    
    out = re.sub("_[^ ]+","",out)
    return out

In [None]:
print(g2e(s1) + "\n" + g2e(s2)  + "\n" + g2e(s3)  + "\n" + g2e(s4)  + "\n")

And if you are really feeling brave, try accounting for these too:

In [None]:
s5="der_ART mann_N spielt_V gerne_ADV fußball_N" 
t5="the man likes playing football"
s6="der_ART mann_N hat_AUX gerne_ADV fußball_N gespielt_V" 
t6="the man liked to play football"

## Statistical machine translation

We will look next at statistical machine translation. NLTK has some built in tools for this that we can make use of.

To make sure we have latest version of nltk let's install and then restart runtime.

In [None]:
!pip install --user -U nltk

In [179]:
import nltk
nltk.download('punkt')
import math
from nltk import AlignedSent
from nltk import IBMModel1

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Build a translation table

We start by performing alignment and building a translation table

In [231]:
s1='klein ist das haus'
t1='the house is small'
s2='das haus ist ja groß'
t2='the house is big'
s3='das buch ist klein'
t3='the book is small'
s4='das buch'
t4='the book'
s4='das house'
t4='the house'

In [232]:
parallel_corpus = []
parallel_corpus.append(AlignedSent(nltk.word_tokenize(s1),nltk.word_tokenize(t1)))
parallel_corpus.append(AlignedSent(nltk.word_tokenize(s2),nltk.word_tokenize(t2)))
parallel_corpus.append(AlignedSent(nltk.word_tokenize(s3),nltk.word_tokenize(t3)))
parallel_corpus.append(AlignedSent(nltk.word_tokenize(s4),nltk.word_tokenize(t4)))
parallel_corpus.append(AlignedSent(nltk.word_tokenize(s4),nltk.word_tokenize(t5)))

In [233]:
ibm1 = IBMModel1(parallel_corpus, 50)

In [234]:
ibm1.translation_table['haus']

defaultdict(<function nltk.translate.ibm1.IBMModel1.set_uniform_probabilities.<locals>.<lambda>>,
            {None: 3.5568994961617923e-12,
             'big': 1e-12,
             'house': 0.8996368619539824,
             'is': 4.956437338488892e-06,
             'small': 1e-12,
             'the': 5.897175288270353e-06})

You can download and train on a larger aligned corpus by running this code (but beware it will take quite a while):

import nltk <br>
nltk.download('comtrans') <br>
ende=comtrans.aligned_sents('alignment-de-en.txt') <br>
ende_subset = ende[1:100] <br>
ibm3 = IBMModel3(ende_subset, 2) <br>

In [235]:
phrase_table = nltk.translate.PhraseTable()
for triple in ibm1.translation_table.items():
      for i in triple[1].items():
            phrase_table.add((triple[0],),(i[0],),math.log(i[1]))
    

In [221]:
phrase_table.translations_for(('ist',))

[PhraseTableEntry(trg_phrase=('is',), log_prob=-0.6047874182872965),
 PhraseTableEntry(trg_phrase=('the',), log_prob=-0.6543246287565938),
 PhraseTableEntry(trg_phrase=('small',), log_prob=-5.375837807320061),
 PhraseTableEntry(trg_phrase=('yes',), log_prob=-7.96649354206475),
 PhraseTableEntry(trg_phrase=('book',), log_prob=-7.96649354206475),
 PhraseTableEntry(trg_phrase=('house',), log_prob=-9.210799371689173),
 PhraseTableEntry(trg_phrase=(None,), log_prob=-14.296341101592958),
 PhraseTableEntry(trg_phrase=('big',), log_prob=-23.96770700482831)]

### Build a probabilistic language model

In [14]:
#!wget https://www.gutenberg.org/files/2554/2554-0.txt
!wget https://www.gutenberg.org/files/31100/31100.txt    
#f1 = open('2554-0.txt')
f = open('31100.txt',"r",encoding='windows-1252')
text = f.read()  
text = text + "\n" + t1 + "\n" + t2 + "\n" + t3 + "\n" + t4 + "\n" + t5 + "\n"
tokenized_text = [list(map(str.lower, nltk.word_tokenize(sent))) 
                  for sent in nltk.sent_tokenize(text)]

--2021-11-18 11:07:24--  https://www.gutenberg.org/files/31100/31100.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4454075 (4.2M) [text/plain]
Saving to: ‘31100.txt.6’


2021-11-18 11:07:25 (5.11 MB/s) - ‘31100.txt.6’ saved [4454075/4454075]



In [196]:
import nltk.lm.preprocessing
n = 4
train_data, padded_sents = nltk.lm.preprocessing.padded_everygram_pipeline(n, tokenized_text)

In [197]:
from nltk.lm import MLE
model = MLE(n) 

In [198]:
model.fit(train_data, padded_sents)


In [261]:
model.generate(8)

['and', 'hope', ',', 'and', 'leave', 'it', 'to', 'my']

In [203]:
from collections import defaultdict
language_prob = defaultdict(lambda: -999.0)
for t in nltk.ngrams(nltk.word_tokenize(t1 + " " + t2 + " " + t3 + " " + t4 + " " + t5),4):
    language_prob[t] = model.logscore(t[2],[t[0],t[1]])
language_model = type('',(object,),{'probability_change': lambda self, context, phrase: language_prob[phrase], 'probability': lambda self, phrase: language_prob[phrase]})()

In [241]:
language_prob.items()

dict_items([(('the', 'house', 'is', 'small'), -5.751065791211061), (('house', 'is', 'small', 'the'), -3.0), (('is', 'small', 'the', 'house'), 0.0), (('small', 'the', 'house', 'is'), 0.0), (('the', 'house', 'is', 'big'), -5.751065791211061), (('house', 'is', 'big', 'yes'), -2.0), (('is', 'big', 'yes', 'the'), -1.0), (('big', 'yes', 'the', 'book'), 0.0), (('yes', 'the', 'book', 'is'), 0.0), (('the', 'book', 'is', 'small'), -4.643856189774724), (('book', 'is', 'small', 'the'), 0.0), (('das',), -999.0), (('haus',), -999.0), (('ist',), -999.0), (('groß',), -999.0), ((None,), -999.0), (('is',), -999.0), (('the',), -999.0), (('house',), -999.0), (('big',), -999.0), (('small',), -999.0), (('yes',), -999.0), (('book',), -999.0), (('klein',), -999.0)])

### Combine with translation model to perform decoding

In [236]:
stack_decoder = nltk.translate.StackDecoder(phrase_table, language_model)

In [237]:
stack_decoder.distortion_factor = 1
stack_decoder.word_penalty = 0

In [238]:
stack_decoder.translate(nltk.word_tokenize("das haus ist groß"))

['is', 'house', 'big', 'the']

In [240]:
stack_decoder.translate(nltk.word_tokenize("klein ist das haus"))

['the', 'small', 'house', 'is']

### Intent classification again
We are now going to switch away from machine translation and look at the rule-based intent classification problem from your coursework. You should write rules to uniquely and correctly identify each of the following utterances: 

PlayMusic:
play the weather girls

AddToPlaylist:
add this to my italian film soundtrack playlist

RateBook:
give the restaurant guidebook 5 stars

SearchScreeningEvent:
find screenings of the book thief at around 7

BookRestaurant:
book me a table outside for 2 for dinner at the national theatre restaurant

GetWeather:
will it be warm enough to eat dinner outside at around 7 tonight

SearchCreativeWork:
find me songs films or books about restaurants

In [257]:
import random
import re
def assign_intent(utt):
  random.seed(10)
  PlayMusic_Pattern = re.compile("play|music")
  AddToPlaylist_Pattern = re.compile("add|playlist")
  RateBook_Pattern = re.compile("rate|book")
  SearchScreeningEvent_Pattern = re.compile("screening")
  BookRestaurant_Pattern = re.compile("book|restaurant")
  GetWeather_Pattern = re.compile("get|weather")
  SearchCreativeWork_Pattern = re.compile("creative")
 
  intents = ['PlayMusic', 'AddToPlaylist', 'RateBook', 'SearchScreeningEvent', 'BookRestaurant', 'GetWeather', 'SearchCreativeWork']
  selected_intents = []

  if re.search(PlayMusic_Pattern,  utt):
     selected_intents.append("PlayMusic")
  if re.search(AddToPlaylist_Pattern,  utt):
     selected_intents.append("AddToPlaylist")
  if re.search(RateBook_Pattern,  utt):
     selected_intents.append("RateBook")
  if re.search(SearchScreeningEvent_Pattern,  utt):
     selected_intents.append("SearchScreeningEvent")
  if re.search(BookRestaurant_Pattern,  utt):
     selected_intents.append("BookRestaurant")
  if re.search(GetWeather_Pattern,  utt):
     selected_intents.append("GetWeather")
  if re.search(SearchCreativeWork_Pattern,  utt):
     selected_intents.append("SearchCreativeWork")

  if len(selected_intents) > 0:
     return selected_intents
  else:
     return random.choice(intents)

In [258]:
example_inputs = ['play the weather girls','add this to my italian film soundtrack playlist','give the restaurant guidebook 5 stars','find screenings of the book thief at around 7','book me a table outside for 2 for dinner at the national theatre restaurant','will it be warm enough to eat dinner outside at around 7 tonight','find me songs films or books about restaurants']
[print(str(assign_intent(i)) + " : " + i) for i in example_inputs]

['PlayMusic', 'GetWeather'] : play the weather girls
['PlayMusic', 'AddToPlaylist'] : add this to my italian film soundtrack playlist
['RateBook', 'BookRestaurant'] : give the restaurant guidebook 5 stars
['RateBook', 'SearchScreeningEvent', 'BookRestaurant'] : find screenings of the book thief at around 7
['RateBook', 'BookRestaurant'] : book me a table outside for 2 for dinner at the national theatre restaurant
BookRestaurant : will it be warm enough to eat dinner outside at around 7 tonight
['RateBook', 'BookRestaurant'] : find me songs films or books about restaurants


[None, None, None, None, None, None, None]