# 1. Introduction to Machine Translation


We have a lot of sequence to sequence tasks in NLP, but one obvious example would be machine translation. So you have a sequence of words in one language as an input, and you want to produce a sequence of words in some other language as an output. Now, you can think about some other examples. For example, summarization is also a sequence to sequence task and you can think about it as machine translation but for one language, monolingual machine translation. Well we will cover these examples in the end of this topic but now let us start with statistical machine translation, and neural machine translation. We will see that actually there are some techniques, that are super similar in both these approaches. For example, we will see **alignments**, word alignments that we need in statistical machine translation. And then, we will see that we have **attention mechanism** in neural networks that kind of has similar meaning in these tasks.


Okay, so let us begin, and I think there is no need to tell you that machine translation is important, we just know that. So I would better start with two other questions. Two questions that actually we skip a lot in our course and in some other courses but these are two very important questions to speak about. 

So one question is **data** and another question is **evaluation**. When you get some real task in your life, some NLP task usually this is not a model that is plane, this is usually data and evaluation. So you can have a fancy neuro-architecture, but if you do not have good data and if you haven't settled down how to do evaluation procedure, you're not going to have good results. 

![mt_1.png](pics/mt_1.png)

So first data, well what kind of data do we need for machine translation? We need some parallel corpora, so we need some text in one language and we need its translation to another language. Where does that come from, so what sources can you think of? Well, one of your source well maybe not so obvious but one very good source, is **European Parliament proceedings**. So you have there some texts in several languages, maybe 20 languages and very exact translations of one in the same statements. And this is nice, so you can use that, some other domain would be **movies**. So you have subtitles that are translated in many languages this is nice.

Something which is not that useful, but still useful, would be books translations or Wikipedia articles. So for example, for Wikipedia you can not guarantee that you have the same text for two languages. But you can have something similar, for example some vague translations or which are to the same topic at least. So we call this corpora comparable but not parallel.

The OPUS website has the nice overview of many sources so please check it out. But I want to discuss something which is not nice, some problems with the data.

Actually, we have lots of problems for any data that we have, and what kind of problems happen for machine translation? 

Well, first, usually the data comes from some specific domain. So imagine you have movie subtitles and you want to train a system for scientific papers translations. It's not going to work, right, so you need to have some close domain. Or you need to know how to transfer your knowledge from one domain to another domain, this is something to think about. Now, you can have some decent amount of data for some language pairs like English and French, or English and German, but probably for some rare language pairs, you have really not a lot of data, and that's a huge problem. Also you can have noisy and not enough data, and it can be not aligned well. 

By alignment I mean, **you need to know the correspondence between the sentences, or even better the correspondence between the words and the sentences**. And this is luxury, so usually you do not have that, at least for a huge amount of data. Okay, now I think it's clear about the data, so the second thing, evaluation. 
![mt_2.png](pics/mt_2.png)

Well you can say that we have some parallel data. So why don't we just split it to train and test and have our test set to compare correct translations and those that are produced by our system.

But well, how do we know that the translation is wrong just because it doesn't occur in your reference?

You know that the language is so relative so every translator would do some different translations. It means that if your system produce something different it doesn't mean yet that it is wrong. So well there is no nice answer for this question, I mean this is a problem, yes. One thing that you can do is to have multiple references so you can have, let's say five references and compare your system output to all of them. And the other thing is you should be very careful how do you compare it. So definitely you shouldn't do just exact match, right you should do something more intelligent. And I'm going to show you BLUE score which is known to be very popular measure in machine translation that try somehow to softly measure whether your system output is somehow similar to the reference translation. Okay, let me show you an example. So you have some reference translation and you have the output of your system and you try to compare them. Well, you remember that we have this nice tool which is called engrams. So you can compute some unigrams and bigrams and trigrams.

Do you have any idea how to use that here?

Well, first we can try to compute some precision, what does it mean? **You look into your system output, and here you have six words, six unigrams and compute how many of them actually occur in the reference**. So the unigram precision core will be 4 out of 6. Now, tell me what would be bigram score here. Well, the bigram score will be 3 out of 5 because you have 5 bigrams in your system output and only 3 of them was sent sent on and on Tuesday occurred in the reference. Now you can proceed and you can compute 3-grams score and 4-grams score, so that's good. Maybe we can just average them and have some measure. Well we could, but there is one problem here, well imagine that the system tries to be super precise. Then it is good for system to output super short sentences, right?

So if I'm sure that this union gram should occur, I will just output this and I will not output more. So just to punish into penalty the model, we can have some **brevity score**. This brevity penalty says that we **divide the length of the output by the length of the reference**. And then the system outputs two short sentences, we will get to know that. Now how do we compute the BLEU score out of these values?

Like this so we have some average so this root is the average of our union gram, bigram, 3-gram, and 4-gram's course. And then we multiply this average by the brevity. Okay, now let us speak about how the system actually works. So this is kind of a mandatory slide on machine translation, because kind of any tutorial on machine translation has this. So I decided not to be an exception and show you that.

![mt_3.png](pics/mt_3.png)

So the idea is like that, we have some source sentence and we want to translate it to get some target sentence. Now the first thing that we can do is just direct transfer. So we can translate this source sentence word by word and get the target sentence.

But well, maybe it's not super good, right? So if you have ever studied some foreign language, you know that just by dictionary translations of every word, you usually do not get nice coherent translation. So probably we would better go into some synthetic level. 

So we do syntax analysis, and then we do the transfer and then we generate the target sentence by knowing how it should look like on on the syntactic level. 

Even better, we could try to go to semantic layer, so that first we analyze the source sentence and understand some meanings of some parts of the sentence. We somehow transfer these meanings to in our language and then we generate some good syntactic structures with good meaning. 

And our dream, like the best things as we could ever think of, would be having some interlingual. So by interlingual, we mean some n ice representation of the whole source sentence that is enough to generate the whole target sentence.

Actually it is still a dream, so it is still a dream of the translators to have that kind of system because it sounds so appealing. But neural translation systems somehow have mechanisms that resembles that and I will show you that in a couple of slides.
![mt_4.png](pics/mt_4.png)
Okay, so for now I want to show you some brief history of the area.

And like any other area, machine translation has some bright and dark periods. So in 1954 there were great expectations, so there was IBM experiments where they translated 60 sentences from Russian to English. And they said, that's easy we can solve the machine translation task completely in just three or five years. So they tried to work on that and they worked a lot, and after many years they concluded that actually it's not that easy. And they said, well, machine translation is too expensive and we should not do automatic machine translation system. We should better focus on just some tools that help human translators to provide good quality translations. 
![mt_5.png](pics/mt_5.png)

So you know these great expectations and then the disappointment made the area silent for a while, but then in 1988 IBM researchers proposed word-based machine translation systems. These machine translation systems were rather simple, so we will cover them, kind of in this chapter and in the next, but these systems were kind of the first working system for machine translation. So this was nice and then the next important step was phrase based machine translations system that were proposed by Philip Koehn in 2003. And this is what probably people mean by statistical machine translation now. You definitely know Google Translate, right? But maybe you haven't heard about Moses. So Moses is the system that allows a researchers to build their own machine translation systems. So it allows to train your models and to compare them, so this is a very nice tool for researchers and it was made available in 2007.

Now, with an extent, obviously very important step here is neural machine translation. It is amazing how fast the neural machine translation systems could go from research papers to production. Usually we have such a big gap between these two things. But in this case there were just two or three years so it is amazing that those ideas that were proposed could be implemented and just launched in many companies in 2016 so we have neutral machine translations now. You might be wondering what is WMT there, it is the workshop on machine translation, which is kind of the annual competition, the annual event and shared tasks. Which means that you can compare your systems there, and it is a very nice venue to compare different systems by different researchers and companies. And to see what are the traits of machine translations. And it happens every year, so usually people who do research in this area keep eye on this and this is very nice thing. 

![mt_6.png](pics/mt_6.png)

This is the slide about intralingual that I promised to show you. So this is how Google neural machine translation works, and there was actually lots of hype around it maybe even too much. But still, so the idea is that you train some system or some pair of languages. For example on English to Japanese and Japanese to English and English to Korean and some other pair, you train some encoder, decoder architecture. It means that you have some encoder that encodes your sentence to some hidden representation. And then you have decoder that takes that hidden representation and decodes it to the target sentence. Now the nice thing is, that if you just take your encoder, let's say for Japanese and decoder for Korean and you just take them. Somehow it works nicely even though the system has never seen Japanese to Korean translations. You see so this is zero-shot translation you have never seen Japanese to Korean, but just by building nice encoder and nice decoder, you can stack them and get this path. So it seems like this hidden representation that you have, is kind of universal for any language pair. Well, it is not completely true but at least it is very promising result. 



In [1]:
one_gram = 6/8
two_gram = 4/7
three_gram = 2/6
four_gram = 1/5

bleu = (one_gram*two_gram*three_gram*four_gram)**(1/4)
print(bleu)

0.4111336169005197


# 2. Noisy channel: said in English, received in French

In this chapter, we will cover one main idea of statistical machine translations. 

![mt_7.png](pics/mt_7.png)

Imagine you have a sentence, let's say, in French or in some other foreign language and then, you want to have its translation to English. How do you do this? Well, you can try to compute the probability of the English sentence given your French sentence. And then, you want to maximize this probability and take the sentence that gives you this maximum probability, right? Sounds very intuitively. Now, let us apply base rule here. So let us say that instead of computing the probabilities of E given F, we would better compute probabilities of F given E. And multiply it by some probability of the English sentence. And also, normalize it by some denominator. 

Now, do you have any idea? Can we further simplify this formula? Well, actually, we can. So, the denominator doesn't depend on the English sentence, which means that we can just get rid of it, okay. Now, we have this formula and now, the question is, why is that easier? Why we like it more than the original formula? 

![mt_7.png](pics/mt_8.png)

This slide is going to explain why. So, we have two models now. We have decoupled our complicated problem to two more simple problems. One problem is language modeling. And actually, you know a lot about it. So, this is how to produce some meaningful probability of the sentence of words. 

Now, the other problem is translation model. And this model doesn't think about some coherent sentences. It just thinks about some good translation of E to F, so that you do not end up with something that is not related to your source sentence. 

So, you have two models about **language** and about **adequacy of the translation**. And then you have argmax to perform the search in your space and find the sentence in English that gives you the best probability. 

![mt_9.png](pics/mt_9.png)

Now, I have one more interpretation for you. The Noisy Channel is a super popular idea, so you definitely need to know about it. And it is actually super simple. So, you have your source sentence and you have some probability of this source sentence. And then, it goes through the noisy channel. The noisy channel is represented by the conditional probability of what you get as the output given your input for the channel. So, as the output, you obtain your French sentence. 

So, let's say that your source sentence was spoilt with the channel and now you obtained it in French. Now, the rest of the chapter is about how to model these two probabilities, the probability of the **sentence** and the probability of the **translation given some sentence**.

![mt_10.png](pics/mt_10.png)

Okay. First, about the language model. You know a lot about it so we covered this in the week two. So, I will have just one slide to have a recap for you. So, we need to compute the probability of a sentence of words. We apply chain rule and then we know that we can factorize it into the probabilities of the next word given some previous history. You can use Markov assumption and then end up with n-gram language models. Or you can use some neural language models such as LSTM to produce the next word, you will need previous words. 

![mt_11.png](pics/mt_11.png)

Now, translation model. Well, it is not so easy. So, imagine you have a sequence of words in one language and you need to produce the probability of a sequence or words in some other language. For example, this is foreign language, like Russian and English language, and these two sentences. How do you produce these probabilities? Well, it is not obvious for me. So, let us start with words level. We can understand something for the level of separate words in these sentences. 

Okay. What can we do? We can have a translation table. 

![mt_12.png](pics/mt_12.png)

So, here, I have the probabilities of Russian words given some English words. And they are normalized, right. So, each row in this matrix is normalized into one. And this are just translations that I learn or that I look up in the dictionary or built somehow. Okay, it's doable. 

![mt_13.png](pics/mt_13.png)

Now, how do I build the probability of the whole sentence given these separate probabilities? We need some word alignments. So, the problem is that we can have some reorderings in the language like here, or even worse, we can have some one to many or many to one correspondence. 
![mt_14.png](pics/mt_14.png)

For example, the word appetit here corresponds to the appetite. And the word with here corresponds to two Russian words [FOREIGN] It means that we need some model to build those alignments. Now, another example would be words that can appear or disappear. For example, some articles or some auxiliary words can happen in one language and then, they can't just vanish in some other language. This is a very unique word alignment models and this is the topic will fall when next chapter.

# 3. Word Alignment Models

Word Alignments Models. This is a super important subtask in machine translation, because different languages have different word order, and we need to learn that from data. So, we need to build some word alignment models and this chapter is exactly about them. Let us go a bit more formal and let us see what are the notations of every object that we have here.

![mt_15.png](pics/mt_15.png)

So, e is the sentence e1, e2 and so on and f is another sentence. So, the length of e sentence is $I$ and the length of f sentence is $J$. Now, I need to learn some alignments between them, which I denote by a. And importantly, you'll see that I say that e is **source** and f is **target**. Why do I say so? 

Well, usually, we talked about machine translation system from French to English or from foreign to English. Why do I say now that it is vice versa? 

![mt_15.png](pics/mt_16.png)

This is because we applied base rule. 
![mt_17.png](pics/mt_17.png)

So, if you remember, we did this to have our decoupled model about language and about translation. And now, to build the system that translates from f to e, we need to model the probability of f given e. Now, what about word alignments? How do we represent them? 

![mt_18.png](pics/mt_18.png)

So, the matrix of word alignments is one nice way to do this. You have one sentence and another sentence, and you have zeros or ones in the matrix. So, you'll know which words correspond to each other. Now, how many matrices do we have here? Well, actually, it is a huge amount of matrices. So, imagine you have two options in every element of the matrix and then, you have the size of the matrix which is $I$ multiplied by $J$, so the number of possible matrices would be two to the power of the size of the matrix and that's a lot. 

So, let us do some constraints, some simplifications to deal with this. And what we do is we say that every target word is allowed to be aligned only to one source word, okay? Like here. So, this is a valid example. Now, what would be the notation here. 

![mt_18.png](pics/mt_19.png)

So, we will have a1 which will represent the number of the source word which is aligned to the first target word. So, this is appetite and this is the second word. Now, what would be a2? So, a2 will be equal to three because we have comes matched to [inaudible] which is the third word in our source. Now, we can proceed and do the same for a4 and five, a6. That's it. So, this is our notation. Please keep it in mind not to get lost. 

![mt_20.png](pics/mt_21.png)

Now, let us build the probabilistic model. Actually, this and the next slide will be about the sketch of the whole process. So, we are going to build the probabilistic model and figure out how we learned that. After that, we'll go into deeper details of this probabilistic model. So, stay with me. We have our sentences, $e$ and $f$. So, this is our observable variables. Now, we have also word alignments. We do not see them, but we need to model them somehow. So, this is hidden variables. And we have parameters of the model and this is actually the most creative step. So, we need somehow to decide how do we parameterized our model to have some meaningful generative story. **And if we have too many parameters, probably, it will be difficult to train that. If we have too less parameters, probably it will be not general enough to describe all the data**. So, this is the moment that we will discuss in more details later. 

![mt_22.png](pics/mt_22.png)

But for now, let's just say that we have some probabilistic model of $f$ and a given $e$ and $\Theta$. What do we do next? Well, you should know that in all these situations, we do **likelihood maximization**. So, we take our data, we write down the probability to see our data and we try to maximize this. **Now, one complicated thing with this is that we do not see everything that we need to model**. So, we can model the probabilities of $f$ and $a$, but we don not see $a$. That's why we need to sum over all possible word alignments. And on the left-hand side, you have the probability of $f$ given all the rest things, which is called **incomplete data**. 

Likelihood maximization for incomplete data means that there are some hidden variables that you do not see. And this is a very bad situation. So, imagine you have a logarithm. So, you take logarithm and you have logarithm of the sum. And you don't know how to maximize these, how to take derivatives and how to get your maximum likelihood estimations. But actually, we have already seen this case two times in our course. So, one was **Hidden Markov Model** and another was **topic models**. In both those cases, we had some hidden variables and we have these incomplete data. And in both cases we used **EM-algorithm**. So, EM-algorithm just to recap, is an iterative process that has E-step and M-step. **The E-step is about estimates for your hidden variables**. So, the E-step will be, what are the best alignments that we can produce right now given our parameters? And the **M-step** is vice versa. Given our best guess for word alignments, **what would be the updates for parameters that maximize our likelihood**? This is also so interesting to go into the exact formulas of EM-algorithm. 

Better, let us discuss generative model because it is really creative thing. 

![mt_23.png](pics/mt_23.png)
Well, let us start with generating the length of the sentence. So, J would be the length of the target sentence. Once we could generate this, let us say that we have independent susception by the target words. So, we have this product by $J$ which denotes the word in our target sentence. Every word will be not modeled yet. So first, real model the alignment for every position. And then, we will model the exact word given that alignment. So, if you are scared with this formula, you can look into just green parts. This is the most important thing. You model alignments and you model words given these alignments. All the other things that you see on the right would be just everything that we know to condition on that. And this is too much to condition on that because we will have well too much parameters. So, we need to do some assumptions. So, we need to say that not all those things are important in this conditions. 

![mt_23.png](pics/mt_24.png)

The first IBM model is the first attempt to simplify this generative story. So, what it says is, let us forget about the priors for word alignments, let us have just a uniform prior. And this prior will know nothing about the positions, but it will have just one constant to tune. So, this is awesome. Now, the translation table will be also very simple. So, we will say that the probability to generate the word, given that it is aligned to some source word, is just the translation probability of that word given the source word. 

So, how does that look like?
![mt_23.png](pics/mt_25.png)

This is the translation table. So, once we know that the word is aligned to some source word, we just take this probability out of it. So, this is a very simple model, but it has a very big drawback. 

It doesn't take into account the position of your word to produce the alignment to this word. 

![mt_23.png](pics/mt_26.png)

So, the second IBM model tries to make better and it says, "Okay, let us take $J$, the position of the target word and let us use it to produce aj.", the alignment for this target word. 

![mt_23.png](pics/mt_27.png)

Well, actually a lot of parameters. 

![mt_23.png](pics/mt_28.png)

So, you know what, you have I multiplied by J, which will be the size of the matrix of probabilities and it is not all. Apart of this matrix, you will also have different matrices for different I and J. So, you cannot just use one and the same matrix for all kind of sentences. You just share this matrix across all sentences with given lengths. But for sentences with different lengths, you have different matrix. 

So, this is a lot of parameters and to try to improve on that, we can say, "Well, the matrix is usually very close to diagonal. What if we model it as a diagonal matrix?" This is what Chris Dyer said in 2013. 

![mt_23.png](pics/mt_29.png)

So, this model has only one perimeter that says, how is the probability mass spread around the diagonal? And this is nice because it is still has some priors about positions, but it has not too many parameters. 

![mt_23.png](pics/mt_30.png)

Now, I have the last example for you for alignments. So, actually, you already know how to build this, you just don't remember that. We had Hidden Markov Models in our course and Hidden Markov Models can help to build some transition probabilities. Why do we need it here? So, imagine these couple of sentences and the phrase in the beginning of the sentence can be aligned to the phrase in the end of the sentence. 

But sometimes, inside this phrase, you just go word-by-word so you do not have any gaps. And this is nice. It means that you need to learn these and to use this information that the previous word was aligned to position five and maybe that means that the next word should be aligned to position six. So, this is what Hidden Markov Model can make for you. So, you model the probability of the next alignment given the previous alignment. So now, let us sum up what we have in this chapter. 

![mt_23.png](pics/mt_31.png)


So, we have covered IBM models, which is a nice word-based technique to build machine translation systems. And actually, there are lots of problems with them that we did not cover. And there are IBM Model Number three and four and five that can try to cope with the problem of fertility, for example, saying that we need to explicitly model how many output words are produced by each source word, or that we need to explicitly deal with spurious word. This are the words that just appear from nowhere in the target language.

## Quiz

![mt_quiz_1.png](pics/mt_quiz_1.png)
![mt_quiz_1.png](pics/mt_quiz_2.png)
![mt_quiz_1.png](pics/mt_quiz_3.png)
![mt_quiz_1.png](pics/mt_quiz_4.png)