# 1. Encoder-decoder architecture

We're going to discuss a very important technique in neural networks. We are going to speak about encoder-decoder architecture and about attention mechanism. We will cover them by the example of neural machine translation, just because they were mostly proposed for machine translation originally. 

But now they are applied to many, many other tasks. For example, you can think about summarization or simplification of the texts, or sequence to sequence chatbots and many, many others. Now let us start with the general idea of the architecture. 

![eda_1.png](pics/eda_1.png)

We have some sequence as the input, and we would want to get some sequence as the output. For example, this could be two sequences for different languages, right? We have our encoder and the task of the encoder is to build some hidden representation over the input sentence in some hidden way. So we get this green hidden vector that tries to **encode the whole meaning of the input sentence**. Sometimes this vector is also called **thought vector**, because it encodes the thought of the sentence. The encoder task is to decode this thought vector or context vector into some output representation. For example, the sequence of words from the other language.

Now what types of encoders could we have here? Well, one most obvious type would be her current neural networks, but actually this is not the only option. So be aware that we have also convolutional neural networks that can be very fast and nice, and they can also encode the meaning of the sentence. We could also have some hierarchical structures. For example, recursive neural networks try to use syntax of the language and build the representation hierarchically from from bottom to the top, and understand the sentence that way. 

![eda_1.png](pics/eda_2.png)

Okay, now what is the first example of sequence to sequence architecture? This is the model that was proposed in 2014 and it is rather simple. So it says, we have some LCM module or RNN module that encodes our input sentence, and then we have end of sentence token at some point. At this point, we understand that our state is our thought vector or context vector, and we need to decode starting from this moment. The decoding is conditional language modelling. So you're already familiar with language modelling with neural networks, but now it is conditioned on this context vector, the green vector.

Okay, as any other language model, you usually fit the output of the previous state as the input to the next state, and generate the next words just one by one.

Now, let us go deeper and stack several layers of our LSTM model. You can do this straightforwardly like this.

![eda_3.png](pics/eda_3.png)

So let us move forward, and speak about a little bit different variant of the same architecture. 

![eda_3.png](pics/eda_4.png)

One problem with the previous architectures is that the green context letter can be forgotten. So if you only feed it as the inputs to the first state of the decoder, then you are likely to forget about it when you come to the end of your output sentence. So it would be better to feed it at every moment. And this architecture does exactly that, it says that every stage of the decoder should have three kind of errors that go to it. 

1. the error from the previous state, 
2. the error from this context vector,
3. the current input which is the output of the previous state. 

Okay, now let us go into more details with the formulas. 

![eda_5.png](pics/eda_5.png)

So you have your sequence modeling task conditional because you need to produce the probabilities of one sequence given another sequence, and you factorize it using the chain rule. 

Also importantly you see that $x$ variables are not needed anymore because you have encoded them to the $v$ vector. V vector is obtained as the last hidden state of the encoder, and encoder is just recurrent neural network. 

The decoder is also the recurrent neural network. However, it has more inputs, right? So you see that now I concatenate the current input Y with the V vector. And this means that I will use all kind of information, all those three errors in my transitions.

Now, how do we get predictions out of this model? Well, the easiest way is just to do soft marks, right? So when you have your decoder RNN, you have your hidden states of your RNN and they are called $S_j$.

You can just apply some linear layer, and then softmax, to get the probability of the current word, given everything that we have, awesome. 

![eda_5.png](pics/eda_6.png)


Now let us try to see whether those v vectors are somehow meaningful. One way to do this is to say, okay they are let's say three dimensional hidden vectors. Let us do some dimensional reduction, for example, by TS&E or PCA, and let us plot them just by two dimensions just to see what are the vectors. 

So you see that the representations of some sentences are close here and it's nice that the model can capture that active and passive voice doesn't actually matter for the meaning of the sentence. For example, you see that the sentence, I gave her a card or she was given a card are very close in this space. 
![eda_5.png](pics/eda_7.png)

Okay, even though these representations are so nice, this is still a bottleneck. So you should think about how to avoid that. And to avoid that, we will go into attention mechanisms and this will be the topic of our next chapter. 

# 2. Attention mechanism

Attention mechanism is a super powerful technique in neural networks. So let us cover it first with some pictures and then with some formulas. 

![eda_8.png](pics/eda_8.png)

Just to recap, we have some encoder that has $h$ states and decoder that has some $s$ states. Now, let us imagine that we want to produce the next decoder state. So we want to compute $s_j$. How can we do this? 

In the previous chapter, we just used the $v$ vector, which was the information about the whole encoded input sentence. And instead of that, we could do something better. We can look into all states of the encoder with some weights. So these alphas denote some weights that will say us whether it is important to look there or here. How can we compute this alphas? 

Well, we want them 
1. to be **probabilities**, 
2. we want them to **capture some similarity** between our current moment in the decoder and different moments in the encoder. 

This way, we'll look into more similar places, and they will give us the most important information to go next with our decoding. 

![eda_9.png](pics/eda_9.png)

If we speak about the same thing with the formulas, we will say that, now, instead of just one $v$ vector that we had before, we will have $v_j$, which is different for different positions of the decoder. And this $v_j$ vector will be computed as an average of encoder states. And the weights will be computed as soft marks because they need to be probabilities. And this soft marks will be applied to similarities of encoder and decoder states. 

Now, do you have any ideas how to compute those similarities? 

![eda_9.png](pics/eda_10.png)

I have a few. So papers actually have tried lots and lots of different options, and there are just three options for you to try to memorize. 

Maybe the easiest option is in the bottom. Let us just do dot product of encoder and decoder states. It will give us some understanding of their similarity. Another way is to say, maybe we need some weights there, some metrics that we need to learn, and it can help us to capture the similarity better. This thing is called multiplicative attention. And maybe we just do not want to care at all with our mind how to compute it. We just want to say, "Well, neural network is something intelligent. Please do it for us." And then we just take one layer over neural network and say that it needs to predict these similarities. So you see there that you have $h$ and $s$ multiplied by some matrices and summed. That's why it is called additive attention. And then you have some non-linearity applied to this. 

These are three options, and you can have also many more options. Now, let us put all the things together, just again to understand how does attention works. 

![eda_11.png](pics/eda_11.png)

You have your conditional language modeling task. You'll try to predict $Y$ sequence given $s$ sequence. And now, you encode your $x$ sequence to some $v_j$ vector, which is different for every position. This $v_j$ vector is used in the decoder. It is concatenated with the current input of the decoder. And this way, the decoder is aware of all the information that it needs, the previous state, the current input, and now, this specific context vector, computed especially for this current state. 

Now, let us see where the attention works. 
![eda_12.png](pics/eda_12.png)

So neural machine translation had lots of problems with long sentences. You can see that blue score for long sentences is really lower, though it is really okay for short ones. Neural machine translation with attention can solve this problem, and it performs really nice for even long sentences. 

Well, this is really intuitive because **attention helps to focus on different parts of the sentence when you do your predictions**. And for long sentences, it is really important because, otherwise, you have to encode the whole sentence into just one vector, and this is obviously not enough. Now, to better understand those $alpha_{ij}$ ways that we have learned with the attention, let us try to visualize them. 

![eda_13.png](pics/eda_13.png)

This weights can be visualized with $i$ by $j$ matrices. Let's say, what is the best promising place in the encoder for every place in the decoder? So with the light dot here, you can see those words that are aligned. So you see this is a very close analogy to word alignments that we have covered before. 

We just learn that these words are somehow similar, relevant, and we should look into this once to translate them to another language. And this is also a good place to note that we can use some techniques from traditional methods, from words alignments and incorporate them to neural machine translation. For example, priors for words alignments can really help here for neural machine translation. 

Now, do you think that this attention technique is really similar to how humans translate real sentences? I mean, humans also look into some places and then translate this places. They have some attention. Do you see any differences? 

![eda_14.png](pics/eda_14.png)

Well, actually there is one important difference here. 

So humans save time with attention because they look only to those places that are relevant. On the contrary, here, we waste time because to guess what is the most relevant place, we first need to check out all the places and compute similarities for the whole encoder states. And then just say, "Okay, this piece of the encoder is the most meaningful." 

![eda_15.png](pics/eda_15.png)

Now, the last story for this chapter is how to make this attention save time, not waste time. It is called local attention, and the idea is rather simple. We say, let us first time try to predict what is the best place to look at. And then after that, **we will look only into some window around this place**. And we will not compute similarities for the whole sequence. 

Now, first, how you can predict the best place. One easy way would be to say, "You know what? Those matrices should be strictly diagonal, and the place for position $j$ should be $j$." Well, for some languages, it might be really bad if you have some different orders and then you can try to predict it. How do you do this? You have this sigmoid for something complicated. This sigmoid gives you probability between zero to one. And then you scale this by the length of the input sentence $I$. So you see that this will be indeed something in between zero and I, which means that you will get some position in the input sentence. Now, what is inside those sigmoid? Well, you see a current decoder state $s_j$, and you just apply some transformations as usual in neural networks. Anyway, so when you have this $a_j$ position, you can just see that you need to look only into this window and compute similarities for attention alphas as usual, or you can also try to use some Gaussian to say that actually those words that are in the middle of the window are even more important. So you can just multiply some Gaussian priors by those alpha weights that we were computing before. 

![eda_16.png](pics/eda_16.png)

Now, I want to show you the comparison of different methods. You can see here that we have **global attention** and **local attention**. And for local attention, we have monotonic predictions and predictive approach. And the last one performs the best. Do you remember what is inside the brackets here? These are different ways to compute similarities for attention weights. So you remember dot product and multiplicative attention? And, also, you could have location-based attention, which is even more simple. It says that we should just take $s_j$ and use it to compute those weights. This is all for that presentation, and I am looking forward to see you in the next one.

# 3. Hot to deal with a vocabulary? 

This chapter is about all kind of problems that you can have with vocabulary in machine translations.
![eda_17.png](pics/eda_17.png)

So first, vocabulary is usually too large, and it is too long to compute softmax. 

Second, vocabulary is usually too small, and you have **out-of-vocabulary words**, and you need somehow to deal with them. And you have a whole range on different tricks for you in neural language modeling and in machine translation.

Now, let us start with hierarchical softmax. It is a nice procedure to help you to compute softmax in a fast way.

So the idea is to build the binary tree for your words, like here. And the words in this tree will have some codes. So for example, the zebra will have code 01, which means that first, we go to the left, it is 0, and then we go to the right, it is 1. And importantly, it is just unique mapping between words and codes in the tree.
Play video starting at 1 minute 10 seconds and follow transcript1:10
So we are going to use this property, and we are going to produce the probabilities for words using their binary codes. This is how we make it. So instead of computing the probability of the word and normalizing it across the whole vocabulary, we are going to split it into separate terms. Each term is the probability of the digit, is the probability of the next decision in the tree, whether to go to the left or to the right. And we build these probabilities into the product so that this product is going to estimate the probability of the whole word.
Play video starting at 1 minute 55 seconds and follow transcript1:55
Now, do you believe that this product of binary probabilities is going to be normalized into one across the whole vocabulary?
Play video starting at 2 minutes 7 seconds and follow transcript2:07
Well, sounds like a magic, isn't it? But let's see that actually, it happens. So you see that this is your binary tree, and you see those probabilities. They are just written down for every path in the tree.
Play video starting at 2 minutes 25 seconds and follow transcript2:25
Now, you can try to sum the first two probabilities, and you note that 0.1 plus 0.9 gives 1, and this is not just random. This is always 1, just because this is probability for two options, right, going to the left and going to the right. That's why they are necessary summed into 1. So if you sum those things, you get their common prefix, which will be 0.7 by 0.8. Then you can try to sum two other childs in the tree, and you get, again, their common prefix. You sum two other childs, and you get, again, their common prefix. And finally, you sum these two values, and you get 1. So it is clearly seen that if you go from bottom to the top with this tree, you will get probability equal to 1. So just to sum it up, we split the probability into these binary solutions, and we use some tree to do this. What kind of tree do we do? Well, it is actually not so important. So interestingly, even just random tree can help. Alternatively, you can use Huffman tree that gets to know the frequencies of some words, and it uses that. Or you can try to use some semantics so you can just cluster your words in the data based on their similarities. Or you can use some pre-built structure that says that these words are similar and that they should be found in one hierarchy branch.
Play video starting at 4 minutes 9 seconds and follow transcript4:09
Okay, now I have one last question for you for this method. What would be a way to model the probability of di given all the other stuff?
Play video starting at 4 minutes 21 seconds and follow transcript4:21
So usually, we have softmax to model some probabilities. What would be the probability in this case?
Play video starting at 4 minutes 29 seconds and follow transcript4:29
Well, in this case, we have just only two options, left and right. So instead of softmax, we can have sigmoid function, which has just two outcomes.
Play video starting at 4 minutes 41 seconds and follow transcript4:41
Okay, now let us come back to some other problems with our bot vocabulary. And copy mechanism is something to help with out-of-vocabulary words. Imagine you have some sentence, and some words are UNK tokens. So you do not have them in the vocabulary, and you do not know how to translate them, you get UNK tokens as the result.
Play video starting at 5 minutes 5 seconds and follow transcript5:05
But what if you know the word alignments? If you know how the words are aligned, you can use that. So you can say, okay, this UNK corresponds to that UNK, which corresponds to this source word. Let us just do dictionary translation, or let us just copy, because this is the name of some place or some other name. Let us just copy this as is. This is why it is called copy mechanism, and the algorithm is super simple. So you need first to make sure that you somehow learn word alignments. For example, your neural machine translation system has these alignments as an input. And it tries to predict them along with the translation predictions.
Play video starting at 5 minutes 54 seconds and follow transcript5:54
Now, you get your translation with UNK tokens, and you post-process this. So you just copy the source words, or you translate them with dictionary or do whatever else what you want. Okay, very simple and nice technique, but actually, there are still many problems. For example, you can have some multi-word alignments. What if the morphology of the languages are complicated, and probably, you want to split it somehow into some parts? Or what if you have some informal spelling? All those things are usually out-of-vocabulary words. And these examples show you that sometimes, it is very nice to go to sub-word level. For example, for rich morphology, it would be nice to model every piece of the word independently.
Play video starting at 6 minutes 49 seconds and follow transcript6:49
Or for informal spelling, it would be definitely good to model them by letters. Because there is no chance to find these words as a whole in the vocabulary. Okay, so there are two big trends in sub-word modelling. One big trend is to do some hybrid models that somehow combines word-level models and character-level models. Another big trend is to do the same architecture, let's say, recurrent neural network, but with some small units, something in between words and characters. So this is one architecture, but other units.
Play video starting at 7 minutes 29 seconds and follow transcript7:29
Okay, let us start with hybrid models.
Play video starting at 7 minutes 33 seconds and follow transcript7:33
So you might know that sometimes, character-based models are super useful. For example, you see the word drinkable. And if you can build your convolutional neural network that has some convolutions that represent the meaning of drink. And then some other convolutions that represent the meaning of able, you are likely to build the meaning of the whole word, even if you have never seen that. So character-level convolutional neural networks can work fine. Also, bidirectional LSTMs can be used on word level as well as on character level.
Play video starting at 8 minutes 13 seconds and follow transcript8:13
Now, for our translation model, it is super nice to have hybrid models. Let's say, let us first try to work on word level.
Play video starting at 8 minutes 23 seconds and follow transcript8:23
So let us try to produce word translations.
Play video starting at 8 minutes 28 seconds and follow transcript8:28
For example, we have a cute cat here. And a and cat are nice words. But what if cute is out of vocabulary? We cannot model it on word level. In this case, let us have some separate units, some separate architecture, maybe even completely different, that will model these probabilities for this word on character level, and the same for the decoder. So first, we will try to decode the sequence on word level. And then, in some moments, the decoder will say, okay, this is some UNK token, please do something about it. And then the character-level model will switch on and do something about it.
Play video starting at 9 minutes 11 seconds and follow transcript9:11
So this is a very nice and obvious idea, and it is used a lot.
Play video starting at 9 minutes 17 seconds and follow transcript9:17
Now, the second big trend would be sub-word modeling, and one good example of that is byte-pair encoding.
Play video starting at 9 minutes 27 seconds and follow transcript9:27
Let us understand what it is. So imagine you have some sentence, and you want to encode this, and you are not quite sure yet what is your vocabulary. You have just some constraints for the size of the vocabulary. So you start with characters. Everything is split to single characters. Then you say, okay, what are the most popular bigrams of my letters here?
Play video starting at 9 minutes 56 seconds and follow transcript9:56
Well, I see that S-H happens three times. So maybe I should collapse them into just one unit. And you do this, and you have some other vocabulary right now. Okay, now you say, okay, what is next? Next, I see that these two letters occur a lot, let us collapse them. These two letters also should be collapsed. And then importantly, you can apply the same procedure to sub-word units. So here, you would collapse your bigrams and unigrams into trigrams. And actually, you can stop whenever you want.
Play video starting at 10 minutes 37 seconds and follow transcript10:37
So you can proceed until you get the nice size of the vocabulary that you like.
Play video starting at 10 minutes 43 seconds and follow transcript10:43
Yep, I have just said you this. So one thing to mention is how to apply this method for test data. So if you have test data, you also split it to letters first. And then you know the exact rules from your training procedure. And you apply those rules to test data, to collapse all the sub-word units as needed.
Play video starting at 11 minutes 9 seconds and follow transcript11:09
Awesome, so this chart shows you why it is so cool technique. So this is the vocabulary size, and this line there, this vertical line, is about the size of the vocabulary that we are allowed. And in case of words, usually you have some long tail that goes outside of this allowed amount of words. But with byte-pair encoding, you can do exactly this number of units, because you decide which rules to apply.
Play video starting at 11 minutes 43 seconds and follow transcript11:43
And finally, I can show you that this actually works fine. So for some different pairs of languages, for some different tasks from, for example, WMT, which is Workshop on Machine Translation, you can see that byte-pair encoding has better BLEU score than word-based techniques. And actually, this BLEU score improvement is very meaningful. This one or two points of BLEU is a lot. So please use this very nice technique if you someday need to build machine translation system.