# 1. Sequence to sequence learning: one-size fits all?

Let us see how many different tasks in NLP can be solved as sequence to sequence tasks. 

![ss_1.png](pics/ss_1.png)

So we have talked a lot about machine translation, that's obvious, but also you have so many other options. For example, you can do speech recognition and there is model called listen, attend and spell. Or you can do image caption generation, And this will be also in quarter to quarter architecture and the paper is called show, attend and tell. So they are so similar, however every task can be solved specifically better if you just think a little bit about those constraints that you have in this task. So in this chapter, we'll speak in more details about text simplification. And we will see that, well, we can use just in quarter to quarter architecture. But if we think a little bit about the specific objections for this task, we can improve.

Okay, let us start with summarization for now. **Summarization** task is when you need to get the short summary of some document. 

![ss_2.png](pics/ss_2.png)

Summarization can have several types. 

![ss_3.png](pics/ss_3.png)

So we can speak about extractive summarization, and it means that we just extract some pieces of the original text. Or we can speak about abstractive summarization, and it means that we want to generate some summary that is not necessarily from out text, but that nicely summarizes the whole text. So this is obviously better, but this is obviously more difficult. So most production systems would be just some extractive summaries. Now, let us try to do extractive summary with sequence to sequence model. So you get your full text as the input, and you get your summary as the output, and you have your encoder and decoder as usual.

![ss_3.png](pics/ss_4.png)

Now, you need some good dataset to train your model. For example, English Gigaword dataset is really huge, and it contains some examples of articles and their headlines.

![ss5.png](pics/ss5.png)

Now you apply your model, and there is also even open-source implementation for the model, so you can just use it right away and get some results. So the results are rather promising. You can see the sentence of the article, just the first sentence, and the generated headline. So the thing on the right is just generated by our model.

Actually, there are some problems with this model, and we will speak about them in another chapter in our course. But for now, we can just say that it works somehow. And let us move forward and discuss another very related task, which is called **simplification**.

![ss_6.png](pics/ss_6.png)

Text simplification task would also need some good dataset to train on. And one good candidate would be simple Wikipedia. So you see that you have some normal Wikipedia sentences and simple Wikipedia sentences, what can be different there? For example, you can have some deletions. For example, in the second example, you just delete two pieces, and in the first example you try to rephrase some pieces.

What kind of operations you can have to modify these sentences?

![ss_7.png](pics/ss_7.png)

Well, as I have already said, you can delete, you can paraphrase, or you can just split one sentence to two simpler smaller sentences.

Now, paraphrasing is rather general approach and you can do different things. You can reorder words, or you can do some syntactic analysis of your sentence and understand that some syntactic structures are more simple and usual just substitute one syntactic structures by some others.

![ss_8.png](pics/ss_8.png)

And the straight forward way to do this would be **rule-based** approach. Actually, we do not cover ruled-based approach a lot in our course, well maybe it's not so fancy as deep neural networks. So usually people want to hear about deep neural networks more, but to be honest, rule-based approach is a very good thing that works in production. So if we just want to be sure that your model performs okay, it's a good idea to start with just implementing your specific rules for the model. 

For text simplification task, it can be just some substitutions, some context free grammar rules that tell you that, for example, solely can be simplified by only. Or if you say something population, you should better say, the population of something, okay? So lots of rules, you can either know them, for example if you have some linguists, or you can learn them from data. So this paraphrase database is a big data source, and it also has some learned rules.
![ss_9.png](pics/ss_9.png)

Another approach would be still to do some deep learning and even reinforcement learning.


So this is not easy to make that model work, but I just want to give you some general idea, very hand-wavey idea of how it could be done. You can do just encoder-decoder architecture as usual. But this architecture is likely not to simplify well, because it doesn't have any simplification objective built in. So one way to build in this objective would be weak supervision by reinforcement learning. What do I mean by that?

In reinforcement learning, we usually have some agents that perform some actions. So here, the actions would be to generate the next word.

Usually we also have some policy, which means the probability distribution for actions. And in this case, it will be probabilities of the next word given everything else.

And the agent performs some actions according to the policy and gets some rewards. So if the generated sentence is good, then the reward should be high. So one very creative step is how do we estimate this reward? And the idea is to do it in three parts. 

1. So adequacy is about whether the simplified sentence is still about the same fact as the original sentence.
2. Fluency is just about the coherence of the sentence and the language model. 
3. And simplicity is whether the simplified version is indeed simpler than the original one. 

![ss_10.png](pics/ss_10.png)

A super high level architecture would be as follows. You have your encoder-decoder agent that can generate some sequences. Then for every sequence you get some rewards based on simplicity, relevance and fluency. These rewards go to reinforce algorithm that we do not cover right now, but you need to know that this reinforced algorithm can use these rewards to update the policy of the agent. So the agent, on the next step, will be more likely to generate those actions that give higher rewards. 

So in some sense it is similar to gradient descent, you would say, but the important distinction is that the rewards are usually not differentiable. So reinforcement learning is really helpful when you cannot just say that you have your most function and you need to optimize it. But when you just say, well, this is simple, this is not simple, so here the reward is high, here the reward is low. If the reward is like that, you cannot just take gradients and do stochastic gradient descent. And that's why you apply something a little bit more magical, which is called a **reinforced algorithm**.

Now, I just want go into details of just one piece on this slide, simplicity. So how do we measure simplicity?

![ss_11.png](pics/ss_11.png)

Well, we have three kinds of information. Input, which is the normal text, then references, which is the golden simplified sentences, and then our output over the system. We need to compare all of them to understand whether we perform well.

For example for machine translation, you would compare just human references with system outputs, right? Because the input is usually in some other language. But here it is very important to compare all of them. For example, one nice measure that can be used is called system against reference and input.

![ss_12.png](pics/ss_12.png)

It computes precision course for different types of operations, for addition, copying, and deletion. For example, what would be the precision for addition operation? Well, what are those terms that we add? These are the terms that occur in output, but do not occur in input. And this is exactly what we see in the denominator. Now, how many of them occur in the reference?

This is exactly what we see in the nominator. So we just have precision score that measures how many of the terms are indeed correct. Now you can think about recall for addition, and precision and recall for other operations, and somehow average them to get this score. I want to show you that this score actually works. 

For example, we have the input and three references and three outputs.

![ss_13.png](pics/ss_13.png)

And you can see that the second output is definitely better than the third one, because now is simplified, we had currently in the input. And this score can distinguish this, because we compare everything with the input.

It doesn't happen for BLEU score for machine translation. There, we compare just output and reference. And the BLEU score thinks that system number two and system number three behaves just the same. 

# 2. Get to the point! Summarization with pointer-generator networks

Hey, in this chapter, I'm going to cover one nice paper about summarization. 

This is a very recent paper from Chris Manning Group, and it is nice because it tells us that on the one hand, we can use encoder-decoder architecture, and it will work somehow. On the other hand, we can think a little bit and improve a lot. 

So, the improvement will be based on **pointer networks**, which are also a very useful tool to be aware of. Also sometimes, we have rather hand-wavy explanations of the architectures with the pictures. Sometimes, it is good to go into details and to see some actual formulas. 

That's why I want to be very precise in this chapter, and in the end of this chapter, you will be able to understand all the details of the architecture. 

![ssa_1.png](pics/ssa_1.png)

So, this is just a recap, first of all that we have usually some encoder, for example bidirectional LSTM and then we have some attention mechanism, 

![ssa_1.png](pics/ssa_2.png)

which means that we produce some probabilities that tells us what are the most important moments in our input sentence. 

Now, you see there is some arrow on the right of the slide. Do you have any idea what does this arrow means? Where does it comes from? 

![ssa_1.png](pics/ssa_3.png)

Well, the attention mechanism is about the **important moments of the encoder based on the current moment of the decoder**. So, now we definitely have the yellow part which is decoder, and then the current state of this decoder tells us how to compute attention. Just to have the complete scheme, we can say that we use this attention mechanism to generate our distribution or vocabulary. 

Awesome. So, this is just a recap of encoder-decoder attention architecture. 

Let us see how it works. 

![ssa_1.png](pics/ssa_4.png)

So, we have some sentences, and we try to get a summary. 

![ssa_5.png](pics/ssa_5.png)

So, the summary would be like that. First, we see some UNK tokens because the vocabulary is not big enough. Then, we also have some problems in this paragraph that we will try to improve. One problem is that the model is abstractive, so the model generates a lot, but it doesn't know that sometimes, **it will be better just to copy something from the input**. 

![ssa_6.png](pics/ssa_6.png)

So, the next architecture will tell us how to do it. Let us have a closer look into the formulas and then see how we can improve the model. 

So, first, **attention distribution**. Do you remember notation? Do you remember what is $H$ and what is $S$? Well, $H$ is the encoder states and $S$ is the decoder states. So, we use both of them to compute the attention weights, and we apply softmax to get probabilities. Then, we use these probabilities to weigh encoder states and get $v_j$. $v_j$ is the context vector specific for the position $j$ over the decoder. 

Then how do we use it? We have seen in some other videos that we can use it to compute the next state of the decoder. In this model, we will go in a little bit more simple way. Our decoder will be just normal RNN model, but we will take the state of this RNN model $s_j$ and concatenate with $v_j$ and use it to produce the probabilities of the outcomes. So, we just concatenate them, apply some transformations, and do softmax to get the probabilities of the words in our vocabulary. 

![ssa_7.png](pics/ssa_7.png)

Now, how can we improve our model? We would want to have some copy distribution. So, this distribution should tell us that sometimes it is nice just to copy something from the input. 

How can we do this? Well, we have attention distribution that already have the probabilities of different moments in the input. What if we just sum them by the words? So, for example, we have seen as two times in our input sequence. Let us say the probability of as should be equal to the sum of those two. And in this way, we'll get some distribution over words that occurred in our input. 

![ssa_8.png](pics/ssa_8.png)

Now, the final thing to do will be just to have a mixture of those two distributions. 

So, one is this copy distribution that tells that some words from the input are good, and another distribution is our generative model that we have discussed before. So just a little bit more formulas. 

How do we weigh these two distributions? 

![ssa_9.png](pics/ssa_9.png)

We weigh them with some probability p generation here, which is also sum function. So every thing which is in green on this slide is some parameters. So, you just learn these parameters and you learn to produce this probability to weigh two kinds of distributions. And this weighting coefficient depends on everything that you have, on the context vector $v_j$, on the decoder state $s_j$, on the current inputs to the decoder. So you just apply transformations to everything that you have and then sigmoid to get probability. 

The training objective for our model would be, as usual, cross-entropy loss with this final distribution. 

So, we will try to predict those words that we need to predict. This is similar to likelihood maximization, and we will need to optimize the subjective. 

![ssa_10.png](pics/ssa_10.png)

Now, this is just the whole architecture, just once again. We have **encoder with attention**, we have **yellow decoder**, and then we have two kinds of distributions that we weigh together and get the final distribution on top. 

![ssa_11.png](pics/ssa_11.png)

Let us see how it works. This is called pointer-generation model because it has two pieces, **generative model** and **pointer network**. So this part about copying some phrases from the input would be called pointer network here. Now, you see that we are good, so we can learn to extract some pieces from the text, but there is one drawback here. So you see that the model repeats some sentences or some pieces of sentences. We need one more trick here, and the trick will be called coverage mechanism. Remember you have **attention probabilities**. You know how much attention you give to every distinct piece of the input. Now, let us just accumulate it. 

![ssa_12.png](pics/ssa_12.png)

So at every step, we are going to sum all those attention distributions to some coverage vector, and this coverage vector will know that certain pieces have been attended already many times. 

How do you compute the attention then? 

Well, to compute attention, you would also need to take into account the **coverage vector**. So the only difference here is that you have one more term there, the coverage vector multiplied by some parameters, green as usual, and this is not enough. So you also need to put it to the loss. Apart from the loss that you had before, you will have one more term for the loss. It will be called **coverage loss** and the idea is to minimize the minimum of the attention probabilities and the coverage vector. 

Take a moment to understand that. 

So imagine you want to attend some moment that has been already attended a lot, then this minimum will be high and you will want to minimize it. And that's why you will have to have small attention probability at this moment. On the opposite, if you have some moment with low coverage value, then you are safe to try to have high attention weight here because the minimum will be still the low coverage value, so the loss will not be high. So this loss motivates you to attend those places that haven't been attended a lot yet. 

![ssa_13.png](pics/ssa_13.png)

Let us see whether the model works nice and whether the coverage trick helps us to avoid repetitions. We can compute the ratio of duplicates in our produced outcomes, and also we can compute the same ratio for human reference summaries, and you can see that it is okay to duplicate unigrams, but it is not okay to duplicate sentences because the green level there is really low, it is zero. So the model before coverage, the red one, didn't know that and it duplicated a lot of three-grams and four-grams and sentences. The blue one doesn't duplicate that, and this is really nice. 

![ssa_14.png](pics/ssa_14.png)

However, we have another problem here. The summary becomes really extractive, which means that we do not generate new sentences, we just extract them from our input. Again, we can try to compare what we have with reference summaries. Let us compute the ratio of those n-grams that are novel. And you can see that for the reference summaries, you have rather high bars for all of them. 

So, the model with **coverage mechanism** has sufficiently lower levels than the model without the coverage mechanism. So in this case, our coverage spoils a model a little bit. 

![ssa_14.png](pics/ssa_15.png)

And again for the real example, this is the summary generated by pointer-generator network plus coverage, and actually let us see. Somebody says he plans to something. 

![ssa_16.png](pics/ssa_16.png)

And here in the original text, we see exactly the same sentences but they are somehow linked. So, we just link them with he says that and so on. Otherwise, it is just extractive model that extracts these three important sentences. 

![ssa_17.png](pics/ssa_17.png)

Now, I want to show you quantitative comparison of different approaches. ROUGE score is an automatic measure for summarization. You can think about it as something as BLEU, but for summarization instead of machine translation. Now, you can see that pointer-generator networks perform better than vanilla seq2seq plus attention, and coverage mechanism improves the system even more. 

However, all those models are not that good if we compare them to some baselines. One very competitive baseline would be just to take first three sentences over the text. But it is very simple and extractive baseline, so there is no idea how to improve it. I mean, this is just something that you get out of this very straightforward approach. 

On the contrary, for those models for attention and coverage, there are some ideas how to improve them even more, so in future everybody hopes that neural systems will be able to improve on that, and it is absolutely obvious that in a few years, we will be able to beat those baselines.

# 3. QUIZ - Summarization and simplification

![quiz_ss_1.png](pics/quiz_ss_1.png)
![quiz_ss_1.png](pics/quiz_ssa_2.png)
![quiz_ss_1.png](pics/quiz_ss_3.png)



In [26]:
#Imagine you have trained an encoder-decoder-attention model to generate a text summary. 
#Let's say you have a vocabulary [big, black, bug, bear] and the vocabulary distribution 
#at some decoding moment is [0.3, 0.4, 0.1, 0.2].

voc = ["big", "black", "bug", "bear"] 
voc_dist =  [0.3, 0.4, 0.1, 0.2]
voc_dict = {voc[i]: voc_dist[i] for i in range(len(voc))}
# Now, let us consider how it changes if we add the pointer part from the paper 
# "Get to the point! Summarization with pointer-generator network" to be able 
# to copy some input words.

# Consider an input sentence: "a big black bug bit a big black bear". And the 
# attention distribution [0.1, 0.1, 0.1, 0.1, 0.2, 0.1, 0.1, 0.1, 0.1].
sentence = "a big black bug bit a big black bear"
sentence_to_list = sentence.split()
att_dist = [0.1, 0.1, 0.1, 0.1, 0.2, 0.1, 0.1, 0.1, 0.1]
copy_dist = {}
for i in range(len(sentence_to_list)):
    if sentence_to_list[i] in voc_dict:
        if sentence_to_list[i] in copy_dist:
            copy_dist[sentence_to_list[i]] += att_dist[i]
        else:
            copy_dist[sentence_to_list[i]] = att_dist[i]
# How will the final distribution look like, if the pointer network (copy distribution) 
# is weighted equally with the generator network (vocabulary distribution)?
print((copy_dist["big"] + voc_dict["big"])/2)
#Enter the probability for "big".

0.25


![quiz_ss_1.png](pics/quiz_ss_4.png)
![quiz_ss_1.png](pics/quiz_ss_5.png)