# What if outputs are also sequences? - Seq2seq

Till this point we always assumed, that though the inputs of the modeling problems are sequences, but the outputs were single categorical values or scalars representing either one prediction step or a similarity score. But what if we would like to move into problems where **the input and the output are both sequences**? (For example parallel texts to translate, questions and answers,...)

We are quite lucky, since it turns out, that full sequence models are quite flexible in this regard also: 

<a href="https://i.stack.imgur.com/WSOie.png"><img src="https://drive.google.com/uc?export=view&id=1UGZ6iRVZBUnXLlvS8nAtvihwyO-a-6R0" width=65%></a>

**We can use the sequence models to capture ("encode") some inputs and generate ("decode") the desired outputs for us (utilizing their hidden states).**

A noteworthy "hero" of DL is [Ilya Sutskever](https://en.wikipedia.org/wiki/Ilya_Sutskever), who was instrumental in the elaboration of Sequence-to-sequence learning methods. 

(Not surprisingly, also a student of Hinton.)

<a href="https://www.utoronto.ca/sites/default/files/2017-03-28-IlyaSutskever.jpg"><img src="https://drive.google.com/uc?export=view&id=1IJ7GeHnyzysJqYTzvMcO2MANEVyN-vrg" width=40%></a>

Now he is the founder of the OpenAI foundation and research group.

## The innen states of LSTMs as dense vector representations of series

As mentioned before the inner states of LSTMs represent an arbitrary long sequence of inputs as a fixed length hidden state vector, thus LSTMs can be regarded as sequence encoders.

The produced representations can be for example used to:

- classification (eg. sentiment analysis in NLP)
- the measurement of similarities of series, thus __search__
- for sequence to sequence transformations, where we generate a new series just in case of language models by applying for example "beam search" from the hidden representations.

<a href="http://suriyadeepan.github.io/img/seq2seq/seq2seq2.png"><img src="https://drive.google.com/uc?export=view&id=1slnyfW87l_HqBLXiIHUKzuMD91CN143_"></a>

LSTM based sequence-to-sequence transformations are used at:

- **neural machine translation**
- Summarization
- Question answering
- Dialogue systems

#### Attention

Many of the seq2seq tasks behave in a "non holistic" way, meaning that during the solution generation it is not true that all of the prior input information is always equally important, it is well worth "attending to" certain elements of it at times, when at other occasions they can be thought of as completely unnecessary. Despite this the encoder-decoder model is constrained to only one summarized representation and can not access the relevant parts of prior hidden states. In early times some tricks were applied to mitigate this effect: entering the input twice or in reverse order, but the real solution proved to be tha so called **"attention mechanism"** (coming from ConvNets).

<a href="https://cdn-images-1.medium.com/max/1600/0*SY3nv8-J6qX1GUxk.png"><img src="https://drive.google.com/uc?export=view&id=19Fckva14TW5FpNKkdIVVGGVPqkmuAlXB"></a>

The decoder receives in each step the prior hidden state and output, as well a _weighted sum_ of all prior states of the encoder as context. 

Context in the $i$ step of the decoder:

$$ c_i = \sum_{j=1}^{T}\alpha_{ij}h_j$$

where for all $h_k$ hidden states there is weight generated by a trained feedforward network $A$:

$$e_{ik} = A(h_k, s_{i-1})$$ 

(where input is $h_k$ encoder state and $s_{i-1}$, the prior hidden state of the decoder) and uses $\alpha_{ij}$ weights to generate a softmax:

$$\alpha_{ij} = \frac{\exp e_{ij}}{\sum_{k=1}^{T}\exp e_{ik}}$$


The classic paper about attention mechanisms is: [Bahdanau et al: "Neural machine translation by jointly learning to align and translate." (2014).](https://arxiv.org/pdf/1409.0473.pdf)

<a href="https://github.com/uzaymacar/attention-mechanisms"><img src="https://drive.google.com/uc?export=view&id=1ETOhoFu8fsWpd_20a2HtYz0h-TWSbog2"></a>


[Image, blog-post and sample code on different types of attention mechanisms](https://github.com/uzaymacar/attention-mechanisms)

# Attention is all you need! - rise of the Transformers

Although we have seen that the usage of attention mechanisms enables the processing over elaborate external memory structures, later on with the advancement of research it turned out that attention mechanisms even without any external memory are extremely powerful in sequence modeling.


The __transformer__ is a powerful seq2seq encoder-decoder architecture which is built solely from "transformer modules" consisting of attention and feed-forward layers without using RNN-s. Nonetheless, in most NLP tasks (e.g., language modeling, translation, question answering etc.) transformer-based models have recently significantly outperformed the "more traditional" RNN-based encoder-decoders.

## Attention in general

The basic attention schema used in transformers can be described as follows: We want to "attend" to part(s) of a certain $\mathbf X=\langle \mathbf x_1,\dots,\mathbf x_n \rangle$ sequence of vectors (embeddings). In order to do that, we transform $\mathbf X$ into a sort of "key-value store" by calculating from $\mathbf X$ a

- $\mathcal K(\mathbf X) = \mathbf K = \langle \mathbf k_1,\dots, \mathbf k_n \rangle$ sequence of key vectors for each $\mathbf x_i$,
- a $\mathcal V(\mathbf X) = \mathbf V = \langle \mathbf v_1,\dots,\mathbf v_n \rangle$ sequence of value vectors for each $\mathbf x_i$,

plus generate (not necessarily from $\mathbf X $) a $\mathbf Q = \langle \mathbf q_1,\dots,\mathbf q_m\rangle$ sequence of query vectors. Using these values, the "answers" to each $\mathbf q$ query can be calculated by

- first calculating a "relevance score" for each $\mathbf k_i$ key, which is simply the $\mathbf q \cdot \mathbf k_i$ dot product (in certain cases scaled by a constant),
- taking the $\langle s_1,\dots,s_n\rangle$ softmax of the scores, which forms a probability distribution over the value vectors;
- finally, calculating the answer as the 
$$ \sum_{i} s_i \mathbf v_i$$ weighted sum of the values. 

##  Attention as a layer

How can the above attention mechanism be used as a _layer_ in a network with an input vector $\mathbf I = \langle \mathbf i_1,\dots, \mathbf i_n\rangle$, where the $\mathbf i_i$s are themselves vectors (embeddings)? The transformer solution is is to calculate a query from each input: 

$$
\mathbf Q = \mathcal Q(\mathbf I) = \langle \mathcal Q(\mathbf i_1),\dots,\mathcal Q(\mathbf i_n)\rangle 
$$
use these queries to attend to a sequence of vectors, and output simply the calculated answers.

The transformer uses two attention-layer variants, which differ only in what they attend to:

- __Self-attention__ layers attend (unsurprisingly) to themselves, while, in contrast 
- __Encoder-decoder attention__ layers, used in the decoder, attend to the output of the encoder.

## Self-attention

In a transformer self-attention layer, both the source of the queries and the target of the attention are the input embeddings. The mappings for queries, keys and values are learned projections:

<a href="http://jalammar.github.io/images/t/self-attention-matrix-calculation.png"><img src="https://drive.google.com/uc?export=view&id=1fYqDpEpgejnUavIBanfhHgbdVp_WNVNY" width="400px"></a>

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

## Multi-headed attention

In order to be able attend to different features on the basis of different queries, the transformer attention layers work with multiple learned query, key and value projections, which are collectively called "attention heads":

> "Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions." 

([Attention is all you need](https://arxiv.org/abs/1706.03762))

<a href="http://jalammar.github.io/images/t/transformer_attention_heads_qkv.png"><img src="https://drive.google.com/uc?export=view&id=1zB0CP-GlenMj346g1UlryVe5SOB2cFYy" width="800"></a>
(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

The outputs are collected for each head separately:

<a href="http://jalammar.github.io/images/t/transformer_attention_heads_z.png"><img src="https://drive.google.com/uc?export=view&id=15Eg3HEgeiUiW8YsaeExqD6Ra5IFo-7IA" width="800"></a>

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

concatenated, and, finally, projected back by another learned weight matrix into the basic model embedding dimension:

<a href="http://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png"><img src="https://drive.google.com/uc?export=view&id=190H1uu8SySbF0Yr3VNGjisWxagygjcxm" width="800"></a>

(In the original "Attention is all you need" paper the model embedding dimension is 512, there are 8 attention heads and the query key and value vectors are all 512/8 = 64 dimensional.)

## Transformer modules
Similarly to most CNN architectures,  transformers are built up from identical modules, that consist of two main components, one or two multiheaded attention layers and a positionwise feedforward network layer with one hidden layer whose dimensionality is larger than the model's basic embedding dimension (2048 in the original paper). The attention and FF layers are residuals with skip connections, and are normalized with layer norm. Two types of modules are used:
The modules in the encoder contain only self-attention:

<a href="http://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png"><img src="https://drive.google.com/uc?export=view&id=14sg-hZGrA6SF1ZadZPLKCd0yOIKu2Tcb" width="550"></a>

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

While the modules in the decoder also contain an "outward attention" layer attending to the output of the encoder:

<a href="https://lilianweng.github.io/lil-log/assets/images/transformer-decoder.png"><img src="https://drive.google.com/uc?export=view&id=1e8sNfkgD_qvTKZy_fzBx6r06cpPgmat2" width="400"></a>

(image source: [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html))

## Encoder-decoder architecture

The full encoder-decoder architecture has the following structure:

<a href="http://nlp.seas.harvard.edu/images/the-annotated-transformer_14_0.png"><img src="https://drive.google.com/uc?export=view&id=1ysJosUbOAOmYBzE6XRLrNWlFSU7-o1fh" width="400"></a>

(image source: [The annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html))

Similarly to other (e.g., RNN-based) seq2seq architectures, the decoder part takes the previous outputs as input. In order to prevent access to information from "future outputs", the self-attention layers in the decoder use "masked attention", i.e., for each position, positions to the right are forced to have $-\infty$ input relevance score in the self-attention softmax layer.

The following animations show the whole transformer seq2seq architecture in action in a translation task:

<a href="http://jalammar.github.io/images/t/transformer_decoding_1.gif"><img src="https://drive.google.com/uc?export=view&id=1oT-y2wQS8MLxzEj7umw1MZo5rDTD4uui"></a>

<a href="http://jalammar.github.io/images/t/transformer_decoding_2.gif"><img src="https://drive.google.com/uc?export=view&id=15XWP6IfFFUK7V9B_1D_QrSVeDOv8JmSP"></a>

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

## Further reading

+ The original transformer paper: [Attention is all you need](https://arxiv.org/abs/1706.03762)
+ A highly readable, illustrated dissection on which this discussion drew: [The illustrated transformer](http://jalammar.github.io/illustrated-transformer/)
+ An annotated version of the original paper with implementation in Pytorch: [The annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
+ Perhaps the most important application, a special kind of language model: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

## Representation power of transformers

In a recent paper titled [Visualizing and Measuring the Geometry of BERT](https://arxiv.org/abs/1906.02715) the authors investigate what inner representation BERT (and probably other transformer models) creates to solve the given language modeling like task it is being trained on.

<a href="http://drive.google.com/uc?export=view&id=1xYGqiO6mRrQZsTWj-iiGs3-9qbKvjkhj"><img src="https://drive.google.com/uc?export=view&id=1u7jAxFZ5-XnRsD7VsKu8eyJI0bNgr2mm" width=85%></a>

Interestingly, they find:

"At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings..."

So it seems, BERT builds up a hierarchical representation in the linguistic sense, thus it has a good grasp about language in general. This is the key, that enables it to become a good candidate for transfer learning!

# The success of transformers

The main advantage of transformers over RNN/LSTM solutions is undenyably: __parallelizability__. 

The computation of attention can be executed in parallel, thus, given enough GPU resources, it is feasible to achieve __huge scale__.

<img src="http://drive.google.com/uc?export=view&id=1MYe3tGO0fOhP4gSq9OWXbuZOCEdlW-3o" width=65%>

This lead to the general success of transformer architectures, especially in NLP (but also in computer vision, and the emergence of "foundational models".

<img src="http://drive.google.com/uc?export=view&id=1jPBqyQcJGUqIJov0xosKxy7p0Ot0jTli" width=65%>
