# Seq2Seq model

__Seq2Seq__, the full name is ***Sequence to Sequence*** model,the meaning can be interpreted as a sequence signal(long sentences, paragraphs, image extraction features, audio signals,etc.),through encoding and decoding to generate a new sequence signal(abbreviated phrase, text description, recognition text), usually used in the fields of machine translation, picture description, automatic dialogue, speech recognition, etc.

# What is Seq2Seq model?

The core idea of the seq2Seq model is to convert a sequence signal as an input to an output sequence signal through a deep neural network. This process consists of two processes: Encoding and Decoding. In the classic implementation, the encoder and decoder are each composed of a recurrent neural network(RNN, LSTM, GRU can be).In Seq2Seq, the two recurrent neural networks are trained together.

**The following figure describes several Seq2Seq models:**

<img src="image/img32.png">

**one to one:** One input predicts one output(such as a picture to predict picture classification),
one to one network structure is as follows:
<img src="image/img33.png">

**one to many:** One input predicts a sequence output (such as: (Picture description, input a picture to predict a text description of a sequence of this picture)
One to many network structure is as follows:
<img src="image/img34.png">
<center>(1)We only enter the input once. (2)We entered unique input once for each hidden layer</center>

**many to one:** A sequence input gets an output (such as sentiment analysis, input a sequence of text of a length to get the evaluation attitude of the text is positive or negative)

<img src="image/img35.png">

**many to many:** A sequence input predicts a sequence output (such as (Machine translation, automatic
dialogue generation)

<img src="image/img36.png">
<img src="image/img37.png">


# Let's get the core idea of the Seq2Seq model

The Seq2Seq model is mainly used to achieve the conversion from one sequence to another, such as French-English translation. The Seq2Seq model consists of two deep neural networks.The deep neural network can be other neural networks such as RNN(Recurrent neural network) or LSTM(Long short-term memory). The Seq2Seq model uses a neural network to map the input sequence to a fixed-dimensional vector, which is an encoding process; then another neural network maps this vector to the target sequence, which is a decoding process.The model structure of Seq2Seq is shown in Fig1. The model inputs the sentence "ABC" and then generates "WXYZ" as the output sentence.

<img src="image/img38.png">

<center>Fig1 Model structure of Seq2Seq</center>

## Encoding and decoding

Encoding and decoding is the core part of the Seq2Seq model.The deep learning network uses RNN as an example to explain the principle of encoding and decoding. The structure of encoding and decoding is shown in Fig 2.

<img src="image/img39.png">
<center>Fig 2 Structure of encoding and decoding</center>

__Encoding:__ The RNN reads each symbol of the input sequence $X$ sequentially. When each symbol is read, the hidden state $h_{<t>}$ of the RNN changes, the formula is as follows. After reading the end of the sequence, the hidden state of the RNN is the summary $C$ of the entire input sequence.

<h3>$h_{<t>} = f(h_{(t-1)}, x_t)$</h3>

__Decoding:__ After another RNN is trained, the output sequence is generated by predicting the next symbol $y_t$. At this time, the hidden state $h_{<t>}$ of the decoder at time $t$ is calculated as follows.$y_t$ and $h_{<t>}$ are $y_{t-1}$ and $c$ for condition.

<h3>$h_{<t>} = f(h_{(t-1)},y_{t-1}, c)$</h3> 

The conditional distribution for the next symbol is

<h3>$P(y_t|y_{t-1},y_{t-2},....,y_1,c) = g(h_{<t>}, y_{t-1}, c)$</h3>

Train the encoder and decoder of the RNN together, and find the maximum conditional log-likelihood function as follows:

<img src="image/img40.png">

Where θ is a set of model parameters, and each $(x_n, y_n)$ is a combination of an input sequence and an output sequence from the training set.

The specific structure of the hidden unit of the RNN is shown in FIG. 3, which includes an update gate z and a reset gate r. The role of the update gate is to select whether the hidden state is updated by the new hidden state. The role of the reset gate is to determine whether the previous hidden state was ignored.

<img src="image/img41.png">
<center>Fig 3 Structure of hidden unit</center>

The specific process of calculating the activation function of the $j^{th}$ hidden unit in the RNN is as follows:

The calculation formula of the reset gate $r_j$ is:

<h3>$r_j = \sigma([W_r x]_{j} + [U_r h_{<t-1>}]_j)$</h3>

Where $\sigma$ is the logical sigmoid function, and $J$ represents the $j^{th}$ element of the vector. $x$ and $h_{<t-1>}$ represent the input sequence and the previous hidden state, respectively. $W_r$ and $U_r$ are learning weight matrices.

Then calculate the update gate $z_j$ in the same way:

<h3>$z_j = \sigma([W_z x]_j + [U_z h_{<t-1>}]_j)$</h3>

The activation function of the last hidden unit $h_j$ is:

<h3>$h^{<t>}_j = z_j h^{<t-1>}_{j} + (1-z_j)\tilde{h^{<t>}_{j}}$</h3>

In this formula, when the reset gate approaches 0, the hidden state will force the previous state to be ignored and reset with the current input. This method can make hidden state delete unimportant information in the future.

## The Attention Mechanism

When the original Seq2Seq model is translated, the source sentence is compressed into a fixed-length vector, which makes it difficult for the neural network to process long sentences. After the Attention mechanism was proposed, this problem was effectively solved. The core idea of the Attention mechanism: each time the model generates a word in translation, it searches for the most relevant set of positions in the source sentence. Then, the model will base on the context vector associated with these source positions and all the target words previously generated To predict new target words.

<img src="image/img42.png">
<center>Fig 4 Given the source sequence ($x_1, x_2, ..., x_T$), try to generate the $t_{th}$ target word $y_t$</center>

The attention mechanism is mainly modified on the encoder and decoder, as shown in Fig 4. The encoder uses a bidirectional RNN, the forward RNN reads the input sequence (from $x_1 to x_T$) in the original order, and calculates a sequence of forward hidden states. The backward RNN reads the sequence in reverse order (from $x_T to x_1$) to get a reverse hidden state sequence. The forward hidden state and the backward hidden state are connected. In this way, the RNN tends to represent the most recent input. The decoder needs to pay attention to the part of the source sentence. By giving the decoder an attention mechanism, it solves the problem that the encoder must encode all the information in the source sentence into a fixed-length vector. With this new method, the information can be spread into the annotated sequence and the decoder can selectively retrieve the information accordingly.

The attention mechanism eliminates the need for the model to encode the entire source sentence into a fixed-length vector, and allows the model to focus only on the information related to generating the next target word. This makes the neural network machine translate better for longer sentences.