# Intro to Attention

**"One important property of human perception is that one does not tend to process a whole scene in its entirety at once. Instead humans focus attention selectively on parts of the visual space to aquire information when and where it is needed and combine information from different fixations voer time to build up and internal representation of the scene, guiding future eye movements and decision making."**

In short, we do not process entire images when we look at a picture, rather we process parts sequentially.


# Sequence to Sequence Models

Attention was created to attend to the more important parts of text/ img data. Classic sequence to sequence models w/o attention must look at the data one time and produce every single part of the ouput. Attention allowed the state of the art neural translation models in which Google uses for its translation engine.

Before we jump into learning about attention models, let's recap what you've learned about sequence to sequence models. We know that RNNs excel at using and generating sequential data, and sequence to sequence models can be used in a variety of applications!

Applications: For any output that can represented as a sequence of vectors (Images, text)
   
   Sequence to Sequence: Takes in sequence of items, produces another sequence as output
   
    Train -> Output
       
    1. English Phrase -> French Phrase
    2. News Article and Summary -> Summary (News Bot)
    3. Questions and Answers-> Answers Model
    4. Picture -> Object Labelling Model
    
    
# Encoders and Decoders

![encode](rnn_img/encoder-decoder.png)

High-level Simplification:

 - Two Recurrent Nets (Encoder and Decoder)
 - Reads Input Sequence and Sends what it understands to the Decoder
 - This understanding is called a 'context state'
 - Decoder generates the output sequence.

![encode](rnn_img/a1.png)

   - Tokenize each word
   - Feed one word at a time as a time step, while updating hidden state of an LSTM for e.g

![encode](rnn_img/a2.png)

   - fed the hidden state
   - process each word and an associated response (may be response or text that follows)
   
![encode](rnn_img/a3.png)


The encoder and decoder do not have to be RNNs; they can be CNNs too!

In the example above, an LSTM is used to generate a sequence of words; LSTMs "remember" by keeping track of the input words that they see and their own hidden state.

In computer vision, we can use this kind of encoder-decoder model to generate words or captions for an input image or even to generate an image from a sequence of input words. We'll focus on the first case: generating captions for images, and you'll learn more about caption generation in the next lesson. For now know that we can input an image into a CNN (encoder) and generate a descriptive caption for that image using an LSTM (decoder).

## Recap for Sequence to Sequence

![encode](rnn_img/a4.png)

- Hidden states are updated for every item in a sequence
- the final hidden state is sent to the decoder as the context state
- the problem is that the encoder is confined to sending a single vector to represent a large sum of  words (e.g) , longer input sequence will have troubles
- longer hidden states will cause overfitting for shorter text
- ATTENTION solves this problem


# Attention Encoder


![encode](rnn_img/a5.png)

- the encoder processes each input just like a sequence to sequence model without attenition
- each word produces a hidden state
- this time the all hidden states are passed onto the attention decoder
- this capture greater context in order to capture more information
- each hidden state is most associated with each input word of the sequence, but also capture a bit of the context around

![encode](rnn_img/a7.png)

A more detailed look:

- Embeddings are created based on the input and are fed through the RNN
- hidden states are generated based on the embeddings


# Attention Decoder


![encode](rnn_img/a8.png)

- hidden states are scored individually
- then feed through a softmax activation function
- then the context state is created by summating the hidden state multiplied by softmax scores

![encode](rnn_img/a9.png)

- Generates the context vector
- Passes through RNN which generates a new hidden state at each time step
- Next time step takes prev hidden state, output, context state to output next vector