### Attention Intro
- we know that LSTM and GRU can learn long-term dependencies but how long?
- Doing a maxpool over RNN states is like doing a maxpool over CNN features
- it's essentially saying pick the most important feature
- By taking the last RNN state, we hope the RNN has both found the relevant feature and remembered it all the way to the end


### Attention
- Encoder is now bidirectional LSTM (output shape = Tx X 2M)

### Attention vs Regular Seq2Seq
![](https://talbaumel.github.io/blog/attention/img/att.jpg)

![](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/Feeding-Hidden-State-as-Input-to-Decoder.png)

- Seq2Seq is side by side network 
- encoder generates the h(Tx) to input of the decoder's state

- Attention is different, encoder throws out the output (h(1),h(2),...h(Tx)) as the input of Attention network

### Attention Architecture
![](https://lilianweng.github.io/lil-log/assets/images/encoder-decoder-attention.png)

\begin{equation*}
context = \sum_{t'=1}^{T_x} \alpha(t') h(t') 
\end{equation*}


- Encoder: Bidirectionan RNN
- Decoder: General LSTM
- How do we calculate the $\alpha$


- t is for the output sequence(t = 1 ... $T_y$)
- $t'$ is for the input sequence($t' = 1,...,T_x)$
- For a signle step of the output(t) I need to look over all of the h(s) ($t'= 1...,T_x)$


\begin{equation*}
\alpha_{t'} =  NeuralNet([S_{t-1},h_{t'}]),t' = 1,...,T_x
\end{equation*}


- input vector is concatenation of s(t-1) and $h(t')$ why?
- Where I pay attention depends on not just the hidden states, but where I am in the output

### Calculating attention weights
- z = concat[ s(t-1),$h(t')$] 
- z = $tanh(W_1 z + b_1)$ # layer 1 of ANN
- z = softmax($W_2 z + b_2)$ # layer 2 of ANN
- Alphas must sum to 1 over time
$\sum_{t'=1}^{T_x} \alpha(t') = 1 $
- we use this special softmax over time
\begin{equation*}
\alpha(t') = \frac{exp(out(t'))}{\sum_{\tau=1}^{T_x} exp(out(\tau))}
\end{equation*}


- out(1) = NeuralNet([s(t-1),h(1)])
- out(2) = NeuralNet([s(t-1),h(2)])
...
- out($T_x$) = NeuralNet([s(t-1),h($T_x$)])
- $\alpha = softmax(out)$

### Pseudocode
- h = encoder(input)
- s = 0
- for t in range($T_y$):  
      alphas = neural_net(s,h)  
      context = dot(alphas,h)  
      o,s = decoder_lstm(context,initial_state =s )  
      output_prediction = dense(o)  


\begin{array}{}
s(t-1) & \ra{} & Decoder & \ra{} & s(t)\\
& &\uparrow{context\ vector}& & \\
\end{array}


### Teacher Forcing

- Training: input(t) = [context(t),target(t-1)]
- Prediction: input(t) = [context(t),y(t-1)]

\begin{array}{}
s(t-1) & \ra{} & Decoder & \ra{} & s(t)\\
& & \uparrow{}  & &\\
& & concat  & &\\
& &context\uparrow{} \uparrow{y(t-1)}& & \\
\end{array}


### Structure
- Regular decoder
    - x = LSTM(M,return_sequence= True)(x)
    - y = Dense(V)(x) 
- __Attenstion__
    - for each output step 1 ... Ty  
          alphas = attention(s,h) # already runs Tx times  
          context = alphas.dot(h)
          output,s = lstm(context)
- Key: s changes at each iteration, so it can't all be calculated at once

### Encoder
- LSTM has latent dimension = $M_1$
- shape of h($t'$) = 2$M_1$
- shape of sequence of h's = ($T_x,2M_1)$  

\begin{array}{}
h(1) &  & h(2) & & h(t') & & h(T_x) \\
\uparrow{} & & \uparrow{} & & \uparrow{} & & \uparrow{} \\
\fbox{LSTM} & \leftarrow{} & \fbox{LSTM}&\leftarrow{} & \fbox{LSTM} &\leftarrow{} & \fbox{LSTM} \\ 
\fbox{LSTM} & \rightarrow{} & \fbox{LSTM}& \rightarrow{} & \fbox{LSTM} & \rightarrow{} & \fbox{LSTM} \\ 
\uparrow{} & & \uparrow{} & & \uparrow{} & & \uparrow{} \\
x(1) &  & x(2) & & x(t') & & x(T_x) \\
\end{array}



### Decoder
- encoder:$h(t') = 2M_1$
- decoder concat: $[s(t-1),h(t')]$ = $M_2 + 2M_1$

- $\alpha(t') shape = 1$

- $\alpha$ will actually have the shape:($N, T_x',1)$

- Keras 2.1.5 allows you to pass in axis = 1 into softmax

### Keeping track of shapes
- $(T_x',1) (T_x',2M_1) -> (1, 2M_1)$
- $\alpha * h = \sum_{t'}^{T_x} \alpha(t')h(t')$

- h = encoder(input)
- s = 0, c = 0
- outputs = []
- for t in range($T_y$):
        context = do_attention(s,h) # s(t-1),h(1)... $h(T_x)$
        o,s,c = decoder_lstm(context,init= [s,c])
        probabilities = final_dense(o)
        ouputs.append(probabilities)
- model = Model(input,outputs)       