[View in Colaboratory](https://colab.research.google.com/github/assaflehr/language-style-transfer/blob/master/notebooks/attention.ipynb)


# Attention for humans and TF/Keras
Just another notebook on attention, hopefully simplified one.


We will go into details of attention but use a toy problem with invented numbers to first get the intuition. Then we will get into code and then, if you really want, you can look back to the equations.

I do assume you know about seq2seq encoder-decoder models (if not, please read about it now)

## "The toy problem: number to text 
input: a string representing a number , like '42.5' , '-12001''
output: a text description like 'forty-two point five' , 'minus twelve thousand and one'

### dataset
we will generate one using num2words python library.

### preprocessing 
we assume a one-hot-encoding input embedding, for the each **character** 0..9,-,''.'' 

we will use a one-hot-encoding output embedding, for each **word** '"one","two",..."thosand","minus",...

## architecture
Encoder-Decoder with attention(ofcourse).


Keep this image open and then we will show example:
First open this great figure from [distill.pub blog](https://distill.pub/2016/augmented-rnns/) 
![image-from-distill](https://distill.pub/2016/augmented-rnns/assets/rnn_attentional_02.svg) from distill.pub blog. 





To make our life simpler, let's only support 4 characters  1,2,.,<e> in the embeddings and use one-hot-encoding.
Imagine we just finished running the encoding step on "12.1<e>...<e>" (padding with end-sequence token) All RNN encoders have two possible inputs: output (o-dimentation) and hidden-state(h-dimension) after each seen characters.

* In attention-less architecture,  we ignore all but the last hidden-state

* With attention, we use all the per-char outputs and the last hidden-state and ignore other hidden-states.

We configured encoder output dimention to be 10d. Their value after training correspand nicely to: 0..3 values are the input embedding, and values 4..9 correspond to the digit number, again to make it simple for us, the human viewers we use one hot-encoding. actual values will surely be different and much more dense. so for '12.1<e>...' we get:

* '1'= 1,0,0,0,**0,0,0,0,0,1**
* '2'= 0,1,0,0,**0,0,0,0,1,0**
* '. ' =  0,0,1,0,**0,0,0,1,0,0**
* '1'= 1,0,0,0,**0,0,1,0,0,0** (the left part is thes same as the1)
* 'e' = 0,0,0,1,**0,1,0,0,0,0**
* 'e' = 0,0,0,1,**1,0,0,0,0,0**
  
The hidden-state is typically of different dimention (not 4), let's say 5 and should include (intuitevly):
* info about the total number of digits, which will help the decoder to know how to process the first digit: 11 elevan, VS 112 one-hundred and two.







### Attention
We will talk about few variants of attentions. We will start with toy attention, and then move to real usefull ones.

The decoder size start with the encoder last hidden-state. This value is used as a query into all the encoder outputs to choose which 1/few to look at. 

#### (1) toy-attention, with few problems

query 1: look with 100% on the first encoder-output.  Then do some logic either write right away "twenty"/"thirty" or if it was '1', remember that, don't output anything and wait for the next char.
query 2: look with 100% on the second encoder-output... 

How can we achieve that? with a 1st query like '0,0,0,0,**0,0,0,0,0,1** which will only return non-zero for the first character. 

Actual code will do dot-product of outputs(6x10)*query(10x1)=(6x1) weights which will be zero except the last.

Problem: Most translation systems do not accept "holes" or "spaces" in translation. In our case, we sometimes skip the first word (case of starting with 1).

**Solution: **

query 1 will look at the first two characters and decide whether to decode a one word for both ("12"-> "twelve") or one for the first only ("2x"-> "twenty"). The hidden-state will remember which chars were already fully processed.

query: if 1st char=='1' pass 1st and 2nd. Else pass only 1st.
This requires more than just vector dot-product.
can be achieved with query x W_matrix x outputs. 


### ** real solutions**
In the real attention mechansims, there are few steps. note that there are a dozen of flavors of attention: "monotonic_attention", "BahdanauMonotonicAttention", "LuongAttention" etc etc. They change the equations a bit.
* read context-vector from memory of the encoder-outputs:
  * find attention weights for each encoder-output.  can be done like we did in the toy, where it is 0 or 1,  or in two other popular ways:  query x W x outputs and *Bahdanau attention*:  FC(tanh(FC(ExO) + FC(H))) . In both we also apply 
  * then apply softmax , to normlize the result into prob.
  * then do weighted-sum of the memory vecotrs.
  * now we have a context-vector of the size of the output.
* (optional) build attention-vector as combination of "context vector" with the current target hidden state. (a = tanh(Wxconcat(c,h), we added tanh and W for some learning). 

* Create big-decoder-input as concatination of the context-vector and the regular input-vector (embedding applied on X).(start with no-input in first decoding, but then input is the previous output). It can be combined 

* run the RNN (GRU) unit on this big-decoder-input, to get a list of hidden-states and outputs. Here we don't need the hidden-states. To get the actual output, we apply a dense layer on each RNN output
  
* should attention query is of h_t or h_t-1
* ???? how to move from attnetion reuslt to output



In [1]:
!pip install num2words
from num2words import num2words
for n in [11.2,42.5, -12001]:
  print (n,num2words(n))

Collecting num2words
[?25l  Downloading https://files.pythonhosted.org/packages/aa/6e/6d026d15d1b0fd37a9dd42ecf559f36871cee67158aff5ba652d3130e8b9/num2words-0.5.6-py2.py3-none-any.whl (64kB)
[K    100% |████████████████████████████████| 71kB 1.8MB/s 
[?25hInstalling collected packages: num2words
Successfully installed num2words-0.5.6
11.2 eleven point two
42.5 forty-two point five
-12001 minus twelve thousand and one


In [43]:
import numpy as np
in_to_out=[('1', [1,0,0,0,0,0,0,0,0,1]),
('2', [0,1,0,0,0,0,0,0,1,0]),
('. ',[0,0,1,0,0,0,0,1,0,0]),
('1', [1,0,0,0,0,0,1,0,0,0]), #(the left part is thes same as the1)
('e', [0,0,0,1,0,1,0,0,0,0]),
('e', [0,0,0,1,1,0,0,0,0,0])]
outputs=np.array(list(zip(*in_to_out))[1]).T
print ('toy attention 1 - looking only at first character')
print ('outputs',outputs.shape)

query= np.array([[0,0,0,0,0,0,0,0,1,1]])
print ('query1',query.shape)

result= query@outputs
print (f'query @ outputs  {query.shape} x {outputs.shape} =',result.shape,result)

#print ('toy attention 2 - looking into two first')
#query= np.array([[0,0,0,0,0,0,0,0,1,1]])
#W = np.eye(10)
#W[:,:]= 4 
#result = query @ W @ outputs
#print ('query @ W @ outputs',result.shape, result)


toy attention 1 - looking only at first character
outputs (10, 6)
query1 (1, 10)
query @ outputs  (1, 10) x (10, 6) = (1, 6) [[1 0 0 0 0 0]]


## Code
see keras/eager-tf in : 
https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb

see older and more detailed version in:
https://www.tensorflow.org/tutorials/seq2seq#background_on_the_attention_mechanism
