## 1. attention公式

### 1.1 计算attention score（在有些论文中称为alignment）
  $$ 
  a_t(s) = align(h_t, \bar{h}_s) = \frac{exp(score(h_t, \bar{h}_s))}{\sum_{s'}exp(score(h_t, \bar{h}_{s'}))}
  $$
  
  其中，$h_t$为$t$时刻decoder状态, $\bar{h}_s$为s时刻encoder状态（对于双向LSTM，encoder状态为前向LSTM状态和后向LSTM状态的组合）。
  
  $score(h_t, \bar(h)_s)$的计算方式有多个：
  
1. dot
  $$
  score(h_t, \bar{h}_s) = h_t^T \bar{h}_s 
  $$
  注意，这里要求encoder和decoder的状态数要相同，一般都是相同的。对于双向LSTM，decoder的状态数为encoder的两倍。
  
2. general, 又称"luong": multiplicative (Luong et al., EMNLP'2015)
  $$
  score(h_t, \bar{h}_s) = h_t^T W_a \bar{h}_s 
  $$
  
3. concat, 又称"bahdanau": additive (Bahdanau et al., ICLR'2015)
  $$
  score(h_t, \bar{h}_s) = v_a^T tanh(W_a [h_t; \bar{h}_s])
  $$

### 1.2 计算context vector
  $c_t = \sum_s {a_t(s) \bar{h}_s }$

## 2. tensorflow中tf.contrib.seq2seq代码详解
其中，关于attention部分，基本流程如下：

1. 调用`prepare_attention`，返回`(attention_keys, attention_values, attention_score_fn, attention_construct_fn)`，其中，`attention_score_fn`用于计算上面的context vector ($c_t$)，`attention_construct_fn`用于计算下一个时刻的decoder输入。
2. 调用`attention_decoder_fn_train`，返回训练时候需要的`decoder_function`
3. 调用`attention_decoder_fn_inference`，返回测试时候需要的`decoder_function`
4. 调用`dynamic_rnn_decoder(cell, decoder_fn, inputs=None, sequence_length=None,parallel_iterations=None, swap_memory=False,time_major=False, scope=None, name=None)`，返回`(outputs, final_state, final_context_state)`，该函数的含义见后文述。

### 2.1 `decoder_function`详解
注意，训练时候的`decoder_function`与测试时候的`decoder_function`的区别在于下一时刻的`cell_input`的构造方式不同，训练时候直接从训练样本中获取，而在测试时是从前一刻计算`attention`作为`cell_output`,进而利用`cell_output = output_fn(cell_output)`获取`logits`输出，利用`argmax`得到下一时刻`cell_input`。代码中对于`attention_decoder_fn_inference`的英文注释为：

>The main difference between this decoder function and the `decoder_fn` in `attention_decoder_fn_train` is how `next_cell_input` is calculated. In decoder function we calculate the next input by applying an argmax across the feature dimension of the output from the decoder. This is a greedy-search approach. (Bahdanau et al., 2014) & (Sutskever et al., 2014) use beam-search instead.
    
`decoder_function`的输入输出定义如下：

**Args**:

- `time`: positive integer constant reflecting the current timestep.
- `cell_state`: state of RNNCell.
- `cell_input`: input provided by `dynamic_rnn_decoder`.
- `cell_output`: output of RNNCell.
- `context_state`: context state provided by `dynamic_rnn_decoder`.

**Returns**: A tuple `(done, next_state, next_input, emit_output, next_context_state)` where:

- `done`: A boolean vector to indicate which sentences has reached a `end_of_sequence_id`. This is used for early stopping by the `dynamic_rnn_decoder`. When `time>=maximum_length` a boolean vector with all elements as `true` is returned.
- `next_state`: `cell_state`, this decoder function does not modify the given state.
- `next_input`: The embedding from argmax of the `cell_output` is used as `next_input`.
- `emit_output`: If `output_fn is None` the supplied `cell_output` is returned, else the `output_fn` is used to update the `cell_output` before calculating `next_input` and returning `cell_output`.
- `next_context_state`: `context_state`, this decoder function does not modify the given context state. The context state could be modified when applying e.g. beam search.

### 2.2 `dynamic_rnn_decoder`详解

Dynamic RNN decoder for a sequence-to-sequence model specified by RNNCell and decoder function.

The `dynamic_rnn_decoder` is similar to the `tf.python.ops.rnn.dynamic_rnn` as the decoder does not make any assumptions of sequence length and batch size of the input.

The `dynamic_rnn_decoder` has two modes: training or inference and expects the user to create seperate functions for each.

Under both training and inference, both `cell` and `decoder_fn` are expected, where `cell` performs computation at every timestep using `raw_rnn`, and `decoder_fn` allows modeling of early stopping, output, state, and next input and context.

When training the user is expected to supply `inputs`. At every time step a slice of the supplied input is fed to the `decoder_fn`, which modifies and returns the input for the next time step.

`sequence_length` is needed at training time, i.e., when `inputs` is not None, for dynamic unrolling. At test time, when `inputs` is None, `sequence_length` is not needed.

Under inference `inputs` is expected to be `None` and the input is inferred solely from the `decoder_fn`.

**Args**:

- `cell`: An instance of RNNCell.
- `decoder_fn`: A function that takes time, cell state, cell input, cell output and context state. It returns a early stopping vector, cell state, next input, cell output and context state. Examples of decoder_fn can be found in the decoder_fn.py folder.
- `inputs`: The inputs for decoding (embedded format). If `time_major == False` (default), this must be a `Tensor` of shape: `[batch_size, max_time, ...]`. If `time_major == True`, this must be a `Tensor` of shape: `[max_time, batch_size, ...]`. The input to `cell` at each time step will be a `Tensor` with dimensions `[batch_size, ...]`.
- `sequence_length`: (optional) An int32/int64 vector sized `[batch_size]`. if `inputs` is not None and `sequence_length` is None it is inferred from the `inputs` as the maximal possible sequence length.
- `parallel_iterations`: (Default: 32).  The number of iterations to run in parallel.  Those operations which do not have any temporal dependency and can be run in parallel, will be.  This parameter trades off time for space.  Values >> 1 use more memory but take less time, while smaller values use less memory but computations take longer.
- `swap_memory`: Transparently swap the tensors produced in forward inference but needed for back prop from GPU to CPU.  This allows training RNNs which would typically not fit on a single GPU, with very minimal (or no) performance penalty.
- `time_major`: The shape format of the `inputs` and `outputs` Tensors. If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`. If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`. Using `time_major = True` is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation.  However, most TensorFlow data is batch-major, so by default this function accepts input and emits output in batch-major form.
- `scope`: VariableScope for the `raw_rnn`; defaults to None.
- `name`: NameScope for the decoder; defaults to "dynamic_rnn_decoder"

**Returns**: A tuple `(outputs, final_state, final_context_state)` where:

- `outputs`: the RNN output 'Tensor'. If time_major == False (default), this will be a `Tensor` shaped: `[batch_size, max_time, cell.output_size]`. If time_major == True, this will be a `Tensor` shaped: `[max_time, batch_size, cell.output_size]`.
- `final_state`: The final state and will be shaped `[batch_size, cell.state_size]`.
- `final_context_state`: The context state returned by the final call to decoder_fn. This is useful if the context state maintains internal data which is required after the graph is run. For example, one way to diversify the inference output is to use a stochastic decoder_fn, in which case one would want to store the decoded outputs, not just the RNN outputs. This can be done by maintaining a TensorArray in context_state and storing the decoded output of each iteration therein.

**Raises**:

- `ValueError`: if inputs is not None and has less than three dimensions.

### 2.3 `prepare_attention`
**Args**:

- attention_states: hidden states to attend over. (对应上面的$\bar{h}_s$, 其shape为 `[batch_size, attention_length, num_units]`)
- attention_option: how to compute attention, either "luong" or "bahdanau".
- num_units: hidden state dimension. (decoder的状态个数)
- reuse: whether to reuse variable scope.

**Returns**:

- attention_keys: to be compared with target states. ($=W_{h_s} \bar{h}_s $)
- attention_values: to be used to construct context vectors.  ($= \bar{h}_s$)
- attention_score_fn: to compute similarity between key and target states. (通过_create_attention_construct_fn返回)
- attention_construct_fn: to build attention states. (通过_create_attention_score_fn返回)

### 2.4 `_create_attention_score_fn`

**Args**:

- name: to label variables.
- num_units: hidden state dimension. (decoder的状态个数)
- attention_score_fn: to compute similarity between key and target states.
- reuse: whether to reuse variable scope.

**Returns**:

- attention_construct_fn: to build attention states.

该函数返回一个**`attention_score_fn(query, keys, values)`**，其输入输出参数含义如下：

**Args**:

- `query`: 表示decoder中t时刻的状态(shape: `[batch_size, num_units]`), 对应上面的$h_t$
- `keys`: 表示encoder中的连续的状态(shape: `[batch_size, attention_length, num_units]`),实际实现中即为`prepare_attention`返回为`attention_keys`($=W_{h_s} \bar{h}_s $). 其中，`attention_length`表示注意力上下文长度，对于global attention它即是encoder中序列的长度，对于local attention它是一个可设的参数, 对应于上面的$\bar{h}_s$，主要用于计算attention score
- `values`: 表示得到attention score后，需要加权的encoder状态构成的tensor(shape: `[batch_size, attention_length, num_units]`), 对应于上面的$\bar{h}_s$，主要用于计算context vector

**Returns**:

- `context_vector`, 对应上面的$c_t$

### 2.5 _create_attention_construct_fn
**Args**:
    
- name: to label variables.
- num_units: hidden state dimension.
- attention_score_fn: to compute similarity between key and target states. (即调用`_create_attention_score_fn`返回的函数对象)
- reuse: whether to reuse variable scope.

** Returns **:

- attention_construct_fn: to build attention states.

该函数返回一个**`construct_fn(attention_query, attention_keys, attention_values)`**函数对象，其输入输出参数含义如下：

**Args**:

- `attention_query`: decoder当前的状态，对应上面的$h_t$
- `attention_keys`: 对应encoder的历史状态$\bar{h}_s$，实际计算为$W_{h_s} \bar{h}_s $
- `attention_values`:对应于计算context vector($c_t$)时的$\bar{h}_s$

**Returns**:

- `attention`: 计算方法为：(1)调用`attention_score_fn`生成context vector ($c_t$)，(2) 将其与`attention_query`  ($h_t$)组合，并乘以一个矩阵得到$W [c_t, h_t]$，其维数为`num_units`.