### Attention Mechanism

Attention mechanism was first published in [Bahdanau et al, Neural Machine Translation by jointly learning to align and translate](https://arxiv.org/pdf/1409.0473.pdf). Where it is applied in a Seq2Seq model as a "layer".

Seq2Seq models have **encoder** and **decoder** part. Attention layer adds extra info from the encoder contexts to decoder. Using Andrew Ng's notation, $(h_1, h_2, \cdots, h_n)$ is the hidden state (output) from the encoder part (say a LSTM layer), $h_n$ is the final state of encoder part and initial state of the decoder part, for decoder part it is denoted as $s_0$, the $i$ th output (translate word) $s_i$ is $f(s_{i-1}, y_{i-1})$ where $f$ denote the feed forward steps. Now with attention mechanism, some context $c$ is extracted from the encoder part, then also as input for the decoder part: $s_i = f(s_{i-1}, y_{i-1}, c_i)$. $c_i$ is a weighted average of encoder output hidden states:

$$
c_i = \sum^{n}_{j=1}\alpha_{ij}h_j,
$$

here the weight $\alpha_{ij}$ denote the **"attention"** that the $i$ th output should pay on $j$ th input. And 

$$
\alpha_{ij} = \frac{\exp(e_{ij})}{\sum^{n}_{k=1}\exp(e_{ik})},
$$

where $e_{ij}$ is calculated from previous decoder hidden state $s_{i-1}$ and $j$ th encoder output hidden state $h_{j}$ use a simple layer (through one activation function).


### Apply to Other Model Structures

Idea of attention mechanism could also be applied to other structures. I used it several times in kaggle competitions. Therefore I just clean and summarize the code from kaggle kernels.

In [1]:
import numpy as np

import torch
import torch.nn as nn

import keras.backend as K
from keras import initializers, regularizers, constraints
from keras.layers import Layer

Using Theano backend.


#### Keras Implementation

https://keras.io/layers/writing-your-own-keras-layers/

- `build(input_shape)`: define weights, must set `self.built = True` at the end.
- `call(x)`: define layer logic.
- `compute_output_shape(input_shape)`: specify in case the layer modifies the shape of input.
    

In [2]:
class Attention(Layer):
    """
    Attention layer used in feed forward structure
    
    Refer from
    author: @qqgeogor
    kaggle profile: @https://www.kaggle.com/qqgeogor
    kernel: https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0-043
    
    Originally the idea from:
    https://arxiv.org/pdf/1512.08756.pdf
    """
    def __init__(self, step_dim, bias=True,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 **kwargs):
        """
        :param step_dim : int, number of timestamps to use. If it's after RNN layer it will be max_len.
        """
        super().__init__(**kwargs)
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')
        
        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)
        
        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)
        
        self.step_dim = step_dim
        self.bias = bias
        self.feature_dim = 0
        
    def build(self, input_shape):
        """ define weights """
        assert len(input_shape) == 3
        self.W = self.add_weight((input_shape[-1], ),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.feature_dim = input_shape[-1]
        
        if self.bias:
            self.b = self.add_weight((input_shape[1], ),
                                      initializer='zero',
                                      name='{}_b'.format(self.name),
                                      regularizer=self.b_regularizer,
                                      constraint=self.b_constraint)
        
        self.built = True
    
    def call(self, x, mask=None):
        """ define structure """
        # e_ij = W * input + b
        e_ij = K.reshape(K.dot(K.reshape(x, (-1, self.feature_dim)), 
                                         K.reshape(self.W, (self.feature_dim, 1))),
                         (-1, self.step_dim))
        if self.bias:
            e_ij = e_ij + self.b
        
        e_ij = K.tanh(e_ij)
        a = K.exp(e_ij)
        
        if mask is not None:
            a = a * K.cast(mask, K.floatx())
        
        # softmax normalization
        a = a / K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)
        
        # output context
        weighted_input = x * a
        c = K.sum(weighted_input, axis=1)
        return c
    
    def compute_output_shape(self, input_shape):
        return input_shape[0], self.feature_dim

#### Pytorch Implementation

Very neat. Just need to define `__init__()` and `foward()` method like what needed for other neural networks.

In [3]:
class Attention(nn.Module):
    def __init__(self, feature_dim, step_dim, bias=True, **kwargs):
        super().__init__(**kwargs)
        self.supports_masking = True
        self.feature_dim = feature_dim
        self.step_dim = step_dim
        self.bias = bias
        
        W = torch.zeros(feature_dim, 1)
        nn.init.xavier_uniform(W)
        self.W = nn.Parameter(W)
        
        if bias:
            self.b = nn.Parameter(torch.zeros(step_dim))
            
    def forward(self, x, mask=None):
        e_ij = torch.mm(x.contiguous().view(-1, self.feature_dim), self.W).view(-1, self.step_dim)
        if self.bias:
            e_ij = e_ij + self.b
        
        e_ij = torch.tanh(e_ij)
        a = torch.exp(e_ij)
        if mask is not None:
            a = a * mask
        
        a = a / torch.sum(a, 1, keepdim=True) + 1e-10
        weighted_input = x * torch.unsequeeze(a, -1)
        c = torch.sum(weighted_input, 1)
        return c