# TensorFlow code for Self-Attention 

### Referrence:
* [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
* [Transformer: A Novel Neural Network Architecture for Language Understanding](https://research.googleblog.com/2017/08/transformer-novel-neural-network.html)
* Tensor2tensor (https://github.com/tensorflow/tensor2tensor)

In [1]:
import tensorflow as tf

In [2]:
def input_fun(**config):
    data = tf.random_normal((
        config['batch_size'], config['sequence_length'], config['hidden_dim']))
    return data

In [3]:
def attention_fun(Q, K, scaled_=True, masked_=False):
    attention = tf.matmul(Q, K, transpose_b=True)  # [batch_size, sequence_length, sequence_length]

    if scaled_:
        d_k = tf.cast(tf.shape(K)[-1], dtype=tf.float32)
        attention = tf.divide(attention, tf.sqrt(d_k))  # [batch_size, sequence_length, sequence_length]

    if masked_:
        raise NotImplementedError

    attention = tf.nn.softmax(attention, dim=-1)  # [batch_size, sequence_length, sequence_length]
    return attention


In [4]:
def model_fun(data, **config):
    Q = tf.layers.dense(data, config['hidden_dim'])  # [batch_size, sequence_length, hidden_dim]
    K = tf.layers.dense(data, config['hidden_dim'])  # [batch_size, sequence_length, hidden_dim]
    V = tf.layers.dense(data, config['n_classes'])  # [batch_size, sequence_length, n_classes]

    attention = attention_fun(Q, K)  # [batch_size, sequence_length, sequence_length]
    output = tf.matmul(attention, V)  # [batch_size, sequence_length, n_classes]
    return output

In [5]:
if __name__ == '__main__':
    inputs = input_fun(batch_size=32, sequence_length=10, hidden_dim=128)
    #with tf.Session() as sess:  print(inputs.eval())
    outputs = model_fun(inputs, hidden_dim=128, n_classes=2)

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        outputs_ = sess.run(outputs)
        print(outputs_.shape)

W0415 11:06:47.832066 4544970176 deprecation.py:323] From <ipython-input-4-2d2c1bc57284>:2: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0415 11:06:47.835134 4544970176 deprecation.py:506] From /Users/vfu/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0415 11:06:48.140724 4544970176 deprecation.py:506] From <ipython-input-3-28e7cc9547e6>:11: calling softmax (from tensorflow.python.ops.nn_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead


(32, 10, 2)
