<a href="https://colab.research.google.com/github/d61h6k4/notebooks/blob/master/Transformers_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install --upgrade tensorflow

Collecting tensorflow
[?25l  Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
[K     |████████████████████████████████| 86.3MB 1.2MB/s 
[?25hCollecting gast==0.2.2 (from tensorflow)
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting tensorboard<2.1.0,>=2.0.0 (from tensorflow)
[?25l  Downloading https://files.pythonhosted.org/packages/9b/a6/e8ffa4e2ddb216449d34cfcb825ebb38206bee5c4553d69e7bc8bc2c5d64/tensorboard-2.0.0-py3-none-any.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 33.1MB/s 
Collecting tensorflow-estimator<2.1.0,>=2.0.0 (from tensorflow)
[?25l  Downloading https://files.pythonhosted.org/packages/95/00/5e6cdf86190a70d7382d320b2b04e4ff0f8191a37d90a422a2f8ff0705bb/tensorflow_estimator-2.0.0-py2.py3-none-any.whl (449kB)
[K     |████

In this notebook we will follow Peter Bloem's blogpost [Transformers from scratch](http://www.peterbloem.nl/blog/transformers)

To better understand let's go step by step:
1. we have keys, queries and values:
$$
q_i = W_q x_i \quad k_i = W_k x_i \quad v_i = W_v x_i \\
w'_{ij} = q_i^T k_j \\
w_{i,j} = softmax(w'_{ij}) \\
y_i = \sum_j w_{i,j}v_j
$$
2. Scaling the dot product:
$$
q_i = W_q x_i \quad k_i = W_k x_i \quad v_i = W_v x_i \\
w'_{ij} = \frac{q_i^T k_j}{\sqrt{k}} \\
w_{i,j} = softmax(w'_{ij}) \\
y_i = \sum_j w_{i,j}v_j
$$
3. Multi-head attention (and here we made trick with storing all heads in one weight)
$$
q_i = W^r_q x_i \quad k_i = W^r_k x_i \quad v_i = W^r_v x_i \\
w'_{ij} = \frac{q_i^T k_j}{\sqrt{k}} \\
w_{i,j} = softmax(w'_{ij}) \\
y_i = \sum_j w_{i,j}v_j
$$

So, let's write dimensions of operations to understand what will exactly happens: \\
$X$ - input will be from $\mathbb{R}^{b \times t \times k}$, where
$b$ is batch size, $t$ is size of input sentence and $k$ is size of word vector.

$W_q \in \mathbb{R}^{h * k \times k}$ - $h$ times concatenated $W_q$ from 1, 2 steps. \\
$q = X W_q \in \mathbb{R}^{b \times t \times h*k}$ - we will implement it in this way. \\
$k = X W_q \in \mathbb{R}^{b \times t \times h*k}$.

So, the next operation is $w'_{ij} = \frac{q_i^Tk_j}{\sqrt{k}}$, and we will process it in next few steps:
1. reshape $q$ and $k$ ($b \times t \times h*k \rightarrow b \times t \times h \times k$)
2. transpose (you can think about transpose as interchanging dimensions, so we get next $b \times t \times h \times k \rightarrow b \times h \times t \times k$)
3. reshape ($b \times h \times t \times k \rightarrow b*h \times t \times k$) and now we can thin about $q$ and $k$ as $h$ times batches of original (not multi-headed) $q$ and $k$ correspondingly.
4. from computation efficency perspective it's better devide on scalar ($\sqrt{k}$) before scalar product, thus we devide on $\sqrt[4]{k}$ each argument $q$ and $k$ . (we devide on $\sqrt[4]{k}$ cause $\sqrt[4]{k} * \sqrt[4]{k} = \sqrt{k}$)
5. Now we multiply $q$ and $k^T$ in batch matrix multiplaction (matrix multiplication which didn't count batch dimension) and got $w' \in \mathbb{R}^{b*h \times t \times t}$


In [0]:
import tensorflow as tf

In [0]:
class SelfAttention(tf.keras.layers.Layer):
    def __init__(self, heads=8):
        super().__init__()
        self.__heads = heads

    def build(self, input_shape):
        # k as in original blog 
        k = input_shape[-1]
        # These compute the queries, keys and values for all
        # heads (as a single concatenated vectors)
        self.tokeys = tf.keras.layers.Dense(k * self.__heads, activation="linear", use_bias=False, input_dim=(k,))
        self.toqueries = tf.keras.layers.Dense(k * self.__heads, activation="linear", use_bias=False, input_dim=(k,))
        self.tovalues = tf.keras.layers.Dense(k * self.__heads, activation="linear", use_bias=False, input_dim=(k,))

        # This unifies the outputs of the different heads into
        # a single k-vector
        self.unifyheads = tf.keras.layers.Dense(k, activation="linear", input_dim=(k * self.__heads))

    def call(self, inputs):
        b, t, k = tf.shape(inputs)
        h  = self.__heads

        queries = tf.reshape(self.toqueries(inputs), [b, t, h, k])
        keys = tf.reshape(self.tokeys(inputs), [b, t, h, k])
        values = tf.reshape(self.tovalues(inputs), [b, t, h, k])

        queries = tf.transpose(queries, perm=[0, 2, 1, 3])
        keys = tf.transpose(keys, perm=[0, 2, 1, 3])
        values = tf.transpose(values, perm[0, 2, 1, 3])

        queries = tf.reshape(queries, [b * h, t, k])
        keys = tf.reshape(keys, [b * h, t, k])
        values = tf.reshape(values, [b * h, t, k])

        queries = queries / (k ** (1 / 4))
        keys = keys / (k ** (1 / 4))

        dot = tf.linalg.matmul(queries, keys, transpose_b=True)
        dot = tf.math.softmax(dot, axis=2)

        out = tf.linalg.matmul(dot, values)
        out = tf.reshape(out, [b, h, t, k])
        out = tf.transpose(out, perm=[0, 2, 1, 3])
        out = tf.reshape(out, [b, t, h * k])

        return self.unifyheads(out)