In [None]:
import numpy as np
from misc_tools.print_latex import print_tex

# Generative Pre-trained Transformer (GPT) networks

## What is GPT
GPT is a transformer neural network architecture that is used in sequence analysis. 
Previous architectures such as RNN, and in extension LSTM, work in cyclic-unrolling fashion, by analyzing single element sequence at a time and some encoded representation of previous context. RNN approach shows difficulty in keeping track of long-range (history) dependencies due in learning "vanishing" or "exploding" gradients, which simply means that first sequence entry affects current entry though multiple unfoldings of RNN. Cyclic RNN architecture also prevents effective parallelization, since it is basically a for-loop. 
GPT architecture introduces a method that addresses issue with long-range dependencies by introducing "attention mechanism", which is also parallelizable due to matrix multiplication implementation. 

## Basic idea of sequence prediction in GPT
General steps of data prediction via GPT is the following:
1. Transforming input data entries to vectors (called an `embedding`) which, by construction,<br>
capture relations between them (and others of same input type);

2. Assign an importance or weight for each embedding, via `attention-mechanism`, <br>
based on context of position-neighbor embeddings;

3. Performing weighted sum of embedding (called `aggregation`) using weights from step 2.


## GPT requirements for "robustness"
1. Context length can be as low as `1` and as high as some `MAX_CONTEXT_LENGTH ` = $N$

## (Dummy problem) Aggregation of all sequence elements via weighted sum

Lets consider that for our sequence of length $N$ we can find and store embeddings matrices $H$ as rows. 
$$H = 
\begin{bmatrix}
\vec{h}_0^T \\\vec{h}_1^T  \\ \vdots \\ \vec{h}_{N - 1}^T
\end{bmatrix}
$$
Suppose we have an array of weights $\vec{w}^T$
$$
\vec{w}^T =
\begin{bmatrix}
w_0 & w_1 & \dots w_{N - 1}
\end{bmatrix}
$$

Given equation $\vec{y}^T = \vec{x}^T M$, we know that after right vector-matrix multiply, $\vec{y}^T$ will contain a linear combination of rows of $M$.

So, in our case:
$$
\vec{h}^{T\prime} = 
\vec{w}^T H = 
\begin{bmatrix}
w_0 & w_1 & \dots w_{N - 1}
\end{bmatrix}
\begin{bmatrix}
\vec{h}_0^T \\\vec{h}_1^T  \\ \vdots \\ \vec{h}_{N - 1}^T
\end{bmatrix}
=
\sum_{i=0}^{N-1} w_i \vec{h}_i^T
$$
$$
\mathrm{dim(\vec{h}^{T\prime})} = \mathrm{dim(\vec{h}_i^T)}
$$

In [34]:
h1, h2, h3 = H = np.array([[2,0,0],[0,-1,0],[0,0,1]])
w1, w2, w3 = wT = np.array([1,1,1])
print_tex(r'\vec{w}^{T} = ', wT, r'; \ H = ', H)
print_tex(r'\vec{h}^{T\prime} = \vec{w}^TH = ', wT @ H)
print_tex(r'\vec{h}^T = w_0 \vec{h}_0^T + w_1 \vec{h}_1^T + w_2 \vec{h}_2^T  = ', w1*h1 +  w2*h2 + w3*h3)

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<i>This dummy problem is not practical, but shows that given correct weights and embeddings, information can be collected and passed between elements of a sequence. </i>

## Attention mechanism
Attention mechanism is aimed at determining weights for embedding aggregation
### Query (Q) and Key (K) matrices

Weight for each individual embedding is generated from analyzing interaction of this embedding and its neighbors.

Interaction (or attention) for a pair is not symmetric.

For example in scope of natural language processing (NLP) a sentence, which is a sequence of words, has a specific structure: <i>subject $\rightarrow$ verb $\rightarrow$ object</i>. Thus, pairs (subject, verb) and (verb, object) are more frequent in text, so are highly 'correlated'.

This asymmetric nature requires for embeddings $\vec{h}_i$ to have two descriptors: 
1. key $\vec{k}_i$ - 'what am I'
2. query $\vec{q}_i$ - 'who I am looking for'
$$
\vec{q}_i =f_q(\vec{h}_i) = \ \stackrel{query}{\leftarrow} \  \vec{h}_i \ \stackrel{key}{\rightarrow} \ = f_k(\vec{h}_i) = \vec{k}_i
$$

Strength of connection of $\vec{h}_i$ with its neighbors is determined by a dot product of its `query` $\vec{q}_i$ and all neighbor's `key` $\vec{k}_i$.

$$Q=
\begin{bmatrix}
\vec{q}_0^T \\\vec{q}_1^T  \\ \vdots \\ \vec{q}_{N - 1}^T
\end{bmatrix}
; \ 
K = 
\begin{bmatrix}
\vec{k}_0^T \\\vec{k}_1^T  \\ \vdots \\ \vec{k}_{N - 1}^T
\end{bmatrix}
; \ K^T=
\begin{bmatrix}
\vec{k}_0 & \vec{k}_1 & \dots & \vec{k}_{N - 1}
\end{bmatrix}
$$

This can be packed into a matrix $A$
$$
A = Q K^T 
; \ \mathrm{dim(A)} =
[ N \times N]
$$

$$A_{i,j} = \sum_{f=0}^{F-1} Q_{i,f} (K^T)_{f,j }= \sum_{f=0}^{F-1} Q_{i,f} K_{j,f} = \vec{q}_i^T \cdot \vec{k}_j^T
\rightarrow W = Q K^T =
\begin{bmatrix}
\vec{q}_0^T \vec{k}_0 & \vec{q}_0^T \vec{k}_1 & \dots \\
\vec{q}_1^T \vec{k}_0 & \vec{q}_1^T \vec{k}_1 & \dots \\
\vdots & \vdots & \ddots  \\
\end{bmatrix}
$$
Where i-th row shows interaction strength between embedding $\vec{h}_i$ and all of other embeddings:
$$A_{i} = 
\begin{bmatrix}
\alpha_{i,0} & \alpha_{i,1} & \dots & \alpha_{i,N - 1}
\end{bmatrix}=
\begin{bmatrix}
\vec{q}_i^T \vec{k}_0 & \vec{q}_i^T \vec{k}_1 & \dots & \vec{q}_i^T \vec{k}_{N - 1}  
\end{bmatrix}
$$

### Value (V) matrix and aggregation
Additionally we need a value $\vec{v}_i$ representation for aggregation using $A_{i,j}$ weights:

$$ \vec{h}_i \ \stackrel{value}{\rightarrow} \ = f_v(\vec{h}_i) = \vec{v}_i $$

We get a new `value` neighbor-aware embedding $\vec{v}_i^\prime$ by aggregating all neighbor embeddings $\vec{v}_i$ using weights $\alpha_{i,j}$:

$$V = 
\begin{bmatrix}
\vec{v}_0^T \\ \vec{v}_1^T  \\ \vdots \\ \vec{v}_{N - 1}^T
\end{bmatrix}
; \ \mathrm{dim(V)} =
[ N \times E]
$$

$$
\vec{v}_i^{T\prime} = 
A_{i} V = 
\begin{bmatrix}
\alpha_{i,0} & \alpha_{i,1} & \dots & \alpha_{i,N - 1}
\end{bmatrix}
\begin{bmatrix}
\vec{v}_0^T \\ \vec{v}_1^T  \\ \vdots \\ \vec{v}_{N - 1}^T
\end{bmatrix}=
\sum_{k = 0}^{N-1} \alpha_{i,k}\vec{v}_k^T 
$$
Or pack all $\vec{v}_i^\prime$ into a matrix $V^\prime$

$$V^\prime = 
\begin{bmatrix}
\vec{v}_0^{T\prime} \\ \vec{v}_1^{T\prime}  \\ \vdots \\ \vec{v}_{N - 1}^{T\prime}
\end{bmatrix}=
AV
; \ \mathrm{dim(V^\prime)} =
[ N \times E]

$$


### Masking communication\aggregation
Embeddings in $H$ are ordered by their appearance in a sequence.<br>
Since we are performing a forecasting task, we know that last entry in a sequence is influenced by previous entries by some rule (that we ultimately want to uncover by training a model). Actually, this applies to each sequence entry. 

This means that for embedding indexed `i` we should consider only neighbors up to `i` (self-including), which can be done by setting all according weights to zero:

$$\alpha_{i,k}: k \in [0,1,\dots, i]$$ 

$$\vec{v}_i^{T\prime} = \sum_{k = 0}^{i} \alpha_{i,k}\vec{v}_k^T  $$

$$
\vec{v}_1^{T\prime} = 
A_{1}^{masked} V = 
\begin{bmatrix}
1 & 1 & 0  & \dots &  0 
\end{bmatrix}
\begin{bmatrix}
\vec{v}_0^T \\ \vec{v}_1^T \\ \vec{v}_2^T  \\ \vdots \\ \vec{v}_{N - 1}^T
\end{bmatrix}
=
\begin{bmatrix}
\vec{v}_0^T & + & \vec{v}_1^T & + & \vec{0}^T & + & \dots & + &  \vec{0}^T 
\end{bmatrix}
=
\vec{v}_0^T + \vec{v}_1^T
$$

Due to ordered nature of a sequence
* first entry $\vec{v}_0^{\prime}$ will have all summation weights zeroed except itself,
* last entry $\vec{v}_{N-1}^{\prime}$ will have all weights available.

Resulting mask $T$ will have a lower triangle shape. 

For example lets do aggregation using mask alone:
$$
V^\prime = 
\begin{bmatrix}
\vec{v}_0^{T\prime} \\ \vec{v}_1^{T\prime} \\ \vdots \\ \vec{v}_{N-1}^{T}
\end{bmatrix}=
T V = 
\begin{bmatrix}
1 & 0 & \dots & 0 \\
1 & 1 & \dots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
1 & 1 & \dots & 1
\end{bmatrix}
\begin{bmatrix}
\vec{v}_0^{T} \\ \vec{v}_1^{T} \\  \vdots \\ \vec{v}_{N-1}^{T}
\end{bmatrix}=
\begin{bmatrix}
\vec{v}_0^{T} \\ \vec{v}_0^{T} + \vec{v}_1^{T}  \\ \vdots \\ \vec{v}_0^{T} + \dots + \vec{v}_{N - 1}^{T}
\end{bmatrix}


In [44]:
v1,v2,v3 = V = np.array([[2,0,0],[0,-1,0],[0,0,1]])
w1, w2, w3 = A = np.tril(np.ones((3,3)))

print_tex(r'\vec{v}_0^T =', v1, r'; \ \vec{v}_1^T =', v2, r'; \ \vec{v}_2^T =', v3, r'; \ V = ', V)
print_tex(r'A = ', A, r'; \ H\prime = A V = ', A @ V)
print_tex(r'A_0 V = 1 \vec{v}_0^T + 0 \vec{v}_1^T + 0 \vec{v}_2^T  = ', w1*v1 +  0*v2 + 0*v3)
print_tex(r'A_1 V = 1 \vec{v}_0^T + 1 \vec{v}_1^T + 0 \vec{v}_2^T  = ', w1*v1 +  w2*v2 + 0*v3)
print_tex(r'A_2 V = 1 \vec{v}_0^T + 1 \vec{v}_1^T + 1 \vec{v}_2^T  = ', w1*v1 +  w2*v2 + w3*v3)


<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<IPython.core.display.Math object>

<i>In practice weights $A_i$ are normalized via Softmax()</i>

## Positional encoding
Because summation is invariant of ordering and elements of a sequence should adhere to order we introduce positional.

Easiest practical approach is to inject positional information $\vec{p}_i$ directly in original embedding $\vec{h}_i$

$$H^\prime = H + P = 
\begin{bmatrix}
\vec{h}_0^T \\\vec{h}_1^T  \\ \vdots \\ \vec{h}_{N - 1}^T
\end{bmatrix}+
\begin{bmatrix}
\vec{p}_0^T \\\vec{p}_1^T  \\ \vdots \\ \vec{p}_{N - 1}^T
\end{bmatrix}
$$

Positional encoding can be fixed or learned via NN.

## Forecasting
After applying attention mechanism we have aggregated old $N$ embeddings into $N$ new neighbor-aware embeddings. 
In case forecasting, each new embedding contains some information from its time-previous neighbors. Even if we needed originally only final prediction, we can use all new token embeddings to predict intermediate results, which will force a model to learn analysis of sequences of any size shorter than context window.

Typically, in NLP, new embeddings are projected to size of vocabulary, softmax is applied and Cross-Entropy calculated for loss vs known predictions.