In [2]:
#Just run this one time if pytorch is not already installed in the current juyter kernel 
# !conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y

In [1]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


The torch framework can be installed with a copy of the CUDA toolkit by nvidia so it can have access to the current device dedicated GPU if there is one.
The `device` variable will have the GPU availability information so we can now if our code can be run in the GPU insted of the CPU. For more infor on install options go to: https://pytorch.org/get-started/locally/

# Transformers

In this notebook we will be implementing one of the most popular architectures that has taken over the DL community, Tranformers!!. 
They started with the 2017 paper by Google called "Attention is all you need" since the main idea behind transformers is to use the so calles "attention mechanism" as the core part of the architecture. This started as a sequence modeling architecture, more specifically for Language modelling, but in recent year it has taken over almost every big field of Deep Learning. Vision with ViT, RL with Decision Transformers, Speech with the Conformer and many more. One of my goals with this project is to implement all of these architectures and see how they compare with more traditional approaches like LSTM for Language, CNN for vision and Encoder-Decoder for Speech.

In particular in this notebook we will be implementing the core and basic transformer architecture.

## 1. The Attention Mechanism

Note: Here all vectors are always column vectors, so $x \in R^{n}$ is a column of size $n$ and $x^T$ a row of size $n$

The key idea behind transformers, as said above, is the attention mechanism. This can be represented as 3 matrices that act as parameters of the model. Query, Key, Value are the usual names given to the said matrices. 
Self attention is a sequence to sequence operator, so it inputs *t* vectors, each in $R^{n}$ and outputs also *t* vectors in $R^{n}$. Another way of seeing it is that is takes an element of $R^{n x t}$ and outputs another element of $R^{n x t}$. So one way of doing this is to multiply the entry matrix of vectors (lets call it $X \in R^{n x t}$) with a matrix $W$ of size (t x t) so the output is another matrix (lets call it $Y \in R^{n x t}$). This is exaclty what the basic attention mechanism does. Basically each output column vector $y_i \in R^{n}$ is calculates like:

$$ y_i = \sum_j w_{ij}x_j \text{ where } w_{ij} = softmax(x_i^T \cdot X, \text{row wise})_j \text{ .j-th entry of the vector obtain by applying a softmax opperation. } $$

Sea $w_i \in R^t := softmax(x_i^T \cdot X, \text{row wise})^T$ entonces: 

$$ y_i = X \cdot w_i$$

más aún, sea $W := softmax(X^T \cdot X, \text{row wise})^T = [w_1 | ... | w_i | ... | w_t]$

$$ Y =  X \cdot W = [y_1 | ... | y_i | ... | y_t]$$

So We have that: 

$$ Y = X \cdot softmax(X^T \cdot X, \text{row wise})^T$$

Note that since $X^T \cdot X$ is a symmetric matrix (This is really easy to check), the above formula can also be written as:

$$ Y = X \cdot softmax(X \cdot X^T, \text{row wise})^T$$

But we also know that $softmax(B, \text{row wise})^T = softmax(B^T, \text{Column wise})$. Then:

$$ Y = X \cdot softmax(X^T \cdot X, \text{column wise})$$

We can appreciate that in the formula above the entry matrix $X$ appear 3 times. What self attentiont does is to replicate this behaviour with 3 different matrices that will be parameters that the model needs to optimize via back propagation. 

So, in self-attention the role of the first appearence of X is made by the Value matrix. The second is the Query and the Third is the Key. So the formula for $Y$ becomes:

$$
Y = V \cdot softmax(Q^T \cdot K, \text{column wise})
$$

Where $Q, K \in R^{n x t}$ and $V \in R^{txt}$




In [None]:
def basic_self_att_1():
    

In [34]:
from torch.nn import Softmax
import numpy as np
x = torch.tensor([[1,2,3], [2, 2, 4], [3, 4, 5]], dtype=torch.float32)
print(x.shape)
print(x)
print(torch.transpose(x, 1, 0))
print(x.softmax(0))
print(x.softmax(1).transpose(0, 1))

torch.Size([3, 3])
tensor([[1., 2., 3.],
        [2., 2., 4.],
        [3., 4., 5.]])
tensor([[1., 2., 3.],
        [2., 2., 4.],
        [3., 4., 5.]])
tensor([[0.0900, 0.1065, 0.0900],
        [0.2447, 0.1065, 0.2447],
        [0.6652, 0.7870, 0.6652]])
tensor([[0.0900, 0.1065, 0.0900],
        [0.2447, 0.1065, 0.2447],
        [0.6652, 0.7870, 0.6652]])
