<a href="https://colab.research.google.com/github/arkeodev/nlp/blob/main/Simple_Transformers/simple_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Transformers Architecture

## Introduction

Transformers are at the heart of modern natural language processing (NLP), driving innovations in language translation, text generation, question answering, and more. Introduced in the groundbreaking 2017 paper "Attention is All You Need" by Vaswani et al., transformers revolutionized the field by moving away from the traditional, computationally expensive sequence-to-sequence learning models like Recurrent Neural Networks (RNNs). Unlike RNNs, which process data step by step, transformers analyze entire sequences of data at once, drastically improving efficiency.

At their core, transformers utilize a mechanism called "self-attention," allowing them to weigh the importance of different parts of the input data differently. This ability to process all data simultaneously, rather than sequentially, transforms their computational complexity from linear to constant time, making them significantly faster and more scalable.

Today, transformers are the foundation of many popular NLP models, including ChatGPT and Gemini, where the "T" in ChatGPT stands for Transformer. These models can understand and generate human-like text, translate languages, summarize documents, and much more, showcasing the versatility and power of transformer technology in tackling complex NLP challenges.

## Understanding the Transformer Architecture

<figure>
    <img src="https://raw.githubusercontents.com/arkeodev/nlp/main/Simple_Transformers/images/overall-architecture -of-transformers.png" width="400" height="400" alt="Transformers Architecture">
    <figcaption>Transformers Architecture</figcaption>
</figure>

Transformers revolutionized machine learning in natural language processing. They consist of two main parts: the encoder, which reads and processes the input, and the decoder, which generates the output.

## Encoder and Decoder Stacks

Both the encoder and decoder are composed of a series of layers. Each layer in the encoder includes two sub-layers: the self-attention mechanism and a position-wise feed-forward network. The decoder also includes these two sub-layers, with an additional layer that performs attention over the encoder's output.

## Key Components of Transformer Layers

### Multi-Head Self-Attention

Instead of one single attention mechanism, transformers use multiple attention heads to capture information from different representation subspaces at different positions. This parallel processing allows the model to learn various aspects of the data in one go.

### Position-wise Feed-Forward Networks

Each layer contains a fully connected feed-forward network applied to each position separately, allowing for the model to consider the position of each element in the sequence.

### Residual Connections and Layer Normalization

These help in stabilizing the training of deep networks by allowing gradients to flow through the network directly.

### Diving into Self-Attention

The self-attention mechanism is what allows Transformers to process data in parallel. It assigns a weight to each element in the input sequence, based on how relevant each element is to every other element. Self-attention can be described with three main components: Queries, Keys, and Values.

- **Queries**: A set of vectors that is matched against the keys to decide the most important elements in the sequence.
- **Keys**: Vectors that are paired with values; they are used to extract the information that queries look for.
- **Values**: Vectors that contain the actual information of each element in the sequence that is extracted based on the weightage from the keys.

Imagine you are in a library with a huge collection of books (the sequence), and you are looking for information on a specific topic.

- The **query** is like your question about the topic you’re interested in.
- The **keys** represent the index or summary of each book.
- The **values** are the actual contents of the books.

The librarian (the self-attention mechanism) checks your question against all summaries (keys) to determine which books (values) have the information you need. This process is done simultaneously for all the questions in parallel, which is what makes the transformer model so powerful and efficient.

The following formula represents the scaled dot-product attention, which is the foundation of self-attention:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

Where $( Q )$, $( K )$, and $( V )$ are the matrices for queries, keys, and values respectively, and $( d_k )$ is the dimension of the keys.

In code, assuming we have matrices for queries, keys, and values, it can be implemented simply as:

In [35]:
import numpy as np

# Define the softmax function
def softmax(x, axis=-1):
    """Compute softmax values for each sets of scores in x over the specified axis."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

# Define the scaled dot product attention function
def scaled_dot_product_attention(Q, K, V):
    matmul_qk = np.matmul(Q, K.transpose(0, 2, 1))
    dk = K.shape[-1]
    scaled_attention_logits = matmul_qk / np.sqrt(dk)
    attention_weights = softmax(scaled_attention_logits)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

# Let's consider two sentences for our demonstration
sentences = ["The cat sits on the mat", "A dog lies on the rug"]

# Define embeddings for each unique word in the sentences
# Embeddings are crafted so that similar words have closer embeddings
word_embeddings = {
    "the": np.array([1, 0, 0, 0, 0, 0, 0, 0]),
    "cat": np.array([0, 1, 0, 0, 0.1, 0.2, 0.3, 0.4]),
    "sits": np.array([0, 0.9, 1, 0, 0.2, 0.1, 0.4, 0.3]),
    "on": np.array([0, 0, 0, 1, 0, 0, 0, 0]),
    "mat": np.array([0, 0.7, 0.2, 0.3, 0, 1, 0, 0]),
    "a": np.array([1, 0, 0, 0, 0, 0, 0, 0.1]),
    "dog": np.array([0, 0.9, 0.1, 0, 0.2, 0.3, 0.4, 0]),
    "lies": np.array([0, 1, 0.8, 0.2, 0.3, 0.1, 0.4, 0]),
    "rug": np.array([0, 0.9, 0.3, 0.4, 0, 0.9, 0.1, 0])
}

# Tokenize the sentences
tokenized_sentences = [sentence.lower().split() for sentence in sentences]

# Convert sentences to embeddings using the predefined embeddings
sentence_embeddings = [[word_embeddings[word] for word in sentence] for sentence in tokenized_sentences]

# Pad sentences to the same length
max_length = max(len(sentence) for sentence in tokenized_sentences)
padded_embeddings = [np.array(sentence + [np.zeros(3)] * (max_length - len(sentence))) for sentence in sentence_embeddings]

# Stack the embeddings to create the Q, K, and V matrices
Q = np.array([np.vstack(sentence) for sentence in padded_embeddings])
K = np.array([np.vstack(sentence) for sentence in padded_embeddings])
V = np.array([np.vstack(sentence) for sentence in padded_embeddings])

# Apply the scaled dot product attention function
attention_output, attention_weights = scaled_dot_product_attention(Q, K, V)

# Let's print the attention weights for the first sentence
print("Attention weights for the first sentence:")
print(attention_weights[0])

# Let's print the attention weights for the second sentence
print("Attention weights for the second sentence:")
print(attention_weights[1])

Attention weights for the first sentence:
[[0.20795408 0.14602296 0.14602296 0.14602296 0.20795408 0.14602296]
 [0.13376468 0.21181252 0.20301404 0.13376468 0.13376468 0.18387941]
 [0.12475757 0.18934399 0.26305682 0.12475757 0.12475757 0.17332649]
 [0.15299844 0.15299844 0.15299844 0.21788799 0.15299844 0.17011824]
 [0.20795408 0.14602296 0.14602296 0.14602296 0.20795408 0.14602296]
 [0.13073607 0.17971615 0.18163247 0.14536482 0.13073607 0.23181441]]
Attention weights for the second sentence:
[[0.20853701 0.14591549 0.14591549 0.14591549 0.20780103 0.14591549]
 [0.13285432 0.19670349 0.20522848 0.13285432 0.13285432 0.19950506]
 [0.12172423 0.18803512 0.24168897 0.13064304 0.12172423 0.19618442]
 [0.15039178 0.15039178 0.16141108 0.21417579 0.15039178 0.17323778]
 [0.20795408 0.14602296 0.14602296 0.14602296 0.20795408 0.14602296]
 [0.12181493 0.1829274  0.19633061 0.14031983 0.12181493 0.23679229]]


Please

### Positional Encodings







Transformers do not inherently understand the order of the sequence, so we add positional encodings to input embeddings to provide this context. This helps the model understand word order, which is crucial in language processing.


## References

- Attention is All You Need (Base Paper): https://arxiv.org/pdf/1706.03762.pdf