# Transformer from Scratch Tutorial

This tutorial will guide you through the process of cloning the `TransformerScratch` repository from GitHub and running the code locally. This repository implements a Transformer model from scratch.

## Step 1: Cloning the Repository

First, let's clone the repository. Run the following command in your terminal:

```bash
!git clone https://github.com/atikul-islam-sajib/TransformerScratch.git

In [None]:
!git clone https://github.com/atikul-islam-sajib/TransformerScratch.git

## Step 2: Navigating to the Repository

Once the repository is cloned, navigate to the repository directory. If you are using a Jupyter notebook, you can change the directory with the following command:

```python
%cd /content/TransformerScratch
```

If you are using a terminal, use:

```bash
cd TransformerScratch
```

In [None]:
%cd /content/TransformerScratch

## Step 4: Installing Dependencies

Before running the code, you need to install the required dependencies. These are usually listed in a `requirements.txt` file or mentioned in the `README.md`. Install them using `pip`:

```python
!pip install -r requirements.txt
```

If there's no `requirements.txt` file, you may need to manually install the required packages mentioned in the `README.md` or the source code files.

In [None]:
!pip install -r requirements.txt

## Step 4: Setting Up the Data

Prepare the data for the model. Here, we use sample sentences in English and German:

```python
english = [
    "The sun is shining brightly today",
    "I enjoy reading books on rainy afternoons",
    "The cat sat on the windowsill watching the birds",
    "She baked a delicious chocolate cake for dessert",
    "We went for a long walk in the park yesterday",
]

german = [
    "Die Sonne scheint heute hell",
    "Ich lese gerne Bücher an regnerischen Nachmittagen",
    "Die Katze saß auf der Fensterbank und beobachtete die Vögel",
    "Sie hat einen leckeren Schokoladenkuchen zum Nachtisch gebacken",
    "Wir sind gestern lange im Park spazieren gegangen",
]

if len(english) != len(german):
    raise ValueError("Length of the sentences are not equal")
```

In [None]:
english = [
    "The sun is shining brightly today",
    "I enjoy reading books on rainy afternoons",
    "The cat sat on the windowsill watching the birds",
    "She baked a delicious chocolate cake for dessert",
    "We went for a long walk in the park yesterday",
]

german = [
    "Die Sonne scheint heute hell",
    "Ich lese gerne Bücher an regnerischen Nachmittagen",
    "Die Katze saß auf der Fensterbank und beobachtete die Vögel",
    "Sie hat einen leckeren Schokoladenkuchen zum Nachtisch gebacken",
    "Wir sind gestern lange im Park spazieren gegangen",
]

## Step 5: Defining Parameters

Define the parameters for the Transformer model:

```python
MAX_LENGTH = 200           # Maximum length of the input sequences
BATCH_SIZE = 2             # Number of samples per batch
EMBEDDING_DIMENSION = 512  # Dimensionality of the embedding vectors
NUM_ENCODER_LAYERS = 8     # Number of encoder layers in the Transformer
NUM_DECODER_LAYERS = 8     # Number of decoder layers in the Transformer
NUM_HEADS = 8              # Number of attention heads
DIM_FEEDFORWARD = 2048     # Dimensionality of the feedforward network
DROPOUT = 0.1              # Dropout rate
LAYER_NORM_EPS = 1e-5      # Epsilon for layer normalization

```

In [None]:
# Variable values for easy configuration
MAX_LENGTH = 200
BATCH_SIZE = 2
EMBEDDING_DIMENSION = 512
NUM_ENCODER_LAYERS = 8
NUM_DECODER_LAYERS = 8
NUM_HEADS = 8
DIM_FEEDFORWARD = 2048
DROPOUT = 0.1
LAYER_NORM_EPS = 1e-5

## Step 6: Initializing Tokenizers

Initialize the tokenizers for both English and German sentences:

```python
from src.tokenizer import Tokenizer

english_tokenizer = Tokenizer(
    text=english,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
    max_length=MAX_LENGTH,
    batch_size=BATCH_SIZE,
)
english_tokenizer_results = english_tokenizer.create_dataloader()
english_dataloader = english_tokenizer_results["dataloader"]
english_vocab_size = english_tokenizer_results["vocab_size"]

german_tokenizer = Tokenizer(
    text=german,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
    max_length=MAX_LENGTH,
    batch_size=BATCH_SIZE,
)
german_tokenizer_results = german_tokenizer.create_dataloader()
german_dataloader = german_tokenizer_results["dataloader"]
german_vocab_size = german_tokenizer_results["vocab_size"]
```

In [None]:
from src.tokenizer import Tokenizer

english_tokenizer = Tokenizer(
    text=english,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
    max_length=MAX_LENGTH,
    batch_size=BATCH_SIZE,
)


german_tokenizer = Tokenizer(
    text=german,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
    max_length=MAX_LENGTH,
    batch_size=BATCH_SIZE,
)

english_tokenizer_results = english_tokenizer.create_dataloader()
english_dataloader = english_tokenizer_results["dataloader"]
english_vocab_size = english_tokenizer_results["vocab_size"]

german_tokenizer_results = german_tokenizer.create_dataloader()
german_dataloader = german_tokenizer_results["dataloader"]
german_vocab_size = german_tokenizer_results["vocab_size"]

## Step 7: Initializing the Embedding Layer

Create the embedding layer:

```python
from src.embedding_layer import EmbeddingLayer

embedding_layer = EmbeddingLayer(
    vocabulary_size=english_vocab_size,
    dimension=EMBEDDING_DIMENSION,
    sequence_length=MAX_LENGTH,
)
```

In [None]:
from src.embedding_layer import EmbeddingLayer

embedding_layer = EmbeddingLayer(
    vocabulary_size=english_vocab_size,
    dimension=EMBEDDING_DIMENSION,
    sequence_length=MAX_LENGTH,
)


## Step 8: Initializing the Transformer Model

Set up the Transformer model:

```python
from src.transformer import Transformer

transformer_model = Transformer(
    d_model=EMBEDDING_DIMENSION,
    nhead=NUM_HEADS,
    num_encoder_layers=NUM_ENCODER_LAYERS,
    num_decoder_layers=NUM_DECODER_LAYERS,
    dim_feedforward=DIM_FEEDFORWARD,
    dropout=DROPOUT,
    layer_norm_eps=LAYER_NORM_EPS,
)
```

In [None]:
from src.transformer import Transformer

transformer_model = Transformer(
    d_model=EMBEDDING_DIMENSION,
    nhead=NUM_HEADS,
    num_encoder_layers=NUM_ENCODER_LAYERS,
    num_decoder_layers=NUM_DECODER_LAYERS,
    dim_feedforward=DIM_FEEDFORWARD,
    dropout=DROPOUT,
    layer_norm_eps=LAYER_NORM_EPS,
)

## Step 9: Testing the Transformer Model

Test the Transformer model with the embeddings from the first batch:

```python
for (english_batch, english_padding_mask), (german_batch, german_padding_mask) in zip(
    english_dataloader, german_dataloader
):
    english_embeddings = embedding_layer(english_batch)
    german_embeddings = embedding_layer(german_batch)

    transformer_output = transformer_model(
        x=english_embeddings,
        y=german_embeddings,
        encoder_padding_mask=english_padding_mask,
        decoder_padding_mask=german_padding_mask,
    )
    print(transformer_output.size())
    break  # Test with only the first batch
```

In [None]:
# Test the Transformer with embeddings
for (english_batch, english_padding_mask), (german_batch, german_padding_mask) in zip(
    english_dataloader, german_dataloader
):
    english_embeddings = embedding_layer(english_batch)
    german_embeddings = embedding_layer(german_batch)

    transformer_output = transformer_model(
        x=english_embeddings,
        y=german_embeddings,
        encoder_padding_mask=english_padding_mask,
        decoder_padding_mask=german_padding_mask,
    )
    print(transformer_output.size())
    break  # Test with only the first batch

In [None]:
from src.transformer import Transformer

transformer_model = Transformer(
    d_model=EMBEDDING_DIMENSION,
    nhead=NUM_HEADS,
    num_encoder_layers=NUM_ENCODER_LAYERS,
    num_decoder_layers=NUM_DECODER_LAYERS,
    dim_feedforward=DIM_FEEDFORWARD,
    dropout=DROPOUT,
    layer_norm_eps=LAYER_NORM_EPS,
)

# THIS IS ANOTHER APPROACH THAT YOU CAN USE TO RUN THE TRANSFORMER

In [None]:
from transformers import AutoTokenizer
from torch.utils.data import DataLoader, TensorDataset

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


############################
#          English         #
############################

english_tokenizer = tokenizer(
    english,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
    max_length=MAX_LENGTH,
)

print("Tokenized Input IDs:", english_tokenizer["input_ids"].size())
print("Attention Mask:", english_tokenizer["attention_mask"].size())

print("*" * 50, "\n")

english_vocab_size = tokenizer.vocab_size

english_tokenizer_results = TensorDataset(
    english_tokenizer["input_ids"], english_tokenizer["attention_mask"]
)
english_tokenizer_dataloader = DataLoader(
    english_tokenizer_results, batch_size=BATCH_SIZE, shuffle=True
)

############################
#          German          #
############################

german_tokenizer = tokenizer(
    german,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
    max_length=MAX_LENGTH,
)

print("Tokenized Input IDs:", german_tokenizer["input_ids"].size())
print("Attention Mask:", german_tokenizer["attention_mask"].size())

print("*" * 50, "\n")

german_vocab_size = tokenizer.vocab_size

german_tokenizer_results = TensorDataset(
    german_tokenizer["input_ids"], german_tokenizer["attention_mask"]
)
german_tokenizer_dataloader = DataLoader(
    german_tokenizer_results, batch_size=BATCH_SIZE, shuffle=True
)

###########################
#         Embedding       #
###########################

assert german_vocab_size == english_vocab_size, "Vocabulary sizes must be equal"

embedding = EmbeddingLayer(
    vocabulary_size=english_vocab_size,
    sequence_length=MAX_LENGTH,
    dimension=EMBEDDING_DIMENSION,
)

# Test the Transformer with embeddings
for (english_batch, english_padding_mask), (german_batch, german_padding_mask) in zip(
    english_tokenizer_dataloader, german_tokenizer_dataloader
):
    english_embeddings = embedding(english_batch)
    german_embeddings = embedding(german_batch)

    transformer_output = transformer_model(
        x=english_embeddings,
        y=german_embeddings,
        encoder_padding_mask=english_padding_mask,
        decoder_padding_mask=german_padding_mask,
    )
    print(transformer_output.size())
    break  # Test with only the first batch
