#### Embedding Layer from the Transformers architecture

![Alt text](Images/01_InputEmbedding.png)

#### Snippet from the `Attention Is All You Need` paper 

![Alt text](Embeddings.png)

Let's understand the highlighted statements from the section `Embeddings and Softmax`

- `Transformers` works with the numerical data, the input tokens are converted into the numbers. These numbers are called `Embeddings`
- `Input Tokens` are the words from the input text
- `Vectors of Dimension dmodel` Each token as a vector is represented with a sepcific dimension called `dmodel` (a hyperparameter)
- `Embedding Layers` This layer is responsible for converting the tokens to vectors

#### PyTorch Code for Input Embedding layer

In [4]:
import torch
import torch.nn as nn
import math

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int) -> None:
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        return(self.embedding(x) * math.sqrt(self.d_model))

Let's understand the above code with an example

* Initialization:
    - `InputEmbeddings` this class is of type `torch.nn module`
    - `d_model` This is the dimension of each vector
    - `vocab_size` The number of tokens present in the corpus
    - `embedding` Embedding layer initialized with a shape - `(vocab_size, d_model)`
* Forward block:
    - According to the paper, the embedding weights are multiplied by `sqrt(d_model)`

* Example:
    - In our example The `d_model` is set to `4` and the `vocab_size` is set to `10`
    - The Embedding layer is configured with the shape of `(10,4)`
    - Now We have an input token "cat". This token is randomly initialized as `[0.2, 0.6, -0.1, 0.4]`
    - According to the paper, the emebdding weights should be multiplied by `sqrt(d_model)` = `sqrt(4)` = `2`
    - Now the input embeddings will be transformed like this,
        * `[0.2, 0.6, -0.1, 0.4]` multiplied by `2`
        * = `[0.2*2, 0.6*2, -0.1*2, 0.4*2]`
        * = `[0.4, 1.2, -0.2, 0.8]`

The final calculated `embeddings` are sent to the further steps in the `Transformers` mechanism
    
    
