# Input Encoding
I think it is really important to go through the input encoding step, as it personally helps me
understnad the network architecture way better if I know exactly how the input looks like.


## Input

For someone like me who was in vision this whole time and didn't really know too much about NLP, it might be worthwhile to read this part to get to know how words get embedded into a numerical input that could be used as an input to the network. 

Unlike, images where the image can directly be converted into a matrix with numerical values, the idea of "converting" word into vector seemed like a really foreign idea. 

### Embedding Layer - Look up table
The simplest method of transforming word into vector is through the use of a look up table, where each word in the data is mapped to a unique vector in a look up table. These vectors are initialized randomly, and continually learns and changes during the training stage. This look up table and its initialization is already implemented in PyTorch's nn.Embedding(). 

Let's solidify the above concept by going through a very simple example.

## Example
### Setup
Consider a toy dataset with the following two sentences:

\[**"I like apples"** **"You like blueberries"**\]

Each word can be transformed into a $d$-dimensional embedding vector. Let's use $d=3$

For now, let us define each word with this indices:

**I -> $0$**

**Like -> $1$**

**Apples -> $2$**

**You -> $3$**

**Blueberries -> $4$**



Additionally, there are $5$ unique words in the dataset in total. Normally, this would be known through some preprocessing, but since this is a simple example, we can just deduce that information very simply.


### Lookup Table
The lookup table would be size $5$ by $3$ (row x col). Let's initialize PyTorch Embedding Layer.

In [4]:
import torch
from torch import nn
from torch.nn import Module
import numpy as np

# Initialize parameters
num_words = 5
embedding_dim = 3
I = 0
LIKE = 1
APPLES = 2 
YOU = 3 
BLUEBERRIES = 4

# Initialize Embedding Layer
embedding_layer = nn.Embedding(num_embeddings = num_words, embedding_dim = embedding_dim)

input_words = torch.LongTensor([[I, LIKE, APPLES, YOU, BLUEBERRIES]])

lookup_table = embedding_layer(input_words)

print("Initialized LookUp Table by Pytorch's  Embedding Layer")
print(lookup_table)

Initialized LookUp Table by Pytorch's  Embedding Layer
tensor([[[ 0.6477,  1.1516, -0.7456],
         [ 0.8228,  1.0397, -0.5882],
         [-0.0581,  1.0294,  2.0208],
         [-0.1690, -0.1230, -0.1516],
         [ 0.2315, -2.6705,  0.1501]]], grad_fn=<EmbeddingBackward>)


Each row of the above tensor represents the vector that each word in our dataset respresents. 

```
input_words = torch.LongTensor([[I, LIKE, APPLES, YOU, BLUEBERRIES]])
lookup_table = embedding_layer(input_words)
```
by putting **input_words** in **embedding_layer**, we are trying to retrieve the vector that each word represents.

The input encoding is done through the following method.
![Encoding Equation](images/sin_and_cos.png "sin and cos encoding")

In a sequential data such as sentences, the value (such as the individual word) matters of course, but the **position** of the value is equally as important.

In order to embed this information, the authors decided to use the above equation to indiciate the position.

$i$ in the equation refers to the position that the value has in the sequence.

In [3]:
np.arange(5)

array([0, 1, 2, 3, 4])

In [11]:
dim = torch.arange(5).reshape(1,1,-1)
dim.shape

torch.Size([1, 1, 5])

In [12]:
pos = torch.arange(5).reshape(1,-1,1)
pos.shape

torch.Size([1, 5, 1])

In [15]:
position = torch.arange(0, 10).unsqueeze(1)
position.shape

torch.Size([10, 1])

In [13]:
phase = (pos/10000) ** (dim//5)
phase.shape

torch.Size([1, 5, 5])

In [None]:
class PositionalEncoding(Module):
    def __init__(self, len_seq, d_model, dropout=0.1):
        super(self).__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, len_seq, dtype=torch.float).unsqueeze(1) # shape = (len_seq, 1)
        division_value =
        pe = pe.
        
    def forward(self, x):
        
        
        
        

In [None]:
def positional_encoding():
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, len_seq, dtype=torch.float).unsqueeze  # shape = (len_seq, 1)
    value = pos/10000