# Transformer Notes
This notebook will be based off of [this video](https://www.youtube.com/watch?v=kCc8FmEb1nY) which goes into depth on how to build the exact base model I need for this project. I will be copy pasting a lot of his work and annotating it to help myself understand the process of making the Transformer work.

In [13]:
# Imports
from collections import Counter

# pytorch functionality
import torch
from torch import Tensor

# data
from torchtext.vocab import vocab, Vocab
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer

## Get Data
For this example I am going to be using the IMDB dataset but the work/model should be generalizable to all text.

In [14]:
train_iter, val_iter, test_iter = WikiText2()

### Tokenization 

For now I'm going to keep my tokenizer very simple. You can use a multitude of techniques for tokenizing your corpus. Here is a [library](https://github.com/openai/tiktoken) worth looking into at some point.

We are going to be getting very long sequences but small token spaces. This can be changed with this library.

#### Some Helper Functions for Data

In [34]:
def build_vocab(in_data, tokenizer):
  counter = Counter()
  for string in in_data:
    counter.update(tokenizer(string))

  return vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

def data_process(in_data, tokenizer, vocab: Vocab):
  raw_iter = iter(in_data)
  data = []
  for raw in raw_iter:
    tensor = torch.tensor([vocab[token] for token in tokenizer(raw)], dtype=torch.long)
    data.append(tensor)
    
  return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

##### Define Tokenizer

I am going to be doing something slightly different to the video. I'm choosing to use the provided torch tokenizer for words rather than doing it char by char. Torch's tools support this kind of work more but it will require some slight adjustments to the work done. 

In [23]:
tokenizer = get_tokenizer("basic_english")

#### Build Vocabulary

In [24]:
vocab = build_vocab(train_iter, tokenizer)

print("Train Vocab Size:", len(vocab))

Train Vocab Size: 28785


Short Example for how encoding and decoding works with Vocab Object

In [28]:
encoded_word = vocab.get_stoi()["there"]
decoded_word = vocab.get_itos()[encoded_word]

print("'There' after encoding:", encoded_word)
print("'There' after decoding", decoded_word)

'There' after encoding: 248
'There' after decoding there


### Convert Data to Tensor Format
Using above data_process function, build a torch tensor representation based on the vocab

In [38]:
train_data = data_process(train_iter, tokenizer, vocab)
val_data = data_process(val_iter, tokenizer, vocab)
test_data = data_process(test_iter, tokenizer, vocab)

print("Training Data Shape and Type:", train_data.shape, train_data.dtype)
print("Validation Data Shape and Type:", val_data.shape, val_data.dtype)
print("Testing Data Shape and Type:", test_data.shape, test_data.dtype)

Training Data Shape and Type: torch.Size([2049990]) torch.int64
