<a href="https://colab.research.google.com/github/WolferkGIT/Deep-Learning/blob/main/03_traininig.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import sys
transformer_path = '/content/drive/MyDrive/Colab_Notebooks/Transformers_pt_1/transformer_lm'
sys.path.append(transformer_path)
%load_ext autoreload
%autoreload 2
import modeling_transformer
from modeling_transformer import TransformerEncoderLayer, TransformerEncoder, TransformerLM

In [4]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as torchdata

# Training a transformer language model

In this notebook, we will learn how to

1. preprocess data for language modeling
2. use `torch.utils.data` to handle batching in an efficient and standard way
3. train a transformer language model

Specifically, we will use the Tiny Shakespeare dataset, which contains the complete works of William Shakespeare, to train a language model. The goal of this notebook is to walk you through the steps of pre-processing the dataset and preparing it for training using the PyTorch DataLoader, creating a language model, training it and using it to generate text.

We will train a character-based langauge model instead of word-based, because:

1. It's faster to train it to the point that it can generate text
2. We don't want to complicate the homework with BPE tokenization
3. We work with a small dataset which might not be enough to train a word-based language model

> Feel free to try training a word-based language model on a larger dataset, such as the WikiText-2 dataset, which is available in the hugginface datasets library.

# Step 1: Load and Explore the Dataset
The first step is to load the dataset and explore it. In this example, we will use the Tiny Shakespeare dataset, which contains the complete works of William Shakespeare. We can download the dataset from the following URL: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Feel free to use `wget` to download the dataset or just download the file manually and upload it to your Colab instance.

Here's how you can use `wget` to download the dataset:
```
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O tiny_shakespeare.txt
```

## Coding task 3.1: load the data and take a look

Read the file to a variable named `raw_data` and print the first 1000 characters.

### Grading criteria
**(1 point max)**

1 point if everything works

In [5]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O tiny_shakespeare.txt

--2023-08-10 16:30:54--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘tiny_shakespeare.txt’


2023-08-10 16:30:55 (55.6 MB/s) - ‘tiny_shakespeare.txt’ saved [1115394/1115394]



In [6]:
with open("tiny_shakespeare.txt", "r") as f:
    raw_data = f.read()

print(raw_data[:2000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



## Inline question 3.1: raw text preprocessing
**(1 point max, 1 extra point for creative ideas)**

Think about how you can pre-process the data (in terms of modifying the text). Provde three ideas and explain why you think they are useful or not. Think about the size of the data, tokenization method (we will use character-level language model), your computational resources, and what kind of text you want to generate. Make this answer as extensive as possible.

***Since the text inside given example has capital letters mostly for the headers, to spimplify model labor we might turn all the letters into smaller ones. It would reduce the size of vocabulary, making training a faster process.
The goal of the model trained on such text would be an ability to generate some text that retains Shakspear style of speech.
Since I declare the goal likewise, it is possible to remove all the Headers like "All:" or "Second citizen", since those are not useful anyhow for the training data if we just want to emulate Shakespear style of speech.
***

# Step 2: preparing the data for the model

## Coding task 3.2
Similar to previous homeworks, where we made a vocabualry of words, we will make a vocabulary of characters.

1. Make a vocabulary of all characters
2. Make `char2idx`
3. Make a class `Tokenizer` that stores `char2idx` and has two methods: `encode` and `decode` that encode and decode text using `char2idx` and `idx2char` dictionaries.
   * You might find it useful to create `idx2char` dictionary inside the `__init__` method of the `Tokenizer` class.
4. Create a `Tokenizer` object
5. Convert the text to a list of integers using `char2idx`, assign it to a variable named `data`
6. Print the first 100 items of `data`

It's useful to have a function that converts a sequence of indices to a string. You will need it to convert the output of the model to a text when you will be generating text, but is it also very useful for **debugging** your pre-processing code.

### Grading criteria
**(2 points max)**

1. 1 point for `char2idx` dictionary
2. 1 point for `Tokenizer` class that passes the tests below

In [35]:
# YOUR CODE STARTS HERE (our implementation is about 4 lines using comprehensions, but it's alright if yours is longer)
class CharToIndex:
    def __init__(self):
        self.char2idx = {}
        self.idx2char = {}
        self.num_chars = 0

    def mapping(self, text):
        # Create char-to-index and index-to-char mappings
        for char in text:
            if char not in self.char2idx:
                self.char2idx[char] = self.num_chars
                self.idx2char[self.num_chars] = char
                self.num_chars += 1

    def encode(self, text):
        # Convert text into a sequence of integers using the char2idx mapping
        sequence = [self.char2idx[char] for char in text]
        return sequence

    def length(self):
        return (self.num_chars)

    def decode(self, sequence):
        # Convert a sequence of integers back to the original text using the idx2char mapping
        text=[]
        for idx in sequence:
            if (idx < self.num_chars) and (idx > 0):
                text.append(self.idx2char[idx])
        return ''.join(text)
# YOUR CODE ENDS HERE

In [36]:
#_tokenizer = Tokenizer(char2idx)
for char in raw_data:
    char = char.lower()

_tokenizer = CharToIndex()
_tokenizer.mapping(raw_data)

# YOUR CODE STARTS HERE

_token_ids = _tokenizer.encode("hello")
_text = _tokenizer.decode(_token_ids)

assert isinstance(_token_ids, list), "token_ids should be a list"
assert isinstance(_token_ids[0], int), "token_ids should be a list of integers"
assert _text == "hello", "decode should work correctly and return the original text"

#del _tokenizer, _token_ids, _text

In [37]:
print(_tokenizer.encode("hello"))
print(_tokenizer.decode([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]))

[22, 8, 28, 28, 14]
irst Czen:
Bfo


In [10]:
_tokenizer.length()

65

# Chunk the data

Our data is too long to be processed in one go. We will split it into chunks of length 128. We will use the first 128 characters to predict the next character. This is a decent length for a sequence, but you can play with it if you want.

## Coding task 3.3

1. Create a list of sequences of length `MAX_LEN + 1`. Each sequence should be a list of integers. You'll see why we need `+ 1` in a minute.
   * You might need to get rid of your last example if it's shorter than `MAX_LEN + 1` characters. We need all data to be of the same length to simplify batching.
   * In the next homework we will implement batchihg for sequences of different lengths and you are probably not going to enjoy it, it's a bit tricky.
2. Split the data into training and validation sets. Use 90% of the data for training and 10% for validation.
3. Make x and y pairs for your data. Remember that we want to use the first 128 characters to predict the next character. So, `x` should be the first 128 characters and `y` should be a shifted version of the same sequence, so it's the last 128 characters. Name them `train_x` and `train_y` for the training set and `val_x` and `val_y` for the validation set.
4. Print an example from the training set. You should see that the first 128 characters are the same as the first 128 characters of the original text, and the last 128 characters are the same as the last 128 characters of the original text, shifted by one character.

You can just stride using `data[i:i+128]` for each `i` in `range(0, len(data), 128)`, no need to do anything fancy. You can figure out more complex ways to do it, just do this after all the homework is done. You receive no extra points if your homework is not finished.

### Grading criteria

1. 1 point for `data_chunks` list and train-test split
2. 1 point for dataset and dataloader objects
3. Extra point for a more interesting way to chunk the text
4. Extra point for implementing a custom dataset class

In [11]:
len(raw_data)

1115394

In [12]:
def slice_to_list(input, size):
    return [input[i:i + size] for i in range(0, len(input), size)]

test_text = "Kazoo baboo la longa kikongo Metal Gear Solid V is the best game ever"

chunk_list = slice_to_list(test_text, 3)
print(chunk_list)

['Kaz', 'oo ', 'bab', 'oo ', 'la ', 'lon', 'ga ', 'kik', 'ong', 'o M', 'eta', 'l G', 'ear', ' So', 'lid', ' V ', 'is ', 'the', ' be', 'st ', 'gam', 'e e', 'ver']


In [13]:
MAX_LEN = 128
# YOUR CODE STARTS HERE (our implementation is about 13 lines, but it's alright if yours is different)
chunk_list = slice_to_list(raw_data, MAX_LEN)
raw_data_sfift = raw_data[1:]

print(chunk_list[1])
print(chunk_list[2])

if len(chunk_list[-1])!= MAX_LEN:
    chunk_list.pop()
print(len(chunk_list[-1]))
# YOUR CODE ENDS HERE

die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:

We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talki
128


# Using `torch.utils.data`

We will use `torch.utils.data.Dataset` to create a dataset object that will be used to create a `torch.utils.data.DataLoader` object. The `DataLoader` object will be used to create batches of data.

## Coding task 3.4

Your task is to learn how to use `torch.utils.data.Dataset` and `torch.utils.data.DataLoader` classes and to apply them to our data.

1. Convert your data to tensors of type long
1. Create a `torch.utils.data.Dataset` object for each train and test data. Name them `train_dataset` and `val_dataset`. You can use the `TensorDataset` class for this or make a new class that inherits from `torch.utils.data.Dataset` and implements the `__getitem__` and `__len__` methods.
2. Try indexing `train_dataset` to get a single example and decode it using `tokenizer.decode()`. What does it contain? Use tokenizer to decode one example (both x and y). Does it look like a valid text? Are the targets shifted by one character?
1. Use the `DataLoader` class to create `train_loader` and `val_loader` objects. It will shuffle and batch data for you. You can use the following parameters:
   * `dataset` - the dataset object you created in the previous step
   * `batch_size` - your choice!
   * `shuffle` - True for training data, False for validation data
   * `num_workers` - 8, number of CPU cores to use for batch preparation
3. Try iterating over `train_loader` and print the shapes of the batches.
    * You can use `break` to stop the loop after the first iteration.
4. Try decoding a batch that you get from `train_loader`. Does it look like a valid text? Are the targets shifted by one character?

Learn more about data feeding in pytorch here: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html


**NOTE:**
1. `TensorDataset` returns a tuple of tensors. Usually these are `(x, y)` pairs, where `x` is the input and `y` is the target. In our case, `x` is the input sequence and `y` is the same sequence shifted by one character. This is how we will train our language model. We will use the first 128 characters to predict the next character.
1. You need to convert your pytorch tensor into a python list in order to use `tokenizer.decode()`. Feel free to do it in-place or modify the `decode` method of the `Tokenizer` class to accept **BOTH** python lists and pytorch tensors. You can check what datatype you have using `isinstance()` function.
2. Printing might look a bit weird because you have a lot of `\n` in the data. It is alright, just be careful when you are verifying that your data is correct.

### Grading criteria

* 1 point for `train_dataset` and `val_dataset` objects
* 1 point if each test is written and passed:
  * train dataset element is correctly processed and x and y corespond to the correct characters
  * printed the shapes of the items that you get from `train_loader`
  * decoded a batch from `train_loader` and printed the decoded text and it is correct

In [14]:
#https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
# YOUR CODE STARTS HERE (our implementation is about 13 lines)
class strings_to_tensors(torchdata.Dataset):
    def __init__(self, string_list, length):
        self.string_list = string_list
        self.fixed_length = length

    def __len__(self):
        return len(self.string_list)

    def __getitem__(self, id):
        text = self.string_list[id]
        # Convert the text to a tensor
        indecies = [ord(char) for char in text]
        tensor = torch.tensor(indecies, dtype=torch.long)  # Modify dtype if needed
        return tensor
# YOUR CODE ENDS HERE

In [15]:
chunk_tensor_list= strings_to_tensors(chunk_list, MAX_LEN)


In [16]:
from sklearn.model_selection import train_test_split
train_ratio = 0.9#defined by task
train_set, test_set = train_test_split(chunk_tensor_list, train_size = train_ratio, test_size = 1 - train_ratio)

In [17]:
#https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
BATCH_SIZE = 64
train_dataloader = torchdata.DataLoader(train_set, batch_size= BATCH_SIZE, shuffle=True)
test_dataloader = torchdata.DataLoader(test_set, batch_size= BATCH_SIZE, shuffle=True)

# Train a Transformer model

Import your `TransformerLM` model from `modeling_transormer` file and train it on the data you prepared above.
You know the drill: define a model, an optimizer, and a training loop, log everything to wandb.
You can also save your model using `TransformerLM.save_pretrained()` method and load it using `TransformerLM.from_pretrained()` method in case you want to.

### Tricky part

In PyTorch, `F.cross_entropy` expects the logits to be of shape `(batch_size, num_classes)` and the targets to be of shape `(batch_size,)` containing the class indices. In our case, the logits tensor has the shape `(batch_size, seq_len, num_classes)` and the targets are of shape `(batch_size, seq_len)`. We need to reshape the input and the targets to make them compatible with `F.cross_entropy`. You can do it like this:

```python
bs, seq_len, num_classes = logits.shape
logits = logits.reshape(bs * seq_len, num_classes)
targets = targets.reshape(bs * seq_len)
```

or, equivalently, like this:

```python
logits = logits.view(-1, num_classes)
targets = targets.view(-1)
```

Try monitoring your GPU consumption and max it out. The more efficient your code is, the faster your model will train.
During training log your loss and and accuracy. You can only log accuracy every 100 batches or so, because it is a bit slow to compute. You can also log the learning rate.
During evlauation you just need to log the perplexity, the loss, and accuracy. Perplexity is just `exp(loss)`.
Accuracy is not the most standard metric for language models, but it is very intererpretable and easy to compute. Don't expect it to be high, though.
Be mindful how frequenly you evaluate your model. You don't want to evaluate it too often, because it will slow down your training loop.

> You can also log the number of batches you process in one second (throughput) as a measure of efficiency. It is not required, but it is a good idea to monitor it.

## Coding task 3.5

Make a training loop and train your model.

### Grading criteria
**(5 points + extra points)**

* 2 points for trainig loop
* 1 point for using the GPU
* 1 point for evaluation loop (we recommend to make it into a separate function to make your code more readable)
* 1 point for wandb logging of train loss, eval loss, train accuracy, eval accuracy, eval perplexity. You can also log the learning rate, but it is not required.
* -1 point if forget to zero your gradients between batches
* -1 point if your forget to put your model to evaluation mode during evaluation and back to training mode during training
* Extra point for using a learning rate scheduler
* Extra point for any other improvements to the training loop


In [18]:
!pip install wandb
import wandb
wandb.login()

Collecting wandb
  Downloading wandb-0.15.8-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.32-py3-none-any.whl (188 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-1.29.2-py2.py3-none-any.whl (215 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.6/215.6 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting pathtools (from wandb)
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.2-cp310-cp310-manylinux_2_5_x86_64.manyli

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [42]:
#from transformer_lm.modeling_transformer import TransformerLM
#already done above
# YOUR CODE STARTS HERE
num_layers=8
hidden=128
num_heads=4
fcn_hidden=128
#vocab_size=128
vocab_size=_tokenizer.length
num_epochs = 2
learning_rate = 0.005
loss_function = nn.CrossEntropyLoss()#used for CLM
wandb_configuration = {
    "attention layers": num_layers,
    "Vocab" : vocab_size,
    "MHA heads": num_heads,
    "Batch_size" : BATCH_SIZE,
    "Epochs" : num_epochs,
    "FCN size" : fcn_hidden,
    "Hidden": hidden
    }


run = wandb.init(project="Casual Language Model", config = wandb_configuration)

#-------
model_right = TransformerLM(num_layers= num_layers, hidden= hidden, num_heads= num_heads,
                      fcn_hidden= fcn_hidden, vocab_size= vocab_size, max_seq_len= MAX_LEN)
#Training the reversed version of the same model for bidirectional generation in the future
#model_left = TransformerLM(num_layers= num_layers, hidden= hidden, num_heads= num_heads,
#                      fcn_hidden= fcn_hidden, vocab_size= vocab_size, max_seq_len= MAX_LEN)

optimizer = torch.optim.Adam(model_right.parameters(), lr= learning_rate)
global_steps = 0
for epoch in range(num_epochs):
    for batch in train_dataloader:
        global_steps += 1
        inputs = batch
        optimizer.zero_grad()  # Zero the gradients
        outputs = model_right(inputs)  # Forward pass
        targets = inputs
        for target in targets:
            target = torch.tensor([-1], dtype=torch.long)#empty element in the beginnig as '-1'
            target = torch.cat((target, target[1:]), dim=0)#slide forward, replacing the first element with space

        loss = loss_function(outputs.view(-1, vocab_size), targets.view(-1))  # Calculate loss
        wandb.log({"training loss": loss}, step= global_steps)
        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights

        #Evaluation on test dataset
        if global_steps % 50 == 0:
            with torch.no_grad():
                model_right.eval()
                for test_batch in test_dataloader:
                    x = test_batch
                    y = x
                    for yi in y:
                        yi = torch.tensor([-1], dtype=torch.long)#empty element in the beginnig as '-1'
                        yi = torch.cat((yi, yi[1:]), dim=0)#slide forward, replacing the first element with space

                    outputs = model_right(x)
                    loss = loss_function(outputs.view(-1, vocab_size), y.view(-1))
                    accuracy = (outputs.argmax(dim=-1) == y).float().mean()
                    wandb.log({"validation loss": loss},step= global_steps)
                    #print(f"Validation loss: {loss:.4f}, accuracy: {accuracy:.4f}")
                model_right.train()


    #print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

# YOUR CODE ENDS HERE

VBox(children=(Label(value='0.001 MB of 0.012 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.077568…

TypeError: ignored

In [20]:
import os
folder_name = "trained6"
folder_path = "/content/drive/MyDrive/Colab_Notebooks/Transformers_pt_1/" + folder_name
os.makedirs(folder_path, exist_ok=True)
model_right.save_pretrained(folder_path)

# Generate text using your model

Now it's time to see what this model can do. Implement a generation function.
The idea is to start with some prefix text, predict the next character, append it to the prefix, and repeat the process.
You can stop generating text when you reach MAX_LEN tokens.

Use `torch.no_grad()` context manager to make sure that you don't compute gradients during generation, or it will blow up your GPU memory.

## Coding task 3.6

Implement a generation function that accepts a prefix text and generates the next tokens up to MAX_LEN.

### Grading criteria
**(2 points)**

* 2 points for generation function
* -1 point if you forget to put your model to evaluation mode during generation and back to training mode after generation or if you forget to use `torch.no_grad()` context manager, or if you are not using the GPU.

In [21]:
# YOUR CODE STARTS HERE (our implementation is about 10 lines)
folder_name = "trained6"
folder_path = "/content/drive/MyDrive/Colab_Notebooks/Transformers_pt_1/" + folder_name
model_right.from_pretrained(folder_path)
# YOUR CODE ENDS HERE

TransformerLM(
  (encoder): TransformerEncoder(
    (embedding): Embedding(128, 128)
    (positional_embedding): Embedding(128, 128)
    (trans_layers): ModuleList(
      (0-7): 8 x TransformerEncoderLayer(
        (self_attention): MultiHeadSelfAttention(
          (k): Linear(in_features=128, out_features=128, bias=True)
          (q): Linear(in_features=128, out_features=128, bias=True)
          (v): Linear(in_features=128, out_features=128, bias=True)
          (mix): Linear(in_features=128, out_features=128, bias=True)
          (dropout): Dropout(p=0, inplace=False)
        )
        (linear_1): Linear(in_features=128, out_features=128, bias=True)
        (linear_2): Linear(in_features=128, out_features=128, bias=True)
        (norm_1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm_2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout_1): Dropout(p=0.1, inplace=False)
        (dropout_2): Dropout(p=0.1, inplace=False)
        (relu): Re

In [50]:
def gogopowerrangers(prompt: str, top_k = 3) -> str:
    model_right.eval()
    with torch.no_grad():
        # encode
        input = torch.tensor(_tokenizer.encode(prompt), dtype=torch.long).unsqueeze(0)

        for _ in range(MAX_LEN - len(prompt)):
            outputs = model_right(input)
            probabilities = outputs[:, -1, :]
            probabilities = torch.nn.functional.softmax(probabilities, dim=-1)
            #top_k
            next_chars = torch.multinomial(probabilities, num_samples= top_k).squeeze(0)
            #print(next_chars.shape)
            random_index = torch.randint(0, len(next_chars), (1,))
            #print(random_index.shape)
            sample_char = next_chars[random_index]
            input = torch.cat([input, sample_char.unsqueeze(0)], dim=1)

    model_right.train()
    return input.squeeze(0).tolist()

prompt = 'she goes: they say that sun'
result = gogopowerrangers(prompt, 5)
print(result)
print(_tokenizer.decode(result))



[3, 22, 8, 5, 40, 14, 8, 3, 10, 5, 4, 22, 8, 20, 5, 3, 19, 20, 5, 4, 22, 19, 4, 5, 3, 21, 9, 10, 101, 101, 101, 101, 101, 32, 119, 67, 54, 111, 115, 107, 107, 107, 98, 97, 75, 75, 93, 70, 70, 70, 79, 79, 121, 110, 42, 117, 90, 90, 16, 101, 101, 101, 58, 84, 84, 115, 84, 74, 106, 17, 105, 112, 79, 79, 63, 44, 104, 108, 116, 116, 116, 87, 26, 72, 24, 68, 77, 105, 105, 104, 77, 72, 70, 118, 118, 6, 103, 119, 103, 103, 117, 110, 110, 110, 101, 101, 101, 32, 77, 102, 103, 118, 65, 65, 113, 53, 112, 74, 74, 97, 98, 97, 74, 119, 102, 32, 77, 66]
she goes: they say that sun:?qbpKc3O.mC?P?


# Exploring hyperparameters and understanding Transformers

Train at least 10 models with different hyperparameters and compare them using wandb. Write a short report.


### Grading criteria
**(5 points max + extra points)**

* 4 points for training 10+ models. 2 points if 5-9 models are trained.
* 1 point for training report that describes what you did and what you learned about the hyperparameters and efficient training.
* Extra point for every 10 more models trained (up to 10 extra points). Please be reasonalbe, training a model for 10 seconds does not count, they need to be if not converged, at least trained for a while.