In [1]:
import sys

import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data

if 'google.colab' in sys.modules:
    print("host is colab")
    !git clone --quiet https://github.com/cwinsor/uml_comp5300.git
    sys.path.append('/content/uml_comp5300/hw5_transformer/')  # make sure we can import transformer_lm
else:
    print("host is traditional server")
    sys.path.append('../')  # make sure we can import transformer_lm

host is traditional server


# Training a transformer language model

In this notebook, we will learn how to

1. preprocess data for language modeling
2. use `torch.utils.data` to handle batching in an efficient and standard way
3. train a transformer language model

Specifically, we will use the Tiny Shakespeare dataset, which contains the complete works of William Shakespeare, to train a language model. The goal of this notebook is to walk you through the steps of pre-processing the dataset and preparing it for training using the PyTorch DataLoader, creating a language model, training it and using it to generate text.

We will train a character-based langauge model instead of word-based, because:

1. It's faster to train it to the point that it can generate text
2. We don't want to complicate the homework with BPE tokenization
3. We work with a small dataset which might not be enough to train a word-based language model

> Feel free to try training a word-based language model on a larger dataset, such as the WikiText-2 dataset, which is available in the hugginface datasets library.

# Step 1: Load and Explore the Dataset
The first step is to load the dataset and explore it. In this example, we will use the Tiny Shakespeare dataset, which contains the complete works of William Shakespeare. We can download the dataset from the following URL: https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Feel free to use `wget` to download the dataset or just download the file manually and upload it to your Colab instance.

Here's how you can use `wget` to download the dataset:
```
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O tiny_shakespeare.txt
```

In [2]:
!pip install wget
import wget
import os
wget.download('https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt')
os.rename('input.txt', 'tiny_shakespeare.txt')



FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'input.txt' -> 'tiny_shakespeare.txt'

## Coding task 3.1: load the data and take a look

Read the file to a variable named `raw_data` and print the first 1000 characters.

### Grading criteria
**(1 point max)**

1 point if everything works

In [3]:
with open("tiny_shakespeare.txt", "r") as f:
    raw_data = f.read()

print(raw_data[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



## Inline question 3.1: raw text preprocessing
**(1 point max, 1 extra point for creative ideas)**

Think about how you can pre-process the data (in terms of modifying the text). Provde three ideas and explain why you think they are useful or not. Think about the size of the data, tokenization method (we will use character-level language model), your computational resources, and what kind of text you want to generate. Make this answer as extensive as possible.

***Your answer:***

1) The conversational format ("SPEAKER:\n") is significant and should be preserved. So (my opinion) keep the uppercase and all punctuation. Even with this the vocabulary will be small (about 200) because there's no hashtags or URLs.

2) No stemming, lemmitization, stop-word required.

3) "n-gram" not required as the transformer is already looking at sequences.

4) Another option that would be interesting is Latent Dirichlet Allocation (LDA) (Blei et al., 2003). LDA is a technique used to compare objects that are comprised of a variable number of components. LDA is a general technique and can be applied in many domains. LDA is the extension of the Beta distribution to multinominal variable. It assumes components in an object are drawn from a multinominal distribution with each document sampled from the same vocabulary but with different proportions.
As applied to documents the objective is to allow comparing documents to one-another and to document sets. It is a 3-level probabilistic model consisting of words, documents and topics. An experiment is the drawing of a document, and the random variables are first a collection of one or more topics, and second a topic is comprised of (a distribution over) a set of words that is the vocabulary. The model differs from the earlier unigram, mixture of unigrams and pLSI/aspect models because it expresses a document as one or more topics, where the earlier models express a document directly as terms without a per-document topic distribution.

5) Encoding is the process of transforming the sequence of tokens into sequences of numbers. This can be a simple dictionary lookup or can be more sophisticated. For example byte-pair encoding (Sennrich) works by iteratively combining recurring adjacent pairs into a single code. The result is that common token-pairs are represented as a single code and rare words are a sequence of sub-word codes. In choosing an encoding scheme the tradeoff is between codebook size vs sequence length - a larger codebook correlates to shorter encoded sequences and vice-versa.


# Step 2: preparing the data for the model

## Coding task 3.2
Similar to previous homeworks, where we made a vocabualry of words, we will make a vocabulary of characters.

1. Make a vocabulary of all characters
2. Make `char2idx`
3. Make a class `Tokenizer` that stores `char2idx` and has two methods: `encode` and `decode` that encode and decode text using `char2idx` and `idx2char` dictionaries.
   * You might find it useful to create `idx2char` dictionary inside the `__init__` method of the `Tokenizer` class.
4. Create a `Tokenizer` object
5. Convert the text to a list of integers using `char2idx`, assign it to a variable named `data`
6. Print the first 100 items of `data`

It's useful to have a function that converts a sequence of indices to a string. You will need it to convert the output of the model to a text when you will be generating text, but is it also very useful for **debugging** your pre-processing code.

### Grading criteria
**(2 points max)**

1. 1 point for `char2idx` dictionary
2. 1 point for `Tokenizer` class that passes the tests below

In [4]:
# YOUR CODE STARTS HERE (our implementation is about 4 lines using comprehensions, but it's alright if yours is longer)

chars = sorted(list(set(raw_data)))
vocab_size = len(chars)
print(f"vocab_size {vocab_size}")
char2idx = { ch:i for i,ch in enumerate(chars) }

class Tokenizer():

    def __init__(self, c2i):
        self.c2i = c2i
        self.i2c = { i:ch for i,ch in enumerate(chars) }
        # print(f"c2i:\n{self.c2i}")
        # print(f"i2c:\n{self.i2c}")
        # assert False, "hold up"
        
    def encode(self, from_text):
        out = [self.c2i[c] for c in from_text]
        return out
    
    def decode(self, from_codes):
        if torch.is_tensor(from_codes):
            from_codes = from_codes.tolist()
        out = ''.join([self.i2c[i] for i in from_codes])
        return out

# YOUR CODE ENDS HERE

vocab_size 65


In [5]:
_tokenizer = Tokenizer(char2idx)

_token_ids = _tokenizer.encode("hello")
_text = _tokenizer.decode(_token_ids)

assert isinstance(_token_ids, list), "token_ids should be a list"
assert isinstance(_token_ids[0], int), "token_ids should be a list of integers"
assert _text == "hello", "decode should work correctly and return the original text"

del _tokenizer, _token_ids, _text

# Chunk the data

Our data is too long to be processed in one go. We will split it into chunks of length 128. We will use the first 128 characters to predict the next character. This is a decent length for a sequence, but you can play with it if you want.

## Coding task 3.3

1. Create a list of sequences of length `MAX_LEN + 1`. Each sequence should be a list of integers. You'll see why we need `+ 1` in a minute.
   * You might need to get rid of your last example if it's shorter than `MAX_LEN + 1` characters. We need all data to be of the same length to simplify batching.
   * In the next homework we will implement batchihg for sequences of different lengths and you are probably not going to enjoy it, it's a bit tricky.
2. Split the data into training and validation sets. Use 90% of the data for training and 10% for validation.
3. Make x and y pairs for your data. Remember that we want to use the first 128 characters to predict the next character. So, `x` should be the first 128 characters and `y` should be a shifted version of the same sequence, so it's the last 128 characters. Name them `train_x` and `train_y` for the training set and `val_x` and `val_y` for the validation set.
4. Print an example from the training set. You should see that the first 128 characters are the same as the first 128 characters of the original text, and the last 128 characters are the same as the last 128 characters of the original text, shifted by one character.

You can just stride using `data[i:i+128]` for each `i` in `range(0, len(data), 128)`, no need to do anything fancy. You can figure out more complex ways to do it, just do this after all the homework is done. You receive no extra points if your homework is not finished.

### Grading criteria

1. 1 point for `data_chunks` list and train-test split
2. 1 point for dataset and dataloader objects
3. Extra point for a more interesting way to chunk the text
4. Extra point for implementing a custom dataset class

In [6]:
MAX_LEN = 128

# YOUR CODE STARTS HERE (our implementation is about 13 lines, but it's alright if yours is different)

tokenizer = Tokenizer(char2idx)
raw_codes = tokenizer.encode(raw_data)

x_list = [raw_codes[i:i+128] for i in range(0, len(raw_codes), 128)]
y_list = [raw_codes[i+1:i+128+1] for i in range(0, len(raw_codes), 128)]
# get rid of the stragglers
x_list = x_list[0:-1]
y_list = y_list[0:-1]

split = int(len(x_list) * 0.9)
x_train = x_list[:split]
x_val   = x_list[split:]
y_train = y_list[:split]
y_val   = y_list[split:]

# YOUR CODE ENDS HERE

# Using `torch.utils.data`

We will use `torch.utils.data.Dataset` to create a dataset object that will be used to create a `torch.utils.data.DataLoader` object. The `DataLoader` object will be used to create batches of data.

## Coding task 3.4

Your task is to learn how to use `torch.utils.data.Dataset` and `torch.utils.data.DataLoader` classes and to apply them to our data.

1. Convert your data to tensors of type long
1. Create a `torch.utils.data.Dataset` object for each train and test data. Name them `train_dataset` and `val_dataset`. You can use the `TensorDataset` class for this or make a new class that inherits from `torch.utils.data.Dataset` and implements the `__getitem__` and `__len__` methods.
2. Try indexing `train_dataset` to get a single example and decode it using `tokenizer.decode()`. What does it contain? Use tokenizer to decode one example (both x and y). Does it look like a valid text? Are the targets shifted by one character?
1. Use the `DataLoader` class to create `train_loader` and `val_loader` objects. It will shuffle and batch data for you. You can use the following parameters:
   * `dataset` - the dataset object you created in the previous step
   * `batch_size` - your choice!
   * `shuffle` - True for training data, False for validation data
   * `num_workers` - 8, number of CPU cores to use for batch preparation
3. Try iterating over `train_loader` and print the shapes of the batches.
    * You can use `break` to stop the loop after the first iteration.
4. Try decoding a batch that you get from `train_loader`. Does it look like a valid text? Are the targets shifted by one character?

Learn more about data feeding in pytorch here: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html


**NOTE:**
1. `TensorDataset` returns a tuple of tensors. Usually these are `(x, y)` pairs, where `x` is the input and `y` is the target. In our case, `x` is the input sequence and `y` is the same sequence shifted by one character. This is how we will train our language model. We will use the first 128 characters to predict the next character.
1. You need to convert your pytorch tensor into a python list in order to use `tokenizer.decode()`. Feel free to do it in-place or modify the `decode` method of the `Tokenizer` class to accept **BOTH** python lists and pytorch tensors. You can check what datatype you have using `isinstance()` function.
2. Printing might look a bit weird because you have a lot of `\n` in the data. It is alright, just be careful when you are verifying that your data is correct.

### Grading criteria

* 1 point for `train_dataset` and `val_dataset` objects
* 1 point if each test is written and passed:
  * train dataset element is correctly processed and x and y corespond to the correct characters
  * printed the shapes of the items that you get from `train_loader`
  * decoded a batch from `train_loader` and printed the decoded text and it is correct

In [7]:
# YOUR CODE STARTS HERE (our implementation is about 13 lines)

BATCH_SIZE = 128
NUM_WORKERS = 8

x_train_tensor = torch.tensor(x_train, dtype=torch.long)
x_val_tensor = torch.tensor(x_val, dtype=torch.long)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
y_val_tensor = torch.tensor(x_val, dtype=torch.long)

train_dataset = torch.utils.data.TensorDataset(x_train_tensor, y_train_tensor)
val_dataset = torch.utils.data.TensorDataset(x_val_tensor, y_val_tensor)

train_loader = torch.utils.data.DataLoader(train_dataset,
                                           batch_size=BATCH_SIZE,
                                           shuffle=True,
                                           num_workers=NUM_WORKERS)
val_loader = torch.utils.data.DataLoader(val_dataset,
                                         batch_size=BATCH_SIZE,
                                         shuffle=True,
                                         num_workers=NUM_WORKERS)

for x, y in train_loader:
    print("---x----")
    print(x.shape)
    print(tokenizer.decode(x[0]))
    print("---y----")
    print(y.shape)
    print(tokenizer.decode(y[0]))
    break

# YOUR CODE ENDS HERE

---x----
torch.Size([128, 128])
an proceed that toucheth us
Whereof I shall not have intelligence.
Tell him his fears are shallow, wanting instance:
And for his
---y----
torch.Size([128, 128])
n proceed that toucheth us
Whereof I shall not have intelligence.
Tell him his fears are shallow, wanting instance:
And for his 


# Train a Transformer model

Import your `TransformerLM` model from `modeling_transormer` file and train it on the data you prepared above.
You know the drill: define a model, an optimizer, and a training loop, log everything to wandb.
You can also save your model using `TransformerLM.save_pretrained()` method and load it using `TransformerLM.from_pretrained()` method in case you want to.

### Tricky part

In PyTorch, `F.cross_entropy` expects the logits to be of shape `(batch_size, num_classes)` and the targets to be of shape `(batch_size,)` containing the class indices. In our case, the logits tensor has the shape `(batch_size, seq_len, num_classes)` and the targets are of shape `(batch_size, seq_len)`. We need to reshape the input and the targets to make them compatible with `F.cross_entropy`. You can do it like this:

```python
bs, seq_len, num_classes = logits.shape
logits = logits.reshape(bs * seq_len, num_classes)
targets = targets.reshape(bs * seq_len)
```

or, equivalently, like this:

```python
logits = logits.view(-1, num_classes)
targets = targets.view(-1)
```

Try monitoring your GPU consumption and max it out. The more efficient your code is, the faster your model will train.
During training log your loss and and accuracy. You can only log accuracy every 100 batches or so, because it is a bit slow to compute. You can also log the learning rate.
During evlauation you just need to log the perplexity, the loss, and accuracy. Perplexity is just `exp(loss)`.
Accuracy is not the most standard metric for language models, but it is very intererpretable and easy to compute. Don't expect it to be high, though.
Be mindful how frequenly you evaluate your model. You don't want to evaluate it too often, because it will slow down your training loop.

> You can also log the number of batches you process in one second (throughput) as a measure of efficiency. It is not required, but it is a good idea to monitor it.

## Coding task 3.5

Make a training loop and train your model.

### Grading criteria
**(5 points + extra points)**

* 2 points for trainig loop
* 1 point for using the GPU
* 1 point for evaluation loop (we recommend to make it into a separate function to make your code more readable)
* 1 point for wandb logging of train loss, eval loss, train accuracy, eval accuracy, eval perplexity. You can also log the learning rate, but it is not required.
* -1 point if forget to zero your gradients between batches
* -1 point if your forget to put your model to evaluation mode during evaluation and back to training mode during training
* Extra point for using a learning rate scheduler
* Extra point for any other improvements to the training loop


In [13]:
from transformer_lm.modeling_transformer import TransformerLM

# YOUR CODE STARTS HERE
from datetime import datetime
import time
import wandb

assert torch.cuda.is_available(), "the code requires CUDA"
device = torch.device("cuda")
start_time = time.time()

config = {
    "run_name": datetime.now().strftime("train_%m%d_%H_%M_%S"),

    "VOCAB_SIZE": vocab_size,
    "MAX_LENGTH": MAX_LEN,

    "NUM_LAYERS": 4,
    "NUM_HIDDEN": 30,
    "NUM_HEADS": 6,
    "FCN_HIDDEN": 71,
    "LEARNING_RATE": 1e-4,
    "DROPOUT": 0.1,

    "MAX_EPOCHS": 50,  # 5000
    "BATCH_SIZE": BATCH_SIZE,
    "EVAL_EVERY": 100,
}
print("config:")
for k, v in config.items():
    print(f"{k}: {v}")

wandb.login()
wandb.init(project="hw5_transformer", config=config)
wandb.define_metric("train_loss", summary="min")
wandb.define_metric("train_accuracy", summary="max")
wandb.define_metric("val_loss", summary="min")
wandb.define_metric("val_accuracy", summary="max")

# rLM(num_layers=2, hidden=15, num_heads=3, fcn_hidden=71, vocab_size=100, max_seq_len=7)

m_ = TransformerLM(
    num_layers=config["NUM_LAYERS"],
    hidden=config["NUM_HIDDEN"],
    num_heads=config["NUM_HEADS"],
    fcn_hidden=config["FCN_HIDDEN"],
    vocab_size=config["VOCAB_SIZE"],
    max_seq_len=config["MAX_LENGTH"],
    dropout=config["DROPOUT"])

model = m_.to(device) 
wandb.watch(model)

optimizer = torch.optim.AdamW(model.parameters(), lr=config["LEARNING_RATE"])

global_step = -1
max_val_accuracy = 0.
for epoch in range(config["MAX_EPOCHS"]):

    # training
    for train_features, train_labels in train_loader:
        global_step += 1
        train_features = train_features.to(device)
        train_labels = train_labels.to(device)

        logits = model(train_features)

        logits = logits.view(-1, config["VOCAB_SIZE"],)
        train_labels = train_labels.view(-1)

        loss = F.cross_entropy(logits, train_labels)

        _, predictions = torch.max(logits, dim=1)
        train_num_predictions = predictions.shape[0]
        train_num_correct = torch.sum(predictions == train_labels)

        train_batch_accuracy = train_num_correct / train_num_predictions

        log_dict = {
            "train_loss": loss,
            "train_accuracy": train_batch_accuracy,
        }
        # "train_num_predictions": train_num_predictions,
        # "train_num_correct": train_num_correct,
        wandb.log(step=global_step, data=log_dict)

        # update the model
        loss.backward()
        optimizer.step()
        optimizer.zero_grad

        # validation
        if global_step % config["EVAL_EVERY"] == 0:
            model.eval()
            with torch.no_grad():

                val_total_loss = 0.
                val_num_predictions = 0
                val_num_correct = 0
                val_accuracy = 0.0

                val_num_batches = 0
                for val_features, val_labels in val_loader:
                    val_num_batches += 1

                    val_features = val_features.to(device)
                    val_labels = val_labels.to(device)

                    logits = model(val_features)

                    logits = logits.view(-1, config["VOCAB_SIZE"],)
                    val_labels = val_labels.view(-1)

                    val_total_loss += F.cross_entropy(logits, val_labels)

                    _, predictions = torch.max(logits, dim=1)
                    val_num_predictions += predictions.shape[0]
                    val_num_correct += torch.sum(predictions == val_labels)

                val_accuracy = val_num_correct / val_num_predictions
                val_avg_loss = val_total_loss / val_num_batches

                if val_accuracy > max_val_accuracy:
                    max_val_accuracy = val_accuracy

                log_dict = {
                    "val_loss": val_avg_loss,
                    "val_accuracy": val_accuracy,
                }
                # "max_val_accuracy": max_val_accuracy,
                # "val_num_predictions": val_num_predictions,
                # "val_num_correct": val_num_correct,
                wandb.log(step=global_step, data=log_dict)

                # print("global_step: ", global_step,
                #       "val_loss ", val_avg_loss,
                #       " val:accuracy: ", val_accuracy,
                #       " max_val_accuracy: ", max_val_accuracy)
            model.train()

wandb.finish()

print("final max_val_accuracy: ", max_val_accuracy)

elapsed = time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time))
print(f"Took: {elapsed}")

# model.save_pretrained(config["run_name"])

# YOUR CODE ENDS HERE



config:
run_name: train_0315_14_55_21
VOCAB_SIZE: 65
MAX_LENGTH: 128
NUM_LAYERS: 4
NUM_HIDDEN: 30
NUM_HEADS: 6
FCN_HIDDEN: 71
LEARNING_RATE: 0.0001
DROPOUT: 0.1
MAX_EPOCHS: 50
BATCH_SIZE: 128
EVAL_EVERY: 100


0,1
train_accuracy,▄▆▆▇▆▇█▁▂▇▇▇▁▂▂▃▆▆▆▆▆██▅▃▃▆▄▇▇▇▇▇▆▇▇█▇██
train_loss,█▆▄▃▂▂▁▃▂▃▅▁▄▅▄▃▂▃▃▃▂▃▃▂▂▂▂▁▂▂▂▁▂▂▂▁▁▃▃▂
val_accuracy,▁█▃█▄██▄████
val_loss,█▃▂▂▃▂▃▂▃▁▃▁


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016666666666666666, max=1.0…

VBox(children=(Label(value='0.001 MB of 0.157 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.008645…

0,1
train_accuracy,▇▇█▇██▅▅▇▆▁▃▇▇▇█▇▇▄▇▇▇▇▄▁▇▄▃▇█▂▃▁▄▇▇▇▆▄▆
train_loss,▆▃▁▂▁▃▃▃▄▃▃▃▂▂▂▂▂▂▂▃▂▂▃▂▄▅▃█▄▆▇▆▄▄▃▃▄▂▃▄
val_accuracy,▁████▁█▅▂█████▂█▂█▅▃█▅██▅▂▃██▅▅
val_loss,█▃▂▁▁▃▂▃▂▂▁▂▂▁▂▁▃▁▁▃▁▄▂▁▃▂▂▁▃▃▁


final max_val_accuracy:  tensor(0.1490, device='cuda:0')
Took: 00:04:19


AttributeError: 'TransformerLM' object has no attribute 'encoder'

# Generate text using your model

Now it's time to see what this model can do. Implement a generation function.
The idea is to start with some prefix text, predict the next character, append it to the prefix, and repeat the process.
You can stop generating text when you reach MAX_LEN tokens.

Use `torch.no_grad()` context manager to make sure that you don't compute gradients during generation, or it will blow up your GPU memory.

## Coding task 3.6

Implement a generation function that accepts a prefix text and generates the next tokens up to MAX_LEN.

### Grading criteria
**(2 points)**

* 2 points for generation function
* -1 point if you forget to put your model to evaluation mode during generation and back to training mode after generation or if you forget to use `torch.no_grad()` context manager, or if you are not using the GPU.

In [37]:
# YOUR CODE STARTS HERE (our implementation is about 10 lines)

# Reference https://www.youtube.com/watch?v=kCc8FmEb1nY
def generate(idx, max_new_tokens, max_seq_len):

    # idx is (B, T) array of indices in the current context
    for _ in range(max_new_tokens):
        # crop idx to the last block_size tokens
        idx_cond = idx[:, -max_seq_len:]
        # get the predictions
        logits = model(idx_cond)
        # focus only on the last time step
        logits = logits[:, -1, :] # becomes (B, C)
        #apply softmax to get probabilities
        probs = F.softmax(logits, dim=-1) # (B, c)
        # sample from the distribution
        idx_next = torch.multinomial(probs, num_samples=1) #(B, 1)
        # append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1) #B, T+1)

    return idx

# generate from the model
model.eval()
with torch.no_grad():
    context = torch.zeros((1, 1), dtype=torch.long, device=device)
    generated = generate(context, max_new_tokens=100, max_seq_len=config["MAX_LENGTH"]).tolist()
    gen_string = tokenizer.decode(generated[0])
    print("--- generated ----")
    print(generated[0])
    print(gen_string)
_ = model.train()

# YOUR CODE ENDS HERE

--- generated ----
[0, 57, 59, 1, 61, 52, 58, 56, 53, 13, 58, 27, 61, 26, 1, 28, 31, 53, 56, 52, 46, 56, 46, 42, 45, 47, 53, 52, 10, 58, 63, 46, 59, 8, 61, 49, 1, 56, 30, 43, 46, 44, 57, 47, 42, 10, 11, 8, 15, 46, 59, 43, 11, 13, 21, 0, 46, 43, 47, 6, 46, 53, 63, 61, 58, 45, 50, 60, 36, 53, 8, 52, 21, 58, 61, 44, 17, 53, 49, 52, 44, 58, 43, 45, 47, 53, 56, 46, 50, 10, 46, 43, 45, 47, 57, 10, 41, 39, 50, 1, 53]

su wntroAtOwN PSornhrhdgion:tyhu.wk rRehfsid:;.Chue;AI
hei,hoywtglvXo.nItwfEoknftegiorhl:hegis:cal o


# Exploring hyperparameters and understanding Transformers

Train at least 10 models with different hyperparameters and compare them using wandb. Write a short report.


### Grading criteria
**(5 points max + extra points)**

* 4 points for training 10+ models. 2 points if 5-9 models are trained.
* 1 point for training report that describes what you did and what you learned about the hyperparameters and efficient training.
* Extra point for every 10 more models trained (up to 10 extra points). Please be reasonalbe, training a model for 10 seconds does not count, they need to be if not converged, at least trained for a while.