# Bigram Language Model and Generative Pretrained Transformer (GPT)

The objective of this ipynb is to train a simplified transformer model. The primary differences between the implementation:
* tokenizer (we use a character level encoder simplicity and compute constraints)
* size (we are using 1 consumer grade gpu hosted on colab and a small dataset. in practice, the models are much larger and are trained on much more data)
* efficiency


Most modern LLMs have multiple training stages, so we won't get a model that is capable of replying to you yet. However, this is the first step towards a model like ChatGPT and Llama.




In [102]:
! pip install torch



In [103]:
! pip install numpy matplotlib



In [104]:
%matplotlib inline
import torch
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from torch import nn

## Part 1: Bigram MLP for TinyShakespeare

1a). Create a list `chars` that contains all unique characters in `text`

1b). Implement `encode(s: str) -> list[int]`

1c). Implement `decode(ids: list[int]) -> str`

1d) . Create two tensors, `inputs_one_hot` and `outputs_one_hot`. Use one hot encoding. Make sure to get every consecutive pair of characters. For example, for the word 'hello', we should create the following input-output pairs
```
he
el
ll
lo
```

1e). Implement BigramOneHotMLP, a 2 layer MLP that predicts the next token. Specifically, implement the constructor, forward, and generate. The output dimension of the first layer should be 8. Use `torch.optim`. The activation function for the first layer should be `nn.LeakyReLU()`

Note: Use the `torch.nn.function.cross_entropy` loss. Read the [docs](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html) about how this loss function works. The logits are the output of a network WITHOUT an activation function applied to the last layer. There are activation functions are applied to every layer except the last.

1f). Train the BigramOneHotMLP for 1000 steps.

1g) . Create two tensors, `input_ids` and `outputs_one_hot`. These `input_ids` will be used for the embedding layer.

1h). Implement and train BigramEmbeddingMLP, a 2 layer mlp that predicts the next token. Specifically, implement the constructor, forward, and generate functions. The output dimension of the first layer should be 8. Use `torch.optim`.



Note: the output will look like gibberish


In [105]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-02-04 18:30:54--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.16’


2024-02-04 18:30:55 (95.4 MB/s) - ‘input.txt.16’ saved [1115394/1115394]



In [124]:
# For the bigram model, let's use the first 1000 characters for the data

with open('input.txt', 'r') as f:
    text = f.read()
text = text[:1000]

In [125]:
# 1a)
chars = list(set(text))
print(f"1a) {chars}")

# 1b)
c2id = {c: i for i, c in enumerate(chars)}
id2c = {i: c for c, i in c2id.items()}

def encode(s: str) -> list[int]:
    return [c2id[c] for c in s]

# 1c)
def decode(ids: list[int]) -> str:
    return ''.join([id2c[i] for i in ids])

# 1d)
def create_one_hot_inputs_and_outputs(text) -> list[torch.tensor, torch.tensor]:
    inputs = []
    outputs = []
    for i in range(len(text) - 1):
        input_idx = encode(text[i])[0]
        output_idx = encode(text[i + 1])[0]

        input_one_hot = torch.zeros(len(chars))
        output_one_hot = torch.zeros(len(chars))

        input_one_hot[input_idx] = 1
        output_one_hot[output_idx] = 1

        inputs.append(input_one_hot)
        outputs.append(output_one_hot)

    inputs_one_hot = torch.stack(inputs)
    outputs_one_hot = torch.stack(outputs)

    return inputs_one_hot, outputs_one_hot

inputs_one_hot, outputs_one_hot = create_one_hot_inputs_and_outputs(text)
# print(inputs_one_hot[1])
# print(outputs_one_hot[0])

# 1e)
vocab_size = len(chars)
class BigramOneHotMLP(nn.Module):
    def __init__(self, vocab_size):
        super(BigramOneHotMLP, self).__init__()
        self.vocab_size = vocab_size
        self.fc1 = nn.Linear(vocab_size, 8)
        self.fc2 = nn.Linear(8, vocab_size)
        self.activation = nn.LeakyReLU()

    def forward(self, x):
        x = self.activation(self.fc1(x))
        x = self.fc2(x)
        return x

    def generate(self, start='a', max_new_tokens=100) -> str:
          bigram_one_hot_mlp.eval()
          with torch.no_grad():
            current_char = start
            word = current_char
            for _ in range(max_new_tokens):
                input_tensor = F.one_hot(torch.tensor(c2id[current_char]), vocab_size).unsqueeze(0).float()
                output = bigram_one_hot_mlp(input_tensor)
                next_char_id = torch.argmax(output).item()
                next_char = [k for k, v in c2id.items() if v == next_char_id][0]
                current_char = next_char
                word += current_char
            return word

bigram_one_hot_mlp = BigramOneHotMLP(vocab_size=vocab_size)

# 1f)
import torch.optim as optim
import torch.nn.functional as F
optimizer = optim.Adam(bigram_one_hot_mlp.parameters(), lr=0.1)

# training loop
for _ in range(1000):
    optimizer.zero_grad()
    predict = bigram_one_hot_mlp(inputs_one_hot)
    loss = F.cross_entropy(predict, outputs_one_hot)
    loss.backward()
    optimizer.step()


print(bigram_one_hot_mlp.generate())

1a) ['u', 'b', ':', 'A', 'I', 'p', 'h', ',', ' ', 'Y', "'", 'g', 'C', 's', '!', 'S', '?', 'R', 'O', 'z', 'B', 'k', 'j', 't', 'c', 'w', 'F', 'y', 'd', 'l', 'v', '\n', 'M', ';', 'N', 'o', 'm', 'i', 'n', '.', 'W', 'r', 'f', 'a', 'e', 'L']
an the the the the the the the the the the the the the the the the the the the the the the the the th


In [127]:
# 1g)
def create_embedding_inputs_and_outputs(text) -> list[torch.tensor, torch.tensor]:
    input_ids = []
    outputs = []
    for i in range(len(text) - 1):
        input_idx = encode(text[i])[0]
        output_idx = encode(text[i + 1])[0]

        output_one_hot = torch.zeros(len(chars))
        output_one_hot[output_idx] = 1

        input_ids.append(input_idx)
        outputs.append(output_one_hot)

    outputs_one_hot = torch.stack(outputs)


    return torch.tensor(input_ids), outputs_one_hot
input_ids, outputs_one_hot = create_embedding_inputs_and_outputs(text)
# 1h)
class BigramEmbeddingMLP(nn.Module):
    def __init__(self, vocab_size,embedding_dim=8, hidden_size=8):
        super(BigramEmbeddingMLP, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim=embedding_dim)
        self.fc1 = nn.Linear(in_features=embedding_dim, out_features=hidden_size)
        self.fc2 = nn.Linear(in_features=hidden_size, out_features=vocab_size)
        self.activation = nn.LeakyReLU()

    def forward(self, x):
        x = self.embedding(x)
        x = self.activation(self.fc1(x))
        x = self.fc2(x)
        return x

    def generate(self, start='a', max_new_tokens=100) -> str:
        generated_text = start
        for _ in range(max_new_tokens):
            input_idx = torch.tensor([encode(generated_text[-1])[0]])
            # print(encode(generated_text[-1])[0])
            output = self.forward(input_idx)
            next_token_id = torch.argmax(output).item()
            next_char = decode([next_token_id])
            generated_text += next_char
        return generated_text

bigram_embedding_mlp = BigramEmbeddingMLP(vocab_size)

optimizer = optim.Adam(bigram_embedding_mlp.parameters(), lr=0.1)
criterion = nn.CrossEntropyLoss()


# training loop
for _ in range(1000):
    optimizer.zero_grad()
    outputs = bigram_embedding_mlp(input_ids)
    # print(input_ids.shape, outputs.shape)
    # break
    loss = criterion(outputs, outputs_one_hot)
    loss.backward()
    optimizer.step()

print(bigram_embedding_mlp.generate())

are the the the the the the the the the the the the the the the the the the the the the the the the t


## Part 2: Generative Pretrained Transformer

For this part, it is best to use a gpu. In the settings at the top go to Runtime -> Change Runtime Type and select T4 GPU

In [109]:
# run nvidia-smi to check gpu usage
!nvidia-smi

Sun Feb  4 18:30:59 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L40S                    On  | 00000000:01:00.0 Off |                    0 |
| N/A   51C    P0              88W / 350W |    587MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L40S                    On  | 00000000:21:00.0 Off |  

In [110]:
# For the gpt model, let's use the full text

with open('input.txt', 'r') as f:
    text = f.read()

Implement a character level tokenization function.

1. Create a list of unique characters in the string.
2. Implement a function `encode(s: str) -> list[int]` that takes a string and returns a list of ids 
3. Implement a function `decode(ids: list[int]) -> str` that takes a list of ids (ints) and returns a string 


In [111]:
# 1
chars = list(set(text))
print(f"1: {chars}")

# 2
c2id = {c: i for i, c in enumerate(chars)}
id2c = {i: c for c, i in c2id.items()}

def encode(s: str) -> list[int]:
    return [c2id[c] for c in s]
# 3
def decode(ids: list[int]) -> str:
    return ''.join([id2c[i] for i in ids])

1: ['q', '&', 'u', '-', 'b', ':', 'E', 'A', 'T', 'I', 'V', 'Q', 'p', 'h', ',', ' ', 'Y', "'", 'g', 'C', 's', '!', 'G', 'S', '?', '$', 'R', 'O', 'J', 'z', 'B', 'k', 'x', 'j', 't', 'c', 'w', 'F', 'y', 'X', 'U', 'P', 'd', 'l', 'v', 'K', '\n', 'Z', 'M', '3', ';', 'N', 'D', 'o', 'H', 'm', 'i', 'n', '.', 'W', 'r', 'f', 'a', 'e', 'L']


In [112]:
data = torch.tensor(encode(text), dtype=torch.long).cuda()

In [113]:
block_size = 16
data[:block_size+1]

tensor([37, 56, 60, 20, 34, 15, 19, 56, 34, 56, 29, 63, 57,  5, 46, 30, 63],
       device='cuda:0')

To train a transformer, we feed the model `n` tokens (context) and try to predict the `n+1`th token (target) in the sequence.



In [114]:
x = data[:block_size]
y = data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([37], device='cuda:0') the target: 56
when input is tensor([37, 56], device='cuda:0') the target: 60
when input is tensor([37, 56, 60], device='cuda:0') the target: 20
when input is tensor([37, 56, 60, 20], device='cuda:0') the target: 34
when input is tensor([37, 56, 60, 20, 34], device='cuda:0') the target: 15
when input is tensor([37, 56, 60, 20, 34, 15], device='cuda:0') the target: 19
when input is tensor([37, 56, 60, 20, 34, 15, 19], device='cuda:0') the target: 56
when input is tensor([37, 56, 60, 20, 34, 15, 19, 56], device='cuda:0') the target: 34
when input is tensor([37, 56, 60, 20, 34, 15, 19, 56, 34], device='cuda:0') the target: 56
when input is tensor([37, 56, 60, 20, 34, 15, 19, 56, 34, 56], device='cuda:0') the target: 29
when input is tensor([37, 56, 60, 20, 34, 15, 19, 56, 34, 56, 29], device='cuda:0') the target: 63
when input is tensor([37, 56, 60, 20, 34, 15, 19, 56, 34, 56, 29, 63], device='cuda:0') the target: 57
when input is tensor([37, 56

In [115]:
batch_size = 1024
device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_batch():
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y
# x,y = get_batch()
# print(x.shape, y.shape)

### Single Self Attention Head
![](https://i.ibb.co/GWR1XG0/head.png)

In [116]:
import torch.nn as nn
class SelfAttentionHead(nn.Module):
    def __init__(self, feature_size = 64, head_size = 16):
      super(SelfAttentionHead, self).__init__()
      self.query = nn.Linear(feature_size, head_size)
      self.key = nn.Linear(feature_size, head_size)
      self.value = nn.Linear(feature_size, head_size)
      self.dropout = nn.Dropout(p=0.1)

    def forward(self, x):
      batch, timesteps, channels = x.shape
      Q = self.query(x)
      K = self.key(x)
      V = self.value(x)

      K_t = K.transpose(-2, -1)

      attention_scores = torch.matmul(Q, K_t) / (K.size(-1) ** 0.5)

      mask = torch.tril(torch.ones(timesteps, timesteps)).to('cuda')
      masked_attention = attention_scores.masked_fill(mask == 0, float('-inf'))
      masked_attention = F.softmax(masked_attention, dim=-1)

      attention_probs = self.dropout(masked_attention)
      attention_output = torch.matmul(attention_probs, V)

      return attention_output

### Multihead Self Attention 

`constructor`

- Create 4 `SelfAttentionHead` instances. Consider using `nn.ModuleList`
- Create a linear layer with n_embd input dim and n_embd output dim

`forward`

In the forward implementation, pass `x` through each head, then concatenate all the outputs along the feature dimension, then pass the concatenated output through the linear layer

![](https://i.ibb.co/y5SwyZZ/multihead.png)

In [117]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads=4, head_size=16, feature_size=64, dropout_rate=0.1):
        super(MultiHeadAttention, self).__init__()
        self.heads = nn.ModuleList([SelfAttentionHead(feature_size, head_size) for _ in range(num_heads)])
        self.linear = nn.Linear(num_heads * head_size, feature_size)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x):
        head_outputs = [head(x) for head in self.heads]
        concatenated_output = torch.cat(head_outputs, dim=-1)  
        attention_output = self.linear(concatenated_output)
        output = self.dropout(attention_output)
        return output

## MLP 
Implement a 2 layer MLP


![](https://i.ibb.co/C0DtrF5/ff.png)

In [118]:
class MLP(nn.Module):
    def __init__(self, input_dim=64, hidden_dim=256, output_dim=64, dropout_rate=0.1):
        super(MLP, self).__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout_rate)
    def forward(self, x: torch.tensor) -> torch.tensor:
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        x = self.dropout(x)
        return x

## Transformer block

Layer normalization help training stability by normalizing the outputs of neurons within a single layer across all features for each individual data point, not across a full batch or a specific feature.

Dropout is a form of regularization to prevent overfitting.

This is the diagram of a transformer block:

![](https://i.ibb.co/X85C473/block.png)

In [119]:
class Block(nn.Module):
    def __init__(self, n_embd=64, n_head=4, n_inner=256, dropout_rate=0.1):
        super(Block, self).__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.attn = MultiHeadAttention(n_head, n_embd // n_head, n_embd, dropout_rate)
        self.ln2 = nn.LayerNorm(n_embd)
        self.mlp = MLP(n_embd, n_inner, n_embd, dropout_rate)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x


## GPT

`constructor` 

1. create the token embedding table and the position embedding table
2. create variable `self.blocks` that is a series of `Block`s. The data will pass through each block sequentially. Consider using `nn.Sequential`
3. create a layer norm layer
4. create a linear layer for predicting the next token

`forward(self, idx, targets=None)`. 

`forward` takes a batch of context ids as input of size (B, T) and returns the logits and the loss, if targets is not None. If targets is None, return the logits and None.
1. get the token by using the token embedding table created in the constructor
2. create the position embeddings
3. sum the token and position embeddings to get the model input
4. pass the model through the blocks, the layernorm layer, and the final linear layer
5. compute the loss

`generate(start_char, max_new_tokens, top_p, top_k, temperature) -> str` 
1. implement top p, top_k, and temperature for sampling



![](https://i.ibb.co/n8sbQ0V/Screenshot-2024-01-23-at-8-59-08-PM.png)

In [120]:
class GPT(nn.Module):
    def __init__(self, vocab_size, n_embd=64, n_head=4, n_layer=3, max_seq_len=128, dropout_rate=0.1):
        super(GPT, self).__init__()
        self.token_embedding = nn.Embedding(vocab_size, n_embd)
        # self.positional_embeddings = self.sin_position_encoding(seq_len=max_seq_len, d_model=n_embd).cuda()
        self.positional_embeddings = nn.Embedding(max_seq_len, n_embd)
        # print(self.positional_embeddings)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head, n_embd * n_head, dropout_rate) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False)
        self.dropout = nn.Dropout(dropout_rate)
        self.position_indices = torch.arange(0, max_seq_len).unsqueeze(0).cuda()


    def forward(self, idx, targets=None):
        token_embeddings = self.token_embedding(idx)
        # print(f"token_embeddings = {token_embeddings.shape}")
        # position_embeddings = self.position_embedding[:, :idx.size(1), :]
        # positions = torch.arange(0, token_embeddings.size(1)).unsqueeze(0)
        position_indices = self.position_indices[:, :token_embeddings.size(1)].expand(token_embeddings.size(0), -1)
        # positions = positions.to(token_embeddings.device)
        # print(position_indices.shape)
        # print(position_indices)
        position_embeddings = self.positional_embeddings(position_indices)
        x = token_embeddings + position_embeddings #
        # print(f"x1: {x.shape}")
        x = self.blocks(self.dropout(x))
        # print(f"x2: {x.shape}")
        x = self.ln_f(x)
        # print(f"x3: {x.shape}")
        logits = self.head(x)
        # print(f"logits: {logits.shape}")

        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        else:
            loss = None

        return logits, loss

    def generate(self, start_char='th', max_new_tokens=32, top_p=0.9, top_k=5, temperature=1.0):
        self.eval()
        with torch.no_grad():
            generated_text = start_char
            for _ in range(max_new_tokens):
                # input_idx = torch.tensor([encode(generated_text[-1])[0]]).to('cuda')
                input_idx = torch.tensor(encode(generated_text)[-block_size:]).to('cuda').reshape(1, -1)
        
                print(f"Input: {decode(input_idx.reshape(-1).cpu().tolist())}")
                logits, _ = self.forward(input_idx)
                probabilities = F.softmax(logits[:, -1, :] / temperature, dim=-1).cpu().numpy().reshape(-1)
                chosen_index = self.top_k_top_p_filtering(probabilities, top_k=top_k, top_p=top_p)
                next_char = decode([chosen_index])
                tmp = torch.argmax(logits, dim=-1).cpu().tolist()[0]
                # print(tmp)
                print(f"Output: {decode(tmp)}")
                print("==========")
                generated_text += next_char

        return generated_text


    def top_k_top_p_filtering(self, probabilities, top_k=5, top_p=0.9):


        def top_k_sampling(probabilities, k=5):
            """
            Performs top-k sampling from a probability distribution.

            :param probabilities: A 1D numpy array containing the probability distribution.
            :param k: The number of top elements to consider for sampling.
            :return: An index sampled from the top-k distribution.
            """
            # Get indices of the top k probabilities
            top_k_indices = np.argsort(probabilities)[-k:]

            # Extract the top k probabilities
            top_k_probabilities = probabilities[top_k_indices]

            # Normalize the top k probabilities so they sum to 1
            top_k_probabilities /= top_k_probabilities.sum()

            # Sample from the top k elements
            chosen_index = np.random.choice(top_k_indices, p=top_k_probabilities)

            return chosen_index

        def top_p_sampling(probabilities, p=0.9):
            """
            Selects tokens from a probability distribution that have a cumulative probability
            greater than the threshold p.

            :param probabilities: A 1D numpy array containing the probability distribution.
            :param p: The cumulative probability threshold (0 < p <= 1).
            :return: An index sampled according to the top-p distribution.
            """
            # print(f"{probabilities=}")
            # Sort probabilities in descending order and also get the original indices
            sorted_indices = np.argsort(probabilities)[::-1]
            # print(f"{sorted_indices=}")
            sorted_probabilities = probabilities[sorted_indices]

            # print(f"{sorted_probabilities=}")

            # Compute cumulative probabilities
            cumulative_probabilities = np.cumsum(sorted_probabilities)
            # print(f"{cumulative_probabilities=}")

            # Find the threshold index
            cutoff_index = np.where(cumulative_probabilities > p)[0][0]
            # print(f"{cutoff_index=}")

            # Filter out indices and probabilities that don't meet the threshold
            filtered_indices = sorted_indices[:cutoff_index + 1]
            # print(f"{filtered_indices=}")
            filtered_probabilities = sorted_probabilities[:cutoff_index + 1]
            # print(f"{filtered_probabilities=}")

            # Normalize the filtered probabilities
            filtered_probabilities /= filtered_probabilities.sum()
            # print(f"normalized_filtered_probabilities{filtered_probabilities}")

            # # Sample from the filtered distribution
            # chosen_index = np.random.choice(filtered_indices, p=filtered_probabilities)
            # print(chosen_index)
            return filtered_probabilities

        probabilities = top_p_sampling(probabilities, p=top_p)
        chosen_index = top_k_sampling(probabilities, k=top_k)
        return chosen_index


### Training loop 

implement training loop

In [121]:
model = GPT(vocab_size=len(chars), n_embd=64, n_head=4, n_layer=2, max_seq_len=32, dropout_rate=0.01).to('cuda') # make you are running this on the GPU
max_iters = 20000

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
model.train()
from tqdm import tqdm
for iter in tqdm(range(max_iters)):
    x, y = get_batch()
    # print(x)

    # break
    _, loss = model(x, y)
    if iter % 2000 == 0:
        print(loss)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()



  0%|          | 0/20000 [00:00<?, ?it/s]

tensor(4.3530, device='cuda:0', grad_fn=<NllLossBackward0>)


 10%|█         | 2008/20000 [00:47<07:15, 41.27it/s]

tensor(1.7146, device='cuda:0', grad_fn=<NllLossBackward0>)


 20%|██        | 4007/20000 [01:32<05:16, 50.60it/s]

tensor(1.6360, device='cuda:0', grad_fn=<NllLossBackward0>)


 30%|███       | 6010/20000 [02:17<04:57, 47.05it/s]

tensor(1.6077, device='cuda:0', grad_fn=<NllLossBackward0>)


 40%|████      | 8007/20000 [03:04<05:13, 38.23it/s]

tensor(1.6105, device='cuda:0', grad_fn=<NllLossBackward0>)


 50%|█████     | 10007/20000 [03:46<03:17, 50.68it/s]

tensor(1.5661, device='cuda:0', grad_fn=<NllLossBackward0>)


 60%|██████    | 12007/20000 [04:28<03:14, 41.09it/s]

tensor(1.6158, device='cuda:0', grad_fn=<NllLossBackward0>)


 70%|███████   | 14010/20000 [05:13<01:54, 52.19it/s]

tensor(1.5697, device='cuda:0', grad_fn=<NllLossBackward0>)


 80%|████████  | 16006/20000 [05:53<01:22, 48.55it/s]

tensor(1.5612, device='cuda:0', grad_fn=<NllLossBackward0>)


 90%|█████████ | 18009/20000 [06:35<00:38, 52.08it/s]

tensor(1.5644, device='cuda:0', grad_fn=<NllLossBackward0>)


100%|██████████| 20000/20000 [07:14<00:00, 46.03it/s]


### Generate text


print some text that your model generates

In [122]:
model.generate(start_char="I tell you, fri", temperature=0.5, top_p=0.99, top_k=1)

Input: I tell you, fri
Output:  hhll mou  sooe
Input: I tell you, friq
Output:  hhll mou  sooeu
Input:  tell you, friqq
Output: thll tou  sooeuu
Input: tell you, friqqq
Output: hrl tou  sooeuuu
Input: ell you, friqqqq
Output:  l tou  sooeuuuu
Input: ll you, friqqqqq
Output: l tou  sooeuuuuu
Input: l you, friqqqqqq
Output: ltour sooeuuuuuu
Input:  you, friqqqqqqq
Output: tour sooeuuuuuuu
Input: you, friqqqqqqqq
Output:  ur sooeuuuuuuuu
Input: ou, friqqqqqqqqq
Output: ur tooeuuuuuuuuu
Input: u, friqqqqqqqqqq
Output: r tooeuuuuuuuuuu
Input: , friqqqqqqqqqqq
Output:  tooeuuuuuuuuuuu
Input:  friqqqqqqqqqqqq
Output: tooeuuuuuuuuuuuu
Input: friqqqqqqqqqqqqq
Output:  oeuuuuuuuuuuuuu
Input: riqqqqqqqqqqqqqq
Output:  nuuuuuuuuuuuuuu
Input: iqqqqqqqqqqqqqqq
Output: nuuuuuuuuuuuuuuu
Input: qqqqqqqqqqqqqqqq
Output: uuuuuuuuuuuuuuuu
Input: qqqqqqqqqqqqqqqq
Output: uuuuuuuuuuuuuuuu
Input: qqqqqqqqqqqqqqqq
Output: uuuuuuuuuuuuuuuu
Input: qqqqqqqqqqqqqqqq
Output: uuuuuuuuuuuuuuuu
Input: qqqqqqqqqqqqqqq

'I tell you, friqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq'