**Alex and Derek**

Spring 2025

CS 444: Deep Learning

#### Project 4: Transformers

In this final notebook, we will train larger GPTs on a large corpus of prose — the entire works of Shakespeare. Once trained, you will be able to prompt your GPTs with some text and it will generate text that appears to follow.

In [19]:
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=4)

# Automatically reload your external source code
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


![Some fun](images/transformer4.png)

## Task 8. Preprocess a large corpus of text

**NOTE:** This is no Task 7. It got removed due to time constraints.

<!-- Let's write code to load in the works of Shakespeare (`shakespeare.txt`) and preprocess it so that we can try a transformer on the text. -->

Run the test code in this section to make sure the works of Shakespeare (`shakespeare.txt`) are loaded and preprocessed properly for the transformer.

In [20]:
from preprocess_corpus import load_document, make_char2ind_map, make_seqs_and_labels

### 8a. Generate corpus and vocabulary

<!-- In `preprocess_corpus.py`, implement the `load_document` function to load in the Shakespeare corpus and make the vocabulary. -->

In [21]:
corpus, vocab = load_document(path2data='shakespeare.txt')

print(f'The vocabulary has {len(vocab)} tokens and it should have 65.')
print(f'The vocabulary is (split up over multiple lines):\n{vocab[:25]}\n{vocab[25:50]}\n{vocab[50:]}\n')
print('and it should be:')
print("""['\\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
['M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
['l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']""")

print(f'The corpus has {len(corpus)} chars and it should have 1115394.')
print(55*'-')
print('The first 50 chars of the corpus is:')
print(corpus[:50])
print('and it should be:')
print('''First Citizen:
Before we proceed any further, hear''')
print(55*'-')
print('The last 50 chars of the corpus is:')
print(corpus[-50:])
print('and it should be:')
print('''eep--die, rather; wink'st
Whiles thou art waking.
''')

The vocabulary has 65 tokens and it should have 65.
The vocabulary is (split up over multiple lines):
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
['M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
['l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

and it should be:
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
['M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
['l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
The corpus has 1115394 chars and it should have 1115394.
-------------------------------------------------------
The first 50 chars of the corpus is:
First Citizen:
Before we proceed any further, hear
and it should be:
Fi

### 8b. Create char2ind map

<!-- In `preprocess_corpus.py`, implement the `make_char2ind_map` function and test it below. -->

In [22]:
char2ind_map = make_char2ind_map(vocab)

print(f'Size of your char2ind map is {len(char2ind_map)} and it should be 65.')
print('Keys of your char2ind map:')
print(''.join(char2ind_map.keys()))
print("They should be \n\n !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz")
print('Values of your char2ind map:')
print(list(char2ind_map.values()))
print("They should be")
print('[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]')


Size of your char2ind map is 65 and it should be 65.
Keys of your char2ind map:

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
They should be 

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Values of your char2ind map:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]
They should be
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]


### 8c. Create sequences of int-coded texts and labels

<!-- In `preprocess_corpus.py`, implement the `make_seqs_and_labels` function, which should extract sequential `seq_len` long chunks (*our desired sequence length for the transformer*) to form the sequences on which we will train the transformer. The labels/targets are just the chars shifted by 1 (i.e. the next char in the corpus). -->

In [23]:
seq_len = 250
seqs, labels = make_seqs_and_labels(corpus, char2ind_map, seq_len=seq_len)

print(f'The shape of your Shakespeare sequences is {seqs.shape} and it should be (4461, 250).')
print(f'The shape of your Shakespeare labels is {labels.shape} and it should be (4461, 250).')
print('The first 15 int-coded tokens of the 1st few sequences are:')
print(seqs[:5, :15].numpy())
print('they should be:')
print('''[[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0]
 [ 0 13 50 50 10  0 35 43  1 49 52 53 61  5 58]
 [ 1 41 47 58 47 64 43 52 57  6  1 58 46 43  1]
 [ 1 58 46 43  1 53 40 48 43 41 58  1 53 44  1]
 [31 43 41 53 52 42  1 15 47 58 47 64 43 52 10]]''')

print('The first 15 int-coded tokens of the last few sequences are:')
print(seqs[-5:, :15].numpy())
print('they should be:')
print('''[[57 53  1 61 43 39 49 50 63  8  1 35 47 50 50]
 [ 6  0 16 53  1 52 53 58  1 53 51 47 58  1 58]
 [58 56 39 52 45 43  1 42 56 53 61 57 47 52 43]
 [42 56 53 54 54  5 42  6  1 39 57  1 40 63  1]
 [13 26 10  0 35 46 39 58  6  1 39 56 58  1 58]]''')

The shape of your Shakespeare sequences is (4461, 250) and it should be (4461, 250).
The shape of your Shakespeare labels is (4461, 250) and it should be (4461, 250).
The first 15 int-coded tokens of the 1st few sequences are:
[[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0]
 [ 0 13 50 50 10  0 35 43  1 49 52 53 61  5 58]
 [ 1 41 47 58 47 64 43 52 57  6  1 58 46 43  1]
 [ 1 58 46 43  1 53 40 48 43 41 58  1 53 44  1]
 [31 43 41 53 52 42  1 15 47 58 47 64 43 52 10]]
they should be:
[[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0]
 [ 0 13 50 50 10  0 35 43  1 49 52 53 61  5 58]
 [ 1 41 47 58 47 64 43 52 57  6  1 58 46 43  1]
 [ 1 58 46 43  1 53 40 48 43 41 58  1 53 44  1]
 [31 43 41 53 52 42  1 15 47 58 47 64 43 52 10]]
The first 15 int-coded tokens of the last few sequences are:
[[57 53  1 61 43 39 49 50 63  8  1 35 47 50 50]
 [ 6  0 16 53  1 52 53 58  1 53 51 47 58  1 58]
 [58 56 39 52 45 43  1 42 56 53 61 57 47 52 43]
 [42 56 53 54 54  5 42  6  1 39 57  1 40 63  1]
 [13 26 10  0 35 46 39

### 8d. Add padding char to dictionary

**TODO:** Add the usual padding char (`'#'`) to the char2ind map to the next available int slot.

In [24]:
# Add padding char to dictionary
char2ind_map['#'] = len(char2ind_map)-1
char2ind_map

{'\n': 0,
 ' ': 1,
 '!': 2,
 '$': 3,
 '&': 4,
 "'": 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '3': 9,
 ':': 10,
 ';': 11,
 '?': 12,
 'A': 13,
 'B': 14,
 'C': 15,
 'D': 16,
 'E': 17,
 'F': 18,
 'G': 19,
 'H': 20,
 'I': 21,
 'J': 22,
 'K': 23,
 'L': 24,
 'M': 25,
 'N': 26,
 'O': 27,
 'P': 28,
 'Q': 29,
 'R': 30,
 'S': 31,
 'T': 32,
 'U': 33,
 'V': 34,
 'W': 35,
 'X': 36,
 'Y': 37,
 'Z': 38,
 'a': 39,
 'b': 40,
 'c': 41,
 'd': 42,
 'e': 43,
 'f': 44,
 'g': 45,
 'h': 46,
 'i': 47,
 'j': 48,
 'k': 49,
 'l': 50,
 'm': 51,
 'n': 52,
 'o': 53,
 'p': 54,
 'q': 55,
 'r': 56,
 's': 57,
 't': 58,
 'u': 59,
 'v': 60,
 'w': 61,
 'x': 62,
 'y': 63,
 'z': 64,
 '#': 64}

## Task 9. Train GPT on Shakespeare

Now we are ready to train a GPT on the works of Shakespeare!

### 9a. Build `GPTMini6`

We will use a deeper transformer called `GPTMini6` for training on the Shakespeare corpus. Build the neural network then check the summary below.

In [25]:
from gpts import GPTMini6

In [26]:
# TODO: Set padding_char_enc to the int coded padding token below
padding_char_enc = char2ind_map['#']
myminigpt = GPTMini6(vocab_sz=9, seq_len=15, padding_char_enc=padding_char_enc)
myminigpt.compile(loss='temporal_cross_entropy')

---------------------------------------------------------------------------
Dense layer output(Output_layer) shape: [1, 15, 9]
Transformer_Block_5:
	Transformer_Block_5_MLP:
	Dropout layer output(Transformer_Block_5_MLP_Dropout) shape: [1, 15, 384]
	Dense layer output(Transformer_Block_5_MLP_Dense2) shape: [1, 15, 384]
	Dense layer output(Transformer_Block_5_MLP_Dense1) shape: [1, 15, 1536]
	Transformer_Block_5_MHA:
	Dropout layer output(Transformer_Block_5_MHA_Dropout) shape: [1, 15, 384]
	Dense layer output(Transformer_Block_5_MHA_Dense) shape: [1, 15, 384]
	Transformer_Block_5_MHA_Attention:
	Dropout layer output(attention_dropout) shape: [1, 6, 15, 15]
	Transformer_Block_5_MHA_QKV:
	Dense layer output(QKVBlock_Value) shape: [1, 15, 384]
	Dense layer output(QKVBlock_Key) shape: [1, 15, 384]
	Dense layer output(QKVBlock_Query) shape: [1, 15, 384]
Transformer_Block_4:
	Transformer_Block_4_MLP:
	Dropout layer output(Transformer_Block_4_MLP_Dropout) shape: [1, 15, 384]
	Dense layer outp

The above cell should output:

```
---------------------------------------------------------------------------
Dense layer output(output) shape: [1, 15, 9]
TransformerBlock_5:
	TransformerBlock_5/MLP:
	Dropout layer output(TransformerBlock_5/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_5/multihead_attention:
	Dropout layer output(TransformerBlock_5/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_5/multihead_attention/attention:
	Dropout layer output(TransformerBlock_5/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_5/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_5/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_4:
	TransformerBlock_4/MLP:
	Dropout layer output(TransformerBlock_4/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_4/multihead_attention:
	Dropout layer output(TransformerBlock_4/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_4/multihead_attention/attention:
	Dropout layer output(TransformerBlock_4/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_4/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_4/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_4/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_3:
	TransformerBlock_3/MLP:
	Dropout layer output(TransformerBlock_3/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_3/multihead_attention:
	Dropout layer output(TransformerBlock_3/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_3/multihead_attention/attention:
	Dropout layer output(TransformerBlock_3/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_3/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_3/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_3/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_2:
	TransformerBlock_2/MLP:
	Dropout layer output(TransformerBlock_2/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_2/multihead_attention:
	Dropout layer output(TransformerBlock_2/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_2/multihead_attention/attention:
	Dropout layer output(TransformerBlock_2/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_2/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_2/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_2/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_1:
	TransformerBlock_1/MLP:
	Dropout layer output(TransformerBlock_1/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_1/multihead_attention:
	Dropout layer output(TransformerBlock_1/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_1/multihead_attention/attention:
	Dropout layer output(TransformerBlock_1/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_1/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_1/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_1/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
TransformerBlock_0:
	TransformerBlock_0/MLP:
	Dropout layer output(TransformerBlock_0/MLP/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/MLP/dense_1) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/MLP/dense_0) shape: [1, 15, 1536]
	TransformerBlock_0/multihead_attention:
	Dropout layer output(TransformerBlock_0/multihead_attention/dropout) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/multihead_attention/dense_1) shape: [1, 15, 384]
	TransformerBlock_0/multihead_attention/attention:
	Dropout layer output(TransformerBlock_0/multihead_attention/attention/dropout) shape: [1, 6, 15, 15]
	TransformerBlock_0/multihead_attention/qkv_block:
	Dense layer output(TransformerBlock_0/multihead_attention/qkv_block/dense_v) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/multihead_attention/qkv_block/dense_k) shape: [1, 15, 384]
	Dense layer output(TransformerBlock_0/multihead_attention/qkv_block/dense_q) shape: [1, 15, 384]
PositionalEncodingBlock:
	Dropout layer output(PositionalEncodingBlock/dropout) shape: [1, 15, 384]
	Positional encoding layer output(PositionalEncodingBlock/positional_enc_layer) shape: [1, 15, 384]
Embedding layer output(EmbeddingLayer) shape: [1, 15, 384]
---------------------------------------------------------------------------
```

### 9b. Train `GPTMini6` on the works of Shakespeare

Use default hyperparameters except for the following:
- For the validation set, use the 1st 200 sequences. For the training set, use all sequences beyond the 1st 200.
- Batch size of `64`.
- Patience of `15`.
- Learning rate decay patience of `9`.
- Learning rate should be allowed to decay no more than `3` times.
- Limit training to `100` epochs maximum.

Make a well-labeled plot showing the **training and validation loss** over the course of training.

In [27]:
# Create sequences and labels
seq_len = 128  # Typical sequence length for character-level models
x_int, y_int = make_seqs_and_labels(corpus, char2ind_map, seq_len)

# Create validation and training splits according to the requirements
# For the validation set, use the 1st 200 sequences
x_val = x_int[:200]
y_val = y_int[:200]

# For the training set, use all sequences beyond the 1st 200
x_train = x_int[200:]
y_train = y_int[200:]

# Define the padding character (this is used for the GPTMini6 model)
padding_char_enc = char2ind_map['#']

# Initialize the GPTMini6 model with default hyperparameters
model = GPTMini6(
    vocab_sz=len(vocab),
    seq_len=seq_len,
    padding_char_enc=padding_char_enc,
    num_heads=6,
    embed_dim=384,
    dropout_rate=0.2
)

# Compile the model with cross entropy loss and Adam optimizer
model.compile(loss='temporal_cross_entropy', optimizer='adam', lr=1e-3)

---------------------------------------------------------------------------
Dense layer output(Output_layer) shape: [1, 128, 65]
Transformer_Block_5:
	Transformer_Block_5_MLP:
	Dropout layer output(Transformer_Block_5_MLP_Dropout) shape: [1, 128, 384]
	Dense layer output(Transformer_Block_5_MLP_Dense2) shape: [1, 128, 384]
	Dense layer output(Transformer_Block_5_MLP_Dense1) shape: [1, 128, 1536]
	Transformer_Block_5_MHA:
	Dropout layer output(Transformer_Block_5_MHA_Dropout) shape: [1, 128, 384]
	Dense layer output(Transformer_Block_5_MHA_Dense) shape: [1, 128, 384]
	Transformer_Block_5_MHA_Attention:
	Dropout layer output(attention_dropout) shape: [1, 6, 128, 128]
	Transformer_Block_5_MHA_QKV:
	Dense layer output(QKVBlock_Value) shape: [1, 128, 384]
	Dense layer output(QKVBlock_Key) shape: [1, 128, 384]
	Dense layer output(QKVBlock_Query) shape: [1, 128, 384]
Transformer_Block_4:
	Transformer_Block_4_MLP:
	Dropout layer output(Transformer_Block_4_MLP_Dropout) shape: [1, 128, 384]
	Den

In [28]:
train_loss_hist, val_loss_hist, val_acc_hist, n_epochs = model.fit(
    x=x_train,
    y=y_train,
    x_val=x_val,
    y_val=y_val,
    batch_size=64,
    max_epochs=100,
    patience=15,
    lr_patience=9,
    lr_decay_factor=0.5,
    lr_max_decays=3,
    val_every=1,
    verbose=True
)

I0000 00:00:1747061153.945329     718 service.cc:145] XLA service 0x70a63417a690 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1747061153.945366     718 service.cc:153]   StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9


2025-05-12 14:45:54.217587: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.


2025-05-12 14:45:54.715622: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 90400


I0000 00:00:1747061155.151239     718 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 1/100: Train Loss: 3.4318, Val Loss: 3.2835, Val Acc: 0.0000, Time: 64.61s


Epoch 2/100: Train Loss: 3.0924, Val Loss: 2.6623, Val Acc: 0.0000, Time: 22.75s


Epoch 3/100: Train Loss: 2.5985, Val Loss: 2.4346, Val Acc: 0.0000, Time: 22.80s


Epoch 4/100: Train Loss: 2.3810, Val Loss: 2.2760, Val Acc: 0.0000, Time: 22.83s


Epoch 5/100: Train Loss: 2.2294, Val Loss: 2.1163, Val Acc: 0.0000, Time: 22.86s


Epoch 6/100: Train Loss: 2.0856, Val Loss: 1.9880, Val Acc: 0.0000, Time: 22.91s


Epoch 7/100: Train Loss: 1.9469, Val Loss: 1.8761, Val Acc: 0.0000, Time: 22.92s


Epoch 8/100: Train Loss: 1.8292, Val Loss: 1.7870, Val Acc: 0.0000, Time: 22.96s


Epoch 9/100: Train Loss: 1.7377, Val Loss: 1.7147, Val Acc: 0.0000, Time: 22.96s


Epoch 10/100: Train Loss: 1.6600, Val Loss: 1.6369, Val Acc: 0.0000, Time: 22.94s


Epoch 11/100: Train Loss: 1.5981, Val Loss: 1.5983, Val Acc: 0.0000, Time: 22.94s


Epoch 12/100: Train Loss: 1.5514, Val Loss: 1.5651, Val Acc: 0.0000, Time: 22.92s


Epoch 13/100: Train Loss: 1.5109, Val Loss: 1.5412, Val Acc: 0.0000, Time: 22.94s


Epoch 14/100: Train Loss: 1.4756, Val Loss: 1.5142, Val Acc: 0.0000, Time: 22.94s


Epoch 15/100: Train Loss: 1.4431, Val Loss: 1.5015, Val Acc: 0.0000, Time: 22.93s


Epoch 16/100: Train Loss: 1.4176, Val Loss: 1.4829, Val Acc: 0.0000, Time: 22.92s


Epoch 17/100: Train Loss: 1.3936, Val Loss: 1.4836, Val Acc: 0.0000, Time: 22.94s


Epoch 18/100: Train Loss: 1.3698, Val Loss: 1.4639, Val Acc: 0.0000, Time: 22.95s


Epoch 19/100: Train Loss: 1.3507, Val Loss: 1.4531, Val Acc: 0.0000, Time: 22.96s


Epoch 20/100: Train Loss: 1.3304, Val Loss: 1.4542, Val Acc: 0.0000, Time: 22.95s


Epoch 21/100: Train Loss: 1.3162, Val Loss: 1.4487, Val Acc: 0.0000, Time: 22.92s


Epoch 22/100: Train Loss: 1.2966, Val Loss: 1.4427, Val Acc: 0.0000, Time: 22.92s


Epoch 23/100: Train Loss: 1.2816, Val Loss: 1.4323, Val Acc: 0.0000, Time: 22.94s


Epoch 24/100: Train Loss: 1.2660, Val Loss: 1.4395, Val Acc: 0.0000, Time: 22.95s


Epoch 25/100: Train Loss: 1.2529, Val Loss: 1.4435, Val Acc: 0.0000, Time: 22.92s


Epoch 26/100: Train Loss: 1.2374, Val Loss: 1.4457, Val Acc: 0.0000, Time: 22.92s


Epoch 27/100: Train Loss: 1.2209, Val Loss: 1.4516, Val Acc: 0.0000, Time: 22.93s


Epoch 28/100: Train Loss: 1.2073, Val Loss: 1.4466, Val Acc: 0.0000, Time: 22.93s


Epoch 29/100: Train Loss: 1.1968, Val Loss: 1.4599, Val Acc: 0.0000, Time: 22.93s


Epoch 30/100: Train Loss: 1.1841, Val Loss: 1.4617, Val Acc: 0.0000, Time: 22.95s


Epoch 31/100: Train Loss: 1.1685, Val Loss: 1.4625, Val Acc: 0.0000, Time: 22.93s
Current lr= 0.001 Updated lr= 0.0005


Epoch 32/100: Train Loss: 1.1276, Val Loss: 1.4600, Val Acc: 0.0000, Time: 22.93s
Current lr= 0.0005 Updated lr= 0.00025


Epoch 33/100: Train Loss: 1.0962, Val Loss: 1.4566, Val Acc: 0.0000, Time: 22.94s
Current lr= 0.00025 Updated lr= 0.000125


Epoch 34/100: Train Loss: 1.0752, Val Loss: 1.4599, Val Acc: 0.0000, Time: 22.93s


Epoch 35/100: Train Loss: 1.0668, Val Loss: 1.4608, Val Acc: 0.0000, Time: 22.94s


Epoch 36/100: Train Loss: 1.0630, Val Loss: 1.4671, Val Acc: 0.0000, Time: 22.94s


Epoch 37/100: Train Loss: 1.0594, Val Loss: 1.4737, Val Acc: 0.0000, Time: 22.94s
Finished training after 37 epochs!


### 9c. Prompt GPT to generate Shakespearian text  

Have your GPT to generate a large amount of text (e.g. generate 5000 chars) that follows a prompt of your choice (a string containing few words or a sentence).

**Guidelines**
1. Use your `make_ind2char_mapping` from the math datasets to make the reverse map.
2. Use the `'distributed'` method for generating text.

When you turn in your project, include an example of at least one long passage of generated text by your GPT below.

In [32]:
from addition_dataset import make_ind2char_mapping

In [38]:
ind2char_map = make_ind2char_mapping(char2ind_map)

prompt = "To be, or not to be: that is the question:"
print(f"Prompt: {prompt}")

model.seq_len = 128
gen_text = model.generate_sequence(
    prompt=prompt,
    length=5000,  # Generate 5000 characters
    char2ind_map=char2ind_map,
    ind2char_map=ind2char_map,
    method='max',
    live_print=True,
)

# Print the final generated text
print('***final output***')
print(prompt + ''.join(gen_text))


Prompt: To be, or not to be: that is the question:
To be, or not to be: that is the question:
I

 wi

ll 

not

 st

ay 

to 

the

e a

gai

n.



PE

TRU

CHI

O:


Wha

t, 

hol

d, 

my 

lor

d?



KA

THA

RIN

A:


I t

han

k t

hee

, t

hou

gh 

I d

id 

say

 it

 is

 no

t.



PE

TRU

CHI

O:


Wel

l, 

my 

lor

d.



KA

THA

RIN

A:


I w

ill

 be

 so

 fa

lse

 of

 th

e s

ea 

of 

the

 se

a,


And

 th

ere

for

e I

 wi

ll 

res

olv

e y

ou 

are

 no

t a

 wo

rd.



P

ETR

UCH

IO:


We

ll,

 my

 lo

rd.



K

ATH

ARI

NA:


I 

wou

ld 

not

 ha

ve 

hea

rd 

the

 se

nat

ors

 of

 th

e s

ea,


An

d t

her

efo

re 

I w

ill

 re

sol

ve 

you

 wi

th 

me.



P

ETR

UCH

IO:


We

ll,

 my

 lo

rd.



K

ATH

ARI

NA:


I 

wou

ld 

not

 ha

ve 

hea

rd 

him

 fr

om 

my 

son

.



PET

RUC

HIO

:
W

ell

, m

y l

ord

.



KAT

HAR

INA

:
I

 wo

uld

 no

t h

ave

 he

ard

 th

e s

ena

tor

s:


The

 se

nat

ors

 of

 th

e s

ea 

of 

the

 se

a,


And

 th

ere

for

e I

 wi

ll 

res

olv

e y

ou 

wit

h h

im.



P

ETR

UCH

IO:


We

ll,

 my

 lo

rd.



K

ATH

ARI

NA:


I 

wou

ld 

not

 ha

ve 

hea

rd 

the

 se

nat

ors

 of

 th

e s

ea,


An

d t

her

efo

re 

I w

ill

 re

sol

ve 

you

 wi

th 

me.



P

ETR

UCH

IO:


We

ll,

 my

 lo

rd.



K

ATH

ARI

NA:


I 

wou

ld 

not

 ha

ve 

hea

rd 

him

 fr

om 

my 

son

.



PET

RUC

HIO

:
W

ell

, m

y l

ord

.



KAT

HAR

INA

:
I

 wo

uld

 no

t h

ave

 he

ard

 th

e s

ena

tor

s:


The

 se

nat

ors

 of

 th

e s

ea 

of 

the

 se

a,


And

 th

ere

for

e I

 wi

ll 

res

olv

e y

ou 

wit

h h

im.



P

ETR

UCH

IO:


We

ll,

 my

 lo

rd.



K

ATH

ARI

NA:


I 

wou

ld 

not

 ha

ve 

hea

rd 

the

 se

nat

ors

 of

 th

e s

ea,


An

d t

her

efo

re 

I w

ill

 re

sol

ve 

you

 wi

th 

me.



P

ETR

UCH

IO:


We

ll,

 my

 lo

rd.



K

ATH

ARI

NA:


I 

wou

ld 

not

 ha

ve 

hea

KeyboardInterrupt: 

### 9d. Questions

**Question 9:** Rerun your generation using the `'max'` method. Which method generates better sounding/more interesting text? **Why?**

**Answer 9:**
The distributed method sounds much better, and this is due to the creative nature of it. By not selecting the top predicted char in the softmax output each time, the text varies. However, run running with 'max', the model falls into a cycle pattern where it repeats the same few lines, due to the only selecting the chars with the highest softmax output.

## Extensions

### General guidelines

1. Never integrate extensions into your base project so that they change the expected behavior of core functions. If your extension changes the core design/behavior, no problem, duplicate your working base project and add features from there.
2. Check the rubric to keep in mind how extensions on this project will be graded.
3. While I may consult your code and "written log" of what you did, **I am grading your extensions based on what you present in your 3-5 min video.**
3. I suggest documenting your explorations in a "log" or "lab notebook" style (i.e. documenting your thought/progression/discovery/learning process). I'm not grading your writing, so you can keep it succinct. **Whatever is most useful to you to remember what you did.** 
4. I suggest taking a hypothesis driven approach. For example "I was curious about X so I explored Y. I found Z, which was not what I expected because..., so then tried A..."
5. Make plots to help showcase your results.
6. **More is not necessarily better.** Generally, a small number of "in-depth" extensions count for more than many "shallow" extensions.

### AI guidelines

You may use AI in mostly any capacity for extensions. However, keep in mind:
1. There is no need to use AI at all!
2. You are welcome to use AI as a tool (e.g. automate something that is tedious, help you get unstuck, etc.). However, you should be coding, you should be thinking, you should be writing, you should be creating. If you are spending most (or even close to most) of your time typing into a chatbot and copy-pasting, you have probably gone too far with AI use.
3. I don't find large volumes of AI generated code/text/plots to be particularly impressive and you risk losing my interest while grading. Remember: I'm grading your extensions based on your video presentation. **More is not necessarily better.**

### Video guidelines

1. Please try to keep your video to 5 minutes (*I have other projects to grade!*). If you turn in a longer video, I make no promise that I will watch more than 5 minutes.
2. Your screen should be shared as you show me what you did. A live video of your face should also appear somewhere on the screen (e.g. picture-in-picture overlay / split screen).
3. Your partner should join you for the video and take turns talking, but, if necessary, it is fine to have one team member present during the record the video.
4. Do not simply read text from your notebook, do not read from a prepared script. I am not grading how polished your video presentation is (see extension grading criteria on rubric). 
5. I am looking for original and creative explorations sparked by your curiosity/interest/passion in a topic. This should be apparent in your video.
6. Be natural,, don't feel the need to impress me with fancy language. If it is helpful, imagine that we are talking one-on-one about your extension. Tell me what you did :)

### Extension ideas

#### 1. Generate text based on other corpora

Train one of your GPTs on a different text dataset and use it to generate text that resembles that body of work.

#### 2. GPT-1

Train OpenAI's GPT-1 model. It has the same architecture as `GPTMini6` except it has:
- 12 stacked Transformer Blocks
- 12 attention heads
- Embedding dimension of 768
- Dropout rate of 0.2

#### 3. GPT-2

Train a model in the family of OpenAI's GPT2 models. It has the same architecture as `GPTMini6` except it has different values for (number of transformer blocks, embedding dimension, attention heads):

**GPT-2 Medium:** (24, 1024, 16)<br/>
**GPT-2 Large:** (36, 1280, 20)<br/>
**GPT-2 XL:** (48, 1600, 25)

Feel free to adapt/pare down based on training time and GPU resources.

#### 4. More complex arithmetic

Explore any of the following:
- Train your transformers to perform addition and/or multiplication with larger numbers.
- Add support for negative integer operands.
- Allow for longer chains of operands (e.g. `1+1+1+1=4`)
- Add support for subtraction and/or other arithmetic operations.

#### 5. Explore hyperparameters

Explore how any of the following affects the quality of the generated text and/or loss:
- Sequence length
- Embedding dimension