# Generative Pre-trained Transformer (GPT) in Tensorflow

This is a fun experiment in GPTs inspired by Andrej Karpathy: https://www.youtube.com/watch?v=kCc8FmEb1nY. He's a great teacher and has taught classes, written about ML, and posted videos about ML for years.

Karpathy walks through writing a GPT from scratch using the mini-Shakespeare dataset and PyTorch. I'm more used to TensorFlow, so I wanted to try and follow along using TF. I didn't quite write this from scratch; Andrej implements a multi-head attention layer from scratch, but TensorFlow includes a MultiHeadAttention layer so I used that. It's worth paying attention in Andrej's tutorial to see what's going on in an attention layer though; the key, value, and query concepts are useful to understand.

I also wanted to try more data, so I used the complete works of William Shakespeare instead of the mini-Shakespeare dataset.

I put a slight spin on the training/inference approach by creating "training" and "production" models. The training model requires tokens as input. The tokens have been translated from the raw input bytes. As output, it generates logits for each of the possible tokens that it thinks will come next. The production model takes characters in and spits characters out. It does this by adding an encoding and decoding layer to the inference model after training. Encoding translates the raw bytes into a smaller vocabulary of tokens using a tf.keras.layers.StringLookup layer. On the output, it chooses a next token given the logits and then decodes back to characters. I think this makes it a little easier to pass in data: all you need to do on the input side is split up a string into characters and truncate or pad to the right length, and on the output side, you get characters right out of the model.

The original paper on Transformers ("Attention is All You Need", Vaswani et al.) is worth a read: https://arxiv.org/pdf/1706.03762.pdf. This implements just the "decoder" portion on the right. The decoder predicts the next character in a sequence given a string of recent characters (e.g. given "To be or not ", it might fill in "t", "o", " ", "b", "e")

In [1]:
"""
Implementation of a Generative Pre-Trained Transformer (GPT) in Tensorflow
"""
import os
import requests
import tensorflow as tf
import numpy as np

# Define some constants
INPUT_FILE = 't8.shakespeare.txt'
# FILE_URL = 'https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt'
FILE_URL = 'http://localhost:8000/t8.shakespeare.txt'
TIMESERIES_CONTEXT = 32  # how many characters to use for prediction?
BATCH_SIZE = 32
NUM_HEADS = 4  # Number of attention heads in each layer
HEAD_SIZE = 32  # Number of units in each head
DROPOUT = 0.2  # Dropout regularization in the attention and dense layers
NUM_LAYERS = 6  # Number of attention layers
LEARNING_RATE = 3e-4

Transformers need some sort of information that associates each input with its position. I'm doing that here with just a range of integers fed into an embedding layer. That gets added to an embedding layer that represents the input characters themselves. This layer class implements both.

In [2]:
class TokenAndPositionEmbedding(tf.keras.layers.Layer):
    def __init__(self, vocab_size, context_size, embed_size):
        super().__init__()
        self.pos_embedding = tf.keras.layers.Embedding(vocab_size, embed_size)
        self.tok_embedding = tf.keras.layers.Embedding(context_size, embed_size)
        self.positions = tf.range(0, context_size)
    
    def call(self, values):
        return self.pos_embedding(values) + self.tok_embedding(self.positions)

Here's the actual Transformer Decoder layer. This is based on the Attention Is All You Need paper but with the layer normalization performed before attention and feed-forward.

In [3]:
class TransformerDecoder(tf.keras.layers.Layer):
    def __init__(self, num_heads: int, head_size: int, dropout: int):
        super().__init__()
        self.attention = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=head_size, dropout=dropout)
        # The feed-forward layer includes a multiple of the input size in a non-linear multi-layer perceptron as a calculation stage,
        # followed by a linear layer that maps the dimensions back to the input size
        self.feed_forward = tf.keras.Sequential([tf.keras.layers.Dense(4*num_heads*head_size, activation='gelu'),
                                                 tf.keras.layers.Dense(num_heads*head_size),
                                                 tf.keras.layers.Dropout(dropout)])
        self.layer_norm1 = tf.keras.layers.LayerNormalization()
        self.layer_norm2 = tf.keras.layers.LayerNormalization()
    
    def call(self, values):
        norm_values = self.layer_norm1(values)
        # At both the attention and feed-forward layers, we include a residual (values and attn) that helps accelerate convergence
        attn = values + self.attention(norm_values, norm_values, attention_mask=np.tri(TIMESERIES_CONTEXT))
        norm_attn = self.layer_norm2(attn)
        feed_fwd = attn + self.feed_forward(norm_attn)
        # The output shape is going to be the same as the input shape
        return feed_fwd

Add some utility functions for reading data and generating random slices for training

In [4]:
def read_input(input_filename: str):
    """Read the input file and return the encoded dataset and an encoding layer"""
    raw_input = list(open(input_filename).read())
    vocabulary = sorted(list(set(raw_input)))
    encoder = tf.keras.layers.StringLookup(vocabulary=vocabulary)
    enc_input = encoder(raw_input).numpy()
    return enc_input, encoder


def generate_batches(data: np.array, context_size, batch_size):
    """Generate batches of value,target pairs for training"""
    # Create shuffled starting offsets into the data
    starting_offsets = np.arange(len(data)-context_size)
    np.random.shuffle(starting_offsets)
    for batch_idx in range(0, len(starting_offsets), batch_size):
        batch_starting_offsets = starting_offsets[batch_idx:(batch_idx+batch_size)].reshape((-1,1))
        # Turn the starting offsets into indices: this will project the arange across all of the 
        # starting_offsets so something like [[5], [11], [3]] becomes [[5, 6, 7], [11, 12, 13], [3, 4, 5]]
        indices = batch_starting_offsets + np.arange(context_size)
        values = data[indices]
        # The targets (what we want to predict) are just the next characters
        targets = data[indices+1]
        yield values, targets

Code to create the model

In [5]:
def create_model(encoder):
    """Create the GPT model and return it"""
    # At each layer, I'll show the output dimensions. B means the batch dimension, T means the time dimension (characters), and C is the channel dimension (number of different values for each character)
    values_input = tf.keras.layers.Input(shape=(TIMESERIES_CONTEXT,), name='values_input')  # (B, T)
    layer = TokenAndPositionEmbedding(encoder.vocabulary_size(), TIMESERIES_CONTEXT, NUM_HEADS*HEAD_SIZE)(values_input)  # (B, T, NUM_HEADS*HEAD_SIZE)
    for _ in range(NUM_LAYERS):
        layer = TransformerDecoder(NUM_HEADS, HEAD_SIZE, DROPOUT)(layer)  # (B, T, NUM_HEADS*HEAD_SIZE)
    # One last layer normalization
    layer = tf.keras.layers.LayerNormalization()(layer)  # (B, T, NUM_HEADS*HEAD_SIZE)
    # And map to the vocabulary size
    output = tf.keras.layers.Dense(encoder.vocabulary_size())(layer)  # (B, T, C)

    model = tf.keras.Model(inputs=values_input, outputs=output)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)  # This loss lets us pass in integers rather than having to one-hot encode
    optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)
    model.compile(loss=loss, optimizer=optimizer)
    return model

Put all the pieces together: create the model, train the model

In [6]:
# Download the input file if it doesn't exist already
if os.path.exists(INPUT_FILE) is False:
    with open(INPUT_FILE, 'wb') as outfile:
        resp = requests.get(FILE_URL)
        outfile.write(resp.content)

data, encoder = read_input(INPUT_FILE)
model = create_model(encoder)
model.summary()
# Loss should get down to about 1.3-1.4
# You can pass steps_per_epoch to limit the training to fewer batches. You can get some interesting
# results at 5,000 and it takes a lot less time (loss about 1.9). The whole file is about 170k.
model.fit(generate_batches(data, TIMESERIES_CONTEXT, BATCH_SIZE))

2023-07-30 10:05:01.928704: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2023-07-30 10:05:01.928734: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2023-07-30 10:05:01.928739: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2023-07-30 10:05:01.928765: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-07-30 10:05:01.928778: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 values_input (InputLayer)   [(None, 32)]              0         
                                                                 
 token_and_position_embeddi  (None, 32, 128)           15872     
 ng (TokenAndPositionEmbedd                                      
 ing)                                                            
                                                                 
 transformer_decoder (Trans  (None, 32, 128)           198272    
 formerDecoder)                                                  
                                                                 
 transformer_decoder_1 (Tra  (None, 32, 128)           198272    
 nsformerDecoder)                                                
                                                                 
 transformer_decoder_2 (Tra  (None, 32, 128)           198272

2023-07-30 10:05:14.410849: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


 170568/Unknown - 26340s 154ms/step - loss: 1.3683

2023-07-30 17:24:13.362451: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 14427151888607609466
2023-07-30 17:24:13.362951: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 7662825767627818096
2023-07-30 17:24:13.362955: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 4727600636621391106
2023-07-30 17:24:13.362962: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 14301856541271672186
2023-07-30 17:24:13.362966: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 4885685330979769634
2023-07-30 17:24:13.362968: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 17894700199796706868
2023-07-30 17:24:13.362972: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv 



<keras.src.callbacks.History at 0x17a7dda80>

Now create a production model that has a character encoder layer on the input, and a decoder layer on the output that translates logits back into characters.

In [7]:
class InverseLookupLayer(tf.keras.layers.Layer):
    def __init__(self, encoder: tf.keras.layers.StringLookup):
        super().__init__()
        self.decoder = tf.keras.layers.StringLookup(vocabulary=encoder.get_vocabulary(), invert=True)
    
    def call(self, logits):
        # Pick the last character prediction in each batch row
        # Then, for each batch row, choose a character based on the logit probabilities
        choose_chars = tf.random.categorical(logits[:,-1,:], num_samples=1)
        # Decode back to characters
        return self.decoder(choose_chars)


def create_production_model(model, encoder):
    """Add a character encoder and decoder layer at the front and back of the model"""
    chars_input = tf.keras.layers.Input(shape=(TIMESERIES_CONTEXT,), dtype=tf.string)
    # Strip the input layer off the training model and add our input with an encoder instead
    prod_model = tf.keras.Sequential([chars_input,
                                      encoder] +
                                      model.layers[1:] +
                                      [InverseLookupLayer(encoder)])
    return prod_model


def generate_text(model: tf.keras.Model, context: str, num_chars: int):
    print(context, end='')
    for _ in range(num_chars):
        context = context[-TIMESERIES_CONTEXT:]  # truncate
        context = ['[UNK]'] * (TIMESERIES_CONTEXT - len(context)) + list(context)  # pad with "unknown" strings at the front
        char = model.predict([context], verbose=0)[0,0].decode()
        print(char, end='', flush=True)
        context += char

Put everything together and generate some text!

In [8]:
prod_model = create_production_model(model, encoder)
generate_text(prod_model, context='   A', num_chars=1000)

   A

2023-07-30 17:24:14.889301: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


I OM
  GlOOCCONSTAFF
  ESCALUS, in a power of beauty, the wax,
    who comes here pursues and good fairies your mothers I am clouds recounts,
   Talk of Antianus, remain thee be the Garter. I perceive thy mother's tomb,
    Here is no more than you blunt;
    Tears with our cold fathers; your returns a peril
    Did to prove ill this kingdom?
  Pedro. At my presence, ho, have you're
    To honest Master Ambassador stay into come to case.
    Look, my lord, that's potent, there anything,
    How now, though thou say'st me fought at the subject. One for a porbation
    COMINIUS OF EPHESUS, Duke of Signior

Enter SIR TOBY, and PISANIO and other ThIsby
    Here came I in thine own lord that taste be than I
    beseech your mother must make speed a little wanting set,
  When nor time to draw thee there would lay prisoners come off,
    And be fruitful.
  PEY. Hear what preciously shift thy company can I each love and life
    Zetbear this deed liveries, and cry and denatory,
   His rifts ha