<a href="https://colab.research.google.com/github/adammoss/MLiS2/blob/master/examples/llm/transformer_finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

As an example application of a transformer, let's build a GPT like language model that predicts the probability of a sentence of $\tau$ tokens,

$
P \left( \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \boldsymbol{x}^{(\tau)}    \right) = \prod_{t  = 1}^{\tau} P \left( \boldsymbol{x}^{(i)} |  \boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(i-1)}    \right)
$

where $\boldsymbol{x}^{(t)}$ is a vector representing a token.

The script uses the 'tiny_shakespeare' dataset, but it's designed to work with other text sources as well. It showcases how to prepare datasets for training and testing, configure model parameters like batch size and context size, and fine-tune a pre-trained GPT-2 model for text generation.

In [203]:
!pip install keras_nlp
!pip install tensorflow_text



In [204]:
import itertools
import operator
import numpy as np
import sys
from datetime import datetime
import os
import requests
from tqdm.notebook import trange, tqdm
import matplotlib.pyplot as plt
import time

In [205]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

import tensorflow_datasets as tfds

import keras_nlp
import keras

import tensorflow_text as tf_text

TensorFlow version: 2.15.0


In [206]:
batch_size = 64
context_size = 256
#docs = ['_chat.txt']
docs = ['tiny_shakespeare']
#docs = ['scientific_papers/arxiv']

In [207]:
train_text = ''
test_text = ''
for doc in docs:
  if doc == 'tiny_shakespeare':
    d = tfds.load(name=doc)['train']
    train_text += next(iter(d))['text'].numpy().decode("utf-8")
    d = tfds.load(name=doc)['test']
    test_text += next(iter(d))['text'].numpy().decode("utf-8")
  elif doc == 'scientific_papers/arxiv':
    d = tfds.load(name=doc)
  else:
    if not os.path.isfile(doc):
      from google.colab import files
      uploaded = files.upload()
    sentences = []
    with open(doc, 'r') as f:
      for x in f.readlines():
        if 'omitted' not in x:
          if len(x.split(']')) > 1:
            sentences.append(x.split(']')[1])
          else:
            sentences.append(x)
    text = ''.join(sentences)
    train_text += text[:int(0.8*len(text))]
    test_text += text[int(0.8*len(text)):]

In [208]:
print(train_text[:200])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


In [209]:
len(train_text.split(' '))

153276

In [210]:
split_train_ds = tf.data.Dataset.from_tensor_slices(tf.strings.split(train_text, sep=' '))
split_test_ds = tf.data.Dataset.from_tensor_slices(tf.strings.split(test_text, sep=' '))

In [211]:
train_sequences = split_train_ds.batch(50, drop_remainder=True)
test_sequences = split_test_ds.batch(50, drop_remainder=True)

In [212]:
def join_input(sequence):
  return tf.strings.reduce_join(sequence, axis=-1, separator=' ')

In [213]:
train_ds = train_sequences.map(join_input)
test_ds = test_sequences.map(join_input)

BUFFER_SIZE = 10000

train_ds = (
    train_ds
    .shuffle(BUFFER_SIZE)
    .batch(batch_size, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

test_ds = (
    test_ds
    .batch(batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE))

In [214]:
x = next(iter(train_ds))
print(x.numpy()[0])

b"solicit him\nFor mercy to his country. Therefore, let's hence,\nAnd with our fair entreaties haste them on.\n\nFirst Senator:\nStay: whence are you?\n\nSecond Senator:\nStand, and go back.\n\nMENENIUS:\nYou guard like men; 'tis well: but, by your leave,\nI am an officer of state, and come\nTo speak with Coriolanus.\n\nFirst Senator:\nFrom whence?\n\nMENENIUS:\nFrom Rome.\n\nFirst Senator:\nYou may not pass,"


In [215]:
x = next(iter(test_ds))
print(x.numpy()[0])

b"rance ta'en\nAs shall with either part's agreement stand?\n\nBAPTISTA:\nNot in my house, Lucentio; for, you know,\nPitchers have ears, and I have many servants:\nBesides, old Gremio is hearkening still;\nAnd happily we might be interrupted.\n\nTRANIO:\nThen at my lodging, an it like you:\nThere doth my father lie; and there, this night,\nWe'll pass the business"


In [216]:
# To speed up training and generation, we do not use a the full GPT2 context length of 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=context_size,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

In [217]:
gpt2_lm.summary()

In [218]:
start = time.time()

output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
That Italian restaurant is now open to the public.

The restaurant was originally slated to open at 7 p.m., but has been closed due to a fire. The owners of the restaurant have since decided to open at 7 p.m. The fire is believed to have been caused by the fire.

"We have been working with the fire department and the local authorities to make sure we get the restaurant back up to the normal schedule," said chef Giovanni Graziano.

The restaurant was originally scheduled to open at 6 p.m. on Saturday. The fire department was called to the restaurant because the owner was injured in an explosion that was reported at the restaurant.

The fire department was able to get the fire out of the restaurant and to the fire station, where firefighters found the fire. Graziano was not able to confirm or deny the fire.

The fire is still under investigation by the fire marshal's department and the fire station's
TOTAL TIME ELAPSED: 15.17s


In [219]:
num_epochs = 10

# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs, validation_data=test_ds)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x78bfb8a675b0>

In [224]:
start = time.time()

conversation = "A"

for _ in range(1):
  input = ' '.join(conversation.split()[-30:])
  output = gpt2_lm.generate(input, max_length=200)
  conversation += output[len(input):]

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")

TOTAL TIME ELAPSED: 13.94s


In [225]:
print(conversation)

A man
To whom I would not give aught.
But what I will do,
I'll not say, 'tis the most noble deed
Which I have done.

KING RICHARD II:
Why, what?
The most noble deed?

KING RICHARD II:
I would not give it, for I would
Have done it in the most noble way
Which is most fair.
Why, what?

KING RICHARD II:
I will not give it for
