# GPT text generation from scratch with KerasHub

**Author:** [Jesse Chan](https://github.com/jessechancy)<br>
**Date created:** 2022/07/25<br>
**Last modified:** 2022/07/25<br>
**Description:** Using KerasHub to train a mini-GPT model for text generation.

## Introduction

In this example, we will use KerasHub to build a scaled down Generative
Pre-Trained (GPT) model. GPT is a Transformer-based model that allows you to generate
sophisticated text from a prompt.

We will train the model on the [simplebooks-92](https://arxiv.org/abs/1911.12391) corpus,
which is a dataset made from several novels. It is a good dataset for this example since
it has a small vocabulary and high word frequency, which is beneficial when training a
model with few parameters.

This example combines concepts from
[Text generation with a miniature GPT](https://keras.io/examples/generative/text_generation_with_miniature_gpt/)
with KerasHub abstractions. We will demonstrate how KerasHub tokenization, layers and
metrics simplify the training
process, and then show how to generate output text using the KerasHub sampling utilities.

Note: If you are running this example on a Colab,
make sure to enable GPU runtime for faster training.

This example requires KerasHub. You can install it via the following command:
`pip install keras-hub`

## Setup

In [1]:
!pip install -q --upgrade keras-hub
!pip install -q --upgrade keras  # Upgrade to Keras 3.

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m644.1/644.1 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m615.3/615.3 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m33.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tf-keras 2.17.0 requires tensorflow<2.18,>=2.17, but you have tensorflow 2.18.0 which is incompatible.[0m[31m
[0m

In [2]:
import os
import keras_hub
import keras

import tensorflow.data as tf_data
import tensorflow.strings as tf_strings

## Settings & hyperparameters

In [3]:
# Data
BATCH_SIZE = 64
MIN_STRING_LEN = 512  # Strings shorter than this will be discarded
SEQ_LEN = 128  # Length of training sequences, in tokens

# Model
EMBED_DIM = 256
FEED_FORWARD_DIM = 128
NUM_HEADS = 3
NUM_LAYERS = 2
VOCAB_SIZE = 5000  # Limits parameters in model.

# Training
EPOCHS = 5

# Inference
NUM_TOKENS_TO_GENERATE = 80

## Load the data

Now, let's download the dataset! The SimpleBooks dataset consists of 1,573 Gutenberg books, and has
one of the smallest vocabulary size to word-level tokens ratio. It has a vocabulary size of ~98k,
a third of WikiText-103's, with around the same number of tokens (~100M). This makes it easy to fit a small model.

In [4]:
keras.utils.get_file(
    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
    extract=True,
)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")

# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
    tf_data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
    .shuffle(buffer_size=256)
)

# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
    tf_data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
    .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
    .batch(BATCH_SIZE)
)

Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip
[1m282386239/282386239[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 0us/step


## Train the tokenizer

We train the tokenizer from the training dataset for a vocabulary size of `VOCAB_SIZE`,
which is a tuned hyperparameter. We want to limit the vocabulary as much as possible, as
we will see later on
that it has a large effect on the number of model parameters. We also don't want to include
*too few* vocabulary terms, or there would be too many out-of-vocabulary (OOV) sub-words. In
addition, three tokens are reserved in the vocabulary:

- `"[PAD]"` for padding sequences to `SEQ_LEN`. This token has index 0 in both
`reserved_tokens` and `vocab`, since `WordPieceTokenizer` (and other layers) consider
`0`/`vocab[0]` as the default padding.
- `"[UNK]"` for OOV sub-words, which should match the default `oov_token="[UNK]"` in
`WordPieceTokenizer`.
- `"[BOS]"` stands for beginning of sentence, but here technically it is a token
representing the beginning of each line of training data.

In [5]:
import os
import zipfile
import keras
# Get current working directory
cwd = os.getcwd()
# Download the dataset to the current working directory
file_path = keras.utils.get_file(
   fname="simplebooks.zip",
   origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
   extract=False,  # Do not extract immediately
   cache_dir=cwd  # Save it in the current working directory
)
# Extract the zip file manually to the current working directory
with zipfile.ZipFile(file_path, 'r') as zip_ref:
   zip_ref.extractall(cwd)
# Now set the dataset directory based on your current working directory
dir = os.path.join(cwd, "simplebooks/")
# Load simplebooks-92 train set and filter out short lines.
raw_train_ds = (
   tf_data.TextLineDataset(dir + "simplebooks-92-raw/train.txt")
   .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
   .batch(BATCH_SIZE)
   .shuffle(buffer_size=256)
)
# Load simplebooks-92 validation set and filter out short lines.
raw_val_ds = (
   tf_data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
   .filter(lambda x: tf_strings.length(x) > MIN_STRING_LEN)
   .batch(BATCH_SIZE)
)
print(f"Dataset extracted to: {dir}")

Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip
[1m282386239/282386239[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 0us/step
Dataset extracted to: /content/simplebooks/


In [6]:
# Train tokenizer vocabulary
vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=VOCAB_SIZE,
    lowercase=True,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

## Load tokenizer

We use the vocabulary data to initialize
`keras_hub.tokenizers.WordPieceTokenizer`. WordPieceTokenizer is an efficient
implementation of the WordPiece algorithm used by BERT and other models. It will strip,
lower-case and do other irreversible preprocessing operations.

In [7]:
tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=SEQ_LEN,
    lowercase=True,
)

## Tokenize data

We preprocess the dataset by tokenizing and splitting it into `features` and `labels`.

In [8]:
# packer adds a start token
start_packer = keras_hub.layers.StartEndPacker(
    sequence_length=SEQ_LEN,
    start_value=tokenizer.token_to_id("[BOS]"),
)


def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels


# Tokenize and split into train and label sequences.
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf_data.AUTOTUNE).prefetch(
    tf_data.AUTOTUNE
)

## Build the model

We create our scaled down GPT model with the following layers:

- One `keras_hub.layers.TokenAndPositionEmbedding` layer, which combines the embedding
for the token and its position.
- Multiple `keras_hub.layers.TransformerDecoder` layers, with the default causal masking.
The layer has no cross-attention when run with decoder sequence only.
- One final dense linear layer

In [9]:
inputs = keras.layers.Input(shape=(None,), dtype="int32")
# Embedding.
embedding_layer = keras_hub.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=SEQ_LEN,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)
x = embedding_layer(inputs)
# Transformer decoders.
for _ in range(NUM_LAYERS):
    decoder_layer = keras_hub.layers.TransformerDecoder(
        num_heads=NUM_HEADS,
        intermediate_dim=FEED_FORWARD_DIM,
    )
    x = decoder_layer(x)  # Giving one argument only skips cross-attention.
# Output.
outputs = keras.layers.Dense(VOCAB_SIZE)(x)
model = keras.Model(inputs=inputs, outputs=outputs)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_hub.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])

Let's take a look at our model summary - a large majority of the
parameters are in the `token_and_position_embedding` and the output `dense` layer!
This means that the vocabulary size (`VOCAB_SIZE`) has a large effect on the size of the model,
while the number of Transformer decoder layers (`NUM_LAYERS`) doesn't affect it as much.

In [10]:
model.summary()

## Training

Now that we have our model, let's train it with the `fit()` method.

In [11]:
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

Epoch 1/5




   2444/Unknown [1m152s[0m 55ms/step - loss: 4.9789 - perplexity: 176.3623

  self.gen.throw(typ, value, traceback)


[1m2444/2444[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m156s[0m 57ms/step - loss: 4.9788 - perplexity: 176.3412 - val_loss: 4.1809 - val_perplexity: 65.4701
Epoch 2/5
[1m2444/2444[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m189s[0m 54ms/step - loss: 4.1737 - perplexity: 65.0336 - val_loss: 4.0890 - val_perplexity: 59.7570
Epoch 3/5
[1m2444/2444[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m141s[0m 55ms/step - loss: 4.0317 - perplexity: 56.3890 - val_loss: 4.0349 - val_perplexity: 56.6003
Epoch 4/5
[1m2444/2444[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m138s[0m 54ms/step - loss: 3.9670 - perplexity: 52.8563 - val_loss: 4.0077 - val_perplexity: 55.1120
Epoch 5/5
[1m2444/2444[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m135s[0m 53ms/step - loss: 3.9189 - perplexity: 50.3764 - val_loss: 3.9665 - val_perplexity: 52.8382


<keras.src.callbacks.history.History at 0x7c04dff6c7c0>

## Inference

With our trained model, we can test it out to gauge its performance. To do this
we can seed our model with an input sequence starting with the `"[BOS]"` token,
and progressively sample the model by making predictions for each subsequent
token in a loop.

To start lets build a prompt with the same shape as our model inputs, containing
only the `"[BOS]"` token.

In [12]:
# The "packer" layers adds the [BOS] token for us.
prompt_tokens = start_packer(tokenizer([""]))
prompt_tokens

<tf.Tensor: shape=(1, 128), dtype=int32, numpy=
array([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int32)>

We will use the `keras_hub.samplers` module for inference, which requires a
callback function wrapping the model we just trained. This wrapper calls
the model and returns the logit predictions for the current token we are
generating.

Note: There are two pieces of more advanced functionality available when
defining your callback. The first is the ability to take in a `cache` of states
computed in previous generation steps, which can be used to speed up generation.
The second is the ability to output the final dense "hidden state" of each
generated token. This is used by `keras_hub.samplers.ContrastiveSampler`, which
avoids repetition by penalizing repeated hidden states. Both are optional, and
we will ignore them for now.

In [13]:

def next(prompt, cache, index):
    logits = model(prompt)[:, index - 1, :]
    # Ignore hidden states for now; only needed for contrastive search.
    hidden_states = None
    return logits, hidden_states, cache


Creating the wrapper function is the most complex part of using these functions. Now that
it's done, let's test out the different utilities, starting with greedy search.

### Greedy search

We greedily pick the most probable token at each timestep. In other words, we get the
argmax of the model output.

In [14]:
sampler = keras_hub.samplers.GreedySampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,  # Start sampling immediately after the [BOS] token.
)
txt = tokenizer.detokenize(output_tokens)
print(f"Greedy search generated text: \n{txt}\n")

Greedy search generated text: 
['[BOS] " i have been thinking of the matter , " he said , " i have heard that the king of the king of the king of the king , who , as you have said , was the king of the king of the king , who , as you have said , was the king of the king of the king , who , as you have said , was the king of the king of the king , who , as you have said , was the king of the king of the king , and the king of the king of the king of the king , and the king of the king of the king , and the king of the king']



As you can see, greedy search starts out making some sense, but quickly starts repeating
itself. This is a common problem with text generation that can be fixed by some of the
probabilistic text generation utilities shown later on!

### Beam search

At a high-level, beam search keeps track of the `num_beams` most probable sequences at
each timestep, and predicts the best next token from all sequences. It is an improvement
over greedy search since it stores more possibilities. However, it is less efficient than
greedy search since it has to compute and store multiple potential sequences.

**Note:** beam search with `num_beams=1` is identical to greedy search.

In [15]:
sampler = keras_hub.samplers.BeamSampler(num_beams=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Beam search generated text: \n{txt}\n")

Beam search generated text: 
['[BOS] " i don \' t know , " he said . " i don \' t know what i want to do , but i don \' t know what i want to do . i don \' t know what i want to do . i don \' t know what i want to do . i don \' t know what i want to do . i don \' t know what i want to do . i don \' t know what i want to do . i don \' t know what i want to do . i don \' t know what i want to do . i don \' t know what i want to do . i don \' t']



Similar to greedy search, beam search quickly starts repeating itself, since it is still
a deterministic method.

### Random search

Random search is our first probabilistic method. At each time step, it samples the next
token using the softmax probabilities provided by the model.

In [16]:
sampler = keras_hub.samplers.RandomSampler()
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Random search generated text: \n{txt}\n")

Random search generated text: 
["[BOS] all this time the good newspaper has been apploased and generally won the handkerchiefans . the pawnees will make no difference for a circumference in life , but , in addition to your life - - at least part of your commission , you can grant us a good hand in receiving reun carl . this action is styled ' s hole , as govern ' charcome . it has greatly axplosion , for the others to come further and further . this period or is multafnesse . the men in their vine ; the"]



Voilà, no repetitions! However, with random search, we may see some nonsensical words
appearing since any word in the vocabulary has a chance of appearing with this sampling
method. This is fixed by our next search utility, top-k search.

### Top-K search

Similar to random search, we sample the next token from the probability distribution
provided by the model. The only difference is that here, we select out the top `k` most
probable tokens, and distribute the probability mass over them before sampling. This way,
we won't be sampling from low probability tokens, and hence we would have less
nonsensical words!

In [17]:
sampler = keras_hub.samplers.TopKSampler(k=10)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-K search generated text: \n{txt}\n")

Top-K search generated text: 
['[BOS] he had heard the sound of the great bird , which he had not seen him and heard that sound had come , the bird of paradise was heard to tell him that he was coming out , and that he was not only one of his friends who had escaped from the shuddering - bird - like the bird that he had heard , in the least three days , for he had been told to do all day . he was very much surprised , for , as he had told the others to do the same , and the young man had been so kind to him , he was very much in the manner . he had heard ,']



### Top-P search

Even with the top-k search, there is something to improve upon. With top-k search, the
number `k` is fixed, which means it selects the same number of tokens for any probability
distribution. Consider two scenarios, one where the probability mass is concentrated over
2 words and another where the probability mass is evenly concentrated across 10. Should
we choose `k=2` or `k=10`? There is no one size that fits all `k` here.

This is where top-p search comes in! Instead of choosing a `k`, we choose a probability
`p` that we want the probabilities of the top tokens to sum up to. This way, we can
dynamically adjust the `k` based on the probability distribution. By setting `p=0.9`, if
90% of the probability mass is concentrated on the top 2 tokens, we can filter out the
top 2 tokens to sample from. If instead the 90% is distributed over 10 tokens, it will
similarly filter out the top 10 tokens to sample from.

In [18]:
sampler = keras_hub.samplers.TopPSampler(p=0.5)
output_tokens = sampler(
    next=next,
    prompt=prompt_tokens,
    index=1,
)
txt = tokenizer.detokenize(output_tokens)
print(f"Top-P search generated text: \n{txt}\n")

Top-P search generated text: 
['[BOS] when he was to the left of the matgot , he was not going to take place at the head of the shaft , and was , and the scrap of the charcoal , as the old lady had said , the king had promised to be a transformed by his mother , who was proclaimed with the queen , and to attend to the court , who was the most beautiful of the time , he had told her that , when he was about to be married , he said , " go in the palace , you will go out and ask your father']



### Using callbacks for text generation

We can also wrap the utilities in a callback, which allows you to print out a prediction
sequence for every epoch of the model! Here is an example of a callback for top-k search:

In [19]:

class TopKTextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model using top-k."""

    def __init__(self, k):
        self.sampler = keras_hub.samplers.TopKSampler(k)

    def on_epoch_end(self, epoch, logs=None):
        output_tokens = self.sampler(
            next=next,
            prompt=prompt_tokens,
            index=1,
        )
        txt = tokenizer.detokenize(output_tokens)
        print(f"Top-K search generated text: \n{txt}\n")


text_generation_callback = TopKTextGenerator(k=10)
# Dummy training loop to demonstrate callback.
model.fit(train_ds.take(1), verbose=2, epochs=2, callbacks=[text_generation_callback])

Epoch 1/2
Top-K search generated text: 
['[BOS] there was a good deal of excitement among the men , and of which were in the habit of a man who had apprehended the supernatural in the city , and of the congratulations , with its equatorials to the contrast to their homestead in the securacy , had been taken by the proclamation of the inhabitants of concluy and the inhabitants were at the same time to the proportion of the inhabitants , and had not yet seen the audibles in the city']

1/1 - 14s - 14s/step - loss: 3.8694 - perplexity: 47.9522
Epoch 2/2
Top-K search generated text: 
["[BOS] when they had left their horses they were to ride on a horse ' walk in the field to get the horses , and when they were ready with their troop , they were going off , and they were all dressed as they rode up to their horse and rode down to the stables . they were going on the horses , for the horses and horses , and horses , horses , were saddles and horses , but they were not so much afraid , they ha

<keras.src.callbacks.history.History at 0x7c04770fece0>

## Conclusion

To recap, in this example, we use KerasHub layers to train a sub-word vocabulary,
tokenize training data, create a miniature GPT model, and perform inference with the
text generation library.

If you would like to understand how Transformers work, or learn more about training the
full GPT model, here are some further readings:

- Attention Is All You Need [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)
- GPT-3 Paper [Brown et al., 2020](https://arxiv.org/abs/2005.14165)

In [20]:
!pip install transformers





In [21]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [22]:
import torch

# Example fine-tuning code (assuming you have input_ids ready)
# gpt2_model.train()  # Set the model to training mode

# Fine-tuning (simplified for demonstration)
# You would need to prepare your data loader here
# gpt2_model.fit(input_ids, input_ids, epochs=3, batch_size=2)

# Text generation function
def generate_text(prompt, max_length=50):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")  # Convert prompt to input IDs
    output_ids = gpt2_model.generate(input_ids, max_length=max_length)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Example prompt
prompt = "Once upon a time in a land far away"
generated_text = generate_text(prompt)
print("Generated Text:", generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Text: Once upon a time in a land far away, the sun was shining, and the moon was shining. The sun was shining, and the moon was shining. The sun was shining, and the moon was shining. The sun was shining, and the


In [23]:
# PART 2

In [24]:
!pip install git+https://github.com/keras-team/keras-hub.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for keras-hub (pyproject.toml) ... [?25l[?25hdone


In [25]:
import os

os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

import keras_hub
import keras
import tensorflow as tf
import time

keras.mixed_precision.set_global_policy("mixed_float16")

In [26]:
# GPT-2 Example 1 from tutorial

In [27]:
# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_hub.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_hub.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/config.json...


100%|██████████| 484/484 [00:00<00:00, 469kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/tokenizer.json...


100%|██████████| 448/448 [00:00<00:00, 957kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/vocabulary.json...


100%|██████████| 0.99M/0.99M [00:01<00:00, 766kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/merges.txt...


100%|██████████| 446k/446k [00:01<00:00, 437kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/model.weights.h5...


100%|██████████| 475M/475M [00:32<00:00, 15.4MB/s]


In [28]:
start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
My trip to Yosemite was an adventure that I've never experienced anywhere else. In fact, I've never been to Yosemite.

The Yosemite experience is so different from my own that I don't know how to describe it, but I have a lot of things to say about it, so here's a list of a few that are worth a read.

1. The Yosemite experience is like the experience of being a normal person.

The Yosemite experience is like being a normal person. The people there are just so nice, and I love them all.

2. There are so few people to talk to

There are so few people to talk to.

The only way to get to know the people who live there is to visit the Yosemite Valley, the largest and most important park in the world (in the Yosemite Valley, of course), and see the Yosemite Valley as an experience. It's the best place to see the world in a way that
TOTAL TIME ELAPSED: 28.80s


In [29]:
start = time.time()

output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
That Italian restaurant is still closed after the restaurant was vandalized with graffiti on the wall. (Photo: Getty Images)

The Italian restaurant that has been a fixture in the San Francisco Bay Area since the 1970s is still open.

The restaurant was vandalized by graffiti on the wall of the restaurant's restaurant in the Mission District in early August.

According to San Francisco police, a man was arrested after the restaurant was attacked and vandalized with graffiti.

The restaurant is still closed after the restaurant was vandalized with graffiti on the wall of the restaurant's restaurant in the Mission District. (Photo: Getty Images)

The San Francisco Police Department is investigating the case and will update this story if additional information is discovered.

The restaurant is open daily from 8 a.m. – 5 p.m. Monday through Friday and Saturday from 7 p.m. – 5 p.m. Sunday through Friday and from 9 a.m. –
TOTAL TIME ELAPSED: 4.18s


In [30]:
# GPT-2 Example 2

In [44]:
from transformers import pipeline

g = pipeline("text-generation", model="gpt2") # generator

p = "In the quiet town of Roseville, something unusual and suspicious was about to happen." # prompt

o = g(p, max_length=100, num_return_sequences=1) # outputs

generated_text = o[0]["generated_text"]
print(generated_text)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In the quiet town of Roseville, something unusual and suspicious was about to happen. The local police chief, James Brown, came upon a group of young men and said, "I'm going to investigate all the murders, if I can find it. I'm going to get to the root of them, and I'm going to find them."

An arrest warrant in the investigation led to the discovery of the murder weapon and the possible use of ricocheting shots from the rifle. There


- The generated text starts well but quickly shifts to politics, losing track on the small town theme. It is also very repetative with moderate diversity in vocabulary. The output is creative but not very relevant to the prompt.

In [32]:
# BERT Model Example

In [54]:
from transformers import pipeline

c = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment") # classifier

r = c("I love using BERT for natural language processing!") # results
print(r)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': '5 stars', 'score': 0.8061189651489258}]


- The BERT model correctly classifies the sentiment of the text as "5 stars" with a high confidence score of 0.81. This means that BERT has correctly detected the positive sentiment expressed in the sentence. The results is straightforward and relevent, showing the model's effectiveness for sentiment analysis.