
Based on:
- ["Text classification from scratch"](https://keras.io/examples/nlp/text_classification_from_scratch/) by Mark Omernick, Francois Chollet
- ["Bidirectional LSTM on IMDB"](https://keras.io/examples/nlp/bidirectional_lstm_imdb/) by Francois Chollet
- ["GPT2 Text Generation with KerasNLP"](https://keras.io/examples/generative/gpt2_text_generation_with_kerasnlp/) by Chen Qian
- ["GPT text generation from scratch with KerasNLP"](https://keras.io/examples/generative/text_generation_gpt/) by Jesse Chan

Make sure you are using a GPU for this lab (otherwise some of the later steps will take a long time).



we will revisit classifiers and RNNs, but using the Keras library, which is built on top of Tensorflow (rather than PyTorch).

We'll see two examples of getting data and making a model to do movie review prediction on IMDB data.

First, we'll work from raw text - a set of text files on disk.

## Setup

In [1]:
!pip install git+https://github.com/keras-team/keras-nlp.git -q

In [2]:
import os
import time

os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import tensorflow as tf
import numpy as np
from keras import layers
import keras_nlp

2024-03-22 06:13:18.150317: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load the data: IMDB movie review sentiment classification

Let's download the data and inspect its structure.

In [3]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  14.2M      0  0:00:05  0:00:05 --:--:-- 17.8M


The `aclImdb` folder contains a `train` and `test` subfolder, plus a few text files:

In [4]:
!ls aclImdb

README	imdb.vocab  imdbEr.txt	test  train


Inside each folder are subfolders for positive and negative data, plus files with precomputed features, and source URLs.

In [5]:
!ls aclImdb/test

labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt


In [6]:
!ls aclImdb/train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


The `aclImdb/train/pos` and `aclImdb/train/neg` folders contain text files, each of
 which represents one review (either positive or negative):

In [7]:
!cat aclImdb/train/pos/6248_7.txt

Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" 

We are only interested in the `pos` and `neg` subfolders, so let's delete the other subfolder that has text files in it:

In [8]:
!rm -r aclImdb/train/unsup

You can use the utility `keras.utils.text_dataset_from_directory` to
generate a labeled `tf.data.Dataset` object from a set of text files on disk filed into class-specific folders.

Let's use it to generate the training, validation, and test datasets. The validation and training datasets are generated from two subsets of the `train` directory, with 20% of samples going to the validation dataset and 80% going to the training dataset. When using the `validation_split` & `subset` arguments, make sure to either specify a random seed, or to pass `shuffle=False`, so that the validation & training splits you get have no overlap.

In [9]:
batch_size = 32
raw_train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)
raw_val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)
raw_test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
Number of batches in raw_train_ds: 625
Number of batches in raw_val_ds: 157
Number of batches in raw_test_ds: 782


Let's look at a few samples:

In [10]:
# It's important to take a look at your raw data to ensure your normalization
# and tokenization will work as expected. We can do that by taking a few
# examples from the training set and looking at them.
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(5):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])

b'I\'ve seen tons of science fiction from the 70s; some horrendously bad, and others thought provoking and truly frightening. Soylent Green fits into the latter category. Yes, at times it\'s a little campy, and yes, the furniture is good for a giggle or two, but some of the film seems awfully prescient. Here we have a film, 9 years before Blade Runner, that dares to imagine the future as somthing dark, scary, and nihilistic. Both Charlton Heston and Edward G. Robinson fare far better in this than The Ten Commandments, and Robinson\'s assisted-suicide scene is creepily prescient of Kevorkian and his ilk. Some of the attitudes are dated (can you imagine a filmmaker getting away with the "women as furniture" concept in our oh-so-politically-correct-90s?), but it\'s rare to find a film from the Me Decade that actually can make you think. This is one I\'d love to see on the big screen, because even in a widescreen presentation, I don\'t think the overall scope of this film would receive its

2024-03-22 06:13:42.720323: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Prepare the data

We will use the `TextVectorization` layer for word splitting & indexing. In the process, we will also remove `<br />` tags.

In [11]:
import string
import re


# Having looked at our data above, we see that the raw text contains HTML break
# tags of the form '<br />'. These tags will not be removed by the default
# standardizer (which doesn't strip HTML). Because of this, we will need to
# create a custom standardization function.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )


# Model constants.
max_features = 20000
embedding_dim = 128
sequence_length = 500

# Now that we have our custom standardization, we can instantiate our text
# vectorization layer. We are using this layer to normalize, split, and map
# strings to integers, so we set our 'output_mode' to 'int'.
# Note that we're using the default split function,
# and the custom standardization defined above.
# We also set an explicit maximum sequence length, since the CNNs later in our
# model won't support ragged sequences.
vectorize_layer = keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Now that the vectorize_layer has been created, call `adapt` on a text-only
# dataset to create the vocabulary. You don't have to batch, but for very large
# datasets this means you're not keeping spare copies of the dataset in memory.

# Let's make a text-only dataset (no labels):
text_ds = raw_train_ds.map(lambda x, y: x)
# Let's call `adapt`:
vectorize_layer.adapt(text_ds)

2024-03-22 06:13:48.430097: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Two options to vectorize the data

There are 2 ways we can use our text vectorization layer:

**Option 1: Make it part of the model**, so as to obtain a model that processes raw strings, like this:

```python
text_input = keras.Input(shape=(1,), dtype=tf.string, name='text')
x = vectorize_layer(text_input)
x = layers.Embedding(max_features + 1, embedding_dim)(x)
...
```

**Option 2: Apply it to the text dataset** to obtain a dataset of word indices, then feed it into a model that expects integer sequences as inputs.

An important difference between the two is that option 2 enables you to do
**asynchronous CPU processing and buffering** of your data when training on a GPU. So if you're training a model on a GPU, you probably want to go with this option to get the best performance. This is what we will do below.

If we were to export our model to production, we'd ship a model that accepts rawstrings as input, like in the code snippet for option 1 above. This can be done after training. We do this in the last section.

In [12]:
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label


# Vectorize the data.
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# Do async prefetching / buffering of the data for best performance on GPU.
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

## Build a model

The first model we introduce uses a Convolutional Neural Network. This has not been covered in lectures (you can read more about it on [Wikipedia](https://en.wikipedia.org/wiki/Convolutional_neural_network#Building_blocks)). The general idea is the a convolution is a type of weighted average that is applied in multiple locations.

In [13]:
# A integer input for vocab indices.
inputs = keras.Input(shape=(None,), dtype="int64")

# Next, we add a layer to map those vocab indices into a space of dimensionality
# 'embedding_dim'.
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

# Conv1D + global max pooling
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

# We add a hidden layer (linear + ReLU):
x = layers.Dense(128, activation="relu")(x)
# Dropout makes some values 0 during training, which tends to help improve how well the model generalised
x = layers.Dropout(0.5)(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model = keras.Model(inputs, predictions)

# Compile the model with binary crossentropy loss and an adam optimizer.
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

## Train the model

We'll run just 3 epochs / iterations so you can see how it works. Later, if you have more time you can run it for longer to see how well it can do.

In [14]:
epochs = 3

# Fit the model using the train and test datasets.
model.fit(train_ds, validation_data=val_ds, epochs=epochs)

Epoch 1/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 106ms/step - accuracy: 0.6359 - loss: 0.5870 - val_accuracy: 0.8704 - val_loss: 0.3077
Epoch 2/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m63s[0m 101ms/step - accuracy: 0.8921 - loss: 0.2660 - val_accuracy: 0.8766 - val_loss: 0.3205
Epoch 3/3
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 98ms/step - accuracy: 0.9466 - loss: 0.1456 - val_accuracy: 0.8784 - val_loss: 0.3796


<keras.src.callbacks.history.History at 0x7f216c6fda00>

## Evaluate the model on the test set

We also evaluate on the test data. The accuracy is lower than on the training data because the test data is unseen.

In [15]:
model.evaluate(test_ds)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 22ms/step - accuracy: 0.8623 - loss: 0.4156


[0.39533329010009766, 0.8659200072288513]

## Make an end-to-end model

If you want to obtain a model capable of processing raw strings, you can create a new model using the weights we just trained, but a different input type and an extra vectorisation step:

In [16]:
# A string input
inputs = keras.Input(shape=(1,), dtype="string")
# Turn strings into vocab indices
indices = vectorize_layer(inputs)
# Turn vocab indices into predictions
outputs = model(indices)

# Our end to end model
end_to_end_model = keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Test it with `raw_test_ds`, which yields raw strings
end_to_end_model.evaluate(raw_test_ds)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 23ms/step - accuracy: 0.8624 - loss: 0.4141


[0.3954782783985138, 0.8659200072288513]

# LSTM

Now let's see how to construct an LSTM in Keras.

First, we'll set up some parameters:

In [17]:
import numpy as np
import keras
from keras import layers

max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review

Keras has code for the LSTM already, so we don't need to implement all the details ourselves.

The code below sets up the model, using integers as input (ie., like at the start above, we assume someone else is converting our input words into word IDs):

In [18]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Add a bidirectional LSTM
# return_sequences determines whether we get just the last vector or all:
#    True - give us an output for each word
#    False - give us just one output
x = layers.Bidirectional(layers.LSTM(64, return_sequences=False))(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()

Load the IMDB movie review sentiment data. This time we are going to use the library provided version of the same data.

In [19]:
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(
    num_words=max_features
)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
# Use pad_sequence to standardize sequence length:
# this will truncate sequences longer than 200 words and zero-pad sequences shorter than 200 words.
x_train = keras.utils.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.utils.pad_sequences(x_val, maxlen=maxlen)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
25000 Training sequences
25000 Validation sequences


Train and evaluate the model. You will see similar results to the approach above. To do better, we would need to train for longer and explore other variations on the model structure.

In [20]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))

Epoch 1/2
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m176s[0m 220ms/step - accuracy: 0.7187 - loss: 0.5287 - val_accuracy: 0.8555 - val_loss: 0.3441
Epoch 2/2
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m169s[0m 217ms/step - accuracy: 0.9020 - loss: 0.2553 - val_accuracy: 0.8592 - val_loss: 0.3259


<keras.src.callbacks.history.History at 0x7f216c66a970>

# Task 1 - Multilayer LSTM

Modify the LSTM architecture to have two layers. You can do this by having an extra layer that produces output at each position, which is fed into the next layer as an input.

You should print the model summary to show for credit.

# Text generation

Now, we are going to turn to text generation to see some variations on inference.

You will learn to load a pre-trained  Language Model (LM) - [GPT-2](https://openai.com/research/better-language-models) (originally invented by OpenAI), finetune it to a specific text style, and generate text based on users' input (also known as prompt). You will also see how GPT2 adapts quickly to non-English languages, such as Chinese.

This examples uses [Keras Core](https://keras.io/keras_core/) to work in any of `"tensorflow"`, `"jax"` or `"torch"`. Support for Keras Core is baked into KerasNLP, simply change the `"KERAS_BACKEND"` environment variable to select the backend of your choice. We select the JAX backend below.

In [23]:
# TODO
max_features = 20000 # Only consider the top 20k words
maxlen = 200 # Only consider the first 200 words of each movie review

inputs = keras. Input (shape=(None,), dtype="int32" )
x = layers. Embedding (max_features, 128) (inputs)
#add first layer
x = layers. Bidirectional(layers.LSTM(64, return_sequences=True) ) (x)
#add second layer
layers. LSTM (64, return_sequences=False) (x)
outputs = layers. Dense(1, activation="sigmoid") (x)
model = keras. Model (inputs, outputs)
model. compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

In [24]:
os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

keras.mixed_precision.set_global_policy("mixed_float16")

Large Language Models are complex to build and expensive to train from scratch. Luckily there are pretrained LLMs available for use right away. Keras provides a large number of pre-trained checkpoints that allow you to experiment with very good models without needing to train them yourself.

Keras also  has a Sampler class that implements generation algorithms such as Top-K, Beam and contrastive search. These samplers can be used to generate text with custom models.

## Load a pre-trained GPT-2 model and generate some text

KerasNLP provides a number of pre-trained models, such as [Google
Bert](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
and [GPT-2](https://openai.com/research/better-language-models). You can see
the list of models available in the [KerasNLP repository](https://github.com/keras-team/keras-nlp/tree/master/keras_nlp/models).

It's very easy to load the GPT-2 model as you can see below:

In [25]:
# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=32,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/tokenizer.json...
100%|██████████| 448/448 [00:00<00:00, 640kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/merges.txt...
100%|██████████| 446k/446k [00:00<00:00, 3.00MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/vocabulary.json...
100%|██████████| 0.99M/0.99M [00:00<00:00, 5.79MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/config.json...
100%|██████████| 484/484 [00:00<00:00, 532kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/model.weights.h5...
100%|██████████| 475M/475M [00:04<00:00, 118MB/s]  


Once the model is loaded, you can use it to generate some text right away. Run
the cells below to give it a try. It's as simple as calling a single function
*generate()*:

In [26]:
start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")

I0000 00:00:1711088766.596885     239 service.cc:145] XLA service 0x55e4725a5710 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1711088766.596933     239 service.cc:153]   StreamExecutor device (0): Host, Default Version
2024-03-22 06:26:06.859027: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1711088787.156091     239 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.



GPT-2 output:
My trip to Yosemite was pretty much over by the time I got there.
The only things that kept me from going to Yosemite were the rocks, the trees, and the snow. I was able to get through a bit of the day and I had a good time. But I didn't get to enjoy the views. It was pretty cold outside. The weather was really cold outside.
I was really excited about the weather in Yosemite, because it was a beautiful place. The only problem I have with the weather is the fact I was not able to get a good view of the mountain. I had to take a photo of the view. It was very dark, and the sun was really low. I had to take some photos of it.
I was really disappointed in the weather for the first few days. I was so excited to go there. I was very excited to get there. I really enjoyed it. It was a beautiful place. I was so happy to see the
TOTAL TIME ELAPSED: 47.28s


Try another one:

In [27]:
start = time.time()

output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
That Italian restaurant is known as "The Italian Kitchen," and it's located on the corner of the East and West avenues of Westfield, just north of the intersection of East Street and East Broadway.

It's the second restaurant in a row to be named after the Italian restaurant owner.

It's also named after the owner of the Italian restaurant that's been named after the chef at the restaurant that opened in 2011.

"We're going to make the best food in the city," said owner and chef Roberto Giustra. "And we're just excited to do it."

It's a restaurant that's been named after the owner, who's been in business for more than a decade. Giustra, who also serves as the chef at the restaurant, said the new owner will bring a lot of experience to the restaurant, which he says has been growing rapidly over the past few years and was the inspiration behind his new location in the former East Village
TOTAL TIME ELAPSED: 18.15s


Notice how much faster the second call is. This is because the computational
graph is [XLA compiled](https://www.tensorflow.org/xla) in the 1st run and
re-used in the 2nd behind the scenes.

The quality of the generated text looks OK, but we can improve it via
fine-tuning.

## More on the GPT-2 model from KerasNLP

Next up, we will actually fine-tune the model to update its parameters, but
before we do, let's take a look at the full set of tools we have to for working
with for GPT2.

The code of GPT2 can be found
[here](https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/gpt2/).
Conceptually the `GPT2CausalLM` can be hierarchically broken down into several
modules in KerasNLP, all of which have a *from_preset()* function that loads a
pretrained model:

- `keras_nlp.models.GPT2Tokenizer`: The tokenizer used by GPT2 model, which is a
    [byte-pair encoder](https://huggingface.co/course/chapter6/5?fw=pt).
- `keras_nlp.models.GPT2CausalLMPreprocessor`: the preprocessor used by GPT2
    causal LM training. It does the tokenization along with other preprocessing
    works such as creating the label and appending the end token.
- `keras_nlp.models.GPT2Backbone`: the GPT2 model, which is a stack of
    `keras_nlp.layers.TransformerDecoder`. This is usually just referred as
    `GPT2`.
- `keras_nlp.models.GPT2CausalLM`: wraps `GPT2Backbone`, it multiplies the
    output of `GPT2Backbone` by embedding matrix to generate logits over
    vocab tokens.

## Finetune on News data

Now you have the knowledge of the GPT-2 model from KerasNLP, you can take one step further to finetune the model so that it generates text in a specific style, short or long, strict or casual. In this tutorial, we will use a news dataset.

In [31]:
pip install tensorflow-datasets

Note: you may need to restart the kernel to use updated packages.


In [32]:
import tensorflow_datasets as tfds

news_ds = tfds.load("ag_news_subset", split="train", as_supervised=True)

2024-03-22 06:29:31.150443: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


Downloading and preparing dataset 11.24 MiB (download: 11.24 MiB, generated: 35.79 MiB, total: 47.03 MiB) to /home/studio-lab-user/tensorflow_datasets/ag_news_subset/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/120000 [00:00<?, ? examples/s]

Shuffling /home/studio-lab-user/tensorflow_datasets/ag_news_subset/1.0.0.incomplete63TXFB/ag_news_subset-train…

Generating test examples...:   0%|          | 0/7600 [00:00<?, ? examples/s]

Shuffling /home/studio-lab-user/tensorflow_datasets/ag_news_subset/1.0.0.incomplete63TXFB/ag_news_subset-test.…

Dataset ag_news_subset downloaded and prepared to /home/studio-lab-user/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.


Let's take a look inside sample data from the ag_news TensorFlow Dataset. Thereare two features:

- **__description__**: text of the article.
- **__label__**: the category (used in earlier labs).

In [33]:
for description, label in news_ds:
    print(description.numpy())
    print(label.numpy())
    break

b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.'
3


2024-03-22 06:29:46.434685: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In our case, we are performing next word prediction in a language model, so we
only need the 'description' feature.

In [34]:
train_ds = (
    news_ds.map(lambda description, _: description)
    .batch(32)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

Now you can finetune the model using the familiar *fit()* function. Note that
`preprocessor` will be automatically called inside `fit` method since
`GPT2CausalLM` is a `keras_nlp.models.Task` instance.

This step takes quite a bit of GPU memory and a long time if we were to train
it all the way to a fully trained state. Here we just use part of the dataset
for demo purposes.

In [None]:
train_ds = train_ds.take(500)
num_epochs = 1

# Linearly decaying learning rate.
learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

W0000 00:00:1711089041.147362     315 assert_op.cc:38] Ignoring Assert operator compile_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/assert_equal_1/Assert/Assert
2024-03-22 06:33:31.903900: E external/local_xla/xla/service/slow_operation_alarm.cc:65] 
********************************
[Compiling module a_inference_one_step_on_data_160102__.46898] Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2024-03-22 06:38:51.828035: E external/local_xla/xla/service/slow_operation_alarm.cc:133] The operation took 7m19.924283759s

********************************
[Compiling module a_inference_one_step_on_data_160102__.46898] Very slow compile? If you want to file a bug, run with envvar XLA_FLAGS=--xla_dump_to=/tmp/foo and attach the results.
********************************
2024-03-22 06:38:51.943203: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocatio

[1m  1/500[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m77:38:16[0m 560s/step - accuracy: 0.2162 - loss: 4.8677

2024-03-22 06:39:09.638537: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6237921952 exceeds 10% of free system memory.


[1m  2/500[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m2:07:53[0m 15s/step - accuracy: 0.2424 - loss: 4.6517  

2024-03-22 06:39:25.042363: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6237921952 exceeds 10% of free system memory.


[1m  3/500[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m2:13:56[0m 16s/step - accuracy: 0.2487 - loss: 4.5922

2024-03-22 06:39:41.965267: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6237921952 exceeds 10% of free system memory.


[1m  4/500[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m2:14:31[0m 16s/step - accuracy: 0.2546 - loss: 4.5394

2024-03-22 06:39:58.453157: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 6237921952 exceeds 10% of free system memory.


[1m 17/500[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m2:08:31[0m 16s/step - accuracy: 0.2785 - loss: 4.2954

After fine-tuning is finished, you can again generate text using the same
*generate()* function. This time, the text will be closer to news writing
style, and the generated length will be close to our preset length in the
training set.

In [None]:
start = time.time()

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")

## Into the Sampling Method

In KerasNLP, there are a few sampling methods, e.g., contrastive search,
Top-K and beam sampling. By default, our `GPT2CausalLM` uses Top-k search, but you can choose your own sampling method.

Much like optimizer and activations, there are two ways to specify your custom sampler:

- Use a string identifier, such as "greedy", you are using the default
configuration via this way.
- Pass a `keras_nlp.samplers.Sampler` instance, you can use custom configuration via this way.

In [None]:
# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_nlp.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# Task 2 - Varying Samplers

Try several other samplers:

- `BeamSampler`, keeps track of the `num_beams` most probable sequences at each timestep, and predicts the best next token from all sequences.
- `ContrastiveSampler`, reweights options based on their similarity to previously generated outputs (to decrease repetition).
- `RandomSampler`, at each time step, samples the next token using the softmax probabilities provided by the model.
- `TopKSampler`, select the top`k` most probable tokens and then sample from them.
- `TopPSampler`, like Top-K, except it selects enough samples to cover `P` percent of the distribution (which could be more or less than the `k` value from the previous approach).

For more information about these, see: https://keras.io/api/keras_nlp/samplers/

To complete the task, show that you have implemented all five and used them to generate output.

You should experiment with the parameters for each sampler and see what patterns you observe in terms of the outputs (this is not marked).

In [None]:
# TODO
beam_sampler=keras_nlp.samplers.BeamSampler (num_beams=5)
Contrastive_sampler= keras_nlp.samplers.ContrastiveSampler(temperature=0.7)
random_sampler= keras_nlp. samplers.RandomSampler(temperature=0.9) topk_sampler= keras_nlp. samplers. TopKSampler (temperature=0.8) topp_sampler= keras_nlp. samplers.TopPSampler (temperature=0.8)

gpt2_lm. compile( sampler=beam_sampler)
output = gpt2_Lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print (output)

gpt2_lm. compile(sampler=Contrastive_sampler)
output = gpt2_Lm.generate("I like basketball", max_length=200)
print("\nGPT-2
output:")
print (output)

gpt2_m. compile (sampler=random_sampler)
output = gpt2_Lm.generate("I like basketball", max_length=200)
print ("\nGPT-2 output:")
print (output)

gpt2_m. compile( sampler=topk_sampler)
output = gpt2_m.generate("I like basketball", max_length=200)
print ("\nGPT-2 output:")
print (output)

gpt2_m. compile (sampler=topp_sampler)
output = gpt2_Im.generate("I like basketball", max_length=200)
print ("\nGPT-2 output:")
print (output)


# [optional] Finetune on Chinese Poem Dataset

We can also finetune GPT2 on non-English datasets. For readers knowing Chinese,
this part illustrates how to fine-tune GPT2 on Chinese poem dataset to teach our
model to become a poet!

Because GPT2 uses byte-pair encoder, and the original pretraining dataset
contains some Chinese characters, we can use the original vocab to finetune on
Chinese dataset.

In [None]:
!# Load chinese poetry dataset.
!git clone https://github.com/chinese-poetry/chinese-poetry.git

Load text from the json file. We only use (全唐诗 - "All Tang poetry") for demo purposes.

In [None]:
import os
import json

poem_collection = []
for file in os.listdir("chinese-poetry/全唐诗"):
    if ".json" not in file or "poet" not in file:
        continue
    full_filename = "%s/%s" % ("chinese-poetry/全唐诗", file)
    with open(full_filename, "r") as f:
        content = json.load(f)
        poem_collection.extend(content)

paragraphs = ["".join(data["paragraphs"]) for data in poem_collection]

Let's take a look at sample data.

In [None]:
print(paragraphs[0])

This is translated by Google Translate as "Stay in Qingjidian, Chaoyang Lidi City. In the good years, people enjoy their work and sing songs on the ridges."

Similar to the news example, we convert to TF dataset, and only use partial data to train.

In [None]:
train_ds = (
    tf.data.Dataset.from_tensor_slices(paragraphs)
    .batch(16)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 1

learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-4,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

Let's check the result!

In [None]:
output = gpt2_lm.generate("昨夜雨疏风骤", max_length=200) # Input is "Rainy and windy last night"
print(output)