#**Pre-Trained BERT Model**






We can download this model from Keras Hub. We will work with a subsequently released version of BERT, RoBERTA, which is robustly optimized (Ro). The model's main value was that it was trained on 10x the amount of data though, so it's better tuned.

In [2]:
import keras_hub

# Backbone here refers to the RoBERTa base layers
tokenizer = keras_hub.models.Tokenizer.from_preset("roberta_base_en")
backbone = keras_hub.models.Backbone.from_preset("roberta_base_en")

Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/config.json...


100%|██████████| 498/498 [00:00<00:00, 764kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/tokenizer.json...


100%|██████████| 463/463 [00:00<00:00, 686kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/assets/tokenizer/vocabulary.json...


100%|██████████| 0.99M/0.99M [00:01<00:00, 755kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/assets/tokenizer/merges.txt...


100%|██████████| 446k/446k [00:01<00:00, 428kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/roberta/keras/roberta_base_en/2/download/model.weights.h5...


100%|██████████| 474M/474M [00:31<00:00, 15.9MB/s]


We need to use the tokenizer that goes with the model to make sure it pre-processes our text in a way that the model expects.

In [3]:
tokenizer("The quick brown fox")

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([  133,  2119,  6219, 23602], dtype=int32)>

Here are the base layers we loaded. Notice that it is essentially a token embedding layer with some normalization afterward, some dropout to the embeddings to avoid overfitting, and then a ton of stacked transformer encoders.

In [5]:
backbone.summary()

Let's try using this pre-trained model with our IMDB reviews...

In [7]:
import os, pathlib, shutil, random, keras

zip_path = keras.utils.get_file(
    origin="https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz",
    fname="imdb",
    extract=True,
)

imdb_extract_dir = pathlib.Path(zip_path) / "aclImdb"
train_dir = pathlib.Path("imdb_train")
test_dir = pathlib.Path("imdb_test")
val_dir = pathlib.Path("imdb_val")

shutil.copytree(imdb_extract_dir / "test", test_dir, dirs_exist_ok=True)

val_percentage = 0.2
for category in ("neg", "pos"):
    src_dir = imdb_extract_dir / "train" / category
    src_files = os.listdir(src_dir)
    random.Random(1337).shuffle(src_files)
    num_val_samples = int(len(src_files) * val_percentage)

    os.makedirs(train_dir / category, exist_ok=True)
    os.makedirs(val_dir / category, exist_ok=True)
    for index, file in enumerate(src_files):
        if index < num_val_samples:
            shutil.copy(src_dir / file, val_dir / category / file)
        else:
            shutil.copy(src_dir / file, train_dir / category / file)

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
[1m84125825/84125825[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 0us/step


Create our Tensorflow Dataset objects from the .txt files now

In [9]:
batch_size = 16
train_ds = keras.utils.text_dataset_from_directory(
    train_dir, batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    val_dir, batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    test_dir, batch_size=batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


We apply the RoBERTa tokenizer to our datasets now. Before we go to the RoBERTa tokenizer, however, we need to add some start and end tokens, and some padding tokens, that were present in the dataset that was used for training RoBERTa.

In [11]:
# We are 'packing' on some additional tokens here. Sequences up to 512 tokens, adding the start token to our reviews, the end token, and the padding token for reviews shorter than 512 words.
def preprocess(text, label):
    packer = keras_hub.layers.StartEndPacker(
        sequence_length=512,
        start_value=tokenizer.start_token_id,
        end_value=tokenizer.end_token_id,
        pad_value=tokenizer.pad_token_id,
        return_padding_mask=True,
    )

    # After adding those tokens, we can apply RoBERTa's tokenizer
    token_ids, padding_mask = packer(tokenizer(text))
    return {"token_ids": token_ids, "padding_mask": padding_mask}, label

# And now that our preprocessing function is written, we can apply it to our Tensorflow Datasets
preprocessed_train_ds = train_ds.map(preprocess)
preprocessed_val_ds = val_ds.map(preprocess)
preprocessed_test_ds = test_ds.map(preprocess)

Here is a pre-processed batch of data. We have integer sequences per review, and a masking vector for each one that tells RoBERTa which tokens it can ignore at inference.

In [12]:
next(iter(preprocessed_train_ds))

({'token_ids': <tf.Tensor: shape=(16, 512), dtype=int32, numpy=
  array([[    0,   713,  1569, ...,     1,     1,     1],
         [    0,   100,    78, ...,   693,   109,     2],
         [    0,   713,    16, ...,     1,     1,     1],
         ...,
         [    0,  7682,   200, ...,    12, 23760,     2],
         [    0,   100, 31124, ...,     1,     1,     1],
         [    0,   100,   524, ...,     1,     1,     1]], dtype=int32)>,
  'padding_mask': <tf.Tensor: shape=(16, 512), dtype=bool, numpy=
  array([[ True,  True,  True, ..., False, False, False],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ..., False, False, False],
         ...,
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ..., False, False, False],
         [ True,  True,  True, ..., False, False, False]])>},
 <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0], dtype=int32)>)

Okay, now we can add some Dense layers onto the RoBERTa backbone and use it for our predictions!

In [16]:
from keras import layers

inputs = backbone.input
x = backbone(inputs)

# Freeze the backbone layers
backbone.trainable = False

# This is isolating a specific embedding from RoBERTa's backbone output that is akin to the document embedding.
# This embedding is associated with the CLS (classification) token.
# The CLS token is a specific token (a fixed value) that is added at the beginning of every textual sequence in RoBERTa's training data.
# The Neural Net learns to shift this token's embedding around depending on all the words that appear in a given sentence, to improve masked word prediction.
# As the model achieves its self-supervised prediction goal, it learns how to produce a CLS token embedding for a given sequence of text that captures relevant information about the entire sentence.
# Note that this is generally more useful / less noisy than doing something noisy like averaging all the word embeddings from the sentence.
x = x[:, 0, :]

x = layers.Dropout(0.1)(x)

# Each embedding is a 768 dimensional vectors; we are just relu activating the embedding, then doing dropout and going to a sigmoid prediction.
x = layers.Dense(768, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

classifier = keras.Model(inputs, outputs)

classifier.compile(
    optimizer=keras.optimizers.Adam(5e-5),
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

classifier.fit(
    preprocessed_train_ds,
    validation_data=preprocessed_val_ds,
)

[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m117s[0m 67ms/step - accuracy: 0.9394 - loss: 0.1709 - val_accuracy: 0.9242 - val_loss: 0.1998


<keras.src.callbacks.history.History at 0x795e6dfe6e50>