# Notebook 03: Experiments on transformer models

So far, I have explored pre-transformer technology (clustering for regression, tree-based models, simple dense neural networks, CNN, and RNN with LSTM and GRU). In this notebook, I experiment on two transformer models on the sentence classification task.

>[Notebook 03: Experiments on transformer models](#scrollTo=uGcgaawZml6-)

>>[3.1 Load and check dependencies](#scrollTo=YU-tMXP7NrUx)

>>>[3.2.1 Load and install dependencies](#scrollTo=HoXBt5hhNzkH)

>>>[3.2.2 Load and preprocess data](#scrollTo=e2KR5pTXODRR)

>>[3.2 Transformer models](#scrollTo=_uSTqDxLQQ88)

>>>[3.2.1 customized BERT model](#scrollTo=9-mekik6wnJk)

>>>[3.2.2 customized GPT-2 model](#scrollTo=v1Cjn3yCwvS_)

>>[3.3 Conjectures and reasons for failing to achieve an accuracy beyond $40$ percent](#scrollTo=UX6fIN5uwzAz)



## 3.1 Load and check dependencies

### 3.2.1 Load and install dependencies

In [1]:
# check for GPU
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-0ca1e208-4df8-8f79-c99e-ad49084a7f9c)


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
from tensorflow.keras.utils import plot_model

I will be using the `keras_nlp` library to import pretrained transformer encoders. `keras_nlp` also requires `tensorflow_text` as a dependecy, so we install them here.

In [3]:
!pip -q install tensorflow-text

In [4]:
!pip -q install keras-nlp

In [5]:
import tensorflow_text
import keras_nlp

In [6]:
!wget https://raw.githubusercontent.com/ZYWZong/ML_Practice_Projects/refs/heads/main/SkimLit_project_practice/SkimLit_utils.py

from SkimLit_utils import *

--2024-12-30 23:03:55--  https://raw.githubusercontent.com/ZYWZong/ML_Practice_Projects/refs/heads/main/SkimLit_project_practice/SkimLit_utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7876 (7.7K) [text/plain]
Saving to: ‘SkimLit_utils.py’


2024-12-30 23:03:56 (72.9 MB/s) - ‘SkimLit_utils.py’ saved [7876/7876]



### 3.2.2 Load and preprocess data

In [7]:
!git clone --quiet https://github.com/Franck-Dernoncourt/pubmed-rct.git

We preprocess the datasets as before.

In [8]:
data_dir = "pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

# preprocess the data as in notebook 00
train_df, dev_df, test_df = SkimLit_preprocess_master(data_dir)
data_all = SkimLit_preprocess_OneHot_NN(train_df, dev_df, test_df)
label_Encoded = SkimLit_preprocess_EncodedLabels(train_df,dev_df,test_df)

train_sentences = data_all["train_text"]
dev_sentences = data_all["dev_text"]
test_sentences = data_all["test_text"]

train_labels = data_all["train_label"]
dev_labels = data_all["dev_label"]
test_labels = data_all["test_label"]

And, as before, let's create `tf.tensor` for our datasets.

In [9]:
train_data = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels))
dev_data = tf.data.Dataset.from_tensor_slices((dev_sentences, dev_labels))
test_data = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels))

## 3.2 Transformer models

In this section, I will build two customized transformer models for our sentence classification task, one using BERT and the other using GPT-2.

### 3.2.1 customized BERT model

**!IMPORTANT NOTE!**

> PS: After several attempts to build customized layers (including adding various dense, pooling layers) and fine-tuning hyperparameters (learning rate and sequence length), I still couldn't get the model beyond a training accuracy of $40\%$. So, below, I present my best attempt.



I experiment on the base BERT encoder model. First, I need to further preprocess the data to be compatible with the BERT encoder. Recall from **notebook 01** that $95\%$ of our sentences are less than $55$ words, so a sequence length of $55$ should be sufficient.

In [10]:
preprocessor_BERT = keras_nlp.models.BertPreprocessor.from_preset("bert_base_en_uncased", sequence_length=55)

In [11]:
# helper function for preprocessing data for BERT encoder
def preprocess_fn_BERT(sentences, label):
    return preprocessor_BERT(sentences), label

# again use a batch of 32 and preprocess the data for BERT
train_data_BERT = train_data.map(preprocess_fn_BERT).batch(32).prefetch(tf.data.AUTOTUNE)
dev_data_BERT = dev_data.map(preprocess_fn_BERT).batch(32).prefetch(tf.data.AUTOTUNE)
test_data_BERT = test_data.map(preprocess_fn_BERT).batch(32).prefetch(tf.data.AUTOTUNE)

Now, let's construct a simple model with the base BERT encoder together with a few dense layers. Also, I will be using a dropout layer after the BERT encoder layer to reduce overfitting the training data. Although this dropout layer make the model less interpretable, for this practice, I'll still use it here.

In [12]:
# load the BERT encoder
bert_encoder = keras_nlp.models.BertBackbone.from_preset("bert_base_en_uncased")
bert_encoder.trainable = False  # freeze all the parameters in BERT

# inputs
token_ids = tf.keras.layers.Input(shape=(55,), dtype=tf.int32, name="token_ids")
padding_mask = tf.keras.layers.Input(shape=(55,), dtype=tf.int32, name="padding_mask")
segment_ids = tf.keras.layers.Input(shape=(55,), dtype=tf.int32, name="segment_ids")

# pass the preprocessed inputs into the BERT encoder
encoder_input = {"token_ids": token_ids, "padding_mask": padding_mask, "segment_ids": segment_ids}
encoder_output = bert_encoder(encoder_input)

# customized dense layers
x = tf.keras.layers.Dropout(0.2)(encoder_output["pooled_output"]) # regularize

# adding dense layers to reduce the x's dimensions by stages
x = tf.keras.layers.Dense(256, activation="relu")(x)
x = tf.keras.layers.Dense(48, activation="relu")(x)

# output layer
output = tf.keras.layers.Dense(5, activation="softmax", name = "output_layer")(x)

model_BERT_base_custom = tf.keras.Model(inputs=[token_ids, padding_mask, segment_ids],
                                 outputs=output,
                                 name = "model_BERT_base_custom")

model_BERT_base_custom.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

model_BERT_base_custom.summary()

In [13]:
history_bert = model_BERT_base_custom.fit(train_data_BERT,
                                          epochs = 5,
                                          validation_data = dev_data_BERT,
                                          validation_steps = int(0.1*len(dev_data_BERT)))

Epoch 1/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m723s[0m 120ms/step - accuracy: 0.3376 - loss: 1.4750 - val_accuracy: 0.3551 - val_loss: 1.4568
Epoch 2/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m644s[0m 114ms/step - accuracy: 0.3520 - loss: 1.4597 - val_accuracy: 0.3521 - val_loss: 1.4397
Epoch 3/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m644s[0m 114ms/step - accuracy: 0.3538 - loss: 1.4579 - val_accuracy: 0.3644 - val_loss: 1.4525
Epoch 4/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m645s[0m 115ms/step - accuracy: 0.3538 - loss: 1.4567 - val_accuracy: 0.3614 - val_loss: 1.4407
Epoch 5/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m682s[0m 115ms/step - accuracy: 0.3541 - loss: 1.4565 - val_accuracy: 0.3537 - val_loss: 1.4515


### 3.2.2 customized GPT-2 model

**!IMPORTANT NOTE!**

> PS: similar phenomenon to my experiments on BERT is observed here for my customized GPT-2 model, which also fails to reach beyond $40\%$ accuracy on the training data.



In [14]:
preprocessor_GPT2 = keras_nlp.models.GPT2Tokenizer.from_preset("gpt2_base_en", sequence_length=55)

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/config.json...


100%|██████████| 484/484 [00:00<00:00, 1.00MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/tokenizer.json...


100%|██████████| 448/448 [00:00<00:00, 300kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/vocabulary.json...


100%|██████████| 0.99M/0.99M [00:01<00:00, 780kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/assets/tokenizer/merges.txt...


100%|██████████| 446k/446k [00:01<00:00, 436kB/s]


In [15]:
train_data = tf.data.Dataset.from_tensor_slices((train_sentences, train_labels))
dev_data = tf.data.Dataset.from_tensor_slices((dev_sentences, dev_labels))
test_data = tf.data.Dataset.from_tensor_slices((test_sentences, test_labels))

def preprocess_fn_GPT2(sentences, label):
    tokenized = preprocessor_GPT2(sentences)
    padding_mask = tf.cast(tokenized != 0, dtype=tf.int32)
    return {"token_ids": tokenized, "padding_mask": padding_mask}, label

train_data_GPT2 = train_data.map(preprocess_fn_GPT2).batch(32).prefetch(tf.data.AUTOTUNE)
dev_data_GPT2 = dev_data.map(preprocess_fn_GPT2).batch(32).prefetch(tf.data.AUTOTUNE)
test_data_GPT2 = test_data.map(preprocess_fn_GPT2).batch(32).prefetch(tf.data.AUTOTUNE)

In [16]:
gpt2_backbone = keras_nlp.models.GPT2Backbone.from_preset("gpt2_base_en", trainable = False)
#gpt2_backbone.trainable = False

token_ids = tf.keras.layers.Input(shape=(55,), dtype=tf.int32, name="token_ids")
padding_mask = tf.keras.layers.Input(shape=(55,), dtype=tf.int32, name="padding_mask")

# Pass inputs through GPT-2 backbone
encoder_inputs = {"token_ids": token_ids, "padding_mask": padding_mask}
outputs = gpt2_backbone(encoder_inputs)

# Use the last token's embedding for classification
last_token_embedding = outputs[:, -1, :]

# perform classification
x = tf.keras.layers.Dropout(0.1)(last_token_embedding)
x = tf.keras.layers.Dense(16, activation="relu")(x)
output = tf.keras.layers.Dense(5, activation="softmax")(x)

model_GPT2 = tf.keras.Model(inputs=[token_ids, padding_mask], outputs=output, name = "model_GPT2")

model_GPT2.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(learning_rate = 0.1),
              metrics=["accuracy"])

model_GPT2.summary()

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_base_en/2/download/model.weights.h5...


100%|██████████| 475M/475M [00:30<00:00, 16.1MB/s]


In [19]:
history_GPT2 = model_GPT2.fit(train_data_GPT2,
                              epochs = 5,
                              validation_data = dev_data_GPT2,
                              validation_steps = int(0.1*len(dev_data_GPT2)))

Epoch 1/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m657s[0m 116ms/step - accuracy: 0.3266 - loss: 1.4780 - val_accuracy: 0.3205 - val_loss: 1.4745
Epoch 2/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m654s[0m 116ms/step - accuracy: 0.3252 - loss: 1.4785 - val_accuracy: 0.3288 - val_loss: 1.4586
Epoch 3/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m651s[0m 116ms/step - accuracy: 0.3252 - loss: 1.4784 - val_accuracy: 0.3195 - val_loss: 1.4725
Epoch 4/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m649s[0m 115ms/step - accuracy: 0.3253 - loss: 1.4785 - val_accuracy: 0.3394 - val_loss: 1.4613
Epoch 5/5
[1m5627/5627[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m649s[0m 115ms/step - accuracy: 0.3253 - loss: 1.4780 - val_accuracy: 0.3291 - val_loss: 1.4694


## 3.3 Conjectures and reasons for failing to achieve an accuracy beyond $40$ percent


1. The pretrained transformer encoder layers yield a dimension much larger than the dimension of relevant information, which is $55$ for our data. The large dimensions results in a very slow convergence rate when fitting the model with our training data.

2. Moreover, I am only running the BERT layer for a few epochs ($\sim 10$). In fact, it has been shown that to achieve a good accuracy with, for example, BERT encoder, one may need to train for hundreds of epochs [[reference](https://www.kaggle.com/models/google/experts-bert/tensorFlow2/pubmed/2?tfhub-redirect=true)]. However, due to limited computation budget on Google Colab, I decided not to continue experimenting on these models.

