In this project, we will add customized layers to a pretrained Huggingface TFDistilBertModel to make a transfer-learning classification model for text sentiment analysis. Note that if additional customization is not needed, Huggingface have pretrained classificational models ready such as AutoModelForSequenceClassification but these won't be discussed today. We will demonstrate how to build and train the transfer-learning model with the Keras functional API.

First, we install and import relevant packages. To save GPU resource, we limit the maximum length of input sequence to 32. The embed dim of DistilBert is fixed at 768.

In [None]:
! pip install datasets transformers

In [None]:
pip install evaluate

In [82]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import tensorflow as tf
from tensorflow import keras
from transformers import pipeline
import numpy as np
import evaluate
embed_dim=768
maxlen=32
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
from keras import backend as K
K._get_available_gpus()

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3876449853543412423
xla_global_id: -1
]


[]

Our dataset is from Huggingface at https://huggingface.co/datasets/mteb/tweet_sentiment_extraction.

In [4]:
from datasets import load_dataset
raw_datasets = load_dataset("mteb/tweet_sentiment_extraction")
from transformers import AutoTokenizer
from transformers import TFDistilBertModel
tokenizer=AutoTokenizer.from_pretrained('distilbert-base-cased')

Downloading readme:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

Downloading and preparing dataset json/mteb--tweet_sentiment_extraction to /root/.cache/huggingface/datasets/mteb___json/mteb--tweet_sentiment_extraction-0669dffec9427684/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/3.63M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/465k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/mteb___json/mteb--tweet_sentiment_extraction-0669dffec9427684/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Let's do some data exploration. The structure of our dataset is displayed below.

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 27481
    })
    test: Dataset({
        features: ['id', 'text', 'label', 'label_text'],
        num_rows: 3534
    })
})

The dataset has three classes for sentiments: negative (0), neutral (1) and positive (2). Usually, the number of samples under each label in the dataset can vary a lot, which will lead to biased models. Fortunately, this is not our case as our three classes have reasonably similar volumes. Therefore, we skip the step of minority duplication.

In [6]:
n_negative=[1 if label == 0 else 0 for label in raw_datasets['train']['label']]
n_neutral=[1 if label == 1 else 0 for label in raw_datasets['train']['label']]
n_positive=[1 if label == 2 else 0 for label in raw_datasets['train']['label']]
print('number of negative, neutral and positive sentiments in train set:',sum(n_negative),sum(n_neutral),sum(n_positive))
assert sum(n_negative)+sum(n_neutral)+sum(n_positive)==len(raw_datasets['train'])

number of negative, neutral and positive sentiments in train set: 7781 11118 8582


And let's test our tokenizer. It breaks the sentence into a list of words, with some words broken down and some unrecognizable patterns removed. Then it transforms all words into their corresponding unique int indices. Also, it appends the [CLS] token (with index 101) to the start and the [SEP] token (with index 102) to the end, as a common practice for BERT series models. The attention mask is used to block paddings with zeros, which will be padded in the same way as the input ids.

In [10]:
sentence1='Hello machineLearning! ab$f5n;O'
tokenizer(sentence1)

{'input_ids': [101, 8667, 3395, 2162, 19386, 3381, 106, 170, 1830, 109, 175, 1571, 1179, 132, 152, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Then we define our preprocess function that uses the tokenizer to tokenize the sentences. We won't do padding at this stage since we want to do dynamic padding, which can only be done right after batching. We will address on this later.

Also, there are many samples with text longer than 32 tokens, which need to be truncated. In stead of discarding truncated part, we overflow the overlength texts into several sections and replicate the labels to generate several samples. I did this only for demonstration purpose and did not verify whether this improved accuracy.

In [11]:
def preprocess_nopad(raw_dataset,max_length=maxlen,stride=maxlen//2):
  texts=raw_dataset['text']
  labels=raw_dataset['label']
  assert len(texts)==len(labels)
  inputs=tokenizer(
      texts,
      max_length=max_length,
      truncation=True,
      stride=stride,
      return_overflowing_tokens=True,
  )
  overflow=inputs['overflow_to_sample_mapping']
  overflowed_labels=np.array(labels)[overflow].tolist()
  assert len(inputs['input_ids'])==len(overflowed_labels)
  ID=[raw_dataset['id'][i] for i in overflow]
  ret_input={'input_ids':inputs['input_ids'],'labels':overflowed_labels,\
             'attention_mask':inputs['attention_mask'],'overflow':overflow}
  return ret_input

In [13]:
train_val_set=raw_datasets['train'].map(preprocess_nopad,
                                    batched=True,
                                    remove_columns=raw_datasets["train"].column_names)
train_val_set.shuffle()
train_val_set=train_val_set.train_test_split(test_size=0.15,) #train-validation split
train_set=train_val_set['train']
val_set=train_val_set['test']
print(train_set,'\n',val_set)

Map:   0%|          | 0/27481 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'labels', 'attention_mask', 'overflow'],
    num_rows: 27923
}) 
 Dataset({
    features: ['input_ids', 'labels', 'attention_mask', 'overflow'],
    num_rows: 4928
})


Now we are ready to further create our batched iterable dataloaders from the datasets. We borrow the TFAutoModelForSequenceClassification model only to call its prepare_tf_dataset function, which will automatically do batching and dynamic padding, and return a tuple with the structure ({'input_ids':input_ids,'attention_mask':attention_mask},labels). We are not using its weights anyway. The reason to introduce another model here is because the TFDistilBertModel that we are using has a prepare_tf_dataset function that won't return the labels.

There are certainly other approaches to build dataloaders. For example, we can use the dataset.to_tf_dataset function but the padding will no longer be dynamic.

In [None]:
from transformers import TFAutoModelForSequenceClassification
tempmodel = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
#tempmodel: Any model that has prepare_tf_dataset function that returns\
# ({'input_ids':input_ids,'attention_mask':attention_mask},labels). TFDistilBertModel won't do the job.
train_loader = tempmodel.prepare_tf_dataset(train_set, batch_size=16, shuffle=True, tokenizer=tokenizer)
val_loader = tempmodel.prepare_tf_dataset(val_set, batch_size=16, shuffle=False, tokenizer=tokenizer)
del tempmodel
print(train_loader)

In [16]:
print(train_loader,'\n-----------------')
for i,t in enumerate(train_loader):
  if i==0:
    print('input keys and shapes:',t[0].keys(),t[0]['input_ids'].shape,t[0]['attention_mask'].shape)
    print('labels:',t[1])
    break

<_PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(16, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(16, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(16,), dtype=tf.int64, name=None))> 
-----------------
input keys and shapes: dict_keys(['input_ids', 'attention_mask']) (16, 32) (16, 32)
labels: tf.Tensor([1 1 1 2 1 1 0 1 2 1 2 0 0 1 2 1], shape=(16,), dtype=int64)


We are now building our model using the Keras functional API with a pretrained DistilBert as the first layer. Disregarding batch size, the DistilBert takes an 1D input of any length (typically <=512), and outputs a 2D tensor with one dimension equal to the length of input and another dimension being the embedding size 768 used through all its multi-head attention blocks. We will take the zeroth output embedding (at the place of [CLS] token) with size 768 and attach a dence layer to it. Following will be the output layer with 3 classes. Dropouts are added to mitigate overfitting. The dropout values and the dense layer size are tunable hyperparameters.

As for loss computation, we softmax normalize the 3 outputs of our model and use SparseCategoricalCrossentropy loss to criterion them against the labeled class, which is an integer (0,1 or 2).

In [84]:
from keras.layers import Input, Dense, Dropout
bert_model = TFDistilBertModel.from_pretrained('distilbert-base-cased')#pretrained but fine tunable
inputs = {'input_ids': Input(shape=(None,), dtype=tf.int32, name='input_ids'),
    'attention_mask': Input(shape=(None,), dtype=tf.int32, name='attention_mask')}
#bert_output = bert_model(input_ids=inputs[0]['input_ids'], attention_mask=inputs[0]['attention_mask'])[0][:,0,:]
bert_output = bert_model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])[0][:,0,:]
### Additional layers ###
bert_output=Dropout(0.3)(bert_output)
dense = Dense(128, activation='relu')(bert_output) #random inited
dense = Dropout(0.3)(dense)
output = Dense(3, activation='softmax')(dense) #random inited
######
model = keras.models.Model(inputs=inputs, outputs=output)

Some layers from the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertModel: ['vocab_projector', 'vocab_transform', 'vocab_layer_norm', 'activation_13']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [19]:
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.00002),
              loss='SparseCategoricalCrossentropy')

We can visualize our model below. As stated, all weights in the DistilBERT model are trainable. Thus, training for such models built on the pretrained BERT will lead to a finetune for the weights inside the BERT.

In [85]:
model.summary()

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 tf_distil_bert_model_6 (TFDist  TFBaseModelOutput(l  65190912   ['input_ids[0][0]',              
 ilBertModel)                   ast_hidden_state=(N               'attention_mask[0][0]']         
                                one, None, 768),                                                  
                                 hidden_states=None                                         

In [None]:
#Function directly taken from Huggingface: https://huggingface.co/docs/transformers/tasks/sequence_classification
#This function is for validation use only
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
from transformers.keras_callbacks import KerasMetricCallback
KMC = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=train_loader)
ESC = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)
MCC = tf.keras.callbacks.ModelCheckpoint(
    filepath='/content/drive/MyDrive/HF_BERT_clsfr/clsfr_chpt',
    save_weights_only=True,
    monitor='val_loss',
    mode='min',
    save_best_only=True)

With all that set, we can start training. The training function will be smart enough to read our inputs described above. To illustrate, as long as we create our train and validation dataloaders according to that format and use properly named keys, the training loop will be fetching data from dataloaders correctly.

In [None]:
model.fit(x=train_loader,validation_data=val_loader,callbacks=(KMC,ESC,MCC),epochs=5, batch_size=16)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
 105/1745 [>.............................] - ETA: 2:01 - loss: 0.2418

In [None]:
model.fit(x=train_loader,validation_data=val_loader,callbacks=(KMC,ESC,MCC),epochs=6, batch_size=5)

Epoch 1/6
Epoch 2/6
Epoch 3/6


<keras.callbacks.History at 0x7f20a463f130>

In [86]:
model.load_weights('/content/drive/MyDrive/HF_BERT_clsfr/clsfr_chpt')

<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x7fa487d514e0>

Below is the evaluation part.

The classify function also overflows overlength texts into splitted samples. Since there will be one prediction returned for each splitted sample, the final prediction for the original sample is taken as the rounded numerical average of all predictions for all its splits.

In [66]:
from tqdm.auto import tqdm
def classify(model,texts,max_length=maxlen,stride=maxlen//2,batch_size=16):
  inputs=tokenizer(
      texts,
      max_length=max_length,
      truncation=True,
      stride=stride,
      return_overflowing_tokens=True,
      padding='max_length',
  )
  overflow=inputs['overflow_to_sample_mapping']
  inputdict={'input_ids':np.array(inputs['input_ids']),'attention_mask':np.array(inputs['attention_mask'])}
  num_batchs=len(overflow)//batch_size
  if not len(overflow)%batch_size==0: num_batchs+=1
  pred=[]
  for n in tqdm(range(0,num_batchs)):
    batch_input={'input_ids':inputdict['input_ids'][n*batch_size:(n+1)*batch_size],
             'attention_mask':inputdict['attention_mask'][n*batch_size:(n+1)*batch_size]}
    pred+=np.argmax(model(batch_input,training=False).numpy(),axis=1).tolist()
  assert len(overflow)==len(pred),(len(overflow),len(pred))
  pred_backflowed=np.zeros(overflow[-1]+1)
  counts=np.zeros_like(pred_backflowed)
  for j,idx in enumerate(overflow):
    pred_backflowed[idx]+=pred[j]
    counts[idx]+=1
  pred_backflowed=np.round(pred_backflowed/counts).astype(int)
  assert idx+1==len(pred_backflowed)
  return pred_backflowed
def acc(arr1,arr2):
  assert len(arr1)==len(arr2),(len(arr1),len(arr2))
  diffs=(np.array(arr1)==np.array(arr2))
  return np.sum(diffs)/len(diffs)

The test accuracy seems not as good as training or validation. However, I haven't taken my time to tweak the hyperparameters and the model structure yet. It looks like there is still some serious overfitting despite the high dropout values. Also, for a complex model like the DistilBERT (despite being the lightest among its line), I doubt a training size of 30k is enough.

A closer look at the confusion matrix reflects that most mistakes are associated with the 'neutral' sentiment. This explains thing a lot since it's actually kind of ambiguous to define the boundaries of a 'neutral' sentiment.

In [None]:
from sklearn.metrics import confusion_matrix
ev=classify(model,raw_datasets['test']['text'],batch_size=16)
print('test accuracy:',acc(ev,raw_datasets['test']['label']))
cm = confusion_matrix(raw_datasets['test']['label'], ev)
print('confusion matrix (vertical=label,horizontal=pred):\n',cm)

  0%|          | 0/262 [00:00<?, ?it/s]

test accuracy: 0.7710809281267685
confusion matrix (vertical=label,horizontal=pred):
 [[ 779  195   27]
 [ 217 1043  170]
 [  32  168  903]]


Finally let's look at some examples. I feel that some labels in the testset are not quite accurate. In the following example, I think our model actually got the right predictions at the two samples where our model and the labels diverged.

In [103]:
test_idxs=np.random.randint(0,len(raw_datasets['test']),size=(15,))
samples=raw_datasets['test'][test_idxs]
preds=classify(model,samples['text'])
print('texts:')
for i,t in enumerate(samples['text']): print(i,':',t)
print('predicted:',preds)
print('labels:   ',np.array(samples['label']))
print(f'number of mistakes: {np.sum(np.array(preds)!=np.array(samples["label"]))}')

  0%|          | 0/2 [00:00<?, ?it/s]

texts:
0 :  meetings are overrated.
1 :  thanks for the #followfriday as you can see us South Africans were on holiday on fri
2 : I`ve been nudged!!! not much going on lately umped games over the wknd and i took one to the pills
3 :  // i feel your pain. i once lived in an apt for 6 mos where the previous tenant had 4 cats. burning eyes/tight lungs =  gregg
4 : _n Where can I get some?
5 : Sitting in an almost empty dorm, waiting for jordan to come to take some last things and say good bye. He graduates tomorrow.
6 :  but i do emily ahahha you scare me, so it would work
7 : Slept in, woke up with an iced coffee, lazed about & went out for a late lunch with the BF. It`s been a sweet little laid-back Saturday.
8 : nope no way in to stop  just have to put up wiv it
9 : I`ll be grand....
10 : watching the jobros live chat .. not live though  haha.
11 : Weekend is getting close. Too bad I`ll be stuck up north  Hopefully I`ll be able to get out next weekend for some real life fun.
12 : 5500 