<h2><center>Notebook Walk-through: </center>
    
<center>Fine-Tuning BERT - Optimizer Considerations and Layer Freezing
</center></h2>

In this notebook we discuss some aspects of BERT Fine-tuning for a specific task. We choose a text classification as an example. We will highlight various aspects you may encounter.

Specifically, we will:

* play with BERT (Hugging Face implementation): Tokenization, Layers and Output Dimensions  
* build a sentiment classifier with BERT from scratch and discuss a couple of options you may have
* train the network with various configurations and make observations that will hopefully be helpful

Note that a lot of the content will be delivered through live experimentation in the walkthrough session, and it will not be recorded in the notebook. Please watch the recording. 

Also, note that we are not attempting to reach state of the art by any means. The purpose of the notebook is to highlight some of the issues you may want to consider when fine-tuning BERT.

We start with a few common imports.


In [1]:
import numpy as np

import tensorflow as tf
import tensorflow_datasets as tfds

import transformers

from transformers import BertTokenizer, TFBertModel
from tensorflow.keras import backend as K

import logging
tf.get_logger().setLevel(logging.ERROR)

Let's check for presence of a GPU. We'll need that (or better) if we use transformer models like BERT. 

In [2]:
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Next, let's specify the versions that we are using:

In [3]:
tf.__version__

'2.3.0'

In [4]:
transformers.__version__

'4.0.0'

### 1. Getting the data

We'll use the IMDB dataset, available from tensorflow_datasets.

In [5]:
train_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:80%]', 'test[80%:]'),
    as_supervised=True)

INFO:absl:No config specified, defaulting to first: imdb_reviews/plain_text
INFO:absl:Load dataset info from /home/joachim/tensorflow_datasets/imdb_reviews/plain_text/1.0.0
INFO:absl:Reusing dataset imdb_reviews (/home/joachim/tensorflow_datasets/imdb_reviews/plain_text/1.0.0)
INFO:absl:Constructing tf.data.Dataset imdb_reviews for split ('train[:80%]', 'test[80%:]'), from /home/joachim/tensorflow_datasets/imdb_reviews/plain_text/1.0.0


Let's some create train examples and test examples. 

In [6]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(20000)))
test_examples_batch, test_labels_batch = next(iter(test_data.batch(5000)))
#train_examples_batch

In [7]:
train_examples_batch[:4]

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell a

In [8]:
train_labels_batch[:4]

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([0, 0, 0, 1])>

### 2. Preparing the model input with the BERT Tokenizer

We use the 'bert-base-cased' from Huggingface as the underlying BERT model.

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert_model = TFBertModel.from_pretrained('bert-base-cased')

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Let's create a few training and test examples. For training time purposes, let's define a relatively short maximum length. We may modify the numbers later. 

In [10]:
num_train_examples = 2500
num_test_examples = 500
num_tiny_set = 5

max_length = 80

x_train = tokenizer([str(x.numpy())[2:] for x in train_examples_batch[:num_train_examples]], 
              max_length=max_length,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')
y_train = train_labels_batch[:num_train_examples]




x_test = tokenizer([str(x.numpy())[2:] for x in test_examples_batch[:num_test_examples]], 
              max_length=max_length,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')
y_test = test_labels_batch[:num_test_examples]


x_tiny = tokenizer([str(x.numpy())[2:] for x in test_examples_batch[:num_tiny_set]], 
              max_length=max_length,
              truncation=True,
              padding='max_length', 
              return_tensors='tf')
y_tiny = test_labels_batch[:num_tiny_set]

Let us look at the class imbalance:

In [11]:
print('ratio of positive examples: ', np.sum(y_train)/len(y_train))

ratio of positive examples:  0.494


Ok, slightly more negative examples in train set.

What did the tokenizer do?

The tokenizer created input ids, token type ids, and masks:

In [12]:
x_train.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [13]:
x_train.input_ids

<tf.Tensor: shape=(2500, 80), dtype=int32, numpy=
array([[  101,  1188,  1108, ...,  9283,  1127,   102],
       [  101,   146,  1138, ...,  1104,  1184,   102],
       [  101, 10852,  6810, ...,  1113,  1103,   102],
       ...,
       [  101,  1247,  1110, ...,  1105, 13952,   102],
       [  101,  1327,  1103, ...,  6188, 11074,   102],
       [  101,  1188,  2523, ..., 12118,  8057,   102]], dtype=int32)>

In [14]:
x_train.token_type_ids

<tf.Tensor: shape=(2500, 80), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>

In [15]:
x_train.attention_mask

<tf.Tensor: shape=(2500, 80), dtype=int32, numpy=
array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]], dtype=int32)>

No surprises...

**Questions:**

* What are the purpose of each component?
* Why do the input ids all start off with 101?

### 3. BERT

Let's look at the first 25 weights in BERT:

In [26]:
[x.name for x in bert_model.weights][:25]

['tf_bert_model/bert/embeddings/word_embeddings/weight:0',
 'tf_bert_model/bert/embeddings/position_embeddings/embeddings:0',
 'tf_bert_model/bert/embeddings/token_type_embeddings/embeddings:0',
 'tf_bert_model/bert/embeddings/LayerNorm/gamma:0',
 'tf_bert_model/bert/embeddings/LayerNorm/beta:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/self/query/kernel:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/self/query/bias:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/self/key/kernel:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/self/key/bias:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/self/value/kernel:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/self/value/bias:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/output/dense/kernel:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/output/dense/bias:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/output/LayerNorm/gamma:0',
 'tf_bert_model/bert/encoder/layer_._0/attention/output/LayerNorm/

**Question:**
* Does this make sense?

It sure does...

What are the outputs of bert_model, when applied to data?

In [17]:
bert_out = bert_model(x_tiny, output_hidden_states=True)

In [18]:
len(bert_out)

3

In [19]:
bert_out[0].numpy().shape

(5, 80, 768)

In [20]:
bert_out[1].numpy().shape

(5, 768)

In [27]:
len(bert_out[2])

13

In [29]:
[x.shape for x in bert_out[2]]

[TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768]),
 TensorShape([5, 80, 768])]

**Questions:**
* What are the interpretations of the 3 outputs?
* Are the respective dimensions as expected?

### 4. Building our Classification Model

Let's build our classification model from scratch and run a few configurations.

In particular, we will consider:

* Optimizer choices
* number of bert layers to be re-trained
* effects of freezing and unfreezing


In [22]:
def create_classification_model(hidden_size = 200, 
                                train_layers = -1, 
                                optimizer=tf.keras.optimizers.Adam()):
    """
    Build a simple classification model with BERT. Let's keep it simple and don't add dropout, layer norms, etc.
    """

    input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name='input_ids_layer')
    token_type_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name='token_type_ids_layer')
    attention_mask = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name='attention_mask_layer')

    bert_inputs = {'input_ids': input_ids,
                  'token_type_ids': token_type_ids,
                  'attention_mask': attention_mask}


    #restrict training to the train_layers outer transformer layers
    if not train_layers == -1:

            retrain_layers = []

            for retrain_layer_number in range(train_layers):

                layer_code = '_' + str(11 - retrain_layer_number)
                retrain_layers.append(layer_code)

            for w in bert_model.weights:
                if not any([x in w.name for x in retrain_layers]):
                    w._trainable = False


    bert_out = bert_model(bert_inputs)


    classification_token = tf.keras.layers.Lambda(lambda x: x[:,0,:], name='get_first_vector')(bert_out[0])


    hidden = tf.keras.layers.Dense(hidden_size, name='hidden_layer')(classification_token)

    classification = tf.keras.layers.Dense(1, activation='sigmoid',name='classification_layer')(hidden)

    classification_model = tf.keras.Model(inputs=[input_ids, token_type_ids, attention_mask], 
                                          outputs=[classification])
    
    classification_model.compile(optimizer=optimizer,
                            loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),
                            metrics='accuracy')


    return classification_model

### 5. Experimentation

Let us compare a few configurations:

* 'default': Adam Optimizer with default parameters (lr=0.001), all BERT layers fine-tuned 
* 'smaller learning rate': Adam Optimizer with lr=0.00005 parameters, all BERT layers fine-tuned 
* 'frozen': Adam Optimizer with default parameters, all BERT layers frozen

#### 5.1 Default

In [23]:
classification_model = create_classification_model()     

In [24]:
classification_model.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask],
                         y_train,
                         validation_data=([x_test.input_ids, x_test.token_type_ids, x_test.attention_mask],
                         y_test),
                        epochs=5,
                        batch_size=8)

#classification_model([x.input_ids, x.token_type_ids, x.attention_mask])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f9974479b50>

In [25]:
classification_model.predict([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
                             batch_size=8, 
                             steps=2)

array([[0.53564674],
       [0.53564674],
       [0.53564674],
       [0.5356468 ],
       [0.53564674],
       [0.53564674],
       [0.53564674],
       [0.53564674],
       [0.53564674],
       [0.53564674],
       [0.53564674],
       [0.53564674],
       [0.53564674],
       [0.53564674],
       [0.53564674],
       [0.53564674]], dtype=float32)

What is this? All essentially the same prediction? And basically not better than always predicting the majority class for each example? It may seem like "BERT is no good for this task"?!

Careful, not so! There are a number of changes one can consider:

* Change the optimizer configuration
* Freeze some BERT layers - maybe for the entire training cycle or for thye first few epochs. 
* Add more data


#### 5.2 Lower Learning Rate


In [23]:
try:
    del classification_model
except:
    pass

try:
    del bert_model
except:
    pass

tf.keras.backend.clear_session()
bert_model = TFBertModel.from_pretrained('bert-base-cased')

classification_model = create_classification_model(optimizer=tf.keras.optimizers.Adam(0.00005))

classification_model.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask],
                         y_train,
                         validation_data=([x_test.input_ids, x_test.token_type_ids, x_test.attention_mask],
                         y_test),
                        epochs=5,
                        batch_size=8)

classification_model.predict([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
                             batch_size=8, 
                             steps=2)

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


array([[9.3064364e-04],
       [6.8027135e-03],
       [1.4993896e-02],
       [9.9996924e-01],
       [9.9966812e-01],
       [9.9994528e-01],
       [1.6548770e-03],
       [2.1150764e-03],
       [8.5419603e-03],
       [2.0312648e-02],
       [3.9322700e-04],
       [9.9739861e-01],
       [9.9916089e-01],
       [4.6210135e-03],
       [9.9998915e-01],
       [5.3805154e-04]], dtype=float32)

That seemed to work! Looks like the learning rate really mattered! (Of course, we have not focused here on finding the model for the test accuracy. We simply wanted to 'get it to work').

#### 5.3 Layer Freezing

In [24]:
try:
    del classification_model
except:
    pass

try:
    del bert_model
except:
    pass

tf.keras.backend.clear_session()
bert_model = TFBertModel.from_pretrained('bert-base-cased')

classification_model = create_classification_model(train_layers=0)

classification_model.fit([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask],
                         y_train,
                         validation_data=([x_test.input_ids, x_test.token_type_ids, x_test.attention_mask],
                         y_test),
                        epochs=5,
                        batch_size=8)

classification_model.predict([x_train.input_ids, x_train.token_type_ids, x_train.attention_mask], 
                             batch_size=8, 
                             steps=2)

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


array([[0.4683084 ],
       [0.24332495],
       [0.71891725],
       [0.94044155],
       [0.6260087 ],
       [0.76680475],
       [0.39689186],
       [0.73443514],
       [0.5678517 ],
       [0.3078529 ],
       [0.22237189],
       [0.67680615],
       [0.8657147 ],
       [0.23996763],
       [0.51509434],
       [0.27128804]], dtype=float32)

That 'worked' too! As expected, the final validation loss is larger and the validation accuracy is smaller though.

**Questions:**
* is that expected? 
* What else is different?

But either way, all of these parameters seem to be interrelated. Experiment!

### 6. Conclusions 

While one has to be careful to generalize from one (truncated) dataset, the pattern is pretty clear: it is not enough to simply define the model and see what you get. Some investigation needs to be devoted to making sure that the combination of model details, optimizer configurations, and data work.

One big tell is if a BERT model is not better than ~'pick the majority class' or close to it, while other models perform better. 

One should also say that there are other things to try in the learning phase, but the point of this notebook was to point out a few obvious issues. Previous students ran into precisely these issues!