In [12]:
#meta 7/5/2021 HF Transformers Course - Ch.3 Fine-tuning a Pretrained Model: Fine-tuning a model with Keras
#src: https://huggingface.co/course/chapter3/3?fw=tf

#history
#7/5/2021 Ch.3 Fine-tuning a Pretrained Model: Fine-tuning a model with Keras
#      same notebook, in Colab (much faster)
#      last cell errors out: ValueError: Shapes (None, 2) and (None, 1) are incompatible

# Fine-tuning a model with Keras


##### Setup
Install the Transformers and Datasets libraries to run this notebook.

In [13]:
! pip install datasets transformers[sentencepiece]



Once you’ve done all the data preprocessing work in the Ch3. last section, you have just a few steps left to train the model. 

##### Note:
the `model.fit` command will run very slowly on a CPU. You can get access to free GPUs or TPUs on [Google Colab](https://colab.research.google.com/).

The code examples below assume you have already executed the examples in Ch 3. previous section. Here is a short summary recapping what you need:

In [14]:
from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_dataset(dataset):
    encoded = tokenizer(
        dataset["sentence1"],
        dataset["sentence2"],
        padding=True,
        truncation=True,
        return_tensors='np',
    )
    return encoded.data

tokenized_datasets = {
    split: tokenize_dataset(raw_datasets[split]) for split in raw_datasets.keys()
}

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


### Training

TensorFlow models imported from transformers are already Keras models. 
[Short introduction to Keras](https://youtu.be/rnTGBy2ax1c)  

That means that once we have our data, very little work is required to begin training on it.  [Finetuning with TF](https://youtu.be/alq1l8Lv9GA)

As in Ch.2, we will use the `TFAutoModelForSequenceClassification` class, with two labels:

In [15]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unlike in Ch. 2, you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been inserted instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.

To fine-tune the model on our dataset, we just have to `compile` our model and then pass our data to the `fit` method. This will start the fine-tuning process (which should take a couple of minutes on a GPU) and report training loss as it goes, plus the validation loss at the end of each epoch. That loss can be hard to interpret, though — what does it tell us about the actual accuracy of our model? Let’s add an `accuracy` metric too, to get better insight into our model’s performance:

##### Note: 
A very common pitfall here — you can just pass the name of the loss as a string to Keras, but by default Keras will assume that you have already applied a `softmax` to your outputs. Many models, however, output the values right before the softmax is applied, which are also known as the `logits`. We need to tell the loss function that that’s what our model does, and the only way to do that is to **call it directly**, rather than by name with a string.

In [16]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

model.compile(
    optimizer='adam',
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)
model.fit(
    tokenized_datasets['train'],
    np.array(raw_datasets['train']['label']), 
    validation_data=(
        tokenized_datasets['validation'],
        np.array(raw_datasets['validation']['label']),
    ),
    batch_size=8
)



<tensorflow.python.keras.callbacks.History at 0x7f4762b684d0>

### Improving training performance
[Learning Rate Scheduling with TF](https://youtu.be/eKv4rRcCNX0)

The above code certainly runs, but you’ll find that the loss declines only slowly or sporadically. The primary cause is the `learning rate`. As with the loss, when we pass Keras the name of an optimizer as a string, Keras initializes that optimizer with default values for all parameters, including `lr`. 

Two tricks:  
1.  Lower the LR 
From long experience, we know that transformer models benefit from a much lower `lr` than the default for Adam, which is 1e-3 (also written as 10 to the power of -3, or 0.001). 5e-5 (0.00005), which is some 20x lower, is a much better starting point.

2. Slowly reduce the `lr` over the course of training  
In the literature, you will sometimes see this referred to as decaying or annealing the `lr`. In Keras, the best way to do this is to use a learning rate scheduler. A good one to use is PolynomialDecay — despite the name, with default settings it simply linearly decays the `lr` from the initial value to the final value over the course of training. In order to use a scheduler correctly, we need to tell it how long training is going to be. We compute that as `num_train_steps` below.

In [17]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay
batch_size = 8
num_epochs = 3
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs
num_train_steps = (len(tokenized_datasets['train']['input_ids']) // batch_size) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.,
    decay_steps=num_train_steps
    )
from tensorflow.keras.optimizers import Adam
opt = Adam(learning_rate=lr_scheduler)

Now we have our all-new optimizer, and we can try training with it. First, let’s reload the model, to reset the changes to the weights from the training run we just did, and then we can compile it with the new optimizer:


In [18]:
import tensorflow as tf

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now, we fit just as before. All of the changes are incorporated when we `compile` the model, so the `fit` command is identical.

### Model predictions
[TF Predictions and Metrics](https://youtu.be/nx10eh4CoOs)  
Training and watching the loss go down is all very nice, but what if we want to actually get outputs from the trained model, either to compute some metrics, or to use the model in production? To do that, we can just use the `predict()` method. This will return the `logits` from the output head of the model, one per class.

In [19]:
preds = model.predict(tokenized_datasets['validation'])['logits']



We can convert these `logits` into the model’s class predictions by using `argmax` to find the highest logit, which corresponds to the most likely class:

In [20]:
class_preds = np.argmax(preds, axis=1)
print(preds.shape, class_preds.shape)

(408, 2) (408,)


Now, let’s use those `preds` to compute some metrics! We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `load_metric` function. The object returned has a `compute` method we can use to do the metric calculation:

In [21]:
from datasets import load_metric

metric = load_metric("glue", "mrpc")
metric.compute(predictions=class_preds, references=raw_datasets['validation']['label'])

{'accuracy': 0.3161764705882353, 'f1': 0.0}

Here, we can see our model has an accuracy of 85.78% on the validation set and an F1 score of 89.97. Those are the two metrics used to evaluate results on the MRPC dataset for the GLUE benchmark. The table in the BERT paper reported an F1 score of 88.9 for the base model. That was the `uncased` model while we are currently using the `cased` model, which explains the better result.

It’s annoying to have to call this method manually, though — why don’t we see if we can get Keras to compute it for us? As an added bonus, if we do it that way we can see the metrics during training, rather than just at the end.

Keras supports a number of metrics by default, which can be passed by just writing their name in a string, just like loss functions and optimizers. Unfortunately for us, however, the `F1 score` is not one of them. We look up the definition of the [F1 score](https://en.wikipedia.org/wiki/F-score), and see that it’s just the harmonic mean of the precision and recall, both of which are supported Keras metrics. We  write a simple F1 metric.

In [22]:
class F1_metric(tf.keras.metrics.Metric):
    def __init__(self, name='f1_score', **kwargs):
        super().__init__(name=name, **kwargs)
        # Initialize our metric by initializing the two metrics it's based on:
        # Precision and Recall
        self.precision = tf.keras.metrics.Precision()
        self.recall = tf.keras.metrics.Recall()

    def update_state(self, y_true, y_pred, sample_weight=None):
        # Update our metric by updating the two metrics it's based on
        self.precision.update_state(y_true, y_pred, sample_weight)
        self.recall.update_state(y_true, y_pred, sample_weight)

    def reset_state(self):
        self.precision.reset_state()
        self.recall.reset_state()

    def result(self):
        # To get the F1 result, we compute the harmonic mean of the current
        # precision and recall
        return 2 / ((1 / self.precision.result()) + (1 / self.recall.result())) 

The above code is an example of creating a class by subclassing. We’ve created a new `Metric` class using the base `tf.keras.metrics.Metric `class as a template. This means that we only need to specify the things that our `Metric` does uniquely: the specific computations for the F1 score. All the other “boilerplate” that’s common to all `Metric` classes is taken care of by the base class.

To see our new metric used in action and reporting “live” during training, let’s reload the model and train again. Note how we can mix built-in metrics that we refer to simply by string with actual Metric objects:

In [29]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.,
    decay_steps=num_train_steps
    )
opt = Adam(learning_rate=lr_scheduler)
model.compile(optimizer=opt, loss=loss, metrics=['accuracy', F1_metric()])
model.fit(
    tokenized_datasets['train'],
    np.array(raw_datasets['train']['label']),
    validation_data=(
        tokenized_datasets['validation'],
        np.array(raw_datasets['validation']['label']),
    ),
    batch_size=8, 
    epochs=3
)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




ValueError: ignored

This time, it will report the validation loss and metrics on top of the training loss. The exact accuracy/F1 score you reach might be a bit different, but it should be in the same ballpark.

This concludes the introduction to fine-tuning using the Keras API. An example of doing this for most common NLP tasks will be given in Ch 7.

## Xtra

In [24]:
#my 
model.summary()

Model: "tf_bert_for_sequence_classification_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_227 (Dropout)        multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
