<a href="https://colab.research.google.com/github/axel-sirota/nlp-and-transformers/blob/main/module4/NLPTransformers_Mod4Demo2_FineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning Transformers with HuggingFace

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

HuggingFace is a company with a heavy open source philosophy that makes transformers readily available so you don't have to do what we did before for every application.

## Prep

In [None]:
!pip install -U datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/486.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m96.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import multiprocessing
import tensorflow as tf
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
import numpy as np

import sys
import keras.backend as K
import random
import os
import pandas as pd
import warnings
import time

TRACE = False
PATIENCE = 2
EPOCHS = 3
BATCH_SIZE = 256

def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  tf.random.set_seed(42)
  random.seed(42)
  if TRACE:
    tf.debugging.set_log_device_placement(True)

def set_session_with_gpus_and_cores():
  cores = multiprocessing.cpu_count()
  gpus = len(tf.config.list_physical_devices('GPU'))
  config = tf.compat.v1.ConfigProto( device_count = {'GPU': gpus  , 'CPU': cores} , intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
  sess = tf.compat.v1.Session(config=config)
  tf.compat.v1.keras.backend.set_session(sess)

set_seeds_and_trace()
set_session_with_gpus_and_cores()
warnings.filterwarnings('ignore')

## How to download a model?

In HuggingFace there are many models, and each has its own tokenizer. Lucky for us there is a class `AutoTokenizer` that doesn the heavylifting after we provide a checkpoint

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
import numpy as np

raw_datasets = load_dataset("imdb")
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    # We are using the BERT tokenizer, specifying to PAD until the end, truncate if either 128 elements are met or the maximum fro the model, which you et from the model card
    return tokenizer(example["text"], padding=True, truncation=True, max_length=128)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)


Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Let's see how it works!

In [None]:
tokenized_datasets['train'][2]['text']

"If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />"

In [None]:
tokenizer(tokenized_datasets['train'][2]['text'])

{'input_ids': [101, 2065, 2069, 2000, 4468, 2437, 2023, 2828, 1997, 2143, 1999, 1996, 2925, 1012, 2023, 2143, 2003, 5875, 2004, 2019, 7551, 2021, 4136, 2053, 2522, 11461, 2466, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2514, 6819, 5339, 8918, 2005, 3564, 27046, 2009, 2138, 2009, 12817, 2006, 2061, 2116, 2590, 3314, 2021, 2009, 2515, 2061, 2302, 2151, 5860, 11795, 3085, 15793, 1012, 1996, 13972, 3310, 2185, 2007, 2053, 2047, 15251, 1006, 4983, 2028, 3310, 2039, 2007, 2028, 2096, 2028, 1005, 1055, 2568, 17677, 2015, 1010, 2004, 2009, 2097, 26597, 2079, 2076, 2023, 23100, 2143, 1007, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2488, 5247, 2028, 1005, 1055, 2051, 4582, 2041, 1037, 3332, 2012, 1037, 3392, 3652, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

The tokenizer from BERT (well DistillBERT) converts each word into its ID according to *its* vocabulary. And notice the masking says we haven't been truncated. What we will do know is do this for all data and convert it into a TF Datasets object (which Keras accepts)

In [None]:

tf_train_dataset = tokenized_datasets["train"].to_tf_dataset(
    columns=['input_ids'],
    label_cols=["label"],
    shuffle=True,
    batch_size=BATCH_SIZE,
)

tf_validation_dataset = tokenized_datasets["test"].to_tf_dataset(
    columns=['input_ids'],
    label_cols=["label"],
    shuffle=False,
    batch_size=BATCH_SIZE,
)

In [None]:
for x in tf_train_dataset.take(1):
  print(x)

(<tf.Tensor: shape=(256, 128), dtype=int64, numpy=
array([[  101,  2019,  2203, ...,  2001,  1037,   102],
       [  101,  1000, 15876, ...,  2008,  2023,   102],
       [  101,  5875,  1010, ...,  1028,  1026,   102],
       ...,
       [  101,  1996,  2143, ...,  3084,  2009,   102],
       [  101,  2034,  1997, ...,  1012,  1996,   102],
       [  101, 28616, 10526, ...,  2438,  4768,   102]])>, <tf.Tensor: shape=(256,), dtype=int64, numpy=
array([1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0

Now let's download the model. It is very important you use the class that starts with `TFAutoModel`. There are auto models for most tasks, so you don't have to manually add the header, for example the `TFAutoModelForSequenceClassification` adds a Dense layer (WITHOUT SOFTMAX) to do the classification

In [None]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


## Training

In [None]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay

batch_size = BATCH_SIZE
num_epochs = EPOCHS
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=1e-8, decay_steps=num_train_steps
)
from tensorflow.keras.optimizers import Adam

opt = Adam(learning_rate=lr_scheduler)

In [None]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)  # VERY important that HuggingFace models output logits

model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

In [None]:
early_stopping = tf.keras.callbacks.EarlyStopping(patience=PATIENCE)


In [None]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


Oh no! We have too many parameters to train! Luckily in Keras is very easy to set some layers as not trainable

In [None]:
model.layers[0].trainable = False

In [None]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 592,130
Non-trainable params: 66,362,880
_________________________________________________________________


*Voilá!*

In [None]:
model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=num_epochs, callbacks=[early_stopping])

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7981d7a9a080>

Now we have a trained model that did transfer learning from DistillBERT

## Testing it out!

In [None]:
tokens = tokenizer(["This is the worst internet service provider", "Although most people say this is the worst, I like it"], padding=True, truncation=True, max_length=128)

In [None]:
tokens

{'input_ids': [[101, 2023, 2003, 1996, 5409, 4274, 2326, 10802, 102, 0, 0, 0, 0, 0], [101, 2348, 2087, 2111, 2360, 2023, 2003, 1996, 5409, 1010, 1045, 2066, 2009, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [None]:
model.predict(tokens['input_ids'])



TFSequenceClassifierOutput(loss=None, logits=array([[ 4.330706 , -3.566491 ],
       [-2.3702366,  2.5378506]], dtype=float32), hidden_states=None, attentions=None)

Notice the prediction where not probabilities but logits!

In [None]:
tf.math.softmax(model.predict(tokens['input_ids'])['logits'])



<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[9.996283e-01, 3.716461e-04],
       [7.332442e-03, 9.926676e-01]], dtype=float32)>

In [None]:
tf.math.argmax(tf.math.softmax(model.predict(tokens['input_ids'])['logits']))



<tf.Tensor: shape=(2,), dtype=int64, numpy=array([0, 1])>

And the model was correct!!

In [None]:
model.evaluate(tf_validation_dataset)



[0.3064627945423126, 0.875760018825531]