# Transfer Learning Transformers with HuggingFace

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

HuggingFace is a company with a heavy open source philosophy that makes transformers readily available so you don't have to do what we did before for every application.

## Prep

In [None]:
!pip install -U datasets evaluate transformers transformers[sentencepiece]

In [None]:
import multiprocessing
import tensorflow as tf
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
import numpy as np

import sys
import keras.backend as K
import random
import os
import pandas as pd
import warnings
import time

TRACE = False
PATIENCE = 2
EPOCHS = 3
BATCH_SIZE = 256

def set_seeds_and_trace():
  os.environ['PYTHONHASHSEED'] = '0'
  np.random.seed(42)
  tf.random.set_seed(42)
  random.seed(42)
  if TRACE:
    tf.debugging.set_log_device_placement(True)

def set_session_with_gpus_and_cores():
  cores = multiprocessing.cpu_count()
  gpus = len(tf.config.list_physical_devices('GPU'))
  config = tf.compat.v1.ConfigProto( device_count = {'GPU': gpus  , 'CPU': cores} , intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
  sess = tf.compat.v1.Session(config=config)
  tf.compat.v1.keras.backend.set_session(sess)

set_seeds_and_trace()
set_session_with_gpus_and_cores()
warnings.filterwarnings('ignore')

## Tokenizing and loading the dataset

In HuggingFace there are many models, and each has its own tokenizer. Lucky for us there is a class `AutoTokenizer` that does the heavylifting after we provide a checkpoint

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np

raw_datasets = load_dataset("imdb")  # load imdb dataset
raw_datasets

Notice it is a dict object with the train, test, and unsupervised datasets to play around

In [None]:
raw_datasets['train'][0]  # Let's see the first review

How do we know if it's positive or negative from label=0?

In [None]:
raw_datasets['train'].features

There it is, within features we see that the index 0 is **Negative**

Now to tokenise the dataset we need to load the proper tokenizer for the model we care about. And the we are goin to apply it everywhere!

After this step the tokenizer converts the text into a Tensor of ids, each representing a diferent word in the BERT vocabulary

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = None # Fetch the tokenizer for that checkpoint


def tokenize_function(example):
    # We are using the BERT tokenizer, specifying to PAD until the end,
    # truncate if either 128 elements are met or the maximum from the model, which you get from the model card

    return pass. # Return a tokenizer function that adds padding to 128 chars and truncates from the examples


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)


Let's see how it worked!

In [None]:
tokenized_datasets['train'][0]['text']

In [None]:
tokenizer(tokenized_datasets['train'][0]['text'])

The tokenizer from BERT (well DistillBERT) converts each word into its ID according to *its* vocabulary. And notice the masking says we haven't been truncated. What we will do know is do this for all data and convert it into a TF Datasets object (which Keras accepts)

In [None]:

tf_train_dataset = None # Convert the tokenized_datasets["train"] to a TF Dataset


tf_validation_dataset = None. # Same with validation

In [None]:
for inputs, labels in tf_train_dataset.take(1):
  print(f' inputs: {inputs.shape}, labels: {labels.shape}')


## Downloading the model and prepare for training

Now let's download the model. It is very important you use the class that starts with `TFAutoModel`. There are auto models for most tasks, so you don't have to manually add the header, for example the `TFAutoModelForSequenceClassification` adds a Dense layer (WITHOUT SOFTMAX) to do the classification

In [None]:

model = None # Download the model for sequence classification with 2 labels (sentiment analysis)

In [None]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay

batch_size = BATCH_SIZE
num_epochs = EPOCHS
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=1e-8, decay_steps=num_train_steps
)
from tensorflow.keras.optimizers import Adam

opt = Adam(learning_rate=lr_scheduler)

In [None]:
loss = None  # Set the loss
# Compile the model

In [None]:
early_stopping = tf.keras.callbacks.EarlyStopping(patience=PATIENCE)


In [None]:
model.summary()

Oh no! We have too many parameters to train! Luckily in Keras is very easy to set some layers as not trainable

In [None]:
# Set the first layer as non trainable

In [None]:
model.summary()

*Voilá!*

In [None]:
# Fit the model

Now we have a trained model that did transfer learning from DistillBERT

## Testing it out!

In [None]:
tokens = tokenizer(["This is the worst internet service provider", "Although most people say this is the worst, I like it"], padding=True, truncation=True, max_length=128)

In [None]:
tokens

In [None]:
model.predict(tokens['input_ids'])

Notice the prediction where not probabilities but logits!

In [None]:
tf.math.softmax(model.predict(tokens['input_ids'])['logits'])

In [None]:
tf.math.argmax(tf.math.softmax(model.predict(tokens['input_ids'])['logits']))

And the model was correct!!

In [None]:
model.evaluate(tf_validation_dataset)