# Using pre-trained models

This tutorial will focus on how we can **find**, **use** and possibly **adapt** pre-trained transformer models in order to solve our tasks.

While there are many resources where we can draw from, the one we will use in this tutorial is [HuggingFace](https://huggingface.co/). It is the most popular regirstry where we can find pretrained models and offers an API so that we can very conveniently interact with them.

First we'll need to install the huggingface library that we are going to use.

In [1]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m67.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m77.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m

We are going to introduce HuggingFace from it's most high-level usage and start going down to a lower level.

## Finding models

To find the appropriate models we'll need to either know what we're looking for or browse the HuggingFace site to see what model best suits our needs. To proceed we'll assume that we've already done that and know what model we want to use.

## Pipeline

The pipeline is the easiest way to interact with HuggingFace models. It does not require us to know or do anything; it just provides a **black-box** pipeline that is general-purpose and attempts to solve our task.

Let's say that the task we want to solve is *sentiment analysis*

In [2]:
from transformers import pipeline

pipe = pipeline('sentiment-analysis')

pipe

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


<transformers.pipelines.text_classification.TextClassificationPipeline at 0x7f2d6cbe7640>

The command above defines a [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines) for sentiment-analysis. This is an object that we can directly call with a sentence, or list of sentences and it will attempt to classify it according to its underlying model (more on that in a bit).

In [3]:
sentences = ['I liked the food a lot.',
             'I didn\'t have a good time at the therater',
             'The sky is grey.',
             'I went to the theater',
             'Too hot to handle',
             'AI courses',
             'Nike shoes are better than Adidas']


results = pipe(sentences)

for result in results:
    print(result)

{'label': 'POSITIVE', 'score': 0.9996389150619507}
{'label': 'NEGATIVE', 'score': 0.999733030796051}
{'label': 'NEGATIVE', 'score': 0.9997350573539734}
{'label': 'POSITIVE', 'score': 0.9864035844802856}
{'label': 'NEGATIVE', 'score': 0.9996044039726257}
{'label': 'NEGATIVE', 'score': 0.9870423078536987}
{'label': 'POSITIVE', 'score': 0.9992983341217041}


For a lot of people this might be enough. However in most cases we want to know what our pipeline actually does. I.e. what model does it use behind its hood?

The best way to get a grasp of what's happening inside the pipeline is to **manually select the model**.

The best way to do this is through HuggingFace's hub. By pressing the [*text classification* tag (sentiment analysis is a text classification task) in the models tab](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads), we can get a list of several candidate models that possibly do what we want. The most popular one is [`distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) which happens to be the default.

Let's say we wanted to perform sentiment analysis in another language (e.g. Spanish). We'll need a model specialized in spanish, e.g. [`pysentimiento/robertuito-sentiment-analysis`](https://huggingface.co/pysentimiento/robertuito-sentiment-analysis).

In [4]:
model_name = 'finiteautomata/beto-sentiment-analysis'

pipe = pipeline('sentiment-analysis', model=model_name)

pipe('Qué gran jugador es Messi')

Downloading (…)lve/main/config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/528 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/242k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/481k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': 'POS', 'score': 0.997605562210083}]

## Loading Keras models

In some cases we might need to do something more advanced that isn't supported by the Huggingface API. In these cases will want to load the model in a more flexible format (i.e. either PyTorch or TensorFlow). One such use-case would be if we wanted to incorporate this Huggingface model as part of a larger Neural Network.

In [5]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)  # or AutoModelForSequenceClassification

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


To use a model in Huggingface we essentially need two parts:

1. **A preprocessing function** (in NLP we call this a **tokenizer**):
this is tasked with bringing the raw data to the format required by the model.
2. **The actual model**:
a keras or pytorch model object that we will use.

### Tokenizer

Let's explore the first. The main task of the tokenizer is to map the words to the id of the respective token of the model.

In [6]:
input_ids = tokenizer('I liked the food a lot.')

input_ids

{'input_ids': [101, 1045, 4669, 1996, 2833, 1037, 2843, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In the above example, the model was trained with the word *food* as the id `2833`. This process is called **tokenization**, because it splits the input sequence into *tokens*.

In [None]:
tokenizer.tokenize('I liked the food a lot.')

['i', 'liked', 'the', 'food', 'a', 'lot', '.']

What happens if the tokenizer encounters a word that it doesn't have in its vocabulary (i.e. the model wasn't trained with a dedicated embedding for this word).

Theoretically, each model can have its own way with dealing with this issue, however, most commonly these are represented by a **special token** called `[UNK]`.

In [7]:
tokenizer.tokenize('This library is called 🤗 transformers')

['this', 'library', 'is', 'called', '[UNK]', 'transformers']

Besides tokenization, the tokenizer does a few more things. It converts the sequence to **lowercase**, adds **special tokens** (e.g. `[START]`, `[END]`, `[SEP]`, `[UNK]`, `[CLS]`), it can **pad** the sequence or **truncate** it to a specific length, convert it to tensors, etc.

The special tokens are an interesting concept, as they can be used to make the model understanding more things than just the words. For instance the `[START]` and `[END]` tokens usually signify the beginning and end of the sequence. If we have two sequences we can use the `[SEP]` token to indicate their separation. It is important to note that the names and types of special tokens are **model specific**.

In [9]:
batch_sentences = [['Hello', "I'm", 'a', 'single', 'sentence'],
                   ['And', 'another', 'sentence'],
                   ['And', 'the', 'very', 'very', 'last', 'one'],
                   ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'],
                   ['I liked the food a lot.']]

batch = tokenizer(batch_sentences,
                  is_split_into_words=True,
                  padding=True,
                  max_length=12,
                  truncation=True,
                  return_tensors='tf')

batch

{'input_ids': <tf.Tensor: shape=(5, 12), dtype=int32, numpy=
array([[ 101, 7592, 1045, 1005, 1049, 1037, 2309, 6251,  102,    0,    0,
           0],
       [ 101, 1998, 2178, 6251,  102,    0,    0,    0,    0,    0,    0,
           0],
       [ 101, 1998, 1996, 2200, 2200, 2197, 2028,  102,    0,    0,    0,
           0],
       [ 101, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 2184,
         102],
       [ 101, 1045, 4669, 1996, 2833, 1037, 2843, 1012,  102,    0,    0,
           0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(5, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]], dtype=int32)>}

In the above sequences we can note a few things:

- All sequences start with `101`, i.e. the `[START]` token
- All sequences end with `102`, i.e. the `[END]` token.
- Sequences below length 12 are padded with zeros. These zeros appear both in the ids and in the attention mask
- Sequence 4 that had a length greater than 12 was truncated.

### Model

What about the model. Things here are fairly simpler. The model is simply a keras model object, like the ones we are familiar with.

In [10]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


A couple of things to note are:
- The majority of the model is represented as a single layer. This is done for several reasons, the most important is so that we don't interfere with its architecture.
- After this base layer, we have the output layer. This is what we can drop if we want to add our own head on top of BERT and fine tune that.

Other than that, we can use BERT as any regular keras model.

In [11]:
out = model(batch)

out

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[-2.095736 ,  2.1814544],
       [ 2.219225 , -1.9427992],
       [-1.3980726,  1.4048259],
       [-1.1327213,  1.2878083],
       [-3.822621 ,  4.103473 ]], dtype=float32)>, hidden_states=None, attentions=None)

In [13]:
import tensorflow as tf

tf.math.softmax(out.logits)

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[1.3691552e-02, 9.8630852e-01],
       [9.8466289e-01, 1.5337110e-02],
       [5.7167754e-02, 9.4283229e-01],
       [8.1620552e-02, 9.1837943e-01],
       [3.6106404e-04, 9.9963892e-01]], dtype=float32)>

In [14]:
import numpy as np

np.argmax(out.logits, axis=1)

array([1, 0, 1, 1, 1])

These outputs are the **logits**. If we want the preds we can use sigmoid, if we want the class we can use argmax.

## Fine-tuning

The last part we'll see is how we can fine-tune a HuggingFace model on our own dataset. This is probably the most common use case of using pre-trained models.

In [15]:
from datasets import load_dataset

dataset = load_dataset('glue', 'cola')
dataset = dataset['train']

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [18]:
model_tag = 'bert-base-cased'

# Load the tokenizer and preprocess the data
tokenizer = AutoTokenizer.from_pretrained(model_tag)
tokenized_data = tokenizer(dataset['sentence'], return_tensors='np', padding=True)
tokenized_data = dict(tokenized_data)
labels = np.array(dataset['label'])

# Load the model
model = TFAutoModelForSequenceClassification.from_pretrained(model_tag)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  108310272 
                                                                 
 dropout_57 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 108,311,810
Trainable params: 108,311,810
Non-trainable params: 0
_________________________________________________________________


In [20]:
import tensorflow as tf

model.compile(optimizer=tf.keras.optimizers.Adam(3e-5))

In [21]:
model.fit(tokenized_data, labels)



<keras.callbacks.History at 0x7f2ce8d89480>