<a href="https://colab.research.google.com/github/almutareb/huggingface-nlp-demo/blob/main/into_the_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

# Preprocessing with a tokenizer

Like other neural networks, Transformer models canâ€™t process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

    1. Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
    2. Mapping each token to an integer
    3. Adding additional inputs that may be useful to the model

All this preprocessing needs to be done in exactly the same way as when the model was pretrained, so we first need to download that information from the Model Hub. To do this, we use the AutoTokenizer class and its from_pretrained() method. Using the checkpoint name of our model, it will automatically fetch the data associated with the modelâ€™s tokenizer and cache it (so itâ€™s only downloaded the first time you run the code below).

In [3]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading (â€¦)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (â€¦)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (â€¦)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

 The main things to remember here are that you can pass one sentence or a list of sentences, as well as specifying the type of tensors you want to get back (if no type is passed, you will get a list of lists as a result).

In [5]:
raw_inputs = [
    "P've been waiting for a Transformers course for a while.",
    "I hate this so much!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1052,  1005,  2310,  2042,  3403,  2005,  1037, 19081,  2607,
          2005,  1037,  2096,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}


# Going through the model

We can download our pretrained model the same way we did with our tokenizer. ðŸ¤— Transformers provides an AutoModel class which also has a from_pretrained() method:

In [7]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In this code snippet, we have downloaded the same checkpoint we used in our pipeline before (it should actually have been cached already) and instantiated a model with it.

This architecture contains only the base Transformer module: given some inputs, it outputs what weâ€™ll call hidden states, also known as features. For each model input, weâ€™ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

The vector output by the Transformer module is usually large. It generally has three dimensions:

    * Batch size: The number of sequences processed at a time (2 in our example).
    * Sequence length: The length of the numerical representation of the sequence (16 in our example).
    * Hidden size: The vector dimension of each model input.

In [8]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 15, 768])


The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers.
The output of the Transformer model is sent directly to the model head to be processed.

[see this image](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg)

The model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

There are many different architectures available in ðŸ¤— Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list:

    - *Model (retrieve the hidden states)
    - *ForCausalLM
    - *ForMaskedLM
    - *ForMultipleChoice
    - *ForQuestionAnswering
    - *ForSequenceClassification
    - *ForTokenClassification
    - and others ðŸ¤—

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we wonâ€™t actually use the AutoModel class, but AutoModelForSequenceClassification:

In [9]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

print(outputs.logits.shape)

torch.Size([2, 2])


# Postprocessing

In [10]:
print(outputs.logits)

tensor([[ 3.2110, -2.7005],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


In [12]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[9.9730e-01, 2.7006e-03],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [13]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# Creating a aTransformer

Weâ€™ll use the AutoModel class, which is handy when you want to instantiate any model from a checkpoint.

The AutoModel class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. Itâ€™s a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

In [14]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.33.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



The configuration contains many attributes that are used to build the model. For example the hidden_size attribute defines the size of the hidden_states vector, and num_hidden_layers defines the number of layers the Transformer model has.

Creating a model from the default configuration initializes it with random values. The model can be used in this state, but it will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand, but this would require a long time and a lot of data, and it would have a non-negligible environmental impact. To avoid unnecessary and duplicated effort, itâ€™s imperative to be able to share and reuse models that have already been trained.

Loading a Transformer model that is already trained is simple â€” we can do this using the from_pretrained() method:

In [15]:
from transformers import BertForMultipleChoice
model = BertModel.from_pretrained("bert-base-cased")

Downloading (â€¦)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

We could replace BertModel with the equivalent AutoModel class. Weâ€™ll do this from now on as this produces checkpoint-agnostic code; if your code works for one checkpoint, it should work seamlessly with another. This applies even if the architecture is different, as long as the checkpoint was trained for a similar task (for example, a sentiment analysis task).

In the code sample above we didnâ€™t use BertConfig, and instead loaded a pretrained model via the bert-base-cased identifier. This is a model checkpoint that was trained by the authors of BERT themselves.

This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

The weights have been downloaded and cached (so future calls to the from_pretrained() method wonâ€™t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. You can customize your cache folder by setting the HF_HOME environment variable.

Saving a model is as easy as loading one â€” we use the save_pretrained() method, which is analogous to the from_pretrained() method:

In [None]:
model.save_pretrained("directory_on_my_computer")

This saves two files to your disk:

* config.json
* pytorch_model.bin

 the configuration is necessary to know your modelâ€™s architecture, while the model weights are your modelâ€™s parameters.

# Tokenizers
Loading and saving tokenizers is as simple as it is with models. Actually, itâ€™s based on the same two methods: from_pretrained() and save_pretrained(). These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model).

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class:

In [18]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading (â€¦)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (â€¦)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [20]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Saving a tokenizer is identical to saving a model:



In [None]:
tokenizer.save_pretrained("directory_on_my_computer")

Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

The first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained.

The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method. Again, we need to use the same vocabulary used when the model was pretrained.

# Tokenization

In [27]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = 'Using a Transformer network is simple'
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. Thatâ€™s the case here with transformer, which is split into two tokens: trans and ##former.

# From tokens to input IDs

In [28]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


and decoding it back

In [29]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


Note that the decode method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence.

In [32]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been learning these Machine Learning courses for a while."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

batched_ids = [ids, ids]

input_ids = torch.tensor([ids])
print("Input IDs: ", input_ids)

output = model(input_ids)
print("Logits: ", output.logits)

input_batch = torch.tensor(batched_ids)
print("batched IDs:", input_batch)

Input IDs:  tensor([[1045, 1005, 2310, 2042, 4083, 2122, 3698, 4083, 5352, 2005, 1037, 2096,
         1012]])
batched IDs: tensor([[1045, 1005, 2310, 2042, 4083, 2122, 3698, 4083, 5352, 2005, 1037, 2096,
         1012],
        [1045, 1005, 2310, 2042, 4083, 2122, 3698, 4083, 5352, 2005, 1037, 2096,
         1012]])
Logits:  tensor([[ 2.8480, -2.3911]], grad_fn=<AddmmBackward0>)


# Padding
Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this:

In [35]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


The second row should be the same as the logits for the second sentence, but weâ€™ve got completely different values!

This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

# Attention Masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

Letâ€™s complete the previous example with an attention mask:

In [36]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

- Use a model with a longer supported sequence length.
- Truncate your sequences.

Models have different supported sequence lengths, and some specialize in handling very long sequences. Longformer is one example, and another is LED. If youâ€™re working on a task that requires very long sequences, we recommend you take a look at those models.

Otherwise, we recommend you truncate your sequences by specifying the max_sequence_length parameter:



```
sequence = sequence[:max_sequence_length]

```

