## Models (BertModel)

In [1]:
import transformers

transformers.__version__

  from .autonotebook import tqdm as notebook_tqdm


'4.36.2'

In [2]:
from transformers import BertConfig, BertModel

In [3]:
config = BertConfig()

model = BertConfig(config)

In [4]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.36.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [5]:
from transformers import BertConfig, BertModel
config = BertConfig()
model = BertModel(config)

In [6]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

config.json: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 570/570 [00:00<00:00, 222kB/s]
model.safetensors: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 436M/436M [00:21<00:00, 19.9MB/s] 


In [7]:
model.save_pretrained("directory_on_my_computer")

In [8]:
ls directory_on_my_computer

config.json        model.safetensors


## Using transformers models to inference

In [9]:
sequences = ["Hello!","Cool","Nice!"]

In [10]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [11]:
import torch 

model_input = torch.tensor(encoded_sequences)

print(model_input)

tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])


In [12]:
output = model(model_input)

print(output)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4032e-02,
           3.9393e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6915e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6739e-01, -1.8187e-01,  ...,  2.4672e-01,
           1.0441e+00, -6.1966e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0111e-02,
           3.2451e-01, -2.0996e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1741e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4876e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1042e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1319e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6063e-02,
           3.3564e-01,  2

## Tokenization

In [13]:
tokenized_text = "Jim Henson was a puppeteeer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteeer']


Loading and saving: tokens

In [14]:
#Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 29.0/29.0 [00:00<00:00, 7.94kB/s]
vocab.txt: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 213k/213k [00:00<00:00, 1.55MB/s]
tokenizer.json: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 436k/436k [00:00<00:00, 1.01MB/s]


In [15]:
#Similar to AutoModel, the AutoTokenizer class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [16]:
tokenizer("Using Transformer network is simple \n")

{'input_ids': [101, 7993, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [17]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

### Encoding 


Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

In [18]:
#*Transforming text to tokens:*
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequences = "Using a Transfomer network is simple"
tokens = tokenizer.tokenize(sequences)

print(tokens)

['Using', 'a', 'Trans', '##fo', '##mer', 'network', 'is', 'simple']


In [19]:
# *transforming tokens into ids:*
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 14467, 4027, 2443, 1110, 3014]


In [20]:
# *Decoding ids to text*
decoding_string = tokenizer.decode([7993, 170, 13809, 14467, 4027, 2443, 1110, 3014])

print(decoding_string)

Using a Transfomer network is simple


 ## Handling multiple sequences


In [22]:
## batching inputs

import torch 
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = "I've be waiting for HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequences)
ids = tokenizer.convert_tokens_to_ids(sequences)
input_ids = torch.tensor(ids)

# This line will fail.
model(input_ids)


IndexError: too many indices for tensor of dimension 0

The problem is that we sent a single sequence to the model, whereas ðŸ¤— Transformers models expect multiple sentences by default. 

In [23]:
# youâ€™ll see that the tokenizer didnâ€™t just convert the list of input IDs into a tensor, it added a dimension on top of it:
tokenized_input = tokenizer(sequences, return_tensors="pt")

print(tokenized_input)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2022,  3403,  2005, 17662, 12172,  2607,
          2026,  2878,  2166,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [24]:
#correct output
tokenized_input = tokenizer(sequences, return_tensors="pt")

print(tokenized_input["input_ids"])

tensor([[  101,  1045,  1005,  2310,  2022,  3403,  2005, 17662, 12172,  2607,
          2026,  2878,  2166,  1012,   102]])


In [25]:
#Letâ€™s try again and add a new dimension:

import torch 
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequences = "I've be waiting for HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequences)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print("input_IDs: ", input_ids)

output = model(input_ids)
print("Logits: ", output.logits)

input_IDs:  tensor([[ 1045,  1005,  2310,  2022,  3403,  2005, 17662, 12172,  2607,  2026,
          2878,  2166,  1012]])
Logits:  tensor([[-2.3035,  2.4422]], grad_fn=<AddmmBackward0>)


This is a batch of two identical sequences!


In [26]:
#Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

batch_ids = [ids,ids]
print(ids)  

[1045, 1005, 2310, 2022, 3403, 2005, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


###  Padding the inputs

In [32]:
# The following list of lists cannot be converted to a tensor:
batch_ids = [
    [200,200,200],
    [200,200]
]
print(batch_ids)

[[200, 200, 200], [200, 200]]


Padding makes sure all our sentences have the same length by adding a special word called the padding token to the sentences with fewer values. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this:

In [33]:
# adding padding to make to make our tensors have a rectangular shape
padding_id = 100

batch_ids = [
    [200,200,200],
    [200,200, padding_id]
]

Thereâ€™s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but weâ€™ve got completely different values!

* This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

In [36]:
# The padding token ID can be found in tokenizer.pad_token_id. Letâ€™s use it and send our two sentences through the model individually and batched together:

import torch
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence_1_ids = [[200,200,200]]
sequence_2_ids = [[200,200]]


batch_ids = [
    [200,200,200],
    [200,200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence_1_ids)).logits)
print(model(torch.tensor(sequence_2_ids)).logits)
print(model(torch.tensor(batch_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


###  Attention masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s:

* 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

In [38]:
# lets fix porevious error with adding attention mask:

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

ouputs = model(torch.tensor(batch_ids), attention_mask=torch.tensor(attention_mask))
print(ouputs.logits)

# Now we get the same logits for the second sentence in the batch.
# Notice how the last value of the second sequence is a padding ID, which is a 0 value in the attention mask.

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)


### Longer sequences

With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

   * Use a model with a longer supported sequence length.
   * Truncate your sequences.

In [41]:
# we recommend you truncate your sequences by specifying the max_sequence_length parameter:

sequence = sequence[:max_sequence_length]

NameError: name 'sequence' is not defined

## Put it all together

When you call your tokenizer directly on the sentence, you get back inputs that are ready to pass through your model:

Here, the model_inputs variable contains everything thatâ€™s necessary for a model to operate well. For DistilBERT, that includes the input IDs as well as the attention mask. Other models that accept additional inputs will also have those output by the tokenizer object.

In [42]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

It also handles multiple sequences at a time, with no change in the API:


In [43]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

It can pad according to several objectives:

In [45]:
# will  pad the sequence up to the max sequence length
model_inputs = tokenizer(sequence, padding="longest") 

#will pad the sequence up to the model's max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequence, padding="max_length")

model_inputs = tokenizer(sequence, padding="max_length", max_length=8)``