# Hugging Face
In this notebook, we'll get to know the Hugging Face ecosystem by loading a dataset, encoding the input data, running a model, and evaluating the results.

In [2]:
%pip install -q datasets ipywidgets

Note: you may need to restart the kernel to use updated packages.


Take a look at the [Hugging Face datasets hub](https://huggingface.co/datasets). Find the MRPC (Microsoft Research Paraphrase Corpus) dataset that is part of the GLUE (General Language Understanding Evaluation) benchmark. Download the validation split of the dataset with dataset's `load_dataset` function.

In [3]:
from datasets import load_dataset

dataset = load_dataset("glue", "mrpc", split="validation")

print(dataset)
print(dataset[0])

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 408
})
{'sentence1': "He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .", 'sentence2': '" The foodservice pie business does not fit our long-term growth strategy .', 'label': 1, 'idx': 9}


## Encoding
With Transformers (we will get to know them in more detail later in the course), tokenization has become part of the model itself. We first install Hugging Face's transformers library.

In [4]:
%pip install transformers

Collecting transformers
  Using cached transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2024.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Using cached safetensors-0.4.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Using cached tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Using cached transformers-4.44.2-py3-none-any.whl (9.5 MB)
Using cached regex-2024.9.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (792 kB)
Using cached safetensors-0.4.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (435 kB)
Using cached tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
Installing collected packages: safetensors, regex, tokenizers, transforme

Use the [model page of the base-uncased version of BERT](https://huggingface.co/bert-base-uncased) to initialize a `BertTokenizer`.

In [5]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')




Encode the first sentence of the first example in the dataset. Look at the outputs of the following functions:
- `tokenizer(sentence)`
- `tokenizer.encode(sentence)`
- `tokenizer.tokenize(sentence)`
- `tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentence))`

In [6]:
sentence = dataset[0]['sentence1']

print(f"Keys: \n {tokenizer(sentence).keys()}")
print(f"Sub-words: \n {tokenizer(sentence)}")
print(f"Input Ids: \n {tokenizer.encode(sentence)}")
print(f"Splitting Text: \n {tokenizer.tokenize(sentence)}")
print(f"Tokenize then map to Ids: \n {tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentence))}")

Keys: 
 dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
Sub-words: 
 {'input_ids': [101, 2002, 2056, 1996, 9440, 2121, 7903, 2063, 11345, 2449, 2987, 1005, 1056, 4906, 1996, 2194, 1005, 1055, 2146, 1011, 2744, 3930, 5656, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Input Ids: 
 [101, 2002, 2056, 1996, 9440, 2121, 7903, 2063, 11345, 2449, 2987, 1005, 1056, 4906, 1996, 2194, 1005, 1055, 2146, 1011, 2744, 3930, 5656, 1012, 102]
Splitting Text: 
 ['he', 'said', 'the', 'foods', '##er', '##vic', '##e', 'pie', 'business', 'doesn', "'", 't', 'fit', 'the', 'company', "'", 's', 'long', '-', 'term', 'growth', 'strategy', '.']
Tokenize then map to Ids: 
 [2002, 2056, 1996, 9440, 2121, 7903, 2063, 11345, 2449, 2987, 1005, 1056, 4906, 1996, 2194, 1005, 1055, 2146, 1011, 2744, 3930, 5656, 1012]


**Decoding.** Check out the various ways of decoding: `.decode`, `.convert_ids_to_tokens`, `.convert_tokens_to_string`.

In [7]:
input_ids = tokenizer(sentence)['input_ids']
print(tokenizer.decode(input_ids))
print(tokenizer.decode(input_ids, skip_special_tokens=True))
print(tokenizer.decode(input_ids, skip_special_tokens=True, remove_tokenization_spaces=True))
print(tokenizer.convert_ids_to_tokens(input_ids))
print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids)))

[CLS] he said the foodservice pie business doesn't fit the company's long - term growth strategy. [SEP]
he said the foodservice pie business doesn't fit the company's long - term growth strategy.
he said the foodservice pie business doesn't fit the company's long - term growth strategy.
['[CLS]', 'he', 'said', 'the', 'foods', '##er', '##vic', '##e', 'pie', 'business', 'doesn', "'", 't', 'fit', 'the', 'company', "'", 's', 'long', '-', 'term', 'growth', 'strategy', '.', '[SEP]']
[CLS] he said the foodservice pie business doesn ' t fit the company ' s long - term growth strategy . [SEP]


Use the NLP section of the [quickstart guide](https://huggingface.co/docs/datasets/quickstart) to apply encoding to the entire dataset.

In [8]:
def encode(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")

dataset = dataset.map(encode, batched=True)
print(dataset[0]['input_ids'])
print(dataset[1]['input_ids'])

[101, 2002, 2056, 1996, 9440, 2121, 7903, 2063, 11345, 2449, 2987, 1005, 1056, 4906, 1996, 2194, 1005, 1055, 2146, 1011, 2744, 3930, 5656, 1012, 102, 1000, 1996, 9440, 2121, 7903, 2063, 11345, 2449, 2515, 2025, 4906, 2256, 2146, 1011, 2744, 3930, 5656, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

We have to rename the "label" column to "labels" to match the expected name in BERT.

In [9]:
dataset = dataset.map(lambda examples: {"labels": examples["label"]}, batched=True)

- Use the guide again to set the data format to "torch". Make sure the columns `input_ids`, `token_type_ids`, `attention_mask` and `labels` are present.
- Create a data loader with a `batch_size` of 4.

In [10]:
import torch

dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=4)

## Model
We now load a pretrained BERT model and perform sequence classification on the MRPC dataset. Load the `BertForSequenceClassification` model. Set the model to evaluation mode by calling `.eval()` on the object.

In [11]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model.eval() # If you just want to use, but not train the model -> put it into eval mode
print(model)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In the evaluation state, no gradient information will be saved in the forward pass, and no dropout will be applied (and the values rescaled to match the training's output distribution). We can always set it back to train mode with `.train()`.

Additionally, we should call the model in a `torch.no_grad()` context, which sets all the tensors' `.requires_grad` fields to False.

## Forward pass
Now we run the model on a single batch. Get a batch from the dataloader, pass it to the model's forward function. It is preferred to use `model(.)` to do this instead of `model.forward(.)`. Some hooks may not be run if you use the latter version, as mentioned in this [PyTorch forum question](https://discuss.pytorch.org/t/any-different-between-model-input-and-model-forward-input/3690).

- Run a single batch through the model.
- Get the output logits
- Run a softmax function on it (use `torch.nn.functional.softmax`) to get output probabilities
- Display the result (i.e. is sentence2 a paraphrase of sentence1)

In [20]:
import torch
from torch.nn import functional as F

# Get a single batch from the dataloader
batch = next(iter(dataloader))

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
batch = {k: v.to(device) for k, v in batch.items()}

model.to(device)

with torch.no_grad():
    outputs = model(input_ids=batch["input_ids"],
                   token_type_ids=batch["token_type_ids"],
                   attention_mask=batch["attention_mask"])
    
    logits = outputs.logits

probs = F.softmax(logits, dim=-1)

predictions = torch.argmax(probs, dim=-1)

for i, (prob, pred) in enumerate(zip(probs, predictions)):
    paraphrase = "Yes" if pred == 1 else "No"
    print(f"Sentence 1: {tokenizer.decode(batch['input_ids'][i], skip_special_tokens=True)}")
    print(f"Sentence 2: {tokenizer.decode(batch['input_ids'][i], skip_special_tokens=True)}")
    print(f"Is sentence2 a paraphrase of sentence1? {paraphrase}")
    print(f"Probabilities: {prob}\n")

Sentence 1: he said the foodservice pie business doesn't fit the company's long - term growth strategy. " the foodservice pie business does not fit our long - term growth strategy.
Sentence 2: he said the foodservice pie business doesn't fit the company's long - term growth strategy. " the foodservice pie business does not fit our long - term growth strategy.
Is sentence2 a paraphrase of sentence1? Yes
Probabilities: tensor([0.2025, 0.7975], device='cuda:0')

Sentence 1: magnarelli said racicot hated the iraqi regime and looked forward to using his long years of training in the war. his wife said he was " 100 percent behind george bush " and looked forward to using his years of training in the war.
Sentence 2: magnarelli said racicot hated the iraqi regime and looked forward to using his long years of training in the war. his wife said he was " 100 percent behind george bush " and looked forward to using his years of training in the war.
Is sentence2 a paraphrase of sentence1? Yes
Prob

In [None]:
# Course solution:
with torch.no_grad():
    for batch in dataloader:
        print(len(batch))
        outputs = model(
            input_ids=batch["input_ids"],
            token_type_ids=batch["token_type_ids"],
            attention_mask=batch["attention_mask"],
            labels=batch['labels'],
        )
        # if your batch object has all the right fields: ouputs = model(**batch)
        print(outputs.keys())
        print(outputs.loss)
        print(outputs.logits)
        break

In [None]:
# Course solution:
from torch.nn import functional as F

probs = F.softmax(outputs.logits, dim=-1)
print(probs)
for sent1, sent2, label, probs in zip(dataset['sentence1'], dataset['sentence2'], dataset['labels'], probs):
    #incomplete atm

**Question:** Load the model again (execute the cell just below the [Model](#Model) section), run the forward pass and your evaluation. Why do you get different results?

**Answer:** 
This is probably due to the initialization of the Model. Loading the model again reinitiates it with random weights as we are not loading a model with fine-tuned weights. Or because of the batch loader generating a different batch each time. I'm not 100 percent certain.
- it will randomly initialize the classifier, so the output will be random. set torch.manual_seed(...) to avoid this random init, with it it will initialize the same every time.


## Evaluate a trained model
We now download a different model instead: `textattack/bert-base-uncased-QQP`. This is a model trained to detect duplicate questions on Quora, so basically our paraphrase detection task, but trained on a different dataset. Let's see how well it performs.

In [23]:
tokenizer = BertTokenizer.from_pretrained("textattack/bert-base-uncased-QQP")
model = BertForSequenceClassification.from_pretrained("textattack/bert-base-uncased-QQP")
model.eval()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [25]:
batch = next(iter(dataloader))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch = {k: v.to(device) for k, v in batch.items()}

model.to(device)

with torch.no_grad():
    outputs = model(input_ids=batch["input_ids"],
                    token_type_ids=batch["token_type_ids"],
                    attention_mask=batch["attention_mask"])
    
    logits = outputs.logits

probs = F.softmax(logits, dim=-1)

predictions = torch.argmax(probs, dim=-1)

for i, (prob, pred) in enumerate(zip(probs, predictions)):
    paraphrase = "Yes" if pred == 1 else "No"
    sentence1 = tokenizer.decode(batch["input_ids"][i], skip_special_tokens=True)
    sentence2 = tokenizer.decode(batch["input_ids"][i], skip_special_tokens=True)
    print(f"Sentence 1: {sentence1}")
    print(f"Sentence 2: {sentence2}")
    print(f"Is sentence2 a paraphrase of sentence1? {paraphrase}")
    print(f"Probabilities: {prob}\n")

Sentence 1: he said the foodservice pie business doesn't fit the company's long - term growth strategy. " the foodservice pie business does not fit our long - term growth strategy.
Sentence 2: he said the foodservice pie business doesn't fit the company's long - term growth strategy. " the foodservice pie business does not fit our long - term growth strategy.
Is sentence2 a paraphrase of sentence1? Yes
Probabilities: tensor([0.0240, 0.9760], device='cuda:0')

Sentence 1: magnarelli said racicot hated the iraqi regime and looked forward to using his long years of training in the war. his wife said he was " 100 percent behind george bush " and looked forward to using his years of training in the war.
Sentence 2: magnarelli said racicot hated the iraqi regime and looked forward to using his long years of training in the war. his wife said he was " 100 percent behind george bush " and looked forward to using his years of training in the war.
Is sentence2 a paraphrase of sentence1? No
Proba