# Processing the data (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also, log into Hugging face

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Continuing with the example from the [previous chapter](https://huggingface.co/course/chapter2), here is how we would train a <font color='blue'>sequence classifier</font> on <font color='blue'>one batch</font> in PyTorch:

In [4]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as Chapter 2
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# New code
batch["labels"] = torch.tensor([1, 1])

optimizer = torch.optim.AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Of course, just training the model on two sentences is not going to yield very good results. To get <font color='blue'>better results</font>, you will need to prepare a <font color='blue'>bigger dataset</font>.

In this section we will use as an example the <font color='blue'>Microsoft Research Paraphrase Corpus</font> dataset, introduced in a [paper](https://www.aclweb.org/anthology/I05-5002.pdf) by William B. Dolan and Chris Brockett. The dataset consists of <font color='blue'>5,801 pairs</font> of <font color='blue'>sentences</font>, with a <font color='blue'>label indicating</font> if they are <font color='blue'>paraphrase</font> or not (i.e., if <font color='blue'>both</font> sentences <font color='blue'>mean</font> the <font color='blue'>same thing</font>). We've selected it for this chapter because it's a small dataset, so it's easy to experiment with training on it.

**Loading a dataset from the Hub**

The Hub doesn't just contain models; it also has <font color='blue'>multiple datasets</font> in lots of <font color='blue'>different languages</font>. You can browse the datasets [here](https://huggingface.co/datasets), and we recommend you <font color='blue'>try</font> to <font color='blue'>load and process a new dataset</font> once you have gone through this section (see the general documentation [here](https://huggingface.co/docs/datasets/loading)). But for now, let's focus on the <font color='blue'>MRPC dataset</font>. This is one of the 10 datasets composing the [GLUE benchmark](https://gluebenchmark.com/), which is an <font color='blue'>academic benchmark</font> that is used to measure the <font color='blue'>performance</font> of <font color='blue'>ML models</font> across 10 different text classification tasks.

The 🤗 Datasets library provides a simple command to <font color='blue'>download</font> and <font color='blue'>cache</font> a <font color='blue'>dataset</font> on the <font color='blue'>Hub</font>. We can download the MRPC dataset like this:

In [6]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

As you can see, we get a <font color='blue'>`DatasetDict` object</font> which contains the <font color='blue'>training</font> set, the <font color='blue'>validation</font> set, and the <font color='blue'>test</font> set. Each of those contains <font color='blue'>several columns</font> (`sentence1`, `sentence2`, `label`, and `idx`) and a <font color='blue'>variable number</font> of <font color='blue'>rows</font>, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set).

This command <font color='blue'>downloads</font> and <font color='blue'>caches the dataset</font>, by default in `~/.cache/huggingface/datasets`. Recall from Chapter 2 that you can customize your cache folder by setting the `HF_HOME` environment variable.

We can <font color='blue'>access each pair</font> of <font color='blue'>sentences</font> in our `raw_datasets` object by <font color='blue'>indexing</font>, like with a dictionary:

In [7]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

We can see the <font color='blue'>labels</font> are <font color='blue'>already integers</font>, so we won't have to do any preprocessing there. To know <font color='blue'>which integer</font> corresponds to <font color='blue'>which label</font>, we can <font color='blue'>inspect</font> the <font color='blue'>features</font> of our <font color='blue'>`raw_train_dataset`</font>. This will tell us the type of each column:

In [8]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

Behind the scenes, <font color='blue'>label</font> is of <font color='blue'>type `ClassLabel`</font>, and the <font color='blue'>mapping</font> of <font color='blue'>integers to label name</font> is stored in the <font color='blue'>`names` folder</font>. In these:

- <font color='blue'>0</font> corresponds to <font color='blue'>not_equivalent</font>,
- and <font color='blue'>1</font> corresponds to <font color='blue'>equivalent</font>.

**Try it out!** Look at element 15 of the training set and element 87 of the validation set. What are their labels?

In [None]:
# Exercise
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[14]     # Element 15 of the training set

{'sentence1': 'Gyorgy Heizler , head of the local disaster unit , said the coach was carrying 38 passengers .',
 'sentence2': 'The head of the local disaster unit , Gyorgy Heizler , said the coach driver had failed to heed red stop lights .',
 'label': 0,
 'idx': 15}

In [None]:
raw_train_dataset = raw_datasets["validation"]
raw_train_dataset[86]     # Element 87 of the validation set

{'sentence1': 'He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife , singer Whitney Houston .',
 'sentence2': 'He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife .',
 'label': 1,
 'idx': 796}

**Preprocessing a dataset**

To <font color='blue'>preprocess</font> the <font color='blue'>dataset</font>, we need to <font color='blue'>convert</font> the <font color='blue'>text to numbers</font> the model can make sense of. As you saw in the [previous chapter](https://huggingface.co/course/chapter2), this is done with a <font color='blue'>tokenizer</font>. We can <font color='blue'>feed</font> the <font color='blue'>tokenizer one sentence</font> or a <font color='blue'>list of sentences</font>, so we can <font color='blue'>directly tokenize</font> all the first sentences and all the second sentences of each pair like this:

In [11]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

However, we <font color='blue'>can't</font> just <font color='blue'>pass two sequences</font> to the <font color='blue'>model</font> and <font color='blue'>get a prediction</font> of whether the <font color='blue'>two sentences</font> are <font color='blue'>paraphrases</font> or not. We need to handle the <font color='blue'>two sequences</font> as a <font color='blue'>pair</font>, and <font color='blue'>apply</font> the <font color='blue'>appropriate preprocessing</font>. Fortunately, the <font color='blue'>tokenizer</font> can also take a <font color='blue'>pair of sequences</font> and <font color='blue'>prepare it</font> the way our BERT model expects:



In [12]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
for key, value in inputs.items():
    print(f"{key}: {value}")

input_ids: [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102]
token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


We discussed the <font color='blue'>input_ids</font> and <font color='blue'>attention_mask</font> keys in [Chapter 2](https://huggingface.co/course/chapter2), but we put off talking about <font color='blue'>token_type_ids</font>. In this example, this is what <font color='blue'>tells the model</font> which <font color='blue'>part</font> of the <font color='blue'>input</font> is the <font color='blue'>first sentence</font> and which is the <font color='blue'>second sentence</font>.

 **Try it out!** Take element 15 of the training set and tokenize the two sentences separately and as a pair. What's the difference between the two results?

This is what happens when you tokenize the two sentences separately:

In [None]:
# Exercise
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
ele15 = raw_datasets["train"][14]
tokenized_sentences_1 = tokenizer(ele15["sentence1"])
tokenized_sentences_2 = tokenizer(ele15["sentence2"])
for key in tokenized_sentences_1:
  print(key, tokenized_sentences_1[key])
print()
for key in tokenized_sentences_2:
  print(key, tokenized_sentences_2[key])

input_ids [101, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 2056, 1996, 2873, 2001, 4755, 4229, 5467, 1012, 102]
token_type_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

input_ids [101, 1996, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2056, 1996, 2873, 4062, 2018, 3478, 2000, 18235, 2094, 2417, 2644, 4597, 1012, 102]
token_type_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Notice that the input_ids have a starting and closing token and `token_type_ids` for a single sentence. When you tokenize the two sentences as a pair it combines both sentences in `inputs_ids` and `token_type_ids`:

In [None]:
# Exercise
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
ele15 = raw_datasets["train"][14]
tokenized_sentences_both = tokenizer(ele15["sentence1"], ele15["sentence2"])
for key in tokenized_sentences_both:
  print(key, tokenized_sentences_both[key]) # The token type id tells you which sentence the token came from

input_ids [101, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 2056, 1996, 2873, 2001, 4755, 4229, 5467, 1012, 102, 1996, 2132, 1997, 1996, 2334, 7071, 3131, 1010, 1043, 7677, 22637, 2002, 10993, 3917, 1010, 2056, 1996, 2873, 4062, 2018, 3478, 2000, 18235, 2094, 2417, 2644, 4597, 1012, 102]
token_type_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
attention_mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


If we <font color='blue'>decode</font> the <font color='blue'>IDs</font> inside `input_ids` <font color='blue'>back to words</font> we will get:

In [None]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

So we see the <font color='blue'>model expects</font> the <font color='blue'>inputs</font> to be <font color='blue'>of the form</font> `[CLS]` sentence1 `[SEP]` sentence2 `[SEP]` when there are <font color='blue'>two sentences</font>. Aligning this with the `token_type_ids` gives us:

In [None]:
print(tokenizer.convert_ids_to_tokens(inputs["input_ids"]))
print(inputs["token_type_ids"])

['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


As you can see, the parts of the input corresponding to <font color='blue'>`[CLS]` sentence1 `[SEP]`</font> all have a <font color='blue'>token type ID</font> of 0</font>, while the <font color='blue'>other parts</font>, corresponding to sentence2 `[SEP]`, all have a <font color='blue'>token type ID</font> of <font color='blue'>1</font>.

Note that if you select a <font color='blue'>different checkpoint</font>, you won't necessarily have the <font color='blue'>`token_type_ids`</font> in <font color='blue'>your tokenized inputs</font> (for instance, they're not returned if you use a DistilBERT model). They are <font color='blue'>only returned</font> when the <font color='blue'>model</font> will <font color='blue'>know what to do with them</font>, because it has <font color='blue'>seen them</font> during its <font color='blue'>pretraining</font>.

Here, <font color='blue'>BERT</font> is <font color='blue'>pretrained</font> with <font color='blue'>token type IDs</font>, and on top of the masked language modeling objective we talked about in [Chapter 1](https://huggingface.co/course/chapter1), it has an <font color='blue'>additional objective</font> called <font color='blue'>next sentence prediction</font>. The goal with this task is to <font color='blue'>model the relationship</font> between <font color='blue'>pairs of sentences</font>.

With next sentence prediction, the <font color='blue'>model</font> is provided <font color='blue'>pairs of sentences</font> (with randomly masked tokens) and <font color='blue'>asked to predict</font> whether the <font color='blue'>second</font> sentence <font color='blue'>follows the first</font>. To make the task non-trivial, <font color='blue'>half of the time</font> the <font color='blue'>sentences follow each other</font> in the original document they were extracted from, and the other half of the time the two sentences come from two different documents.

In general, you <font color='blue'>don't need to worry</font> about whether or not there are `token_type_ids` in your tokenized inputs: <font color='blue'>**as long as you use the same checkpoint for the tokenizer and the model**</font>, everything will be fine as the tokenizer knows what to provide to its model.

Now that we have seen how our tokenizer can deal with <font color='blue'>one pair of sentences</font>, we can use it to <font color='blue'>tokenize our whole dataset</font>: like in the [previous chapter](https://huggingface.co/course/chapter2), we can <font color='blue'>feed</font> the tokenizer a <font color='blue'>list of pairs of sentences</font> by giving it the list of first sentences, then the list of second sentences. This is also compatible with the padding and truncation options we saw in [Chapter 2](https://huggingface.co/course/chapter2). So, one way to preprocess the training dataset is:

In [15]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

This works well, but it has the <font color='blue'>disadvantage</font> of <font color='blue'>returning a dictionary</font> (with our `keys`, `input_ids`, `attention_mask`, and `token_type_ids`, and values that are `lists of lists`). It will also only work if you have <font color='blue'>enough RAM</font> to <font color='blue'>store</font> your <font color='blue'>whole dataset</font> during the tokenization (whereas the datasets from the 🤗 Datasets library are [Apache Arrow](https://arrow.apache.org/) files stored on the disk, so you only keep the samples you ask for loaded in memory).

To <font color='blue'>keep</font> the <font color='blue'>data as a dataset</font>, we will use the [Dataset.map()](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map) method. This also allows us some <font color='blue'>extra flexibility</font>, if we need <font color='blue'>more preprocessing</font> done than <font color='blue'>just tokenization</font>. The `map()` method works by <font color='blue'>applying a function</font> on <font color='blue'>each element</font> of the <font color='blue'>dataset</font>, so let's define a function that tokenizes our inputs:

In [16]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This function takes a <font color='blue'>dictionary</font> (like the items of our dataset) and <font color='blue'>returns</font> a <font color='blue'>new dictionary</font> with the <font color='blue'>keys</font> `input_ids`, `attention_mask`, and `token_type_ids`. Note that it <font color='blue'>also works</font> if the example dictionary contains <font color='blue'>several samples</font> (each key as a list of sentences) since the <font color='blue'>tokenizer</font> works on <font color='blue'>lists of pairs</font> of <font color='blue'>sentences</font>, as seen before. This will allow us to use the option <font color='blue'>`batched=True`</font> in our call to `map()`, which will greatly <font color='blue'>speed up</font> the <font color='blue'>tokenization</font>. The <font color='blue'>**tokenizer**</font> is <font color='blue'>**backed by a tokenizer written in Rust**</font> from the [🤗 Tokenizers](https://github.com/huggingface/tokenizers) library. This tokenizer can be very fast, but only if we give it lots of inputs at once.

Note that we've <font color='blue'>left</font> the <font color='blue'>padding argument out</font> in our tokenization function for now. This is because <font color='blue'>padding all the samples</font> to the <font color='blue'>maximum length</font> is <font color='blue'>not efficient</font>: it's better to <font color='blue'>pad the samples</font> when we're <font color='blue'>building a batch</font>, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!

Here is how we <font color='blue'>apply</font> the <font color='blue'>tokenization function</font> on <font color='blue'>all our datasets</font> at once. We're using `batched=True` in our call to map so the function is <font color='blue'>applied to multiple elements</font> of our <font color='blue'>dataset at once</font>, and not on each element separately. This allows for faster preprocessing.

The way the 🤗 Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the preprocessing function:

In [19]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

You can even use <font color='blue'>multiprocessing</font> when <font color='blue'>applying your preprocessing function</font> with `map()` by passing along a <font color='blue'>`num_proc`</font> argument. We didn't do this here because the 🤗 Tokenizers library already <font color='blue'>uses multiple threads</font> to tokenize our samples faster, but if you are not using a fast tokenizer backed by this library, this could speed up your preprocessing.

Our `tokenize_function` returns a dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`, so those <font color='blue'>three fields</font> are <font color='blue'>added to all splits</font> of our <font color='blue'>dataset</font>. Note that we could also have <font color='blue'>changed existing fields</font> if our <font color='blue'>preprocessing function</font> returned a <font color='blue'>new value</font> for an <font color='blue'>existing key</font> in the dataset to which we applied `map()`.

The last thing we will need to do is <font color='blue'>pad all the examples</font> to the <font color='blue'>length of the longest element</font> when we batch elements together — a technique we refer to as <font color='blue'>dynamic padding</font>.

**Dynamic padding**

The function that is responsible for putting together samples inside a batch is called a <font color='blue'>collate function</font>. It's an <font color='blue'>argument</font> you can <font color='blue'>pass</font> when you build a <font color='blue'>DataLoader</font>, the <font color='blue'>default</font> being a <font color='blue'>function</font> that will just <font color='blue'>convert your samples</font> to <font color='blue'>PyTorch tensors</font> and <font color='blue'>concatenate</font> them (recursively if your elements are lists, tuples, or dictionaries). This <font color='blue'>won't</font> be possible in our case since the <font color='blue'>inputs</font> we have <font color='blue'>won't all be</font> of the <font color='blue'>same size</font>. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training by quite a bit, but note that if you're training on a TPU it can cause problems — TPUs prefer fixed shapes, even when that requires extra padding.

To do this in practice, we have to define a <font color='blue'>collate function</font> that will apply the <font color='blue'>correct amount of padding</font> to the <font color='blue'>items of the dataset</font> we want to <font color='blue'>batch together</font>. Fortunately, the 🤗 Transformers library provides us with such a function via <font color='blue'>`DataCollatorWithPadding`</font>. It <font color='blue'>takes a tokenizer</font> when you <font color='blue'>instantiate it</font> (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need:

In [20]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test this new toy, let's <font color='blue'>grab</font> a <font color='blue'>few samples</font> from our <font color='blue'>training set</font> that we would like to <font color='blue'>batch together</font>. Here, we <font color='blue'>remove</font> the <font color='blue'>columns `idx`, `sentence1`, and `sentence2`</font> as they <font color='blue'>won't be needed</font> and <font color='blue'>contain strings</font> (and we can't create tensors with strings) and have a look at the lengths of each entry in the batch:

In [21]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

No surprise, we get <font color='blue'>samples</font> of <font color='blue'>varying length</font>, from <font color='blue'>32 to 67</font>. <font color='blue'>Dynamic padding</font> means the <font color='blue'>samples in this batch</font> should all be <font color='blue'>padded</font> to a <font color='blue'>length of 67</font>, the <font color='blue'>maximum length</font> inside the <font color='blue'>batch</font>. Without dynamic padding, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept. Let's double-check that our data_collator is dynamically padding the batch properly:

In [22]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

Looking good! Now that we've gone from <font color='blue'>raw text</font> to <font color='blue'>batches</font> our model can deal with, we're <font color='blue'>ready to fine-tune</font> it!

**Try it out!** Replicate the preprocessing on the GLUE SST-2 dataset. It's a little bit different since it's composed of single sentences instead of pairs, but the rest of what we did should look the same. For a harder challenge, try to write a preprocessing function that works on any of the GLUE tasks.

**Loading the GLUE SST-2 dataset using the Hugging Face Transformers library:** We will go through the process of preprocessing the [GLUE SST-2 dataset](https://huggingface.co/datasets/gimmaru/glue-sst2) using the Hugging Face Transformers library. The SST-2 dataset is a single-sentence text classification task, making it slightly different from other GLUE tasks that involve pairs of sentences. We'll cover loading the dataset, tokenization, and dynamic padding.

**1. Loading the Dataset:**

We'll start by loading the SST-2 dataset from the 🤗 Datasets library. This dataset consists of single sentences along with their corresponding labels.



In [24]:
from datasets import load_dataset
raw_datasets = load_dataset("glue", "sst2")

**2. Tokenization and Preprocessing:**

Now that we have the raw dataset, let's preprocess the data by tokenizing the sentences using a pretrained tokenizer. We'll define a tokenization function and then apply it to the entire dataset.

In [27]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

# Tokenize the entire dataset
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

**3. Dynamic Padding:**

Dynamic padding allows us to pad the batch to the length of the longest sequence within that batch, instead of padding to the maximum sequence length in the entire dataset. This improves efficiency by reducing unnecessary padding.

In [28]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Example: Select a few samples from the training set
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence"]}

# Apply dynamic padding using data_collator
batch = data_collator(samples)

{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 29]),
 'token_type_ids': torch.Size([8, 29]),
 'attention_mask': torch.Size([8, 29]),
 'labels': torch.Size([8])}

The batch dictionary now contains keys for `input_ids`, `attention_mask`, `token_type_ids`, and `labels`, each corresponding to a PyTorch tensor. The `attention_mask` indicates which tokens are padding tokens, and `token_type_ids` distinguish between `sentences` in paired tasks.