In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
os.environ["WANDB_ENTITY"] = "clif"
os.environ["WANDB_PROJECT"] = "adapters"

In [2]:
import torch
from torch.utils.data import DataLoader

from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
    AdamW,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    set_seed,
)

from tqdm.auto import tqdm

import datasets
import transformers

  from .autonotebook import tqdm as notebook_tqdm


This notebook can run with any model checkpoint on the [model hub](https://huggingface.co/models) that has a version with a classification head. Here we select [`bert-base-cased`](https://huggingface.co/bert-base-cased).

In [3]:
model_checkpoint = "bert-base-cased"

The next two sections explain how we load and prepare our data for our model, If you are only interested on seeing how 🤗 Accelerate works, feel free to skip them (but make sure to execute all cells!)

## Load the data

To load the dataset, we use the `load_dataset` function from 🤗 Datasets. It will download and cache it (so the download won't happen if we restart the notebook).

In [4]:
raw_datasets = load_dataset("glue", "mrpc")

In [5]:
raw_datasets["train"][0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

## Preprocess the data

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

By default (unless you pass `use_fast=Fast` to the call above) it will use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

You can directly call this tokenizer on one sentence or a pair of sentences:

In [7]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it.")

{'input_ids': [101, 8667, 117, 1142, 1141, 5650, 106, 102, 1262, 1142, 5650, 2947, 1114, 1122, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [8]:
def tokenize_function(examples):
    outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length", max_length=128)
    return outputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [9]:
tokenize_function(raw_datasets['train'][:5])

{'input_ids': [[101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, 1270, 107, 1103, 7737, 107, 117, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 11336, 6732, 3384, 1106, 1140, 1112, 1178, 107, 1103, 7737, 107, 117, 7277, 2180, 5303, 4806, 1117, 1711, 1104, 9938, 4267, 12223, 21811, 1117, 2554, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 10684, 2599, 9717, 1161, 2205, 11288, 1377, 112, 188, 1196, 4147, 1103, 4129, 1106, 19770, 2787, 1107, 1772, 1111, 109, 123, 119, 126, 3775, 119, 102, 10684, 2599, 9717, 1161, 3306, 11288, 1377, 112, 188, 1107, 1876, 1111, 109, 5691, 1495, 1550, 1105, 1962, 1122, 1106, 19770, 2787, 1111, 109, 122, 119, 129, 3775, 1107, 1772, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

To apply this function on all the sentences (or pairs of sentences) in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.


In [10]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, remove_columns=["idx", "sentence1", "sentence2"])

In [11]:
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

To double-check we only have columns that are accepted as arguments for the model we will instantiate, we can look at them here.

In [12]:
tokenized_datasets["train"].features

{'labels': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

The model we will be using is a `BertModelForSequenceClassification`. We can check its signature in the [Transformers documentation](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertForSequenceClassification) and all seems to be right! The last step is to set our datasets in the `"torch"` format, so that each item in it is now a dictionary with tensor values.

In [13]:
tokenized_datasets.set_format("torch")

## A first look at the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since all our tasks are about sentence classification, we use the `AutoModelForSequenceClassification` class. Like with the tokenizer, the from_pretrained method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which is 2 here):

In [14]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
import adapters

adapters.init(model)

model.add_adapter("my_adapter")
model.train_adapter("my_adapter")

print(model.adapter_summary())

Name                     Architecture         #Param      %Param  Active   Train
--------------------------------------------------------------------------------
my_adapter               bottleneck          894,528       0.826       1       1
--------------------------------------------------------------------------------
Full model                               108,310,272     100.000               0


In [16]:
def create_dataloaders(train_batch_size=8, eval_batch_size=32):
    train_dataloader = DataLoader(
        tokenized_datasets["train"], shuffle=True, batch_size=train_batch_size
    )
    eval_dataloader = DataLoader(
        tokenized_datasets["validation"], shuffle=False, batch_size=eval_batch_size
    )
    return train_dataloader, eval_dataloader

Let's have a look at our train and evaluation dataloaders to check a batch can go through the model.

In [17]:
train_dataloader, eval_dataloader = create_dataloaders()

The last piece we will need for the model evaluation is the metric. The `datasets` library provides a function `load_metric` that allows us to easily create a `datasets.Metric` object we can use.

In [18]:
metric = load_metric("glue", "mrpc")

  metric = load_metric("glue", "mrpc")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


To use this object on some predictions we call the `compute` methode to get our metric results:

## Fine-tuning the model

We are now ready to fine-tune this model on our dataset. As mentioned before, everything related to training needs to be in one big training function that will be executed on each TPU core, thanks to our `notebook_launcher`.

It will use this dictionary of hyperparameters, so tweak anything you like in here!

In [19]:
hyperparameters = {
    "learning_rate": 1e-4,
    "num_epochs": 1,
    "train_batch_size": 8, # Actual batch size will this x 8
    "eval_batch_size": 32, # Actual batch size will this x 8
    "seed": 42,
}

The two most important things to remember for training on TPUs is that your accelerator object has to be defined inside your training function, and your model should be created outside the training function. 

If you define your Accelerator in another cell that gets executed before the final launch (for debugging), you will need to restart your notebook as the line `accelerator = Accelerator()` needs to be executed for the first time inside the training function spwaned on each TPU core.

This is because that line will look for a TPU device, and if you set it outside of the distributed training launched by `notebook_launcher`, it will perform setup that cannot be undone in your runtime and you will only have access to one TPU core until you restart the notebook.

The reason we declare the model outside the loop is because on a TPU when launched from a notebook the same singular model object is used, and it is passed back and forth between all the cores automatically. 

Since we can't explore each piece in separate cells, comments have been left in the code. This is all pretty standard and you will notice how little the code changes from a regular training loop! The main lines added are:

- `accelerator = Accelerator()` to initalize the distributed setup,
- sending all objects to `accelerator.prepare`,
- replace `loss.backward()` with `accelerator.backward(loss)`,
- use `accelerator.gather` to gather all predictions and labels before storing them in our list of predictions/labels,
- truncate predictions and labels as the prepared evaluation dataloader has a few more samples to make batches of the same size on each process.

The first three are for distributed training, the last two for distributed evaluation. If you don't care about distributed evaluation, you can also just replace that part by your standard evaluation loop launched on the main process only.

Other changes (which are purely cosmetic to make the output of the training readable) are:

- some logging behavior behind a `if accelerator.is_main_process:`,
- disable the progress bar if `accelerator.is_main_process` is `False`,
- use `accelerator.print` instead of `print`.

In [20]:
def training_function(model):
    # Initialize accelerator
    accelerator = Accelerator()

    # To have only one message (and not 8) per logs of Transformers or Datasets, we set the logging verbosity
    # to INFO for the main process only.
    if accelerator.is_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    train_dataloader, eval_dataloader = create_dataloaders(
        train_batch_size=hyperparameters["train_batch_size"], eval_batch_size=hyperparameters["eval_batch_size"]
    )
    # The seed need to be set before we instantiate the model, as it will determine the random head.
    set_seed(hyperparameters["seed"])

    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=hyperparameters["learning_rate"])

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader
    )

    num_epochs = hyperparameters["num_epochs"]
    # Instantiate learning rate scheduler after preparing the training dataloader as the prepare method
    # may change its length.
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=100,
        num_training_steps=len(train_dataloader) * num_epochs,
    )

    # Instantiate a progress bar to keep track of training. Note that we only enable it on the main
    # process to avoid having 8 progress bars.
    progress_bar = tqdm(range(num_epochs * len(train_dataloader)), disable=not accelerator.is_main_process)
    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

        model.eval()
        all_predictions = []
        all_labels = []

        for step, batch in enumerate(eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            # We gather predictions and labels from the 8 TPUs to have them all.
            all_predictions.append(accelerator.gather(predictions))
            all_labels.append(accelerator.gather(batch["labels"]))

        # Concatenate all predictions and labels.
        # The last thing we need to do is to truncate the predictions and labels we concatenated
        # together as the prepared evaluation dataloader has a little bit more elements to make
        # batches of the same size on each process.
        all_predictions = torch.cat(all_predictions)[:len(tokenized_datasets["validation"])]
        all_labels = torch.cat(all_labels)[:len(tokenized_datasets["validation"])]

        eval_metric = metric.compute(predictions=all_predictions, references=all_labels)

        # Use accelerator.print to print only on the main process.
        accelerator.print(f"epoch {epoch}:", eval_metric)

And we're ready for launch! It's super easy with the `notebook_launcher` from the Accelerate library.

In [23]:
from accelerate import notebook_launcher

notebook_launcher(training_function, (model,), num_processes=1)

Launching training on one GPU.




100%|██████████| 459/459 [00:40<00:00, 11.22it/s]

epoch 0: {'accuracy': 0.7377450980392157, 'f1': 0.826580226904376}





: 