<a href="https://colab.research.google.com/github/WZX1998/MT/blob/master/Copy_of_NAACL_2019_Tutorial_on_Transfer_Learning_in_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook accompanying NAACL 2019 tutorial on "Transfer Learning in Natural Language Processing".

The tutorial will be given on June 2 at NAACL 2019 in Minneapolis, MN, USA by [Sebastian Ruder](http://ruder.io/), [Matthew Peters](https://www.linkedin.com/in/petersmatthew), [Swabha Swayamdipta](http://www.cs.cmu.edu/~sswayamd/index.html) and [Thomas Wolf](http://thomwolf.io/).

You can check the [webpage](https://naacl2019.org/program/tutorials/) of NAACL tutorials for more information.

Further material: [slides](http://tiny.cc/NAACLTransfer) and [code](http://tiny.cc/NAACLTransferCode).



# Running the notebook

This notebook is shared in `view only mode`.
If you want to run the cells inside it, you should either:

- click on `open in playground mode` in the `file` menu (your change won't be saved though)
- save a copy in your drive that you can open in `edit mode` to be able to save your changes.

## Install and notebook preparation

In [1]:
!pip install pytorch-pretrained-bert pytorch-ignite ipdb



# Introduction

This notebook accompanies the tutorial given at NAACL 2019 on Transfer Learning in Natural Language Processing. It present the full workflow of a transfer learning approach in NLP.

But first, why do we use transfer learning in Natural Language Processing?

*   Many NLP tasks share common knowledge about language (e.g. linguistic representations, structural similarities)
*   Tasks can inform each other—e.g. syntax and semantics
*  Annotated data is rare, make use of as much supervision as available.
*  Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks (e.g. classification, information extraction, Q&A, etc).

What this tutorial is about:

* Goal: provide broad overview of transfer methods in NLP, focusing on the most empirically successful methods as of today (mid 2019)
* Provide practical, hands on advice → by end of tutorial, everyone has ability to apply recent advances to text classification task

What this is not:
* Comprehensive (it’s impossible to cover all related papers in one tutorial!)
* This tutorial is mostly for work done in English, extensibility to other languages depends on availability of supervision.

# Colab and codebase

The [github repo](http://tiny.cc/NAACLTransferCode) and its notebook version, the present colab notebook, tries to present in the simplest and most compact way a few of the major Transfer Learning techniques, which have emerged over the past years. The code does not attempt to be state-of-the-art even though effort has been made to achieve reasonable performance and, given limited modifications, to be competitive with the current state of the art.

Special effort has been made to

* ensure the code can be used as easily as possible, in particular by hosting pretrained models and datasets,
* keep the code as compact and self-contained as possible to make it easy to manipulate and understand.

# Hands-on #1: Pretraining

## Our model

So, let's start by creating the model that will be the back bone of our work. We will pretrain this model and use it to experiment with various transfer learning schemes.

In the tutorial, we decided to use an architecture which is pretty much exactly the GPT-2 architecture (the only difference is the use of a ReLU non-linearity instead of gelu for code conciseness).

In [0]:
import torch
import torch.nn as nn

class Transformer(nn.Module):
    def __init__(self, embed_dim, hidden_dim, num_embeddings, num_max_positions, num_heads, num_layers, dropout, causal):
        super().__init__()
        self.causal = causal
        self.tokens_embeddings = nn.Embedding(num_embeddings, embed_dim)
        self.position_embeddings = nn.Embedding(num_max_positions, embed_dim)
        self.dropout = nn.Dropout(dropout)

        self.attentions, self.feed_forwards = nn.ModuleList(), nn.ModuleList()
        self.layer_norms_1, self.layer_norms_2 = nn.ModuleList(), nn.ModuleList()
        for _ in range(num_layers):
            self.attentions.append(nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout))
            self.feed_forwards.append(nn.Sequential(nn.Linear(embed_dim, hidden_dim),
                                                    nn.ReLU(),
                                                    nn.Linear(hidden_dim, embed_dim)))
            self.layer_norms_1.append(nn.LayerNorm(embed_dim, eps=1e-12))
            self.layer_norms_2.append(nn.LayerNorm(embed_dim, eps=1e-12))

    def forward(self, x, padding_mask=None):
        """ x has shape [seq length, batch], padding_mask has shape [batch, seq length] """
        positions = torch.arange(len(x), device=x.device).unsqueeze(-1)
        h = self.tokens_embeddings(x)
        h = h + self.position_embeddings(positions).expand_as(h)
        h = self.dropout(h)

        attn_mask = None
        if self.causal:
            attn_mask = torch.full((len(x), len(x)), -float('Inf'), device=h.device, dtype=h.dtype)
            attn_mask = torch.triu(attn_mask, diagonal=1)

        for layer_norm_1, attention, layer_norm_2, feed_forward in zip(self.layer_norms_1, self.attentions,
                                                                       self.layer_norms_2, self.feed_forwards):
            h = layer_norm_1(h)
            x, _ = attention(h, h, h, attn_mask=attn_mask, need_weights=False, key_padding_mask=padding_mask)
            x = self.dropout(x)
            h = x + h

            h = layer_norm_2(h)
            x = feed_forward(h)
            x = self.dropout(x)
            h = x + h
            print(h)
        return h

## Pretraining

To pretrain our model, we need to add:
* a head on top of our model hidden states: we choose a language modeling head with tied weights,
* initialize the weights, and
* choose and define a loss function: we choose a simple cross-entropy loss.

We add these elements with a *pretraining* module extending our *Transformer* model:

In [0]:
class TransformerWithLMHead(nn.Module):
    def __init__(self, config):
        """ Transformer with a language modeling head on top (tied weights) """
        super().__init__()
        self.config = config
        self.transformer = Transformer(config.embed_dim, config.hidden_dim, config.num_embeddings,
                                       config.num_max_positions, config.num_heads, config.num_layers,
                                       config.dropout, causal=not config.mlm)

        self.lm_head = nn.Linear(config.embed_dim, config.num_embeddings, bias=False)
        self.apply(self.init_weights)
        self.tie_weights()

    def tie_weights(self):
        self.lm_head.weight = self.transformer.tokens_embeddings.weight

    def init_weights(self, module):
        """ initialize weights - nn.MultiheadAttention is already initalized by PyTorch (xavier) """
        if isinstance(module, (nn.Linear, nn.Embedding, nn.LayerNorm)):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if isinstance(module, (nn.Linear, nn.LayerNorm)) and module.bias is not None:
            module.bias.data.zero_()

    def forward(self, x, labels=None, padding_mask=None):
        """ x has shape [seq length, batch], padding_mask has shape [batch, seq length] """
        hidden_states = self.transformer(x, padding_mask)
        logits = self.lm_head(hidden_states)

        if labels is not None:
            shift_logits = logits[:-1] if self.transformer.causal else logits
            shift_labels = labels[1:] if self.transformer.causal else labels
            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
            return logits, loss

        return logits

## Tokenizer

Let's start by and preparing a pretraining dataset and creating our model for pretraining.

The following code elements are extracted from the pretraining script of the repo accompanying the tutorial: https://github.com/huggingface/naacl_transfer_learning_tutorial

To simplify pre-processing, we'll use a pre-defined open vocabulary tokenizer: the tokenizer of the BERT base cased model.

In [0]:
from pytorch_pretrained_bert import BertTokenizer, cached_path

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)

100%|██████████| 213450/213450 [00:00<00:00, 5152242.06B/s]


## configuration

We define a simple configuration (we use argparse in the repo, here a namedtuple with similar attributes)

In [0]:
from collections import namedtuple

Config = namedtuple('Config',
  field_names="embed_dim, hidden_dim, num_max_positions, num_embeddings      , num_heads, num_layers," 
              "dropout, initializer_range, batch_size, lr, max_norm, n_epochs, n_warmup,"
              "mlm, gradient_accumulation_steps, device, log_dir, dataset_cache")
args = Config( 410      , 2100      , 256              , len(tokenizer.vocab), 10       , 16        ,
               0.1    , 0.02             , 16        , 2.5e-4, 1.0 , 50     , 1000    ,
               False, 4, "cuda" if torch.cuda.is_available() else "cpu", "./"   , "./dataset_cache.bin")

## Preparing a pretraining dataset

We download a large dataset for pretrainining: wikitext-103 with 103M tokens.

To go faster we will download a version already tokenized, this tokenized dataset is the cache file you obtain when running the training script of the tutorial repo

In [0]:
dataset_file = cached_path("https://s3.amazonaws.com/datasets.huggingface.co/wikitext-103/"
                           "wikitext-103-train-tokenized-bert.bin")
datasets = torch.load(dataset_file)

# Convert our encoded dataset to torch.tensors and reshape in blocks of the transformer's input length
for split_name in ['train', 'valid']:
    tensor = torch.tensor(datasets[split_name], dtype=torch.long)
    num_sequences = (tensor.size(0) // args.num_max_positions) * args.num_max_positions
    datasets[split_name] = tensor.narrow(0, 0, num_sequences).view(-1, args.num_max_positions)

100%|██████████| 329949905/329949905 [00:06<00:00, 53136921.55B/s]


## Creating model and optimizer

In [0]:
model = TransformerWithLMHead(args).to(args.device)
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)

## Preparing our training loop

In [0]:
import os
from torch.utils.data import DataLoader
from ignite.engine import Engine, Events
from ignite.metrics import RunningAverage, Accuracy
from ignite.handlers import ModelCheckpoint
from ignite.contrib.handlers import CosineAnnealingScheduler, PiecewiseLinear, create_lr_scheduler_with_warmup, ProgressBar

train_dataloader = DataLoader(datasets['train'], batch_size=args.batch_size, shuffle=True)

# Define training function
def update(engine, batch):
    model.train()
    batch = batch.transpose(0, 1).contiguous().to(args.device)  # to shape [seq length, batch]
    logits, loss = model(batch, labels=batch)
    loss = loss / args.gradient_accumulation_steps
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_norm)
    if engine.state.iteration % args.gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
    return loss.item()
trainer = Engine(update)

# Add progressbar with loss
RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")
ProgressBar(persist=True).attach(trainer, metric_names=['loss'])

# Learning rate schedule: linearly warm-up to lr and then decrease the learning rate to zero with cosine
cos_scheduler = CosineAnnealingScheduler(optimizer, 'lr', args.lr, 0.0, len(train_dataloader) * args.n_epochs)
scheduler = create_lr_scheduler_with_warmup(cos_scheduler, 0.0, args.lr, args.n_warmup)
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)

# Save checkpoints and training config
checkpoint_handler = ModelCheckpoint(args.log_dir, 'checkpoint', save_interval=1, n_saved=5, require_empty=False)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': model})
torch.save(args, os.path.join(args.log_dir, 'training_args.bin'))



## Training!

In [0]:
# trainer.run(train_dataloader, max_epochs=args.n_epochs)

# Hands-on #2: Adapting our pretrained model

Now let's fine-tune our model.

We will start with a simple fine-tuning process:
* keep the model architecture unchanged
* add a classification head on top of our model
* use an additional embeddings to trigger the classification behavior.

**Target task: TREC**

We are going to use the Text REtrieval Conference (TREC) Question Classification dataset as our target task and dataset (Xin Li, Dan Roth, Learning Question Classifiers. COLING'02, Aug., 2002).

The TREC dataset contains 5500 labeled questions in training set and another 500 for test set. The dataset has 6 labels. Average length of each sentence is 10, vocabulary size of 8700.

References:
* https://nlp.stanford.edu/courses/cs224n/2004/may-steinberg-project.pdf
* http://cogcomp.org/Data/QA/QC/
* http://www.aclweb.org/anthology/C02-**1150**

## A simple scheme

Let's define the architecture of our adaptation model.

We will slightly change the architecture of the pre-training model by replacing the pre-training head (language modeling) with a classification head:

In [0]:
class TransformerWithClfHead(nn.Module):
    def __init__(self, config, fine_tuning_config):
        super().__init__()
        self.config = fine_tuning_config
        self.transformer = Transformer(config.embed_dim, config.hidden_dim, config.num_embeddings,
                                       config.num_max_positions, config.num_heads, config.num_layers,
                                       fine_tuning_config.dropout, causal=not config.mlm)
        
        self.classification_head = nn.Linear(config.embed_dim, fine_tuning_config.num_classes)

        self.apply(self.init_weights)

    def init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding, nn.LayerNorm)):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if isinstance(module, (nn.Linear, nn.LayerNorm)) and module.bias is not None:
            module.bias.data.zero_()

    def forward(self, x, clf_tokens_mask, clf_labels=None, padding_mask=None):
        hidden_states = self.transformer(x, padding_mask)

        clf_tokens_states = (hidden_states * clf_tokens_mask.unsqueeze(-1).float()).sum(dim=0)
        clf_logits = self.classification_head(clf_tokens_states)

        if clf_labels is not None:
            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
            loss = loss_fct(clf_logits.view(-1, clf_logits.size(-1)), clf_labels.view(-1))
            return clf_logits, loss
        return clf_logits

## Fine-tuning configuration

We need an additional configuration for the fine-tuning which will define the number of classes of our classification head (6 classes for the TREC dataset) and a few optimization settings for adapatation (learning rate, batch size...)

In [0]:
AdaptationConfig = namedtuple('AdaptationConfig',
  field_names="num_classes, dropout, initializer_range, batch_size, lr, max_norm, n_epochs,"
              "n_warmup, valid_set_prop, gradient_accumulation_steps, device,"
              "log_dir, dataset_cache")
adapt_args = AdaptationConfig(
               6          , 0.1    , 0.02             , 16        , 6.5e-5, 1.0   , 3,
               10      , 0.1           , 1, "cuda" if torch.cuda.is_available() else "cpu",
               "./"   , "./dataset_cache.bin")

## Load and prepare TREC dataset

Let's load the TREC dataset and prepare it by adding a classification token at the end of each sample:

In [0]:
import random
from torch.utils.data import TensorDataset, random_split

dataset_file = cached_path("https://s3.amazonaws.com/datasets.huggingface.co/trec/"
                           "trec-tokenized-bert.bin")
datasets = torch.load(dataset_file)

for split_name in ['train', 'test']:

    # Trim the samples to the transformer's input length minus 1 & add a classification token
    datasets[split_name] = [x[:args.num_max_positions-1] + [tokenizer.vocab['[CLS]']]
                            for x in datasets[split_name]]

    # Pad the dataset to max length
    padding_length = max(len(x) for x in datasets[split_name])
    datasets[split_name] = [x + [tokenizer.vocab['[PAD]']] * (padding_length - len(x))
                            for x in datasets[split_name]]

    # Convert to torch.Tensor and gather inputs and labels
    tensor = torch.tensor(datasets[split_name], dtype=torch.long)
    labels = torch.tensor(datasets[split_name + '_labels'], dtype=torch.long)
    datasets[split_name] = TensorDataset(tensor, labels)

# Create a validation dataset from a fraction of the training dataset
valid_size = int(adapt_args.valid_set_prop * len(datasets['train']))
train_size = len(datasets['train']) - valid_size
valid_dataset, train_dataset = random_split(datasets['train'], [valid_size, train_size])

train_loader = DataLoader(train_dataset, batch_size=adapt_args.batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=adapt_args.batch_size, shuffle=False)
test_loader = DataLoader(datasets['test'], batch_size=adapt_args.batch_size, shuffle=False)

100%|██████████| 250835/250835 [00:00<00:00, 6456844.51B/s]


## Create adaptation model and load pretrained weights

We can now instantiate our adpatation model and load the weights that were pretrained from the pretrained model (the core model). The added weights (classification head) will be initialized. 

In [0]:
# If you have pretrained a model in the first section, you can use its weigths
# state_dict = model.state_dict()

# Otherwise, just load pretrained model weigths (and reload the training config as well)
state_dict = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_checkpoint.pth"), map_location='cpu')
args = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_training_args.bin"))

adaptation_model = TransformerWithClfHead(config=args, fine_tuning_config=adapt_args).to(adapt_args.device)

incompatible_keys = adaptation_model.load_state_dict(state_dict, strict=False)
print(f"Parameters discarded from the pretrained model: {incompatible_keys.unexpected_keys}")
print(f"Parameters added in the adaptation model: {incompatible_keys.missing_keys}")

100%|██████████| 201626725/201626725 [00:05<00:00, 40134140.83B/s]
100%|██████████| 837/837 [00:00<00:00, 350957.96B/s]


Parameters discarded from the pretrained model: ['lm_head.weight']
Parameters added in the adaptation model: ['classification_head.weight', 'classification_head.bias']


## Preparing our fine-tuning loop

In [0]:
optimizer = torch.optim.Adam(adaptation_model.parameters(), lr=adapt_args.lr)

# Training function and trainer
def update(engine, batch):
    adaptation_model.train()
    batch, labels = (t.to(adapt_args.device) for t in batch)
    inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
    _, loss = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']), clf_labels=labels,
                               padding_mask=(batch == tokenizer.vocab['[PAD]']))
    loss = loss / adapt_args.gradient_accumulation_steps
    loss.backward()
    torch.nn.utils.clip_grad_norm_(adaptation_model.parameters(), adapt_args.max_norm)
    if engine.state.iteration % adapt_args.gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
    return loss.item()
trainer = Engine(update)

# Evaluation function and evaluator (evaluator output is the input of the metrics)
def inference(engine, batch):
    adaptation_model.eval()
    with torch.no_grad():
        batch, labels = (t.to(adapt_args.device) for t in batch)
        inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
        clf_logits = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']),
                                      padding_mask=(batch == tokenizer.vocab['[PAD]']))
    return clf_logits, labels
evaluator = Engine(inference)

# Attache metric to evaluator & evaluation to trainer: evaluate on valid set after each epoch
Accuracy().attach(evaluator, "accuracy")
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(engine):
    evaluator.run(valid_loader)
    print(f"Validation Epoch: {engine.state.epoch} Error rate: {100*(1 - evaluator.state.metrics['accuracy'])}")

# Learning rate schedule: linearly warm-up to lr and then to zero
scheduler = PiecewiseLinear(optimizer, 'lr', [(0, 0.0), (adapt_args.n_warmup, adapt_args.lr),
                                              (len(train_loader)*adapt_args.n_epochs, 0.0)])
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)

# Add progressbar with loss
RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")
ProgressBar(persist=True).attach(trainer, metric_names=['loss'])

# Save checkpoints and finetuning config
checkpoint_handler = ModelCheckpoint(adapt_args.log_dir, 'finetuning_checkpoint', save_interval=1, require_empty=False)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': adaptation_model})
torch.save(args, os.path.join(adapt_args.log_dir, 'fine_tuning_args.bin'))

## Fine-tuning on TREC!

In [0]:
trainer.run(train_loader, max_epochs=adapt_args.n_epochs)

In [0]:
evaluator.run(test_loader)
print(f"Test Results - Error rate: {100*(1.00 - evaluator.state.metrics['accuracy']):.3f}")

tensor([[[ 0.4391,  0.1626,  0.1467,  ...,  0.1157, -1.0591, -0.0905],
         [ 0.3974, -0.0127,  0.2904,  ...,  0.1714, -1.1012,  0.0498],
         [ 0.1233,  0.0590, -0.1214,  ...,  0.0048, -0.8143, -0.2302],
         ...,
         [ 0.1233,  0.0590, -0.1214,  ...,  0.0048, -0.8143, -0.2302],
         [ 0.1505, -0.0619,  0.0587,  ..., -0.0415, -1.2243,  0.0993],
         [ 0.3974, -0.0127,  0.2904,  ...,  0.1714, -1.1012,  0.0498]],

        [[ 0.4932,  0.0197, -0.1870,  ..., -0.0308, -0.7219,  0.2458],
         [ 0.0876,  0.1363,  0.0373,  ...,  0.0541, -0.5581,  0.2716],
         [ 0.1202,  0.3994,  0.0686,  ..., -0.1272, -0.8014, -0.1283],
         ...,
         [ 0.1202,  0.3994,  0.0686,  ..., -0.1272, -0.8014, -0.1283],
         [ 0.2103,  0.4731, -0.0675,  ...,  0.0555, -0.5935, -0.2669],
         [ 0.4270,  0.1037, -0.0340,  ..., -0.2160, -0.8526,  0.5883]],

        [[ 0.5071,  0.0989, -0.0647,  ..., -0.1137, -0.4345,  0.6222],
         [ 0.4076,  0.1204, -0.0670,  ..., -0

# Hands-on #3: Using Adapters and freezing

Now let's see if we can add adapters inside our model to reduce the number of parameters to update during fine-tuning.

### New backbone model

We need to update our backbone model.

Let's create a new Transformer class which will inherit from our original Transformer and add adapter modules after the self-attention and the feed-forward modules.

In [0]:
class TransformerWithAdapters(Transformer):
    def __init__(self, adapters_dim, embed_dim, hidden_dim, num_embeddings, num_max_positions,
                 num_heads, num_layers, dropout, causal):
        """ Transformer with adapters (small bottleneck layers) """
        super().__init__(embed_dim, hidden_dim, num_embeddings, num_max_positions, num_heads, num_layers,
                         dropout, causal)
        self.adapters_1 = nn.ModuleList()
        self.adapters_2 = nn.ModuleList()
        for _ in range(num_layers):
          
            self.adapters_1.append(nn.Sequential(nn.Linear(embed_dim, adapters_dim),
                                                 nn.ReLU(),
                                                 nn.Linear(adapters_dim, embed_dim)))
            
            self.adapters_2.append(nn.Sequential(nn.Linear(embed_dim, adapters_dim),
                                                 nn.ReLU(),
                                                 nn.Linear(adapters_dim, embed_dim)))

    def forward(self, x, padding_mask=None):
        """ x has shape [seq length, batch], padding_mask has shape [batch, seq length] """
        positions = torch.arange(len(x), device=x.device).unsqueeze(-1)
        h = self.tokens_embeddings(x)
        h = h + self.position_embeddings(positions).expand_as(h)
        h = self.dropout(h)

        attn_mask = None
        if self.causal:
            attn_mask = torch.full((len(x), len(x)), -float('Inf'), device=h.device, dtype=h.dtype)
            attn_mask = torch.triu(attn_mask, diagonal=1)

        for (layer_norm_1, attention, adapter_1, layer_norm_2, feed_forward, adapter_2)\
                          in zip(self.layer_norms_1, self.attentions,    self.adapters_1,
                                 self.layer_norms_2, self.feed_forwards, self.adapters_2):
            h = layer_norm_1(h)
            x, _ = attention(h, h, h, attn_mask=attn_mask, need_weights=False, key_padding_mask=padding_mask)
            x = self.dropout(x)
            
            x = adapter_1(x) + x  # Add an adapter with a skip-connection after attention module
            
            h = x + h

            h = layer_norm_2(h)
            x = feed_forward(h)
            x = self.dropout(x)
            
            x = adapter_2(x) + x  # Add an adapter with a skip-connection after feed-forward module
            
            h = x + h
        return h

### Adaptation model

Again, we build an adaptation model on top of this new backbone. Using the same code as previously.

In [0]:
class TransformerWithClfHeadAndAdapters(nn.Module):
    def __init__(self, config, fine_tuning_config):
        """ Transformer with a classification head and adapters. """
        super().__init__()
        self.config = fine_tuning_config
        self.transformer = TransformerWithAdapters(fine_tuning_config.adapters_dim, config.embed_dim, config.hidden_dim,
                                                   config.num_embeddings, config.num_max_positions, config.num_heads,
                                                   config.num_layers, fine_tuning_config.dropout, causal=not config.mlm)

        self.classification_head = nn.Linear(config.embed_dim, fine_tuning_config.num_classes)
        self.apply(self.init_weights)

    def init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding, nn.LayerNorm)):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if isinstance(module, (nn.Linear, nn.LayerNorm)) and module.bias is not None:
            module.bias.data.zero_()

    def forward(self, x, clf_tokens_mask, lm_labels=None, clf_labels=None, padding_mask=None):
        hidden_states = self.transformer(x, padding_mask)

        clf_tokens_states = (hidden_states * clf_tokens_mask.unsqueeze(-1).float()).sum(dim=0)
        clf_logits = self.classification_head(clf_tokens_states)

        if clf_labels is not None:
            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
            loss = loss_fct(clf_logits.view(-1, clf_logits.size(-1)), clf_labels.view(-1))
            return clf_logits, loss

        return clf_logits

### Configuration

We need a new configuration for our modified model in which we can indicate the hidden dimension for the adapters.

We've added more untrained parameters than in our previous examples (all the adapters are untrained). So let's increase the learning rate a bit.

We'll increase the learning rate by a factor of 10.

In [0]:
AdaptationConfig = namedtuple('AdaptationConfig',
  field_names="adapters_dim, num_classes, dropout, initializer_range, batch_size, lr, max_norm, n_epochs,"
              "n_warmup, valid_set_prop, gradient_accumulation_steps, device,"
              "log_dir, dataset_cache")
adapt_args = AdaptationConfig(
               32         , 6          , 0.1    , 0.02             , 16        , 6.5e-4, 1.0   , 3,
               10      , 0.1           , 1, "cuda" if torch.cuda.is_available() else "cpu",
               "./"   , "./dataset_cache.bin")

### Create adaptation model and load pretrained weights

We can now instantiate our adpatation model and load the weights that were pretrained from the pretrained model (the core model). The added weights (classification head) will be initialized. 

In [0]:
# If you have pretrained a model in the first section, you can use its weigths
# state_dict = model.state_dict()

# Otherwise, just load pretrained model weigths (and reload the training config as well)
state_dict = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_checkpoint.pth"), map_location='cpu')
args = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_training_args.bin"))

adaptation_model = TransformerWithClfHeadAndAdapters(config=args, fine_tuning_config=adapt_args)
adaptation_model.to(adapt_args.device)

incompatible_keys = adaptation_model.load_state_dict(state_dict, strict=False)
print(f"Parameters discarded from the pretrained model: {incompatible_keys.unexpected_keys}")
print(f"Parameters added in the adaptation model: {incompatible_keys.missing_keys}")

In [0]:
for name, param in adaptation_model.named_parameters():
    if 'embeddings' not in name and 'classification' not in name and 'adapters_1' not in name and 'adapters_2' not in name:
        param.detach_()
        param.requires_grad = False
        
    else:
        param.requires_grad = True

full_parameters = sum(p.numel() for p in adaptation_model.parameters())
trained_parameters = sum(p.numel() for p in adaptation_model.parameters() if p.requires_grad)
        
print(f"We will train {trained_parameters:3e} parameters out of {full_parameters:3e},"
      f" i.e. {100 * trained_parameters/full_parameters:.2f}%")

### Training loop

And the training loop is identical to our previous case. 

In [0]:
optimizer = torch.optim.Adam(adaptation_model.parameters(), lr=adapt_args.lr)

# Training function and trainer
def update(engine, batch):
    adaptation_model.train()
    batch, labels = (t.to(adapt_args.device) for t in batch)
    inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
    _, loss = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']), clf_labels=labels,
                               padding_mask=(batch == tokenizer.vocab['[PAD]']))
    loss = loss / adapt_args.gradient_accumulation_steps
    loss.backward()
    torch.nn.utils.clip_grad_norm_(adaptation_model.parameters(), adapt_args.max_norm)
    if engine.state.iteration % adapt_args.gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
    return loss.item()
trainer = Engine(update)

# Evaluation function and evaluator (evaluator output is the input of the metrics)
def inference(engine, batch):
    adaptation_model.eval()
    with torch.no_grad():
        batch, labels = (t.to(adapt_args.device) for t in batch)
        inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
        clf_logits = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']),
                                      padding_mask=(batch == tokenizer.vocab['[PAD]']))
    return clf_logits, labels
evaluator = Engine(inference)

# Attache metric to evaluator & evaluation to trainer: evaluate on valid set after each epoch
Accuracy().attach(evaluator, "accuracy")
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(engine):
    evaluator.run(valid_loader)
    print(f"Validation Epoch: {engine.state.epoch} Error rate: {100*(1 - evaluator.state.metrics['accuracy'])}")

# Learning rate schedule: linearly warm-up to lr and then to zero
scheduler = PiecewiseLinear(optimizer, 'lr', [(0, 0.0), (adapt_args.n_warmup, adapt_args.lr),
                                              (len(train_loader)*adapt_args.n_epochs, 0.0)])
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)

# Add progressbar with loss
RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")
ProgressBar(persist=True).attach(trainer, metric_names=['loss'])

# Save checkpoints and finetuning config
checkpoint_handler = ModelCheckpoint(adapt_args.log_dir, 'finetuning_checkpoint', save_interval=1, require_empty=False)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': adaptation_model})
torch.save(args, os.path.join(adapt_args.log_dir, 'fine_tuning_args.bin'))

### Run the training

In [0]:
trainer.run(train_loader, max_epochs=adapt_args.n_epochs)

In [0]:
evaluator.run(test_loader)
print(f"Test Results - Error rate: {100*(1.00 - evaluator.state.metrics['accuracy']):.3f}")

# Hands-on #4: Using gradual unfreezing

Let's implement a simple progressive unfrezzing schedule.

We will progressively unfreeze all the layers during the training.

### Fine-tuning configuration

We can use a higher learning rate since the layers are progressively unfrozen

In [0]:
AdaptationConfig = namedtuple('AdaptationConfig',
  field_names="num_classes, dropout, initializer_range, batch_size, lr, max_norm, n_epochs,"
              "n_warmup, valid_set_prop, gradient_accumulation_steps, device,"
              "log_dir, dataset_cache")
adapt_args = AdaptationConfig(
               6          , 0.1    , 0.02             , 16        , 6.5e-4, 1.0   , 10,
               10      , 0.1           , 1, "cuda" if torch.cuda.is_available() else "cpu",
               "./"   , "./dataset_cache.bin")

### Reload the model and prepare the training loop

We use exactly the code of the previous cells.

In [0]:
# If you have pretrained a model in the first section, you can use its weigths
# state_dict = model.state_dict()

# Otherwise, just load pretrained model weigths (and reload the training config as well)
state_dict = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_checkpoint.pth"), map_location='cpu')
args = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_training_args.bin"))

adaptation_model = TransformerWithClfHead(config=args, fine_tuning_config=adapt_args)
adaptation_model.to(adapt_args.device)

incompatible_keys = adaptation_model.load_state_dict(state_dict, strict=False)
print(f"Parameters discarded from the pretrained model: {incompatible_keys.unexpected_keys}")
print(f"Parameters added in the adaptation model: {incompatible_keys.missing_keys}")

In [0]:
optimizer = torch.optim.Adam(adaptation_model.parameters(), lr=adapt_args.lr)

# Training function and trainer
def update(engine, batch):
    adaptation_model.train()
    batch, labels = (t.to(adapt_args.device) for t in batch)
    inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
    _, loss = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']), clf_labels=labels,
                               padding_mask=(batch == tokenizer.vocab['[PAD]']))
    loss = loss / adapt_args.gradient_accumulation_steps
    loss.backward()
    torch.nn.utils.clip_grad_norm_(adaptation_model.parameters(), adapt_args.max_norm)
    if engine.state.iteration % adapt_args.gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
    return loss.item()
trainer = Engine(update)

# Evaluation function and evaluator (evaluator output is the input of the metrics)
def inference(engine, batch):
    adaptation_model.eval()
    with torch.no_grad():
        batch, labels = (t.to(adapt_args.device) for t in batch)
        inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
        clf_logits = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']),
                                      padding_mask=(batch == tokenizer.vocab['[PAD]']))
    return clf_logits, labels
evaluator = Engine(inference)

# Attache metric to evaluator & evaluation to trainer: evaluate on valid set after each epoch
Accuracy().attach(evaluator, "accuracy")
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(engine):
    evaluator.run(valid_loader)
    print(f"Validation Epoch: {engine.state.epoch} Error rate: {100*(1 - evaluator.state.metrics['accuracy'])}")

# Learning rate schedule: linearly warm-up to lr and then to zero
scheduler = PiecewiseLinear(optimizer, 'lr', [(0, 0.0), (adapt_args.n_warmup, adapt_args.lr),
                                              (len(train_loader)*adapt_args.n_epochs, 0.0)])
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)

# Add progressbar with loss
RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")
ProgressBar(persist=True).attach(trainer, metric_names=['loss'])

# Save checkpoints and finetuning config
checkpoint_handler = ModelCheckpoint(adapt_args.log_dir, 'finetuning_checkpoint', save_interval=1, require_empty=False)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': adaptation_model})
torch.save(args, os.path.join(adapt_args.log_dir, 'fine_tuning_args.bin'))

### Now let's add a progressive unfreezing method and freeze the inside of our model.

In [0]:
for name, param in adaptation_model.named_parameters():
    if 'embeddings' not in name and 'classification' not in name:
        param.detach_()
        param.requires_grad = False
        
    else:
        param.requires_grad = True

full_parameters = sum(p.numel() for p in adaptation_model.parameters())
trained_parameters = sum(p.numel() for p in adaptation_model.parameters() if p.requires_grad)
        
print(f"We will start by training {trained_parameters:3e} parameters out of {full_parameters:3e},"
      f" i.e. {100 * trained_parameters/full_parameters:.2f}%")

In [0]:
import re

# We will unfreeze blocks regularly along the training: one block every `unfreezing_interval` step
unfreezing_interval = int(len(train_loader) * adapt_args.n_epochs / (args.num_layers + 1))

@trainer.on(Events.ITERATION_COMPLETED)
def unfreeze_layer_if_needed(engine):
    if engine.state.iteration % unfreezing_interval == 0:
        # Which layer should we unfreeze now
        unfreezing_index = args.num_layers - (engine.state.iteration // unfreezing_interval)

        # Let's unfreeze it
        unfreezed = []
        for name, param in adaptation_model.named_parameters():
            if re.match(r"transformer\.[^\.]*\." + str(unfreezing_index) + r"\.", name):
                unfreezed.append(name)
                param.require_grad = True
        print(f"Unfreezing block {unfreezing_index} with {unfreezed}")

### Run the training

In [0]:
trainer.run(train_loader, max_epochs=adapt_args.n_epochs)

In [0]:
evaluator.run(test_loader)
print(f"Test Results - Error rate: {100*(1.00 - evaluator.state.metrics['accuracy']):.3f}")

# Hands-on #5: Using discriminative learning

Here is an experiment on varying the learning rate along the model depth for fine-tuning.


### Configuration

Let's prepare a configuration file with a new hyper-parameter (`decreasing_factor`) controlling the learning rate decrease factor along the model depth.

In [0]:
AdaptationConfig = namedtuple('AdaptationConfig',
  field_names="num_classes, dropout, initializer_range, batch_size, lr, max_norm, n_epochs,"
              "n_warmup, valid_set_prop, gradient_accumulation_steps, device,"
              "log_dir, dataset_cache, decreasing_factor")
adapt_args = AdaptationConfig(
               6          , 0.1    , 0.02             , 16        , 6.5e-4, 1.0   , 10,
               10      , 0.1           , 1, "cuda" if torch.cuda.is_available() else "cpu",
               "./"   , "./dataset_cache.bin", 2.6)

### Reload the model

We reuse exactly the code of the previous experiment

In [0]:
# If you have pretrained a model in the first section, you can use its weigths
# state_dict = model.state_dict()

# Otherwise, just load pretrained model weigths (and reload the training config as well)
state_dict = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_checkpoint.pth"), map_location='cpu')
args = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_training_args.bin"))

adaptation_model = TransformerWithClfHead(config=args, fine_tuning_config=adapt_args)
adaptation_model.to(adapt_args.device)

incompatible_keys = adaptation_model.load_state_dict(state_dict, strict=False)
print(f"Parameters discarded from the pretrained model: {incompatible_keys.unexpected_keys}")
print(f"Parameters added in the adaptation model: {incompatible_keys.missing_keys}")

Parameters discarded from the pretrained model: ['lm_head.weight']
Parameters added in the adaptation model: ['classification_head.weight', 'classification_head.bias']


### Prepare the optimizer

To easily distinguish between parameters, we will prepare the optimizer with distinctive parameters groups associated to each layer.

In [0]:
import re

# Build parameters groups by layer, numbered from the top ['1', '2', ..., '15']
parameter_groups = []
for i in range(args.num_layers):
    name_pattern = r"transformer\.[^\.]*\." + str(i) + r"\."
    group = {'name': str(args.num_layers - i),
             'params': [p for n, p in adaptation_model.named_parameters() if re.match(name_pattern, n)]}
    parameter_groups.append(group)

# Add the rest of the parameters (embeddings and classification layer) in a group labeled '0'
name_pattern = r"transformer\.[^\.]*\.\d*\."
group = {'name': '0',
         'params': [p for n, p in adaptation_model.named_parameters() if not re.match(name_pattern, n)]}
parameter_groups.append(group)

# Sanity check that we still have the same number of parameters
assert sum(p.numel() for g in parameter_groups for p in g['params'])\
    == sum(p.numel() for p in adaptation_model.parameters())

optimizer = torch.optim.Adam(parameter_groups, lr=adapt_args.lr)

### Prepare training loop

We reuse exactly the code of the previous experiment

In [0]:
# Training function and trainer
def update(engine, batch):
    adaptation_model.train()
    batch, labels = (t.to(adapt_args.device) for t in batch)
    inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
    _, loss = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']), clf_labels=labels,
                               padding_mask=(batch == tokenizer.vocab['[PAD]']))
    loss = loss / adapt_args.gradient_accumulation_steps
    loss.backward()
    torch.nn.utils.clip_grad_norm_(adaptation_model.parameters(), adapt_args.max_norm)
    if engine.state.iteration % adapt_args.gradient_accumulation_steps == 0:
#        print([(p['name'], p["lr"]) for p in optimizer.param_groups])
        optimizer.step()
        optimizer.zero_grad()
    return loss.item()
trainer = Engine(update)

# Evaluation function and evaluator (evaluator output is the input of the metrics)
def inference(engine, batch):
    adaptation_model.eval()
    with torch.no_grad():
        batch, labels = (t.to(adapt_args.device) for t in batch)
        inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
        clf_logits = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']),
                                      padding_mask=(batch == tokenizer.vocab['[PAD]']))
    return clf_logits, labels
evaluator = Engine(inference)

# Attache metric to evaluator & evaluation to trainer: evaluate on valid set after each epoch
Accuracy().attach(evaluator, "accuracy")
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(engine):
    evaluator.run(valid_loader)
    print(f"Validation Epoch: {engine.state.epoch} Error rate: {100*(1 - evaluator.state.metrics['accuracy'])}")

# Learning rate schedule: linearly warm-up to lr and then to zero
scheduler = PiecewiseLinear(optimizer, 'lr', [(0, 0.0), (adapt_args.n_warmup, adapt_args.lr),
                                              (len(train_loader)*adapt_args.n_epochs, 0.0)])
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)

# Add progressbar with loss
RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")
ProgressBar(persist=True).attach(trainer, metric_names=['loss'])

# Save checkpoints and finetuning config
checkpoint_handler = ModelCheckpoint(adapt_args.log_dir, 'finetuning_checkpoint', save_interval=1, require_empty=False)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': adaptation_model})
torch.save(args, os.path.join(adapt_args.log_dir, 'fine_tuning_args.bin'))

### Control the learning rate of each parameter group

We finish by adding a method to controle the learning rate of the inner layers of our model according to our decreasing schedule.

In [0]:
@trainer.on(Events.ITERATION_STARTED)
def update_layer_learning_rates(engine):
    for param_group in optimizer.param_groups:
        layer_index = int(param_group["name"])
        param_group["lr"] = param_group["lr"] / (adapt_args.decreasing_factor ** layer_index)

### Run the training

In [0]:
trainer.run(train_loader, max_epochs=adapt_args.n_epochs)

HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 1 Error rate: 16.697247706422015



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 2 Error rate: 8.807339449541285



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 3 Error rate: 7.889908256880728



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 4 Error rate: 8.25688073394495



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 5 Error rate: 8.440366972477065



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 6 Error rate: 7.706422018348624



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 7 Error rate: 7.889908256880728



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 8 Error rate: 7.522935779816509



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 9 Error rate: 7.155963302752289



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 10 Error rate: 7.339449541284404



<ignite.engine.engine.State at 0x7f6691791710>

In [0]:
evaluator.run(test_loader)
print(f"Test Results - Error rate: {100*(1.00 - evaluator.state.metrics['accuracy']):.3f}")

# Hands-on #6: Using multi-task learning

Here is a simple example of multi-tasking: using a language modeling loss in addition to the classification loss.

For more complexe multi-tasking scenarii, you probably want to turn to a more specialized framework like [Snorkel MeTaL](https://github.com/HazyResearch/metal) or [AllenNLP](https://allennlp.org/).

We'll start by defining an adaptation model with a language modeling and a classification head. We can just inherit from the pretraining model with a language modeling head.

In [0]:
class TransformerWithClfHeadAndLMHead(nn.Module):
    def __init__(self, config, fine_tuning_config):
        super().__init__()
        self.config = fine_tuning_config
        self.transformer = Transformer(config.embed_dim, config.hidden_dim, config.num_embeddings,
                                       config.num_max_positions, config.num_heads, config.num_layers,
                                       config.dropout, causal=not config.mlm)

        self.lm_head = nn.Linear(config.embed_dim, config.num_embeddings, bias=False)
        self.classification_head = nn.Linear(config.embed_dim, fine_tuning_config.num_classes)

        self.apply(self.init_weights)
        self.tie_weights()

    def tie_weights(self):
        self.lm_head.weight = self.transformer.tokens_embeddings.weight

    def init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding, nn.LayerNorm)):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        if isinstance(module, (nn.Linear, nn.LayerNorm)) and module.bias is not None:
            module.bias.data.zero_()

    def forward(self, x, clf_tokens_mask, lm_labels=None, clf_labels=None, padding_mask=None):
        """ x and clf_tokens_mask have shape [seq length, batch] padding_mask has shape [batch, seq length] """
        hidden_states = self.transformer(x, padding_mask)

        lm_logits = self.lm_head(hidden_states)
        clf_tokens_states = (hidden_states * clf_tokens_mask.unsqueeze(-1).float()).sum(dim=0)
        clf_logits = self.classification_head(clf_tokens_states)

        loss = []
        if clf_labels is not None:
            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
            loss.append(loss_fct(clf_logits.view(-1, clf_logits.size(-1)), clf_labels.view(-1)))

        if lm_labels is not None:
            shift_logits = lm_logits[:-1] if self.transformer.causal else lm_logits
            shift_labels = lm_labels[1:] if self.transformer.causal else lm_labels
            loss_fct = nn.CrossEntropyLoss(ignore_index=-1)
            loss.append(loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)))

        if len(loss):
            return (lm_logits, clf_logits), loss

        return lm_logits, clf_logits

### Configuration

Let's prepare a configuration file with two new hyper-parameter (`clf_loss_coef` and `lm_loss_coef`) respectively controlling the relative proportions of the classification loss and the language modeling loss in the total loss.

In [0]:
AdaptationConfig = namedtuple('AdaptationConfig',
  field_names="num_classes, dropout, initializer_range, batch_size, lr, max_norm, n_epochs,"
              "n_warmup, valid_set_prop, gradient_accumulation_steps, device,"
              "log_dir, dataset_cache, clf_loss_coef, lm_loss_coef")
adapt_args = AdaptationConfig(
               6          , 0.1    , 0.02             , 16        , 6.5e-5, 1.0   , 6,
               10      , 0.1           , 1, "cuda" if torch.cuda.is_available() else "cpu",
               "./"   , "./dataset_cache.bin", 1.0, 0.5)

### Create adaptation model and load pretrained weights

We can now instantiate our adpatation model and load the weights that were pretrained from the pretrained model (the core model). The added weights (classification head) will be initialized.

In this case we make sure the weights on the language modeling head are still tied to the input embeddings.

In [0]:
# If you have pretrained a model in the first section, you can use its weigths
# state_dict = model.state_dict()

# Otherwise, just load pretrained model weigths (and reload the training config as well)
state_dict = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_checkpoint.pth"), map_location='cpu')
args = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_training_args.bin"))

adaptation_model = TransformerWithClfHeadAndLMHead(config=args, fine_tuning_config=adapt_args).to(adapt_args.device)

incompatible_keys = adaptation_model.load_state_dict(state_dict, strict=False)
adaptation_model.tie_weights()  # Make sure weights are tied after loading checkpoint

print(f"Parameters discarded from the pretrained model: {incompatible_keys.unexpected_keys}")
print(f"Parameters added in the adaptation model: {incompatible_keys.missing_keys}")

Parameters discarded from the pretrained model: []
Parameters added in the adaptation model: ['classification_head.weight', 'classification_head.bias']


### Reload the dataset

We use exactly the code of the previous case.

In [0]:
import random
from torch.utils.data import TensorDataset, random_split

dataset_file = cached_path("https://s3.amazonaws.com/datasets.huggingface.co/trec/"
                           "trec-tokenized-bert.bin")
datasets = torch.load(dataset_file)

for split_name in ['train', 'test']:

    # Trim the samples to the transformer's input length minus 1 & add a classification token
    datasets[split_name] = [x[:args.num_max_positions-1] + [tokenizer.vocab['[CLS]']]
                            for x in datasets[split_name]]

    # Pad the dataset to max length
    padding_length = max(len(x) for x in datasets[split_name])
    datasets[split_name] = [x + [tokenizer.vocab['[PAD]']] * (padding_length - len(x))
                            for x in datasets[split_name]]

    # Convert to torch.Tensor and gather inputs and labels
    tensor = torch.tensor(datasets[split_name], dtype=torch.long)
    labels = torch.tensor(datasets[split_name + '_labels'], dtype=torch.long)
    datasets[split_name] = TensorDataset(tensor, labels)

# Create a validation dataset from a fraction of the training dataset
valid_size = int(adapt_args.valid_set_prop * len(datasets['train']))
train_size = len(datasets['train']) - valid_size
valid_dataset, train_dataset = random_split(datasets['train'], [valid_size, train_size])

train_loader = DataLoader(train_dataset, batch_size=adapt_args.batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=adapt_args.batch_size, shuffle=False)
test_loader = DataLoader(datasets['test'], batch_size=adapt_args.batch_size, shuffle=False)

### Prepare training loop

We reuse exactly the code of the previous experiment

In [0]:
optimizer = torch.optim.Adam(adaptation_model.parameters(), lr=adapt_args.lr)

# Training function and trainer
def update(engine, batch):
    adaptation_model.train()
    batch, labels = (t.to(adapt_args.device) for t in batch)
    inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
    
    _, losses = adaptation_model(inputs,
                                 clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']),
                                 clf_labels=labels,
                                 lm_labels=inputs,
                                 padding_mask=(batch == tokenizer.vocab['[PAD]']))

    clf_loss, lm_loss = losses
    loss = (adapt_args.clf_loss_coef * clf_loss
          + adapt_args.lm_loss_coef  * lm_loss) / adapt_args.gradient_accumulation_steps
    
    loss.backward()
    torch.nn.utils.clip_grad_norm_(adaptation_model.parameters(), adapt_args.max_norm)
    if engine.state.iteration % adapt_args.gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
    return loss.item()
trainer = Engine(update)

# Evaluation function and evaluator (evaluator output is the input of the metrics)
def inference(engine, batch):
    adaptation_model.eval()
    with torch.no_grad():
        batch, labels = (t.to(adapt_args.device) for t in batch)
        inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
        _, clf_logits = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']),
                                                 padding_mask=(batch == tokenizer.vocab['[PAD]']))
    return clf_logits, labels
evaluator = Engine(inference)

# Attache metric to evaluator & evaluation to trainer: evaluate on valid set after each epoch
Accuracy().attach(evaluator, "accuracy")
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(engine):
    evaluator.run(valid_loader)
    print(f"Validation Epoch: {engine.state.epoch} Error rate: {100*(1 - evaluator.state.metrics['accuracy'])}")

# Learning rate schedule: linearly warm-up to lr and then to zero
scheduler = PiecewiseLinear(optimizer, 'lr', [(0, 0.0), (adapt_args.n_warmup, adapt_args.lr),
                                              (len(train_loader)*adapt_args.n_epochs, 0.0)])
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)

# Add progressbar with loss
RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")
ProgressBar(persist=True).attach(trainer, metric_names=['loss'])

# Save checkpoints and finetuning config
checkpoint_handler = ModelCheckpoint(adapt_args.log_dir, 'finetuning_checkpoint', save_interval=1, require_empty=False)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': adaptation_model})
torch.save(args, os.path.join(adapt_args.log_dir, 'fine_tuning_args.bin'))

## Run the training

In [0]:
trainer.run(train_loader, max_epochs=adapt_args.n_epochs)

HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 1 Error rate: 8.99082568807339



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 2 Error rate: 7.522935779816509



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 3 Error rate: 7.522935779816509



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 4 Error rate: 7.155963302752289



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 5 Error rate: 5.871559633027523



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 6 Error rate: 5.5045871559633035



<ignite.engine.engine.State at 0x7f6691fcbcf8>

In [0]:
evaluator.run(test_loader)
print(f"Test Results - Error rate: {100*(1.00 - evaluator.state.metrics['accuracy']):.3f}")

Test Results - Error rate: 4.000


# Bonus: Using a masked-language modeling model

This is a bonus extension: try using a language model pretrained with masked language modeling objective

In [0]:
AdaptationConfig = namedtuple('AdaptationConfig',
  field_names="num_classes, dropout, initializer_range, batch_size, lr, max_norm, n_epochs,"
              "n_warmup, valid_set_prop, gradient_accumulation_steps, device,"
              "log_dir, dataset_cache")
adapt_args = AdaptationConfig(
               6          , 0.1    , 0.02             , 16        , 1e-4, 1.0   , 10,
               10      , 0.1           , 1, "cuda" if torch.cuda.is_available() else "cpu",
               "./"   , "./dataset_cache.bin")

In [0]:
# If you have pretrained a model in the first section, you can use its weigths
# state_dict = model.state_dict()

# Otherwise, just load pretrained model weigths (and reload the training config as well)
state_dict = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_checkpoint_mlm.pth"), map_location='cpu')
args = torch.load(cached_path("https://s3.amazonaws.com/models.huggingface.co/"
                                    "naacl-2019-tutorial/model_training_args_mlm.bin"))

adaptation_model = TransformerWithClfHead(config=args, fine_tuning_config=adapt_args).to(adapt_args.device)

incompatible_keys = adaptation_model.load_state_dict(state_dict, strict=False)
print(f"Parameters discarded from the pretrained model: {incompatible_keys.unexpected_keys}")
print(f"Parameters added in the adaptation model: {incompatible_keys.missing_keys}")

100%|██████████| 201626725/201626725 [00:04<00:00, 48099831.34B/s]
100%|██████████| 861/861 [00:00<00:00, 168649.69B/s]


Parameters discarded from the pretrained model: ['lm_head.weight']
Parameters added in the adaptation model: ['classification_head.weight', 'classification_head.bias']


## Loading the dataset

We make one change with regards to our causal model here:
We put the classification token at the beginning of each sample instead of the end. It's seems to be easier for the model to learn the now embeddings when it's always at the same location in the input sequence.

In [0]:
import random
from torch.utils.data import TensorDataset, random_split

dataset_file = cached_path("https://s3.amazonaws.com/datasets.huggingface.co/trec/"
                           "trec-tokenized-bert.bin")
datasets = torch.load(dataset_file)

for split_name in ['train', 'test']:

    # Trim the samples to the transformer's input length minus 1
    # add a classification token at the beggining
    datasets[split_name] = [[tokenizer.vocab['[CLS]']] + x[:args.num_max_positions-1]
                            for x in datasets[split_name]]

    # Pad the dataset to max length
    padding_length = max(len(x) for x in datasets[split_name])
    datasets[split_name] = [x + [tokenizer.vocab['[PAD]']] * (padding_length - len(x))
                            for x in datasets[split_name]]

    # Convert to torch.Tensor and gather inputs and labels
    tensor = torch.tensor(datasets[split_name], dtype=torch.long)
    labels = torch.tensor(datasets[split_name + '_labels'], dtype=torch.long)
    datasets[split_name] = TensorDataset(tensor, labels)

# Create a validation dataset from a fraction of the training dataset
valid_size = int(adapt_args.valid_set_prop * len(datasets['train']))
train_size = len(datasets['train']) - valid_size
valid_dataset, train_dataset = random_split(datasets['train'], [valid_size, train_size])

train_loader = DataLoader(train_dataset, batch_size=adapt_args.batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=adapt_args.batch_size, shuffle=False)
test_loader = DataLoader(datasets['test'], batch_size=adapt_args.batch_size, shuffle=False)

## Preparing the training loop

In [0]:
optimizer = torch.optim.Adam(adaptation_model.parameters(), lr=adapt_args.lr)

# Training function and trainer
def update(engine, batch):
    adaptation_model.train()
    batch, labels = (t.to(adapt_args.device) for t in batch)
    inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
    _, loss = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']), clf_labels=labels,
                               padding_mask=(batch == tokenizer.vocab['[PAD]']))
    loss = loss / adapt_args.gradient_accumulation_steps
    loss.backward()
    torch.nn.utils.clip_grad_norm_(adaptation_model.parameters(), adapt_args.max_norm)
    if engine.state.iteration % adapt_args.gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
    return loss.item()
trainer = Engine(update)

# Evaluation function and evaluator (evaluator output is the input of the metrics)
def inference(engine, batch):
    adaptation_model.eval()
    with torch.no_grad():
        batch, labels = (t.to(adapt_args.device) for t in batch)
        inputs = batch.transpose(0, 1).contiguous()  # to shape [seq length, batch]
        clf_logits = adaptation_model(inputs, clf_tokens_mask=(inputs == tokenizer.vocab['[CLS]']),
                                      padding_mask=(batch == tokenizer.vocab['[PAD]']))
    return clf_logits, labels
evaluator = Engine(inference)

# Attache metric to evaluator & evaluation to trainer: evaluate on valid set after each epoch
Accuracy().attach(evaluator, "accuracy")
@trainer.on(Events.EPOCH_COMPLETED)
def log_validation_results(engine):
    evaluator.run(valid_loader)
    print(f"Validation Epoch: {engine.state.epoch} Error rate: {100*(1 - evaluator.state.metrics['accuracy'])}")

# Learning rate schedule: linearly warm-up to lr and then to zero
scheduler = PiecewiseLinear(optimizer, 'lr', [(0, 0.0), (adapt_args.n_warmup, adapt_args.lr),
                                              (len(train_loader)*adapt_args.n_epochs, 0.0)])
trainer.add_event_handler(Events.ITERATION_STARTED, scheduler)

# Add progressbar with loss
RunningAverage(output_transform=lambda x: x).attach(trainer, "loss")
ProgressBar(persist=True).attach(trainer, metric_names=['loss'])

# Save checkpoints and finetuning config
checkpoint_handler = ModelCheckpoint(adapt_args.log_dir, 'finetuning_checkpoint', save_interval=1, require_empty=False)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint_handler, {'mymodel': adaptation_model})
torch.save(args, os.path.join(adapt_args.log_dir, 'fine_tuning_args.bin'))

## Running the training

In [0]:
trainer.run(train_loader, max_epochs=adapt_args.n_epochs)

HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 1 Error rate: 17.431192660550455



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 2 Error rate: 10.82568807339449



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 3 Error rate: 9.541284403669724



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 4 Error rate: 9.174311926605505



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 5 Error rate: 9.174311926605505



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 6 Error rate: 8.62385321100917



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 7 Error rate: 8.807339449541285



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 8 Error rate: 8.25688073394495



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 9 Error rate: 8.62385321100917



HBox(children=(IntProgress(value=0, max=307), HTML(value='')))

Validation Epoch: 10 Error rate: 7.706422018348624



<ignite.engine.engine.State at 0x7f6691fcb5f8>

In [0]:
evaluator.run(test_loader)
print(f"Test Results - Error rate: {100*(1.00 - evaluator.state.metrics['accuracy']):.3f}")

Test Results - Error rate: 3.800
