In [1]:
# hide
!nvidia-smi

Fri May  7 09:40:07 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
# hide
import sys
if 'google.colab' in sys.modules: 
    !pip install -Uqq fastai transformers datasets wandb
    !pip install -q git+git://github.com/aikindergarten/fasthugs.git

[K     |████████████████████████████████| 204kB 4.2MB/s 
[K     |████████████████████████████████| 2.1MB 5.2MB/s 
[K     |████████████████████████████████| 225kB 22.7MB/s 
[K     |████████████████████████████████| 2.1MB 17.5MB/s 
[K     |████████████████████████████████| 61kB 7.3MB/s 
[K     |████████████████████████████████| 3.3MB 38.7MB/s 
[K     |████████████████████████████████| 901kB 40.2MB/s 
[K     |████████████████████████████████| 112kB 46.4MB/s 
[K     |████████████████████████████████| 245kB 45.2MB/s 
[K     |████████████████████████████████| 133kB 43.7MB/s 
[K     |████████████████████████████████| 163kB 35.1MB/s 
[K     |████████████████████████████████| 102kB 10.3MB/s 
[K     |████████████████████████████████| 71kB 8.7MB/s 
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone
  Building wheel for subprocess32 (setup.py) ... [?25l[?25hdone
  Building wheel for fasthugs (setup.py) ... [?25l[?25hdone


# Finetuning Transformers on GLUE benchmark

In [3]:
#collapse_input
from transformers import AutoModelForSequenceClassification
from fastai.text.all import *
from fastai.callback.wandb import *

from fasthugs.learner import TransLearner
from fasthugs.data import TransformersTextBlock, TextGetter, get_splits

from datasets import load_dataset, concatenate_datasets

import wandb
import gc

In [4]:
#hide
%env WANDB_ENTITY=fastai_community
%env WANDB_PROJECT=glue-benchmark

env: WANDB_ENTITY=fastai_community
env: WANDB_PROJECT=glue-benchmark


## Introduction

In this blogpost will look at how to conbine the power of HuggingFace with great flexibility of fastai. For this purpose we will look at finetuning on [GLUE benchmark](https://gluebenchmark.com/) tasks. Fun fact: it was introduced in this [paper](https://arxiv.org/abs/1804.07461) 2018 as tough to beat benchmark to chellange NLP systems and in just about a year new [SuperGLUE](https://arxiv.org/abs/1911.11763) benchmark was introduced because original GLUE has become too easy for the models.

To give you a grasp on what are we dealing with, here is a brief summary of GLUE tasks:

In [13]:
#hide_input
abreviations=["cola","sst2","mrpc","stsb","qqp","mnli","qnli","rte","wnli"]
name = [
    "Corpus of Linguistic Acceptability",
    "Stanford Sentiment Treebank",
    "Microsoft Research Paraphrase Corpus",
    "Semantic Textual Similarity Benchmark",
    "Quora question pair",
    "Mulit-Genre Natural Language Inference",
    "Stanford Question Answering Dataset",
    "Recognize Textual Entailment",
    "Winograd Schema Challenge"
]
descriptions = [
    "Determine whether it is a grammatical sentence",
    "Predict the sentiment of a givensentence",
    "Determine whether the sentences in the pair are semantically equivalent",
    "Determine similarity score for 2 sentences",
    "Determine if 2 questions are the same (paraphrase)",
    "Predict whether the premise entails, contradicts or is neutral to the hypothesis",
    "Determine whether the context sentence containsthe answer to the question",
    "Determine whether one sentece entails another",
    "Predict if the sentence with the pronoun substituted is entailed by the original sentence"
]
df = pd.DataFrame({'Name':name,
                   'Task description':descriptions,
                   'Size':['8.5k','67k','3.7k','7k','364k','393k','105k','2.5k', '634'],
                   'Metrics':['matthews_corrcoef','accuracy','f1/accuracy','pearsonr/spearmanr',
                              'f1/accuracy','accuracy','accuracy','accuracy','accuracy']},
                   index=abreviations)
display_df(df)

Unnamed: 0,Name,Task description,Size,Metrics
cola,Corpus of Linguistic Acceptability,Determine whether it is a grammatical sentence,8.5k,matthews_corrcoef
sst2,Stanford Sentiment Treebank,Predict the sentiment of a givensentence,67k,accuracy
mrpc,Microsoft Research Paraphrase Corpus,Determine whether the sentences in the pair are semantically equivalent,3.7k,f1/accuracy
stsb,Semantic Textual Similarity Benchmark,Determine similarity score for 2 sentences,7k,pearsonr/spearmanr
qqp,Quora question pair,Determine if 2 questions are the same (paraphrase),364k,f1/accuracy
mnli,Mulit-Genre Natural Language Inference,"Predict whether the premise entails, contradicts or is neutral to the hypothesis",393k,accuracy
qnli,Stanford Question Answering Dataset,Determine whether the context sentence containsthe answer to the question,105k,accuracy
rte,Recognize Textual Entailment,Determine whether one sentece entails another,2.5k,accuracy
wnli,Winograd Schema Challenge,Predict if the sentence with the pronoun substituted is entailed by the original sentence,634,accuracy


As you can see some datasets are really small here. And we'll look at how one can adress.

## Setup

Let's define main settings for the run in one place:

In [None]:
ds_name = 'glue'
model_name = "distilroberta-base"

max_len = 512
bs = 32
val_bs = bs*2

n_epoch = 4
lr = 2e-5
wd = 0.
opt_func = Adam
diff_lr_decay_factor = 0

To make switching between datasets smooth I'll define couple of dictionaries containing per-task information. We'll need metrics, text fields to retrieve data and number of outputs for the model.

In [8]:
GLUE_TASKS = ["cola", "mnli", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]
def validate_task():
    assert task in GLUE_TASKS

In [9]:
#collapse_output
glue_metrics = {
    'cola':[MatthewsCorrCoef()],
    'sst2':[accuracy],
    'mrpc':[F1Score(), accuracy],
    'stsb':[PearsonCorrCoef(), SpearmanCorrCoef()],
    'qqp': [F1Score(), accuracy],
    'mnli':[accuracy],
    'qnli':[accuracy],
    'rte': [accuracy],
    'wnli':[accuracy],
}

glue_textfields = {
    'cola':['sentence', None],
    'sst2':['sentence', None],
    'mrpc':['sentence1', 'sentence2'],
    'stsb':['sentence1', 'sentence2'],
    'qqp': ['question1', 'question2'],
    'mnli':['premise', 'hypothesis'],
    'qnli':['question', 'sentence'],
    'rte': ['sentence1', 'sentence2'],
    'wnli':['sentence1', 'sentence2'],
}

glue_num_labels = {'mnli':3, 'stsb':1}

In [None]:
#collapse_input
def layerwise_splitter(model):
    emb = L(model.base_model.embeddings)
    layers = L(model.base_model.encoder.layer.children())
    clf = L(m for m in list(model.children())[1:] if params(m))
    groups = emb + layers + clf
    return groups.map(params)

## Running a GLUE task

In [None]:
#hide_output
task = 'sst2'; validate_task()
ds = load_dataset(ds_name, task)

In [None]:
valid_ = 'validation-matched' if task=='mnli' else 'validation'
len(ds['train']), len(ds[valid_])

(67349, 872)

In [None]:
train_idx, valid_idx = get_splits(ds, valid=valid_)
train_ds = concatenate_datasets([ds['train'], ds[valid_]])
train_ds[0]

{'idx': 0,
 'label': 0,
 'sentence': 'hide new secretions from the parental units '}

Here I use number of characters a proxy for length of tokenized text to speed up `dls` creation.

In [None]:
lens = train_ds.map(lambda s: {'len': sum([len(s[i]) for i in glue_textfields[task] if i])},
                    remove_columns=train_ds.column_names, num_proc=2, keep_in_memory=True)
train_lens = lens.select(train_idx)['len']
valid_lens = lens.select(valid_idx)['len']





In [None]:
blocks = [TransformersTextBlock(pretrained_model_name=model_name),CategoryBlock()]
dblock = DataBlock(blocks=blocks,
                   get_x=TextGetter(*glue_textfields[task]),
                   get_y=ItemGetter('label'),
                   splitter=IndexSplitter(valid_idx))

In [None]:
dl_kwargs=[{'res':train_lens}, {'val_res':valid_lens}]
dls = dblock.dataloaders(train_ds, bs=bs, val_bs=val_bs, dl_kwargs=dl_kwargs)
dls.show_batch(max_n=4)

Unnamed: 0,text,category
0,"... spiced with humor ('i speak fluent flatula,'advises denlopp after a rather, er, bubbly exchange with an alien deckhand ) and witty updatings ( silver's parrot has been replaced with morph, a cute alien creature who mimics everyone and everything around )",1
1,"stopped thinking about how good it all was, and started doing nothing but reacting to it - feeling a part of its grand locations, thinking urgently as the protagonists struggled, feeling at the mercy of its inventiveness, gasping at its visual delights.",1
2,"ozpetek offers an aids subtext, skims over the realities of gay sex, and presents yet another tired old vision of the gay community as an all-inclusive world where uptight, middle class bores like antonia can feel good about themselves.",0
3,"'s... worth the extra effort to see an artist, still committed to growth in his ninth decade, change while remaining true to his principles with a film whose very subject is, quite pointedly, about the peril of such efforts.",1


### Single run

The GLUE benchmark contains 8 tasks and it might be cumbersome to systematize the results. To make the analysis simpler and much more powerful I will be using Weights&Biases tracking platform.
And even better thanks to Morgan McGuire (@morg) we have an open W&B project. You just need to log your runs under `glue-benchmark` project and set `entity="fastai_community"` and your results will be added to the pull for further investigation of hyperparameters. The fastest way to start participating would be to fork this notebook as it is set up to run any of the GLUE tasks with minimal changes.
There is a lot to try: gradual unfreezing strategy is reported not to be helpful when finetuning Transformer-based models (for example see a discussion [here](https://github.com/huggingface/transformers/pull/11533)); differential learning rates are used in NLP [[1](https://arxiv.org/abs/1905.05583), [2](https://arxiv.org/abs/2003.10555)] but are not common practice, do we need to use weight decay, if yes - how much and where, what suggestions from LR-finder work best? These are only few of many open questions and there are so much more.
And even more interesting one how do this scale with dataset and model size?

Deep Learning as of now is highly empirical field and experiments require both some engeniering and compute. This post is aimed to fuel comunity effort towards finding empirical truth by joining small forces together. Even if you're new to NLP do not hasitate to participate and run couple of experiments while learning along the way!

In [None]:
WANDB_NAME = f'{ds_name}-{task}-{model_name}'
GROUP = f'{ds_name}-{task}-{model_name}-{lr:.0e}'
if diff_lr_decay_factor: GROUP += f"diff_lr_{diff_lr_decay_factor}"
NOTES = f'finetuning {model_name} with {opt_func.__name__} lr={lr:.0e}'
TAGS =[model_name, ds_name, opt_func.__name__]

In [None]:
#hide_output 
wandb.init(reinit=True, project="glue-benchmark", entity="fastai_community",
           name=WANDB_NAME, group=GROUP, notes=NOTES, tags=TAGS);

In [None]:
#hide_output
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=glue_num_labels.get('task', 2))
metrics = glue_metrics[task]
learn = TransLearner(dls, model, metrics=metrics, opt_func=opt_func, splitter=layerwise_splitter)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight'

In [None]:
if diff_lr_decay_factor != 0:
    k = len(layerwise_splitter(model))
    lr = slice(lr*diff_lr_decay_factor**k,lr)

metric_to_monitor = metrics[0].name if isinstance(metrics[0], Metric) else metrics[0].__name__
cbs = [WandbCallback(log_preds=False, log_model=False),
       SaveModelCallback(monitor=metric_to_monitor, fname=f'{model_name}-{task}')]
learn.fit_one_cycle(4, lr, wd=wd, cbs=cbs)

Could not gather input dimensions


epoch,train_loss,valid_loss,accuracy,time
0,0.2344,0.234248,0.915138,03:09
1,0.173527,0.24728,0.917431,03:11
2,0.097726,0.246916,0.924312,03:10
3,0.078707,0.26363,0.925459,03:11


Better model found at epoch 0 with accuracy value: 0.9151375889778137.
Better model found at epoch 1 with accuracy value: 0.9174311757087708.
Better model found at epoch 2 with accuracy value: 0.9243119359016418.
Better model found at epoch 3 with accuracy value: 0.9254587292671204.


In [None]:
learn.show_results()

Unnamed: 0,text,category,category_
0,"the movie has an infectious exuberance that will engage anyone with a passing interest in the skate/surf culture, the l.a. beach scene and the imaginative ( and sometimes illegal ) ways kids can make a playground out of the refuse of adults.",1,1
1,"what really makes it special is that it pulls us into its world, gives us a hero whose suffering and triumphs we can share, surrounds him with interesting characters and sends us out of the theater feeling we've shared a great adventure.",1,1
2,this is a train wreck of an action film -- a stupefying attempt by the filmmakers to force-feed james bond into the mindless xxx mold and throw 40 years of cinematic history down the toilet in favor of bright flashes and loud bangs.,0,0
3,"it's one of those baseball pictures where the hero is stoic, the wife is patient, the kids are as cute as all get-out and the odds against success are long enough to intimidate, but short enough to make a dream seem possible.",1,1
4,"though perry and hurley make inspiring efforts to breathe life into the disjointed, haphazard script by jay scherick and david ronn, neither the actors nor director reginald hudlin can make it more than fitfully entertaining.",0,1
5,"may be far from the best of the series, but it's assured, wonderfully respectful of its past and thrilling enough to make it abundantly clear that this movie phenomenon has once again reinvented itself for a new generation.",1,1
6,"despite all evidence to the contrary, this clunker has somehow managed to pose as an actual feature movie, the kind that charges full admission and gets hyped on tv and purports to amuse small children and ostensible adults.",0,0
7,"it's inoffensive, cheerful, built to inspire the young people, set to an unending soundtrack of beach party pop numbers and aside from its remarkable camerawork and awesome scenery, it's about as exciting as a sunburn.",0,1
8,"but the power of these ( subjects ) is obscured by the majority of the film that shows a stationary camera on a subject that could be mistaken for giving a public oration, rather than contributing to a film's narrative.",0,0


In [None]:
#hide
# test_dl = dls.test_dl(ds['test'])
# preds = learn.get_preds(dl=test_dl)

In [None]:
#hide
del learn
gc.collect()
torch.cuda.empty_cache()

### Sweeps

Finding the perfect learning rate for a task isn't easy. Add weight decay, different optimizers, differential learning rates and various scheduler to the mix and search for best hyperparameters becomes a really big task. For that reason there exist automated tools for hyperparameter search. Here we'll look at `sweep`s functionality provided by W&B.
It not only facilitates hyperparameter finetuning but also enables great visualization of the results, which might help for further analysis.

In [None]:
wandb.login()

Failed to query for notebook name, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable
[34m[1mwandb[0m: Currently logged in as: [33mfastai_community[0m (use `wandb login --relogin` to force relogin)


True

In [None]:
def train():
    with wandb.init() as run:
        cfg = run.config
        model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=glue_num_labels.get(task, 2))
        metrics = glue_metrics[task]
        k = len(layerwise_splitter(model))
        if cfg.diff_lr_decay_factor: lr = slice(cfg.lr*cfg.diff_lr_decay_factor**k,cfg.lr)
        learn = TransLearner(dls, model, metrics=metrics, opt_func=Adam, splitter=layerwise_splitter)
        learn.fit_one_cycle(n_epoch, cfg.lr, wd=cfg.wd, cbs=[WandbCallback(log_preds=False, log_model=False)])
        del learn
        gc.collect()
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

In [None]:
metrics = glue_metrics[task]
metric_to_monitor = metrics[0].name if isinstance(metrics[0], Metric) else metrics[0].__name__
sweep_name = f"glue-{task}-sweep"
sweep_config = {
  "name": sweep_name,
  "method": "random",
  "parameters": {
        "lr": {"values":[1e-5,2e-5,3e-5,5e-5, 1e-4, 3e-4]},
        "wd":{"values":[0.,1e-2,5e-2]},
        "diff_lr_decay_factor":{"values":[0., 0.9, 0.8, 0.7, 0.6]}
    },
  "metric":{"goal": "maximise", "name": metric_to_monitor},
  "early_terminate": {"type": "hyperband", "s": 2, "eta": 3, "max_iter": 40}
}

In [None]:
sweep_id = wandb.sweep(sweep_config, project='glue-benchmark', entity="fastai_community")

Create sweep with ID: dwio5fl2
Sweep URL: https://wandb.ai/fastai_community/uncategorized/sweeps/dwio5fl2


In [None]:
wandb.agent(sweep_id, function=train)

In [None]:
wandb.finish()

## Another task example: MultiNLI

In [None]:
task = 'mnli'
validate_task()

In [None]:
ds = load_dataset(ds_name, task)

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


In [None]:
train_idx, valid_idx = get_splits(ds, valid='validation_matched')
train_ds = concatenate_datasets([ds['train'], ds['validation_matched']])

In [None]:
train_ds[0]

{'hypothesis': 'Product and geography are what make cream skimming work. ',
 'idx': 0,
 'label': 1,
 'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.'}

In [None]:
lens = train_ds.map(lambda s: {'len': len(s['premise'])+len(s['hypothesis'])}, remove_columns=train_ds.column_names, num_proc=4, keep_in_memory=True)
train_lens = lens.select(train_idx)['len']
valid_lens = lens.select(valid_idx)['len']







In [None]:
dblock = DataBlock(blocks = [TransformersTextBlock(pretrained_model_name=model_name),
                             CategoryBlock()],
                   get_x=TextGetter(*glue_textfields[task]),
                   get_y=ItemGetter('label'),
                   splitter=IndexSplitter(valid_idx))

In [None]:
%%time
dl_kwargs=[{'res':train_lens}, {'val_res':valid_lens}]
dls = dblock.dataloaders(train_ds, bs=bs, val_bs=val_bs, dl_kwargs=dl_kwargs, num_workers=4)

CPU times: user 1min 55s, sys: 2.15 s, total: 1min 57s
Wall time: 1min 57s


In [None]:
dls.show_batch(max_n=4)

Unnamed: 0,text,text_,category
0,well uh that's kind of obvious i mean they're even carrying it to to where now uh that they advertise on TV you know if your if you uh you know have done this or if you need this uh uh we'll sue for you and you don't have to pay us unless you but then what they don't tell you is that if you if they win you give them at least a third of the of the thing that they win so i don't know it is uh it's getting to be more business now rather than uh actually uh dealing with the crime than with uh um the uh punishment they the the lawyers are just in it for the money i'm i'm convinced i know i i agree with you i think you're real you're very right that the politicians should i think they,I think that there should be an equal representation of backgrounds in our politicians.,0
1,that's funny yeah and that is a good short term thing though that little things like that that overall though i just think we're just going to i don't know see i know i guess i'm kind of leery of this topic because i know that Bush is real for the new world order the one world government and alleviating all you know national debt between all of the nations but i see that to be a potential power problem later with um who's going to be in charge with this new world order and i you know i'm uncomfortable with that much power being in one place but i know we already have a new money system we already have new bills printed for the US Treasury already has our new bills printed for new currency and i mean i've seen them and so i know that the long-term,I hope all power is concentrated into one country.,2
2,that's right um we were down in Dallas right after Christmas and on the way back we stopped in Louisiana to visit my brother and we were driving my husband's Toyota pick up truck well we made a quick little stop when we got to Baton Rouge and he came came back out and the car the truck wouldn't stop i mean it wouldn't start so gave it a somebody came along and helped give it a little push and the next morning they took it to the garage and it was just a small private garage and he said it was the starter motor proba bly and he was going to take it off and either repair it or replace it or whatever and we got a call in the middle of the morning and he said i've got good news and bad news uh the starter motor,He said the starter motor was probably fine and that it must be something else.,2
3,not really yeah really well that's pretty wild we yeah we used it for fleas we had fleas in our yard real bad last year and we did that um i just i'm not basically i like to mow the lawn believe it or not but i sometimes have problems starting the mower so a lot of times i don't get out and do it but my husband basically does most of it and he does the you know edging and all that kind of thing and we're renting and so we don't really put a lot of money into the uh you know this like this lawn could probably stand a couple of loads of dirt and some Saint Augustine we just we have winter rye out back and we have i don't even know what it is out front but um we this is the first house we've,We don't worry much about the lawn because we rent.,0


### Tracking with W&B

In [None]:
WANDB_NAME = f'{ds_name}-{task}-{model_name}'
GROUP = f'{ds_name}-{task}-{model_name}-{lr:.0e}'
NOTES = f'finetuning {model_name} with Adam lr={lr:.0e}'
TAGS =[model_name, ds_name, 'adam']

In [None]:
#hide_output 
wandb.init(reinit=True, project="glue-benchmark", entity="fastai_community",
           name=WANDB_NAME, group=GROUP, notes=NOTES, tags=TAGS);

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mfastai_community[0m (use `wandb login --relogin` to force relogin)


### Training

In [None]:
#hide_output
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
metrics = glue_metrics[task]
learn = TransLearner(dls, model, metrics=metrics)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight'

In [None]:
metric_to_monitor = metrics[0].name if isinstance(metrics[0], Metric) else metrics[0].__name__
cbs = [WandbCallback(log_preds=False, log_model=False),
       SaveModelCallback(monitor=metric_to_monitor, fname=f'{model_name}-{task}')]
learn.fit_one_cycle(8, lr, wd=wd, cbs=cbs)

Could not gather input dimensions


epoch,train_loss,valid_loss,accuracy,time


In [None]:
learn.show_results()

In [None]:
valid_mm_dl = dls.test_dl(ds['validation_mismatched'], with_labels=True)
learn.validate(dl=valid_mm_dl)

## Low resource tasks 

Notice that `rte` task has only 2.5k samples in the training set. This is not much at all for untrivial language task like this one. But we can try to use a small trick for that. The MNLI task is quite similar and has much more training data. Let's reuse model trained on it for improving RTE score. This trick is common practice and has been employed in original [RoBERTa paper](https://arxiv.org/abs/1907.11692) whe reporting GLUE score.

In [None]:
#hide_output
task = 'rte'; validate_task()

ds = load_dataset(ds_name, task)

valid_ = 'validation-matched' if task=='mnli' else 'validation'
len(ds['train']), len(ds[valid_])

train_idx, valid_idx = get_splits(ds, valid=valid_)
train_ds = concatenate_datasets([ds['train'], ds[valid_]])

Reusing dataset glue (/root/.cache/huggingface/datasets/glue/rte/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


In [None]:
train_ds[0]

{'idx': 0,
 'label': 1,
 'sentence1': 'No Weapons of Mass Destruction Found in Iraq Yet.',
 'sentence2': 'Weapons of Mass Destruction Found in Iraq.'}

In [None]:
dblock = DataBlock(blocks = [TransformersTextBlock(pretrained_model_name=model_name), CategoryBlock()],
                   get_x=TextGetter(*glue_textfields[task]),
                   get_y=ItemGetter('label'),
                   splitter=IndexSplitter(valid_idx))

In [None]:
dls = dblock.dataloaders(train_ds, bs=bs, val_bs=val_bs)
dls.show_batch(max_n=4)

Unnamed: 0,text,text_,category
0,No Weapons of Mass Destruction Found in Iraq Yet.,Weapons of Mass Destruction Found in Iraq.,1
1,The most recent poll carried out by NOP market research in January revealed that 61% of Britons are opposed to joining the euro.,The introduction of the euro has been opposed.,0
2,"The disappearance of York University chef Claudia Lawrence is now being treated as suspected murder, North Yorkshire Police said. However detectives said they had not found any proof that the 35-year-old, who went missing on 18 March, was dead. Her father Peter Lawrence made a direct appeal to his daughter to contact him five weeks after she disappeared. His plea came at a news conference held shortly after a £10,000 reward was offered to help find Miss Lawrence. Crimestoppers said the sum they were offering was ""significantly higher"" than usual because of public interest in the case.",Claudia Lawrence is 35 years old.,0
3,"A Continental Connection flight from Newark to Buffalo crashed into a house about four to six miles from Buffalo Niagara International Airport on Thursday night, killing 50 people, officials said. Continental Airlines Flight 3407 is a daily commuter flight from Newark Liberty International Airport in Newark, New Jersey to Buffalo, New York, operated under the Continental Connection brand by Virginia-based regional airline Colgan Air.",A daily commuter flight crashed in New York.,0


In [None]:
WANDB_NAME = f'{ds_name}-{task}-{model_name}'
GROUP = f'{ds_name}-{task}-{model_name}-{lr:.0e}'
if diff_lr_decay_factor: GROUP += f"diff_lr_{diff_lr_decay_factor}"
NOTES = f'finetuning {model_name} with {opt_func.__name__} lr={lr:.0e}'
TAGS =[model_name, ds_name, opt_func.__name__]

In [None]:
#hide_output 
wandb.init(reinit=True, project="fasthugs", entity="fastai_community",
           name=WANDB_NAME, group=GROUP, notes=NOTES, tags=TAGS);

In [None]:
#hide_output
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=glue_num_labels.get('task', 2))
metrics = glue_metrics[task]
learn = TransLearner(dls, model, metrics=metrics, opt_func=opt_func)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight'

In [None]:
try:
    learn.load('distilroberta-base-mnli', with_opt=False, strict=False)
except RuntimeError as e:
    print(e)

Error(s) in loading state_dict for RobertaForSequenceClassification:
	size mismatch for classifier.out_proj.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).
	size mismatch for classifier.out_proj.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).


In [None]:
if diff_lr_decay_factor != 0:
    k = len(layerwise_splitter(model))
    lr = slice(lr*diff_lr_decay_factor**k,lr)

metric_to_monitor = metrics[0].name if isinstance(metrics[0], Metric) else metrics[0].__name__
cbs = [WandbCallback(log_preds=False, log_model=False),
       SaveModelCallback(monitor=metric_to_monitor, fname=f'{model_name}-{task}')]
learn.fit_one_cycle(10, lr, wd=wd, cbs=cbs, pct_start=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,0.569979,0.56589,0.693141,00:30
1,0.51128,0.529077,0.736462,00:31
2,0.409093,0.60169,0.743682,00:31
3,0.265996,0.763166,0.736462,00:31
4,0.171846,0.770063,0.754513,00:32
5,0.098103,0.922156,0.768953,00:32
6,0.067698,1.030401,0.761733,00:31
7,0.048222,1.007513,0.772563,00:31
8,0.034855,1.05637,0.765343,00:32
9,0.021131,1.069907,0.761733,00:32


As one can see by using this simple trick we've improved the result reported at HuggingFace [model card](https://huggingface.co/distilroberta-base#evaluation-results) by some 10%. Pretty nice, ha?

Just to be sure that improvement is due to using model finetuned on `mnli` let's do another run starting from vanilla `distilroberta`:

In [None]:
#hide_output
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=glue_num_labels.get('task', 2))
metrics = glue_metrics[task]
learn = TransLearner(dls, model, metrics=metrics, opt_func=opt_func)

In [None]:
learn.fit_one_cycle(10, lr, wd=wd, cbs=cbs, pct_start=0.1)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight'

epoch,train_loss,valid_loss,accuracy,time
0,0.695126,0.691306,0.527076,00:31
1,0.692349,0.692152,0.480144,00:31
2,0.678994,0.64174,0.624549,00:31
3,0.602276,0.600447,0.67148,00:31
4,0.488653,0.662074,0.6787,00:31
5,0.37743,0.683057,0.6787,00:31
6,0.269494,0.967499,0.65704,00:31
7,0.182777,1.01697,0.685921,00:32
8,0.140067,1.038462,0.696751,00:31
9,0.11393,1.068865,0.68231,00:32


The same holds for STSB taks, which has 7k training samples. You can compare the results for cold and warm starts in this [report]().