## Implement the "Don't stop pretraining" using HuggingFace. 

* The fine-tuning of language model part mostly follows [this notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling.ipynb).
* The fine-tuning for classification part mostly follows [this notebook](https://github.com/YipingNUS/huggingface-learning-notes/blob/master/fine-tuning-sst.ipynb).

In [1]:
from transformers import AutoModelForPreTraining, AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments

from fastai.data.external import untar_data
from fastbook import *

import numpy as np

In [38]:
from sklearn.datasets import load_files
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import datasets
from datasets import load_dataset, Dataset

In [5]:
model_checkpoint = "distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True, max_length=512)

## 1. Load and preprocess the dataset for language model

* I purposely loaded the dataset manually instead of using the built-in dataset to demonstration how to load a custom dataset.

In [3]:
path = untar_data(URLs.IMDB)
(path).ls()

(#7) [Path('/storage/data/imdb/README'),Path('/storage/data/imdb/imdb.vocab'),Path('/storage/data/imdb/tmp_lm'),Path('/storage/data/imdb/tmp_clas'),Path('/storage/data/imdb/unsup'),Path('/storage/data/imdb/test'),Path('/storage/data/imdb/train')]

In [4]:
train_data = load_files(path/'train', encoding='utf-8')
test_data = load_files(path/'test', encoding='utf-8')

In [6]:
all_texts = np.concatenate([train_data.data, test_data.data])
all_texts.shape

(50000,)

In [7]:
# use a very small test set
train_dataset = Dataset.from_dict({'text': all_texts[:-1000]})
test_dataset = Dataset.from_dict({'text': all_texts[-1000:]})

In [8]:
def tokenize_function(examples):
    return tokenizer(examples["text"], max_length=512, truncation=True)

In [None]:
tokenized_test_datasets = test_dataset.map(tokenize_function, batched=True, num_proc=8, remove_columns=["text"])
tokenized_train_datasets = train_dataset.map(tokenize_function, batched=True, num_proc=8, remove_columns=["text"])
tokenized_test_datasets.set_format()

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [13]:
# block_size = tokenizer.model_max_length
block_size = 128

In [10]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the ðŸ¤— Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [27]:
test_lm_datasets = tokenized_test_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=8,
)
train_lm_datasets = tokenized_train_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=8,
)



















## 2. Fine-tune the language model (DART/TART)

The `lm_head` has the following structure:

```
 (lm_head): RobertaLMHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (decoder): Linear(in_features=768, out_features=50265, bias=True)
  )
```

The `classifier` head is similarly two layers:

```
 (classifier): RobertaClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=2, bias=True)
  )
```

In [None]:
model = AutoModelForPreTraining.from_pretrained(model_checkpoint)  # equivalent as AutoModelForMaskedLM but more general

In [22]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [32]:
training_args = TrainingArguments(
    "test-clm",
    num_train_epochs=3, 
    evaluation_strategy="steps",
    eval_steps=100, 
    learning_rate=3e-4,
    warmup_ratio=0.2,
    weight_decay=0.01,
    per_device_train_batch_size=80,
    per_device_eval_batch_size=200,
    fp16=True
)

In [33]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_lm_datasets,
    eval_dataset=test_lm_datasets,
    data_collator=data_collator,
)

In [34]:
trainer.train()

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,4.428,5.367817,10.9045,189.646
2,7.1751,7.149477,10.625,194.635
3,7.1639,7.185637,10.6315,194.516


TrainOutput(global_step=3867, training_loss=6.055203528807457, metrics={'train_runtime': 4555.7204, 'train_samples_per_second': 0.849, 'total_flos': 1.9513037829586176e+16, 'epoch': 3.0})

In [41]:
model.save_pretrained('distilroberta-base-imdb-lm-finetuned')

## 3. Fine-tuning the classifier

In [39]:
model = AutoModelForSequenceClassification.from_pretrained('distilroberta-base-imdb-lm-finetuned') 

Some weights of the model checkpoint at distilroberta-base-imdb-lm-finetuned were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base-imdb-lm-finetuned and are newly initialized: ['classifier.dense.weight'

**Note:** Since we're splitting and preprocessing slightly differently from LM, we need reprocessing the dataset.

In [40]:
train_dataset = Dataset.from_dict({'text': train_data.data, 'label': train_data.target})
test_dataset = Dataset.from_dict({'text': test_data.data, 'label': test_data.target})

In [41]:
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, max_length=200)

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=1024, remove_columns=["text"])
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=1024, remove_columns=["text"])

HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




In [42]:
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

In [46]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3, 
    evaluation_strategy="epoch",     # total # of training epochs
    learning_rate=3e-4,
    per_device_train_batch_size=128,  # batch size per device during training
    per_device_eval_batch_size=400,   # batch size for evaluation
    warmup_ratio=0.2,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    group_by_length=True,
    fp16=True,
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)

In [None]:
trainer.train()

### Result for baseline model without DART/TART

In [45]:
# baseline. everything else remains the same
model = AutoModelForSequenceClassification.from_pretrained('distilroberta-base') 

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight'

In [None]:
trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)

In [47]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Runtime,Samples Per Second
1,No log,0.321027,0.88216,0.877485,0.913736,0.844,113.4275,220.405
2,No log,0.266819,0.89736,0.901626,0.865651,0.94072,113.1416,220.962
3,0.262100,0.297293,0.90428,0.902838,0.916646,0.88944,113.301,220.651


TrainOutput(global_step=588, training_loss=0.23647036195612278, metrics={'train_runtime': 1434.3154, 'train_samples_per_second': 0.41, 'total_flos': 7390794420000000.0, 'epoch': 3.0})

In [31]:
with torch.no_grad():
    torch.cuda.empty_cache()