# Fine-tuning a masked language model (PyTorch)

The notebook is based on the tutorial from [HuggingFace](https://huggingface.co/learn/nlp-course/chapter7/3?fw=pt#fine-tuning-distilbert-with-the-trainer-api).

In [28]:
!nvidia-smi

/bin/bash: nvidia-smi: command not found


Install the Transformers, Datasets, and Evaluate libraries to run this notebook. Also we'll use the models with [bert-extractive-summarizer](https://github.com/dmmiller612/bert-extractive-summarizer).

In [29]:
!pip install datasets evaluate transformers[sentencepiece] -q
!pip install accelerate -q
!pip install bert-extractive-summarizer -q
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [65]:
import re
import pandas as pd
from transformers import (
    AutoModelForMaskedLM, 
    AutoTokenizer, 
    DataCollatorForLanguageModeling, 
    default_data_collator,
    TrainingArguments,
    Trainer,
    AutoConfig, 
    AutoTokenizer, 
    AutoModel,
    pipeline
)
import torch
from datasets import load_dataset, DatasetDict, Dataset
import collections
import numpy as np
import math
from summarizer import Summarizer

In [47]:
RANDOM_STATE = 42

We want to push our trained model to Hugging Face, hence let's login:

In [31]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [32]:
model_checkpoints = [
    "bert-base-uncased",
    "distilbert-base-uncased",
    "google/mobilebert-uncased",
    "albert-base-v2"
]

for model_config in model_checkpoints:
  model = AutoModelForMaskedLM.from_pretrained(model_config)
  num_parameters = model.num_parameters() / 1_000_000
  print(f'{model_config} number of parameters: {round(num_parameters)}M')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


bert-base-uncased number of parameters: 110M
distilbert-base-uncased number of parameters: 67M


Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


google/mobilebert-uncased number of parameters: 37M
albert-base-v2 number of parameters: 11M


Let's compare model sizes (number of parameters):
- BERT 110M
- DistilBERT 67M
- MobileBERT 37M
- ALBERT 11M

ALBERT model has 10 times fewer parameters than BERT and 6 times fewer than DistilBERT!

In [33]:
model_checkpoint = model_checkpoints[2]

In [34]:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [35]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

It is possible to test the model on mask filling task:

In [36]:
text = "This is a great [MASK]."

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits

# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great success.'
'>>> This is a great achievement.'
'>>> This is a great deal.'
'>>> This is a great example.'
'>>> This is a great victory.'


Download a dataset of ML paper abstracts from ArXiv:

In [37]:
arxiv_dataset = load_dataset("CShorten/ML-ArXiv-Papers", split="train")
print(len(arxiv_dataset))
arxiv_dataset



117592


Dataset({
    features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
    num_rows: 117592
})

Remove everything unnecessary (useless columns and \\n symbols in titles and abstracts):

In [51]:
def simple_clean_text(example):
  example['title'] = example['title'].replace('\n', ' ').strip()
  example['abstract'] = example['abstract'].replace('\n', ' ').strip()
  return example

arxiv_prepared_dataset = arxiv_dataset.remove_columns(['Unnamed: 0.1', 'Unnamed: 0'])
arxiv_prepared_dataset = arxiv_prepared_dataset.map(simple_clean_text)
arxiv_prepared_dataset



Dataset({
    features: ['title', 'abstract'],
    num_rows: 117592
})

And split it for train/val and test parts:

In [53]:
arxiv_prepared_dataset_split = arxiv_prepared_dataset.train_test_split(test_size=0.1, seed=RANDOM_STATE)
arxiv_prepared_dataset_split



DatasetDict({
    train: Dataset({
        features: ['title', 'abstract'],
        num_rows: 105832
    })
    test: Dataset({
        features: ['title', 'abstract'],
        num_rows: 11760
    })
})

In [54]:
arxiv_prepared_dataset_split['train'][123]

{'title': 'The Implicit Bias of AdaGrad on Separable Data',
 'abstract': 'We study the implicit bias of AdaGrad on separable linear classification problems. We show that AdaGrad converges to a direction that can be characterized as the solution of a quadratic optimization problem with the same feasible set as the hard SVM problem. We also give a discussion about how different choices of the hyperparameters of AdaGrad might impact this direction. This provides a deeper understanding of why adaptive methods do not seem to have the generalization ability as good as gradient descent does in practice.'}

Now let's save the prepared copy on Hugging Face:

In [55]:
arxiv_prepared_dataset_split.push_to_hub('aalksii/ml-arxiv-papers')



Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/106 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]



Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/12 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Now we can try to pull the dataset:

In [58]:
arxiv_dataset = load_dataset('aalksii/ml-arxiv-papers', split='train')
arxiv_dataset

Downloading readme:   0%|          | 0.00/969 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/aalksii___parquet/aalksii--ml-arxiv-papers-b8200c762e68b8c7/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/73.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/105832 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11760 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/aalksii___parquet/aalksii--ml-arxiv-papers-b8200c762e68b8c7/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


Dataset({
    features: ['title', 'abstract'],
    num_rows: 105832
})

Sanity check -- filter all None values and empty strings in the abstract columns:

In [81]:
# Make test dataset:
ds = Dataset.from_pandas(pd.DataFrame({
    'title': ['abc', None, 'def', 'asd'],
    'abstract': ['qwe', 'r', None, '']
}))

In [82]:
sanity_filter = lambda example: example['abstract'] is not None and len(example['abstract'])
ds.filter(sanity_filter)[:]

Filter:   0%|          | 0/4 [00:00<?, ? examples/s]

{'title': ['abc', None], 'abstract': ['qwe', 'r']}

As we can see, no changes applied for the original data -- that's perfect!

In [83]:
arxiv_dataset.filter(sanity_filter)



Dataset({
    features: ['title', 'abstract'],
    num_rows: 105832
})

It is important to remove:
1. special latex symbols such as \$T\$, 
2. latex comands such as \\\sqrt\{n\} and \\\epsilon,
3. URLs.


In [None]:
latex_pattern = r'(\$+)(?:(?!\1)[\s\S])*\1|\\[a-zA-Z]+(?:\{[^\}]+\})?|http[s]?://\S+'

def clean_text(example):
  example['abstract'] = re.sub(latex_pattern, '', example['abstract'])
  example['abstract'] = example['abstract'].replace('\n', ' ').strip()
  return example

In [None]:
arxiv_dataset[6]['abstract']

'  We consider inapproximability of the correlation clustering problem defined\nas follows: Given a graph $G = (V,E)$ where each edge is labeled either "+"\n(similar) or "-" (dissimilar), correlation clustering seeks to partition the\nvertices into clusters so that the number of pairs correctly (resp.\nincorrectly) classified with respect to the labels is maximized (resp.\nminimized). The two complementary problems are called MaxAgree and MinDisagree,\nrespectively, and have been studied on complete graphs, where every edge is\nlabeled, and general graphs, where some edge might not have been labeled.\nNatural edge-weighted versions of both problems have been studied as well. Let\nS-MaxAgree denote the weighted problem where all weights are taken from set S,\nwe show that S-MaxAgree with weights bounded by $O(|V|^{1/2-\\delta})$\nessentially belongs to the same hardness class in the following sense: if there\nis a polynomial time algorithm that approximates S-MaxAgree within a factor of

In [None]:
clean_text(arxiv_dataset[6])['abstract']

'We consider inapproximability of the correlation clustering problem defined as follows: Given a graph  where each edge is labeled either "+" (similar) or "-" (dissimilar), correlation clustering seeks to partition the vertices into clusters so that the number of pairs correctly (resp. incorrectly) classified with respect to the labels is maximized (resp. minimized). The two complementary problems are called MaxAgree and MinDisagree, respectively, and have been studied on complete graphs, where every edge is labeled, and general graphs, where some edge might not have been labeled. Natural edge-weighted versions of both problems have been studied as well. Let S-MaxAgree denote the weighted problem where all weights are taken from set S, we show that S-MaxAgree with weights bounded by  essentially belongs to the same hardness class in the following sense: if there is a polynomial time algorithm that approximates S-MaxAgree within a factor of  with high probability, then for any choice of

In [None]:
average_length = [len(a) for a in arxiv_dataset['abstract']]
average_length = int(sum(average_length) / len(average_length))
average_length

1157

In [None]:
arxiv_dataset = arxiv_dataset.select(range(10000))
len(arxiv_dataset)

10000

In [None]:
arxiv_dataset = arxiv_dataset.map(clean_text)



First, we need to tokenize each abstract in the dataset:

In [None]:
def tokenize_function(examples):
    result = tokenizer(examples["abstract"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = arxiv_dataset.map(
    tokenize_function, batched=True, remove_columns=['title', 'abstract']
)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
    num_rows: 10000
})

Then, we want want to group it all together and divide it into chunks. It will help the model to use all the information, instead of losing part of text while truncation is happened (because of the model_max_length parameter -- long token sequences will be trucated):

In [None]:
tokenizer.model_max_length

1000000000000000019884624838656

In [None]:
chunk_size = 128

As an example of dividing the tokens into chunks, lets take first 3 rows:

In [None]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets[:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Abstract {idx} length: {len(sample)} tokens'")

'>>> Abstract 0 length: 195 tokens'
'>>> Abstract 1 length: 351 tokens'
'>>> Abstract 2 length: 327 tokens'


In [None]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated abstract length: {total_length}'")

'>>> Concatenated abstract length: 873'


In [None]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 105'


Now create a function that will group rows of tokens for each batch:

In [None]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

It is noticable that chunked dataset has more rows than original one:

In [None]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

An example of the tokenized text:

In [None]:
tokenizer.decode(lm_datasets[1]["input_ids"])

'suitable regularity conditions on the admissible predictors, the underlying family of probability distributions and the loss function, we give an information - theoretic characterization of achievable predictor performance in terms of conditional distortion - rate functions. the ideas are illustrated on the example of nonparametric regression in gaussian noise. [SEP] [CLS] in a sensor network, in practice, the communication among sensors is subject to : ( 1 ) errors or failures at random times ; ( 3 ) costs ; and ( 2 ) constraints since sensors and networks operate under scarce resources, such as power, data rate, or communication. the signal - to -'

After, let's mask part of the sentences using Collator to train our model with MLM:

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [None]:
samples = [lm_datasets[i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a MobileBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



'>>> [CLS] the [MASK] of statistical learning [MASK] to construct a predictor of [MASK] random variable as a function [MASK] a related random variable on the [MASK] [MASK] an i. [MASK]. d [MASK] [MASK] sample from the joint distribution of. allowable predictors commits drawn [MASK] some [MASK] [MASK], and the goal is to approach as [MASK]ptotically the performance ( [MASK] loss ) of the best [MASK]or in the [MASK]. we consider the setting in which [MASK] [MASK] perfect observation of the - part of [MASK] sample, while the - part has to be communicated at some finite bit rate. the [MASK] of the [MASK] values is allowed to depend on the - [MASK] [MASK] under'

'>>> suitable regularity conditions on the admissible predictors, the underlying [MASK] [MASK] probability distributions and [MASK] loss function, we give an information - theoretic characterization of ac barevable predictor performance in terms of conditional [MASK] - rate functions. the ideas are illustrated on the example of no

Alternatively, we can define whole word masking (WWM) collator to mask not only a single token, but the whole word! Nevertheless, I didn't find the advantages of WWM over token masking, so we'll use the latter. 

In [None]:
wwm_probability = 0.15


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [None]:
samples = [lm_datasets[i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] [MASK] problem of statistical learning is [MASK] construct a predictor of [MASK] random variable as a function [MASK] [MASK] related random variable on the basis of [MASK] i. i. d [MASK] training sample from the joint [MASK] of. allowable predictors are drawn from some specified class, and the goal [MASK] to approach asymptotically the performance [MASK] expected loss ) of the best predictor in [MASK] class [MASK] we consider the setting in which one has [MASK] [MASK] of the - part of the sample, while the - [MASK] [MASK] to [MASK] communicated at [MASK] finite bit rate. the encoding of the [MASK] values is allowed to [MASK] on the [MASK] values. under'

'>>> suitable [MASK] [MASK] conditions on the admissible [MASK] [MASK], the underlying family of probability distributions and the loss function [MASK] we give an information - theoretic characterization [MASK] achievable predictor performance in [MASK] of conditional distortion - [MASK] [MASK] [MASK] the ideas are illustra

Now we can split the dataset on train, val and test parts:

In [None]:
train_val_test_dataset = lm_datasets.shuffle(seed=42).train_test_split(test_size=0.2, seed=42)
val_test_dataset = train_val_test_dataset['test'].train_test_split(test_size=0.5, seed=42)

downsampled_dataset = DatasetDict({
    'train': train_val_test_dataset['train'],
    'val': val_test_dataset['train'],
    'test': val_test_dataset['test'],
})

downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 12348
    })
    val: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1544
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1544
    })
})

In [None]:
batch_size = 64

# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

output_dir = f"{model_name}-ml-arxiv-papers"

training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
    remove_unused_columns=False
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["val"],
    # data_collator=data_collator,
    data_collator=whole_word_masking_data_collator,
    tokenizer=tokenizer,
)

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

In [None]:
trainer.train()

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Results of training for 10k rows in initial dataset (about 15k rows processed rows with tokens), batch size 64, WWM:

1. ALBERT: perplexity 52->22

`TrainOutput(global_step=597, training_loss=3.203148154956811, metrics={'train_runtime': 393.3845, 'train_samples_per_second': 96.669, 'train_steps_per_second': 1.518, 'total_flos': 213664196247552.0, 'train_loss': 3.203148154956811, 'epoch': 3.0})`

2. DistilBERT: perplexity 68->21

`TrainOutput(global_step=579, training_loss=3.283159638110014, metrics={'train_runtime': 205.4545, 'train_samples_per_second': 180.303, 'train_steps_per_second': 2.818, 'total_flos': 1227648866605056.0, 'train_loss': 3.283159638110014, 'epoch': 3.0})`

As we can see, the results are similar in terms of perplexity.

In [None]:
# trainer.save_model(model_checkpoint)

In [None]:
# model_ = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
# tokenizer_ = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
doc = arxiv_dataset[55]['abstract']

In [None]:
len(doc)

In [None]:
output_dir

Let's now use our pretrained model as a feature selector for summarizer:

In [None]:
# Load model, model config and tokenizer via Transformers
custom_config = AutoConfig.from_pretrained(output_dir)
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained(output_dir)
custom_model = AutoModel.from_pretrained(output_dir, 
                                         config=custom_config)


summarizer = Summarizer(custom_model=custom_model, 
                        custom_tokenizer=custom_tokenizer)

In [None]:
output = summarizer(doc, ratio=0.2, return_as_list=True)

In [None]:
doc

In [None]:
output

In [None]:
f"The summary is less than the input by {round(100 - 100*len(' '.join(output))/len(doc))}%"

In [None]:
unmasker = pipeline('fill-mask', 
                    model=output_dir,
                    # model=model_, 
                    # tokenizer=tokenizer_
                    )
unmasker("We measure the time (elapsed run [MASK]), space (RAM and disk space requirements), and fit (tensor reconstruction accuracy) of the four algorithms")