# Fine-tuning a masked language model (PyTorch)

The notebook is based on the tutorial from [HuggingFace](https://huggingface.co/learn/nlp-course/chapter7/3?fw=pt#fine-tuning-distilbert-with-the-trainer-api).

In [1]:
!nvidia-smi

Tue May 16 10:23:02 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Install the Transformers, Datasets, and Evaluate libraries to run this notebook. Also we'll use the models with [bert-extractive-summarizer](https://github.com/dmmiller612/bert-extractive-summarizer).

In [2]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
!pip install bert-extractive-summarizer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
from transformers import (
    AutoModelForMaskedLM, 
    AutoTokenizer, 
    DataCollatorForLanguageModeling, 
    default_data_collator,
    TrainingArguments,
    Trainer,
    AutoConfig, 
    AutoTokenizer, 
    AutoModel,
    pipeline
)
import torch
from datasets import load_dataset
import collections
import numpy as np
import math
from summarizer import Summarizer

In [4]:
model_checkpoint_distilbert = "distilbert-base-uncased"
model_checkpoint_albert = "albert-base-v2"

distilbert = AutoModelForMaskedLM.from_pretrained(model_checkpoint_distilbert)
albert = AutoModelForMaskedLM.from_pretrained(model_checkpoint_albert)

ALBERT model has 10 times fewer parameters than BERT and 6 times fewer than DistilBERT:

In [5]:
distilbert_num_parameters = distilbert.num_parameters() / 1_000_000
albert_num_parameters = albert.num_parameters() / 1_000_000

print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> ALBERT number of parameters: {round(albert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> ALBERT number of parameters: 11M'
'>>> BERT number of parameters: 110M'


In [6]:
text = "This is a great [MASK]."

In [7]:
model_checkpoint = model_checkpoint_albert
model = albert

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [9]:
inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great idea.'
'>>> This is a great recipe.'
'>>> This is a great site.'
'>>> This is a great tutorial.'
'>>> This is a great website.'


In [10]:
arxiv_dataset = load_dataset("CShorten/ML-ArXiv-Papers", split="train")
len(arxiv_dataset)



117592

In [11]:
average_length = [len(a) for a in arxiv_dataset['abstract']]
average_length = int(sum(average_length) / len(average_length))
average_length

1157

In [12]:
arxiv_dataset = arxiv_dataset.select(range(30000))
len(arxiv_dataset)

30000

In [13]:
arxiv_dataset[0]

{'Unnamed: 0.1': 0,
 'Unnamed: 0': 0.0,
 'title': 'Learning from compressed observations',
 'abstract': '  The problem of statistical learning is to construct a predictor of a random\nvariable $Y$ as a function of a related random variable $X$ on the basis of an\ni.i.d. training sample from the joint distribution of $(X,Y)$. Allowable\npredictors are drawn from some specified class, and the goal is to approach\nasymptotically the performance (expected loss) of the best predictor in the\nclass. We consider the setting in which one has perfect observation of the\n$X$-part of the sample, while the $Y$-part has to be communicated at some\nfinite bit rate. The encoding of the $Y$-values is allowed to depend on the\n$X$-values. Under suitable regularity conditions on the admissible predictors,\nthe underlying family of probability distributions and the loss function, we\ngive an information-theoretic characterization of achievable predictor\nperformance in terms of conditional distortion-rat

In [14]:
def tokenize_function(examples):
    result = tokenizer(examples["abstract"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = arxiv_dataset.map(
    tokenize_function, batched=True, remove_columns=['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract']
)

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (515 > 512). Running this sequence through the model will result in indexing errors


In [15]:
tokenized_datasets

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'],
    num_rows: 30000
})

In [16]:
tokenizer.model_max_length

512

In [17]:
chunk_size = 128

In [18]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets[:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Abstract {idx} length: {len(sample)}'")

'>>> Abstract 0 length: 230'
'>>> Abstract 1 length: 370'
'>>> Abstract 2 length: 347'


In [19]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated abstract length: {total_length}'")

'>>> Concatenated abstract length: 947'


In [20]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 51'


In [21]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [22]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
    num_rows: 50903
})

In [23]:
tokenizer.decode(lm_datasets[1]["input_ids"])

'communicated at some finite bit rate. the encoding of the $y$-values is allowed to depend on the $x$-values. under suitable regularity conditions on the admissible predictors, the underlying family of probability distributions and the loss function, we give an information-theoretic characterization of achievable predictor performance in terms of conditional distortion-rate functions. the ideas are illustrated on the example of nonparametric regression in gaussian noise.[SEP][CLS] in a sensor network, in practice, the communication among sensors is subject to:(1) errors or failures at random'

In [24]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [25]:
samples = [lm_datasets[i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a AlbertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



'>>> [CLS] the problem of statistical learning is to construct a predictor of[MASK] random variable[MASK]y$ as a function of a related[MASK] variable $x[MASK] on the basis of an i.i.[MASK]. training[MASK] from the joint distribution of $(x,y[MASK]$. allowable[MASK]ors are drawn from some specified class, and the goal is to approach asymptotically[MASK] performance (expected loss) of the best predictor in the class[MASK] we consider the setting in which one[MASK] resin observation of the[MASK]x$-[MASK][MASK] the sample, while the $y$-part has to be'

'>>> communicated at some[MASK] bit rate[MASK][MASK] encoding of the $y$-values is allowed to depend[MASK] the[MASK]x$-values. under[MASK] regularity conditions on[MASK] admissible predictors, the[MASK] family[MASK] probability distribution striker and[MASK] loss function, weern an information-theoretic[MASK]characterization of a[MASK]evable predictor performance in terms of conditional distortion-rate functions. the ideas are illustrated 

In [26]:
wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [27]:
samples = [lm_datasets[i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] the[MASK] of statistical learning is to construct[MASK] predictor of a random variable[MASK][MASK][MASK] as a function of a related random[MASK] $x$ on[MASK][MASK] of[MASK] i.i.d. training sample[MASK] the joint distribution[MASK] $(x,y)$. allowable predictors are drawn from some specified[MASK][MASK][MASK] the goal is to[MASK][MASK][MASK][MASK][MASK][MASK][MASK] performance (expected loss) of the[MASK] predictor in the class. we consider[MASK] setting in which one has perfect observation of the $x$-part of[MASK] sample, while the $y$-part[MASK][MASK] be'

'>>> communicated at some finite bit rate. the encoding[MASK] the $y$-values is[MASK] to depend on the $x$-values.[MASK] suitable regularity conditions[MASK] the admissible predictors, the underlying family of probability distributions and the[MASK] function, we[MASK][MASK] information-theoretic characterization of achievable predictor performance in terms of conditional distortion-rate[MASK][MASK] the ideas[MASK] illustr

In [28]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets.train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [29]:
batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-arxiv",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    # push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

In [30]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [31]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 149837.82


In [32]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,3.738,2.66431
2,2.5497,2.502258
3,2.4458,2.482439


TrainOutput(global_step=471, training_loss=2.9081290846417662, metrics={'train_runtime': 296.7657, 'train_samples_per_second': 101.09, 'train_steps_per_second': 1.587, 'total_flos': 168558059520000.0, 'train_loss': 2.9081290846417662, 'epoch': 3.0})

In [33]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 11.34


In [34]:
saved_model_path = 'distilbert_arxiv_finetuned'
trainer.save_model(saved_model_path)

In [35]:
model_ = AutoModelForMaskedLM.from_pretrained(saved_model_path)
tokenizer_ = AutoTokenizer.from_pretrained(saved_model_path)

In [36]:
doc = arxiv_dataset[55]['abstract']

In [37]:
len(doc)

1162

In [38]:
# Load model, model config and tokenizer via Transformers
custom_config = AutoConfig.from_pretrained(saved_model_path)
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained(saved_model_path)
custom_model = AutoModel.from_pretrained(saved_model_path, 
                                         config=custom_config)


model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)

Some weights of the model checkpoint at distilbert_arxiv_finetuned were not used when initializing AlbertModel: ['predictions.decoder.weight', 'predictions.LayerNorm.bias', 'predictions.LayerNorm.weight', 'predictions.dense.bias', 'predictions.bias', 'predictions.dense.weight', 'predictions.decoder.bias']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertModel were not initialized from the model checkpoint at distilbert_arxiv_finetuned and are newly initialized: ['albert.pooler.bias', 'albert.pooler.weight']
You should probably TRAIN this model on a down-stream tas

In [39]:
output = model(doc, ratio=0.2, return_as_list=True)



In [40]:
doc

'  Higher-order tensor decompositions are analogous to the familiar Singular\nValue Decomposition (SVD), but they transcend the limitations of matrices\n(second-order tensors). SVD is a powerful tool that has achieved impressive\nresults in information retrieval, collaborative filtering, computational\nlinguistics, computational vision, and other fields. However, SVD is limited to\ntwo-dimensional arrays of data (two modes), and many potential applications\nhave three or more modes, which require higher-order tensor decompositions.\nThis paper evaluates four algorithms for higher-order tensor decomposition:\nHigher-Order Singular Value Decomposition (HO-SVD), Higher-Order Orthogonal\nIteration (HOOI), Slice Projection (SP), and Multislice Projection (MP). We\nmeasure the time (elapsed run time), space (RAM and disk space requirements),\nand fit (tensor reconstruction accuracy) of the four algorithms, under a\nvariety of conditions. We find that standard implementations of HO-SVD and HO

In [41]:
output

['Higher-order tensor decompositions are analogous to the familiar Singular\nValue Decomposition (SVD), but they transcend the limitations of matrices\n(second-order tensors).',
 'We find that standard implementations of HO-SVD and HOOI\ndo not scale up to larger tensors, due to increasing RAM requirements.']

In [42]:
f"The summary is less than the input by {round(100 - 100*len(' '.join(output))/len(doc))}%"

'The summary is less than the input by 74%'