# Fine-tuning a masked language model (PyTorch)

This is a tutorial available from [Huggingface library](https://huggingface.co/course/chapter7/6?fw=pt) on domain adaptation of Masked Language model such as BERT and RoBERTa.

This note take a step further to conduct domain adaptation to academic writing genre using [Elsevier OA CC-BY Corpus](https://huggingface.co/datasets/orieg/elsevier-oa-cc-by).

Install the Transformers and Datasets libraries to run this notebook.

In [4]:
!pip install datasets transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

zsh:1: no matches found: transformers[sentencepiece]
The operation couldn’t be completed. Unable to locate a Java Runtime that supports apt.
Please visit http://www.java.com for information on installing Java.



You will need to setup git, adapt your email and name in the following cell.

In [2]:
!git config --global user.email "e.masaki0101@gmail.com"
!git config --global user.name "egumasa"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
import torch
torch.has_mps

True

In [6]:
device = torch.device('mps')

In [7]:
from transformers import AutoModelForMaskedLM

#model_checkpoint = "distilbert-base-uncased"
model_checkpoint = "distilroberta-base"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint).to(device) #moving to MPS

In [9]:
model.device

device(type='mps', index=0)

This allows to download the model.
Then you can check the parameter by model.num_parameters()

In [10]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 82M'
'>>> BERT number of parameters: 110M'


RoBERTa uses <mask> for the mask token.
BERT uses [MASK] for the masking token.

In [14]:
text = "This is a great <mask>."

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

AutoTokenizer.from_pretrained() allows you to input tokens and make it a tenfor...?

In [21]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

RuntimeError: Placeholder storage has not been allocated on MPS device!

In [22]:
inputs = tokenizer(text, return_tensors="pt")
inputs

{'input_ids': tensor([[    0,   713,    16,    10,   372, 50264,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [23]:
token_logits #this is the tensor for each word.

NameError: name 'token_logits' is not defined

In [24]:
token_logits.size()

NameError: name 'token_logits' is not defined

In [25]:
from datasets import load_dataset

# imdb_dataset = load_dataset("imdb")
# imdb_dataset

from datasets import load_dataset

dataset = load_dataset("orieg/elsevier-oa-cc-by")

No config specified, defaulting to: elsevier_oa_cc_by/all
Reusing dataset elsevier_oa_cc_by (/Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
#type(imdb_dataset)
type(dataset)

datasets.dataset_dict.DatasetDict

In [None]:
#len(imdb_dataset)
len(dataset)

3

In [19]:
dataset['train']
dataset['validation']

Dataset({
    features: ['title', 'abstract', 'subjareas', 'keywords', 'asjc', 'body_text', 'author_highlights'],
    num_rows: 4009
})

In [21]:
len(dataset['validation']['body_text'])

4009

In [None]:
sample = dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Title: {row['title']}'")
    print(f"'>>> abstract: {row['abstract']}'")
    text = " ".join(row['body_text'])
    print(f"'>>> body_text: {text}'")
    


'>>> Title: The evolution of the platyrrhine talus: A comparative analysis of the phenetic affinities of the Miocene platyrrhines with their modern relatives'
'>>> abstract: Platyrrhines are a diverse group of primates that presently occupy a broad range of tropical-equatorial environments in the Americas. However, most of the fossil platyrrhine species of the early Miocene have been found at middle and high latitudes. Although the fossil record of New World monkeys has improved considerably over the past several years, it is still difficult to trace the origin of major modern clades. One of the most commonly preserved anatomical structures of early platyrrhines is the talus. This work provides an analysis of the phenetic affinities of extant platyrrhine tali and their Miocene counterparts through geometric morphometrics and a series of phylogenetic comparative analyses. Geometric morphometrics was used to quantify talar shape affinities, while locomotor mode percentages (LMPs) were u

In [26]:
# This is an attempt to concatenate the string before processing the data.
def conc_string(example):
  return {'text': " ".join(example['body_text'])}

dataset = dataset.map(conc_string)

Loading cached processed dataset at /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-8243647c4a34b0c6.arrow
Loading cached processed dataset at /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-54b77145fc2e4edb.arrow
Loading cached processed dataset at /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-9641162288e967c1.arrow


In [27]:
def tokenize_function(examples):
    #result = tokenizer(examples["text"]) #, padding = True, truncation = True)
    #string = " ".join(examples["body_text"])
    #examples['text'] = " ".join(examples['body_text'])
    result = tokenizer(examples["text"], truncation=True)
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
# tokenized_datasets = dataset.map(
#     tokenize_function, batched=True, remove_columns=["text", "label"]
# )
# tokenized_datasets
tokenized_datasets = dataset.map(tokenize_function,
                                 batched=True,
                                 remove_columns=[
                                     'title', 'abstract', 'subjareas',
                                     'keywords', 'asjc', 'author_highlights',
                                     'body_text', 'text'
                                 ])
tokenized_datasets

Loading cached processed dataset at /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-9cfb9ac1939cb447.arrow


  0%|          | 0/5 [00:00<?, ?ba/s]

Loading cached processed dataset at /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-88928e3699d52052.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 32072
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 4008
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 4009
    })
})

In [28]:
tokenizer.model_max_length

512

In [29]:
chunk_size = 500

In [30]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:15]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 512'
'>>> Review 1 length: 512'
'>>> Review 2 length: 512'
'>>> Review 3 length: 512'
'>>> Review 4 length: 512'
'>>> Review 5 length: 512'
'>>> Review 6 length: 512'
'>>> Review 7 length: 512'
'>>> Review 8 length: 512'
'>>> Review 9 length: 512'
'>>> Review 10 length: 512'
'>>> Review 11 length: 512'
'>>> Review 12 length: 512'
'>>> Review 13 length: 512'
'>>> Review 14 length: 512'


In [28]:
# tokenized_samples.pop('text')

KeyError: 'text'

In [31]:
def del_attr(example):
  return example.pop('text')

# dataset = dataset.map(del_attr)

In [32]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}

total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 7680'


In [33]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 500'
'>>> Chunk length: 180'


In [34]:
def group_texts(examples):
    #examples.pop('text') #this is necessary to run the concatenation
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
# tokenized_datasets['train'].pop('text')

AttributeError: ignored

In [35]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Loading cached processed dataset at /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-934e9e3d1869b4cd.arrow


  0%|          | 0/5 [00:00<?, ?ba/s]

Loading cached processed dataset at /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-20b9f509ddb66d4d.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 32828
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 4103
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 4103
    })
})

In [36]:
lm_datasets["train"][1]["input_ids"]

[43,
 36,
 14484,
 25890,
 4400,
 1076,
 482,
 1014,
 4397,
 12238,
 2068,
 2,
 0,
 4771,
 24811,
 37788,
 3766,
 29,
 32,
 762,
 12628,
 2192,
 11,
 484,
 2292,
 8223,
 1168,
 1588,
 7823,
 6,
 16674,
 215,
 25,
 36846,
 3914,
 2457,
 26609,
 8,
 3553,
 118,
 3914,
 2457,
 26609,
 6,
 8,
 33561,
 24811,
 37965,
 16254,
 808,
 3175,
 34596,
 215,
 25,
 5,
 263,
 4325,
 3892,
 4203,
 853,
 757,
 219,
 28366,
 284,
 566,
 643,
 646,
 134,
 8174,
 1216,
 18291,
 64,
 28,
 33689,
 1538,
 30,
 1169,
 4747,
 50,
 4003,
 1975,
 415,
 7776,
 13240,
 6448,
 4,
 3687,
 5,
 1337,
 4747,
 6448,
 577,
 6,
 2269,
 1988,
 2816,
 12992,
 22356,
 34774,
 3258,
 646,
 176,
 742,
 8,
 34228,
 25286,
 18303,
 1258,
 646,
 246,
 742,
 33,
 57,
 2885,
 4,
 635,
 6,
 5,
 614,
 4532,
 3363,
 36,
 658,
 7,
 654,
 8871,
 8,
 5,
 1198,
 12,
 45888,
 9,
 5,
 44017,
 25,
 157,
 25,
 5,
 239,
 425,
 9,
 5,
 3868,
 4204,
 8,
 11690,
 73,
 7779,
 30999,
 6,
 4067,
 6,
 32,
 49,
 1049,
 38940,
 646,
 176,
 8174,
 6479

In [35]:
lm_datasets["train"][1]["input_ids"][0]

43

In [33]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"][0])

')'

In [34]:
tokenizer.decode(lm_datasets["train"][2]["input_ids"][1])

' addition'

In [37]:
from transformers.data.data_collator import DataCollatorForWholeWordMask
from transformers import DataCollatorForLanguageModeling

#data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
data_collator = DataCollatorForWholeWordMask(tokenizer = tokenizer, mlm_probability =0.15 )

In [38]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> <s>Glioblastoma is the most common and most aggressive<mask> brain malignancy. Even with maximum feasible surgical resection with radiotherapy and adjuvant temozolomide (TMZ<mask> survival rates<mask>ZX<mask> median of 14.6 months from diagnosis in molecular<mask> unselected patients (<mask>upp et al., 2005). Radiotherapy and TMZ provide better survival outcomes than<mask>otherapy alone to treat gl Mimoblastoma (Yang et al., 2014). Both extent of rese<mask> and residual volume are significantly associated with<mask> and recurrence (Chaichana et al., 2014). Gross total resection is associated with survival<mask>, but it is not always<mask> because the side Chrom neurological functions is necessary.<mask> current multimodality treatments� surgery, radiotherapy, chemotherapy for<mask> tumor are<mask> not completely satisfying.<mask>ethyl is<mask>iocyanate (PEITC) is onese the most extensively studied is<mask>iocyan<mask>voteMoon et al<mask> 2011<mask><mask><mask> chemopreventive act



In [42]:
# data_collator(samples)
samples

AttributeError: 'list' object has no attribute 'keys'

In [39]:
for chunk in data_collator(samples)["labels"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

OverflowError: out of range integral type conversion attempted

In [39]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):

    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id

    return default_data_collator(features)

In [44]:
# def whole_word_masking_data_collator(features):

#     for feature in features:
#         word_ids = feature.pop("word_ids")

#         # Create a map between words and corresponding token indices
#         mapping = collections.defaultdict(list)
#         current_word_index = -1
#         current_word = None
#         for idx, sent_id in enumerate(word_ids):
#           mapping.append([])
#           for idy, word_id in enumerate(sent_id):
#             if word_id is not None:
#                 if word_id != current_word:
#                     current_word = word_id
#                     current_word_index += 1
#                 mapping[idx][current_word_index].append(idy)

#         # Randomly mask words
#         for x, mp in enumerate(mapping):
#           mask = np.random.binomial(1, wwm_probability, (len(mp),))
#           input_ids = feature["input_ids"][x]
#           labels = feature["labels"][x]
#           new_labels = [-100] * len(labels)
#           for word_id in np.where(mask)[0]:
#               word_id = word_id.item()
#               for idx in mp[word_id]:
#                   new_labels[idx] = labels[idx]
#                   input_ids[idx] = tokenizer.mask_token_id

#     return default_data_collator(features)

In [40]:
samples = [lm_datasets["train"][i] for i in range(2)]


In [41]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")
    print(f"\n'>>> {tokenizer.convert_ids_to_tokens(chunk)}'")


'>>> <s>Glioblastoma<mask> the<mask><mask> and most aggressive primary brain malignancy. Even<mask> maximum feasible surgical resection with<mask><mask> and adjuvant temozolomide (TMZ<mask><mask> rates are<mask><mask> median<mask> 14<mask>6 months from diagnosis in molecularly unselected patients<mask>Stupp et al<mask> 2005). Radiotherapy and TMZ provide better survival outcomes<mask> radiotherapy alone<mask> treat<mask><mask><mask><mask><mask> (<mask><mask> al., 2014). Both<mask><mask> resection<mask> residual volume are significantly associated with survival and recurrence<mask>Chaichana et al.,<mask><mask><mask> total resection is associated with survival improvement, but it is not always possible because the preservation of neurological functions is necessary. The current multimodality treatments<mask> surgery,<mask><mask><mask> chemotherapy for<mask> tumor are still<mask> completely satisfying. Phenethyl isothiocyanate (PEITC) is one of the most extensively studied isothiocyanate

In [85]:
train_size = 1000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

Loading cached split indices for dataset at /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-69e8c160865e4fa3.arrow and /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-bb51babeda02bca5.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 100
    })
})

In [28]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [44]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-academic",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    #fp16=True,
    logging_steps=logging_steps,
)

In [46]:
from transformers import Trainer

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=downsampled_dataset["train"],
#     eval_dataset=downsampled_dataset["test"],
#     data_collator=data_collator,
# )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/Volumes/GoogleDrive/My Drive/Colab Notebooks/TorchNLP-Notes/3.Transformer/distilroberta-base-finetuned-academic is already a clone of https://huggingface.co/egumasa/distilroberta-base-finetuned-academic. Make sure you pull the latest changes with `repo.git_pull()`.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [47]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

The following columns in the evaluation set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 100
  Batch size = 64


  0%|          | 0/2 [00:00<?, ?it/s]

>>> Perplexity: 9.82


In [66]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `RobertaForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 10000
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 471


  0%|          | 0/471 [00:00<?, ?it/s]



KeyboardInterrupt: 

In [None]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 64


>>> Perplexity: 11.85


In [None]:
trainer.push_to_hub()

Saving model checkpoint to distilbert-base-uncased-finetuned-imdb
Configuration saved in distilbert-base-uncased-finetuned-imdb/config.json
Model weights saved in distilbert-base-uncased-finetuned-imdb/pytorch_model.bin


Upload file pytorch_model.bin:   0%|          | 3.34k/256M [00:00<?, ?B/s]

Upload file runs/Jun06_10-38-15_c68116698c68/1654512304.3934739/events.out.tfevents.1654512304.c68116698c68.74…

Upload file runs/Jun06_10-38-15_c68116698c68/events.out.tfevents.1654512516.c68116698c68.74.2: 100%|##########…

Upload file training_args.bin: 100%|##########| 3.17k/3.17k [00:00<?, ?B/s]

Upload file runs/Jun06_10-38-15_c68116698c68/events.out.tfevents.1654512300.c68116698c68.74.0:  65%|######4   …

remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/egumasa/distilbert-base-uncased-finetuned-imdb
   beff6ca..cb2e827  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Masked Language Modeling', 'type': 'fill-mask'}, 'dataset': {'name': 'imdb', 'type': 'imdb', 'args': 'plain_text'}}
remote: Enforcing permissions...        
remote: Allowed refs: all        
To https://huggingface.co/egumasa/distilbert-base-uncased-finetuned-imdb
   cb2e827..4481b98  main -> main



'https://huggingface.co/egumasa/distilbert-base-uncased-finetuned-imdb/commit/cb2e8273a22577d1ed6c5057beb9f777e0a58d67'

In [75]:
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [86]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])

eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)

# eval_dataset = eval_dataset.rename_columns(
#     {
#         "masked_input_ids": "input_ids",
#         "masked_attention_mask": "attention_mask",
#         "masked_labels": "labels",
#     }
# )

Loading cached processed dataset at /Users/masakieguchi/.cache/huggingface/datasets/orieg___elsevier_oa_cc_by/all/1.0.1/90990052f835613074d87af3592fd8eaee912f17ca4d0401bcf6d2d791d45117/cache-f43a729d795438a7.arrow


In [94]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [88]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [89]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [95]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [91]:
from huggingface_hub import get_full_repo_name

model_name = "distilroberta-base-uncased-finetuned-acd"
repo_name = get_full_repo_name(model_name)
repo_name

'egumasa/distilroberta-base-uncased-finetuned-acd'

In [92]:
from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/Volumes/GoogleDrive/My Drive/Colab Notebooks/TorchNLP-Notes/3.Transformer/distilroberta-base-uncased-finetuned-acd is already a clone of https://huggingface.co/egumasa/distilroberta-base-uncased-finetuned-acd. Make sure you pull the latest changes with `repo.git_pull()`.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [97]:
for epoch in range(num_train_epochs):
    # Training
    #model.train()
    for batch in train_dataloader:
        print(batch.keys())
        print(batch['labels'])
        # batch_mps = {
        #     'input_ids': batch['input_ids'].to(device),
        #     'attention_mask': batch['attention_mask'].to(device),
        #     'labels': batch['labels'].to(device)
        # }

dict_keys(['input_ids', 'labels'])
tensor([[1848, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., -100, -100, -100],
        ...,
        [-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., -100,    7, -100],
        [-100, -100, -100,  ..., -100, -100, -100]])
dict_keys(['input_ids', 'labels'])
tensor([[-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., 9415, -100, -100],
        [-100, -100, -100,  ..., -100, -100, -100],
        ...,
        [-100, -100, -100,  ..., -100, -100, -100],
        [-100, 1466,  322,  ..., -100, -100, -100],
        [1822, -100, -100,  ..., -100, -100, -100]])
dict_keys(['input_ids', 'labels'])
tensor([[ -100,  2456,  -100,  ...,  -100,  -100,  -100],
        [  435,  -100,  -100,  ...,  -100,    10,  -100],
        [ -100,    50,  -100,  ...,  -100,  -100,    67],
        ...,
        [    5,  -100,  -100,  ...,    30,  -100,  -100]

KeyboardInterrupt: 

In [101]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        batch_mps = {
            'input_ids': batch['input_ids'].to(device),
            'labels': batch['labels'].to(device)
        }

        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

  0%|          | 0/48 [00:00<?, ?it/s]

TypeError: forward() got an unexpected keyword argument 'masked_input_ids'

https://huggingface.co/egumasa/distilbert-base-uncased-finetuned-imdb

In [None]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="egumasa/distilbert-base-uncased-finetuned-imdb"
)

loading configuration file https://huggingface.co/egumasa/distilbert-base-uncased-finetuned-imdb/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ab07f1c3cc7d75a451f1c8c8da751204dcc93bc7246c6362e6545e6b1dd82f62.be4f374030e5ae5e858f9d83c8a42915a38e58004e88ffc1867a4b1fbcb6c058
Model config DistilBertConfig {
  "_name_or_path": "egumasa/distilbert-base-uncased-finetuned-imdb",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.19.2",
  "vocab_size": 30522
}

loading configuration file https://huggingface.co/egumasa/distilbert-base-uncased-finetune

OSError: ignored

In [None]:
preds = mask_filler("The paper [MASK] that the movie was interesting.")

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> the paper said that the movie was interesting.
>>> the paper stated that the movie was interesting.
>>> the paper noted that the movie was interesting.
>>> the paper commented that the movie was interesting.
>>> the paper wrote that the movie was interesting.
