# Fine-tuning a masked language model (PyTorch)

This notebook will fine-tune BERT models from the pretrained settings for Masked Language Modelling on the Wikitext-V2 dataset with a variety of weight decay and dropout values. It was made from modifying the Huggingface tutorial with the same name found here: https://huggingface.co/course/chapter7/3?fw=tf.  In addition, it also pre-processes and saves the Wikitext dataset.

In [2]:
from transformers import BertForMaskedLM

model_checkpoint = "bert-base-uncased"
model = BertForMaskedLM.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
bert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> BERT number of parameters: {round(bert_num_parameters)}M'")


'>>> BERT number of parameters: 110M'


In [4]:
text = "This is a great [MASK]."

In [5]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained(model_checkpoint)

In [6]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great idea.'
'>>> This is a great day.'
'>>> This is a great place.'
'>>> This is a great time.'
'>>> This is a great thing.'


In [7]:
from datasets import load_dataset

wiki_dataset = load_dataset("wikitext", "wikitext-2-v1")
wiki_dataset

Reusing dataset wikitext (C:\Users\Noah\.cache\huggingface\datasets\wikitext\wikitext-2-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

In [8]:
sample = wiki_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Text: {row['text']}'")

Loading cached shuffled indices for dataset at C:\Users\Noah\.cache\huggingface\datasets\wikitext\wikitext-2-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-67496e9edb819c55.arrow



'>>> Text:  <unk> , short @-@ arc , high pressure xenon arc lamps have a color temperature closely <unk> noon sunlight and are used in solar simulators . That is , the <unk> of these lamps closely <unk> a heated black body <unk> that has a temperature close to that observed from the Sun . After they were first introduced during the 1940s , these lamps began replacing the shorter @-@ lived carbon arc lamps in movie <unk> . They are employed in typical 35mm , <unk> and the new digital <unk> film projection systems , automotive <unk> <unk> , high @-@ end " tactical " <unk> and other specialized uses . These arc lamps are an excellent source of short wavelength ultraviolet radiation and they have intense emissions in the near infrared , which is used in some night vision systems . 
'

'>>> Text:  Field Marshal Antonio José de Sucre is portrayed as an intimate friend of the General . The historical Antonio José de Sucre , the Field Marshal of <unk> , had been the most trusted general of Si

In [9]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = wiki_dataset.map(
    tokenize_function, batched=True, remove_columns=["text"]
)
tokenized_datasets

Loading cached processed dataset at C:\Users\Noah\.cache\huggingface\datasets\wikitext\wikitext-2-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-b6ab8a5dd0b34a33.arrow
Loading cached processed dataset at C:\Users\Noah\.cache\huggingface\datasets\wikitext\wikitext-2-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-b53886597b1e56d5.arrow
Loading cached processed dataset at C:\Users\Noah\.cache\huggingface\datasets\wikitext\wikitext-2-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-ccc694868fe9065d.arrow


DatasetDict({
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3760
    })
})

In [10]:
tokenizer.model_max_length

512

In [11]:
chunk_size = 128

In [12]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Text {idx} length: {len(sample)}'")

'>>> Text 0 length: 2'
'>>> Text 1 length: 9'
'>>> Text 2 length: 2'


In [13]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated texts length: {total_length}'")

'>>> Concatenated texts length: 13'


In [14]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 13'


In [15]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [16]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Loading cached processed dataset at C:\Users\Noah\.cache\huggingface\datasets\wikitext\wikitext-2-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-99a172a97b170d7b.arrow
Loading cached processed dataset at C:\Users\Noah\.cache\huggingface\datasets\wikitext\wikitext-2-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-a4ed72012931aa98.arrow
Loading cached processed dataset at C:\Users\Noah\.cache\huggingface\datasets\wikitext\wikitext-2-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-0a1ce786ded4725e.arrow


DatasetDict({
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2405
    })
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 19247
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2089
    })
})

In [18]:
lm_datasets.save_to_disk("processed_dataset")

In [29]:
from datasets import DatasetDict

lm_datasets = DatasetDict()
lm_datasets = lm_datasets.load_from_disk("processed_dataset")
lm_datasets

DatasetDict({
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2405
    })
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 19247
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2089
    })
})

In [30]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

'runs parallel to the first game and follows the " nameless ", a penal military unit serving the nation of gallia during the second europan war who perform secret black operations and are pitted against the imperial unit " < unk > raven ". [SEP] [CLS] the game began development in 2010, carrying over a large portion of the work done on valkyria chronicles ii. while it retained the standard features of the series, it also underwent multiple adjustments, such as making the game more < unk > for series newcomers. character designer < unk > honjou and composer hitoshi sakimoto both returned from previous entries, along with'

In [18]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [22]:
print(len(lm_datasets["train"]))

19247


In [23]:
from transformers import TrainingArguments
import math

weight_decays = [0, 0.1, 0.01, 0.001]

for decay_val in weight_decays:
    
    model_checkpoint = "bert-base-uncased"
    model = BertForMaskedLM.from_pretrained(model_checkpoint)

    batch_size = 20
    # Show the training loss with every epoch
    logging_steps = len(lm_datasets["train"]) // batch_size
    model_name = model_checkpoint.split("/")[-1]

    training_args = TrainingArguments(
        output_dir=f"weight_decay_"+str(decay_val),
        overwrite_output_dir=True,
        evaluation_strategy="epoch",
        learning_rate=5e-5,
        num_train_epochs=15,
        weight_decay=decay_val,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        push_to_hub=False,
        fp16=True,
        logging_steps=logging_steps,
    )
    
    from transformers import Trainer

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"],
        eval_dataset=lm_datasets["validation"],
        data_collator=data_collator,
    )
    
    eval_results = trainer.evaluate()
    print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")    

    trainer.train()
    
    eval_results = trainer.evaluate()
    print(f">>> Weight decay: " + str(decay_val) +  "Perplexity:" + str(math.exp(eval_results['eval_loss'])))
    
    trainer.save_model()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Using amp half precision backend
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20


***** Running training *****
  Num examples = 19247
  Num Epochs = 15
  Instantaneous batch size per device = 20
  Total train batch size (w. parallel, distributed & accumulation) = 20
  Gradient Accumulation steps = 1
  Total optimization steps = 14445


>>> Perplexity: 10.49


Epoch,Training Loss,Validation Loss
1,1.8226,1.489798
2,1.6751,1.488142
3,1.6064,1.472458
4,1.5518,1.464766
5,1.5066,1.45656
6,1.4715,1.437576
7,1.4391,1.462479
8,1.402,1.418378
9,1.3619,1.425511
10,1.3497,1.410594


Saving model checkpoint to weight_decay_0\checkpoint-500
Configuration saved in weight_decay_0\checkpoint-500\config.json
Model weights saved in weight_decay_0\checkpoint-500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to weight_decay_0\checkpoint-1000
Configuration saved in weight_decay_0\checkpoint-1000\config.json
Model weights saved in weight_decay_0\checkpoint-1000\pytorch_model.bin
Saving model checkpoint to weight_decay_0\checkpoint-1500
Configuration saved in weight_decay_0\checkpoint-1500\config.json
Model weights saved in weight_decay_0\checkpoint-1500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to weight_decay_0\checkpoint-2000
Configuration saved in weight_decay_0\checkpoint-2000\config.json
Model weights saved in weight_decay_0\checkpoint-2000\pytorch_model.bin
Saving model checkpoint to weight_decay_0\checkpoint-2500
Configuration saved

Saving model checkpoint to weight_decay_0
Configuration saved in weight_decay_0\config.json


>>> Weight decay: 0Perplexity:4.068027780016066


Model weights saved in weight_decay_0\pytorch_model.bin
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Noah/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface

***** Running training *****
  Num examples = 19247
  Num Epochs = 15
  Instantaneous batch size per device = 20
  Total train batch size (w. parallel, distributed & accumulation) = 20
  Gradient Accumulation steps = 1
  Total optimization steps = 14445


>>> Perplexity: 10.49


Epoch,Training Loss,Validation Loss
1,1.8228,1.489305
2,1.6748,1.490565
3,1.6062,1.474359
4,1.5507,1.465934
5,1.5051,1.456819
6,1.4644,1.432261
7,1.4312,1.456484
8,1.3931,1.412177
9,1.3508,1.416174
10,1.3374,1.402703


Saving model checkpoint to weight_decay_0.1\checkpoint-500
Configuration saved in weight_decay_0.1\checkpoint-500\config.json
Model weights saved in weight_decay_0.1\checkpoint-500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to weight_decay_0.1\checkpoint-1000
Configuration saved in weight_decay_0.1\checkpoint-1000\config.json
Model weights saved in weight_decay_0.1\checkpoint-1000\pytorch_model.bin
Saving model checkpoint to weight_decay_0.1\checkpoint-1500
Configuration saved in weight_decay_0.1\checkpoint-1500\config.json
Model weights saved in weight_decay_0.1\checkpoint-1500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to weight_decay_0.1\checkpoint-2000
Configuration saved in weight_decay_0.1\checkpoint-2000\config.json
Model weights saved in weight_decay_0.1\checkpoint-2000\pytorch_model.bin
Saving model checkpoint to weight_decay_0.1\checkpoin

Saving model checkpoint to weight_decay_0.1
Configuration saved in weight_decay_0.1\config.json


>>> Weight decay: 0.1Perplexity:4.030469501050297


Model weights saved in weight_decay_0.1\pytorch_model.bin
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Noah/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingfa

***** Running training *****
  Num examples = 19247
  Num Epochs = 15
  Instantaneous batch size per device = 20
  Total train batch size (w. parallel, distributed & accumulation) = 20
  Gradient Accumulation steps = 1
  Total optimization steps = 14445


>>> Perplexity: 10.49


Epoch,Training Loss,Validation Loss
1,1.8229,1.489171
2,1.6754,1.489993
3,1.6063,1.472989
4,1.5523,1.465808
5,1.5067,1.458348
6,1.4718,1.438198
7,1.439,1.462922
8,1.4017,1.417821
9,1.3614,1.426324
10,1.3494,1.410945


Saving model checkpoint to weight_decay_0.01\checkpoint-500
Configuration saved in weight_decay_0.01\checkpoint-500\config.json
Model weights saved in weight_decay_0.01\checkpoint-500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to weight_decay_0.01\checkpoint-1000
Configuration saved in weight_decay_0.01\checkpoint-1000\config.json
Model weights saved in weight_decay_0.01\checkpoint-1000\pytorch_model.bin
Saving model checkpoint to weight_decay_0.01\checkpoint-1500
Configuration saved in weight_decay_0.01\checkpoint-1500\config.json
Model weights saved in weight_decay_0.01\checkpoint-1500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to weight_decay_0.01\checkpoint-2000
Configuration saved in weight_decay_0.01\checkpoint-2000\config.json
Model weights saved in weight_decay_0.01\checkpoint-2000\pytorch_model.bin
Saving model checkpoint to weight_decay_0

Saving model checkpoint to weight_decay_0.01
Configuration saved in weight_decay_0.01\config.json


>>> Weight decay: 0.01Perplexity:4.067527345803425


Model weights saved in weight_decay_0.01\pytorch_model.bin
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Noah/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingf

***** Running training *****
  Num examples = 19247
  Num Epochs = 15
  Instantaneous batch size per device = 20
  Total train batch size (w. parallel, distributed & accumulation) = 20
  Gradient Accumulation steps = 1
  Total optimization steps = 14445


>>> Perplexity: 10.49


Epoch,Training Loss,Validation Loss
1,1.8224,1.488343
2,1.6753,1.487519
3,1.6063,1.474778
4,1.5517,1.462987
5,1.507,1.457924
6,1.4717,1.437412
7,1.4393,1.462606
8,1.4019,1.416968
9,1.3622,1.425924
10,1.35,1.411779


Saving model checkpoint to weight_decay_0.001\checkpoint-500
Configuration saved in weight_decay_0.001\checkpoint-500\config.json
Model weights saved in weight_decay_0.001\checkpoint-500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to weight_decay_0.001\checkpoint-1000
Configuration saved in weight_decay_0.001\checkpoint-1000\config.json
Model weights saved in weight_decay_0.001\checkpoint-1000\pytorch_model.bin
Saving model checkpoint to weight_decay_0.001\checkpoint-1500
Configuration saved in weight_decay_0.001\checkpoint-1500\config.json
Model weights saved in weight_decay_0.001\checkpoint-1500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to weight_decay_0.001\checkpoint-2000
Configuration saved in weight_decay_0.001\checkpoint-2000\config.json
Model weights saved in weight_decay_0.001\checkpoint-2000\pytorch_model.bin
Saving model checkpoint to we

Saving model checkpoint to weight_decay_0.001
Configuration saved in weight_decay_0.001\config.json


>>> Weight decay: 0.001Perplexity:4.069863232278728


Model weights saved in weight_decay_0.001\pytorch_model.bin


In [24]:
from transformers import BertConfig, TrainingArguments
import math
dropouts = [0, 0.2, 0.4]

for dropout_val in dropouts:
    batch_size = 20
    # Show the training loss with every epoch
    logging_steps = len(lm_datasets["train"]) // batch_size
    model_name = model_checkpoint.split("/")[-1]

    dropout_config = BertConfig(hidden_dropout_prob = dropout_val, attention_probs_dropout_prob = dropout_val)
    model_checkpoint = "bert-base-uncased"
    model = BertForMaskedLM.from_pretrained(model_checkpoint, config=dropout_config)
    
    training_args = TrainingArguments(
        output_dir=f"dropout_"+str(dropout_val),
        overwrite_output_dir=True,
        evaluation_strategy="epoch",
        learning_rate=5e-5,
        num_train_epochs=15,
        weight_decay=0.01,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        push_to_hub=False,
        fp16=True,
        logging_steps=logging_steps,
    )
    
    from transformers import Trainer

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_datasets["train"],
        eval_dataset=lm_datasets["validation"],
        data_collator=data_collator,
    )
    
    eval_results = trainer.evaluate()
    print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")    

    trainer.train()
    
    eval_results = trainer.evaluate()
    print(f">>> Dropout: " + str(dropout_val) + "Perplexity:" + str(math.exp(eval_results['eval_loss'])))
    
    trainer.save_model()

loading weights file https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin from cache at C:\Users\Noah/.cache\huggingface\transformers\a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of BertForMaskedLM were initialized from the model checkpoint at ber

***** Running training *****
  Num examples = 19247
  Num Epochs = 15
  Instantaneous batch size per device = 20
  Total train batch size (w. parallel, distributed & accumulation) = 20
  Gradient Accumulation steps = 1
  Total optimization steps = 14445


>>> Perplexity: 10.49


Epoch,Training Loss,Validation Loss
1,1.6879,1.477764
2,1.5359,1.478608
3,1.4625,1.464348
4,1.4006,1.454949
5,1.3481,1.449947
6,1.3092,1.431754
7,1.2692,1.454478
8,1.2274,1.414745
9,1.1829,1.420637
10,1.1674,1.417111


Saving model checkpoint to dropout_0\checkpoint-500
Configuration saved in dropout_0\checkpoint-500\config.json
Model weights saved in dropout_0\checkpoint-500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to dropout_0\checkpoint-1000
Configuration saved in dropout_0\checkpoint-1000\config.json
Model weights saved in dropout_0\checkpoint-1000\pytorch_model.bin
Saving model checkpoint to dropout_0\checkpoint-1500
Configuration saved in dropout_0\checkpoint-1500\config.json
Model weights saved in dropout_0\checkpoint-1500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to dropout_0\checkpoint-2000
Configuration saved in dropout_0\checkpoint-2000\config.json
Model weights saved in dropout_0\checkpoint-2000\pytorch_model.bin
Saving model checkpoint to dropout_0\checkpoint-2500
Configuration saved in dropout_0\checkpoint-2500\config.json
Model weights saved in 

Saving model checkpoint to dropout_0
Configuration saved in dropout_0\config.json


>>> Dropout: 0Perplexity:4.097257217826608


Model weights saved in dropout_0\pytorch_model.bin
loading weights file https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin from cache at C:\Users\Noah/.cache\huggingface\transformers\a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of BertForMaskedL

***** Running training *****
  Num examples = 19247
  Num Epochs = 15
  Instantaneous batch size per device = 20
  Total train batch size (w. parallel, distributed & accumulation) = 20
  Gradient Accumulation steps = 1
  Total optimization steps = 14445


>>> Perplexity: 10.49


Epoch,Training Loss,Validation Loss
1,2.038,1.518632
2,1.8591,1.510691
3,1.7921,1.501466
4,1.7365,1.485995
5,1.6927,1.477181
6,1.6594,1.460625
7,1.6279,1.483993
8,1.5943,1.448256
9,1.5581,1.443808
10,1.5465,1.432633


Saving model checkpoint to dropout_0.2\checkpoint-500
Configuration saved in dropout_0.2\checkpoint-500\config.json
Model weights saved in dropout_0.2\checkpoint-500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to dropout_0.2\checkpoint-1000
Configuration saved in dropout_0.2\checkpoint-1000\config.json
Model weights saved in dropout_0.2\checkpoint-1000\pytorch_model.bin
Saving model checkpoint to dropout_0.2\checkpoint-1500
Configuration saved in dropout_0.2\checkpoint-1500\config.json
Model weights saved in dropout_0.2\checkpoint-1500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to dropout_0.2\checkpoint-2000
Configuration saved in dropout_0.2\checkpoint-2000\config.json
Model weights saved in dropout_0.2\checkpoint-2000\pytorch_model.bin
Saving model checkpoint to dropout_0.2\checkpoint-2500
Configuration saved in dropout_0.2\checkpoint-2500\config.

Saving model checkpoint to dropout_0.2
Configuration saved in dropout_0.2\config.json


>>> Dropout: 0.2Perplexity:4.135483218561814


Model weights saved in dropout_0.2\pytorch_model.bin
loading weights file https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin from cache at C:\Users\Noah/.cache\huggingface\transformers\a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of BertForMaske

***** Running training *****
  Num examples = 19247
  Num Epochs = 15
  Instantaneous batch size per device = 20
  Total train batch size (w. parallel, distributed & accumulation) = 20
  Gradient Accumulation steps = 1
  Total optimization steps = 14445


>>> Perplexity: 10.49


Epoch,Training Loss,Validation Loss
1,3.0515,1.727323
2,2.6562,1.689313
3,2.5001,1.668526
4,2.4165,1.649227
5,2.3539,1.649804
6,2.3008,1.625283
7,2.2576,1.645586
8,2.2201,1.60351
9,2.1782,1.599943
10,2.158,1.592281


Saving model checkpoint to dropout_0.4\checkpoint-500
Configuration saved in dropout_0.4\checkpoint-500\config.json
Model weights saved in dropout_0.4\checkpoint-500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to dropout_0.4\checkpoint-1000
Configuration saved in dropout_0.4\checkpoint-1000\config.json
Model weights saved in dropout_0.4\checkpoint-1000\pytorch_model.bin
Saving model checkpoint to dropout_0.4\checkpoint-1500
Configuration saved in dropout_0.4\checkpoint-1500\config.json
Model weights saved in dropout_0.4\checkpoint-1500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2089
  Batch size = 20
Saving model checkpoint to dropout_0.4\checkpoint-2000
Configuration saved in dropout_0.4\checkpoint-2000\config.json
Model weights saved in dropout_0.4\checkpoint-2000\pytorch_model.bin
Saving model checkpoint to dropout_0.4\checkpoint-2500
Configuration saved in dropout_0.4\checkpoint-2500\config.

Saving model checkpoint to dropout_0.4
Configuration saved in dropout_0.4\config.json


>>> Dropout: 0.4Perplexity:4.848238999974596


Model weights saved in dropout_0.4\pytorch_model.bin
