<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Pre-training LLMs with Hugging Face**

Estimated time: **45** minutes

# Introduction

This project aims to introduce you to the process of pretraining large language models (LLMs) using the popular Hugging Face library. Hugging Face is a leading open-source platform for natural language processing that provides a wide range of pretrained models and tools for fine-tuning and deploying these models.

You will learn how to load pre-trained models from Hugging Face and make inferences using the Pipeline module. Additionally, you will learn how to further train pre-trained LLMs on your own data (self-supervised fine-tuning). By the end of this lab, you will have a solid understanding of how to pretrain LLMs and store them to later fine-tune for your specific use cases. This will empower you to create powerful and customized natural language processing solutions.


# __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-required-libraries">Installing required libraries</a></li>
            <li><a href="#Importing-required-libraries">Importing required libraries</a></li>
        </ol>
    </li>
    <li><a href="#Pretraining-and-self-supervised-fine-tuning">Pretraining and self-supervised fine-tuning</a>
        <ol>
            <li><a href="#Importing-required-datasets">Importing required datasets</a></li>
            <li><a href="#Loading-the-saved-model">Loading the saved model</a></li>
            <li><a href="#Inferencing-a-pretrained-BERT-model">Inferencing a pretrained BERT model</a></li>
        </ol>
    </li>
    <li><a href="#Exercise">Exercise</a></li>
</ol>


---


# Objectives

After completing this lab, you will be able to:


 - Load pretrained LLMs from Hugging Face and make inferences using the pipeline module
 - Train pretrained LLMs on your data 
 - Store LLMs to fine-tune them for specific use cases
 


---


# Setup


### Installing required libraries
The following required libraries are pre-installed in the Skills Network Labs environment. However, if you run these notebook commands in a different Jupyter environment (e.g. Watson Studio or Ananconda), you will need to install these libraries by removing the `#` sign before `%pip` in the code cell below:

_PS: To run lab this in your own environment, please note that the versions of libraries may differ due to dependencies._


In [3]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
%pip install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 torch=2.1.0+cu118
# - Update a specific package
%pip install pmdarima -U
# - Update a package to specific version
%pip install --upgrade pmdarima==2.0.2
# Note: If your environment doesn't support "%pip install", use "!mamba install"

Note: you may need to restart the kernel to use updated packages.



Usage:   
  d:\Course\IBMGenAI\venv\Scripts\python.exe -m pip install [options] <requirement specifier> [package-index-options] ...
  d:\Course\IBMGenAI\venv\Scripts\python.exe -m pip install [options] -r <requirements file> [package-index-options] ...
  d:\Course\IBMGenAI\venv\Scripts\python.exe -m pip install [options] [-e] <vcs project url> ...
  d:\Course\IBMGenAI\venv\Scripts\python.exe -m pip install [options] [-e] <local project path> ...
  d:\Course\IBMGenAI\venv\Scripts\python.exe -m pip install [options] <archive url/path> ...

no such option: -y


Collecting pmdarima
  Downloading pmdarima-2.0.4-cp312-cp312-win_amd64.whl.metadata (8.0 kB)
Collecting joblib>=0.11 (from pmdarima)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting Cython!=0.29.18,!=0.29.31,>=0.29 (from pmdarima)
  Downloading Cython-3.0.11-cp312-cp312-win_amd64.whl.metadata (3.2 kB)
Collecting numpy>=1.21.2 (from pmdarima)
  Downloading numpy-2.2.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting pandas>=0.19 (from pmdarima)
  Using cached pandas-2.2.3-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting scikit-learn>=0.22 (from pmdarima)
  Downloading scikit_learn-1.6.0-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting scipy>=1.3.2 (from pmdarima)
  Using cached scipy-1.14.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting statsmodels>=0.13.2 (from pmdarima)
  Downloading statsmodels-0.14.4-cp312-cp312-win_amd64.whl.metadata (9.5 kB)
Collecting pytz>=2020.1 (from pandas>=0.19->pmdarima)
  Using cached pytz-2024.2-py2.py3-none-a

  error: subprocess-exited-with-error
  
  × Building wheel for pmdarima (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [41 lines of output]
      Partial import of pmdarima during the build process.
      
      Requirements: ['joblib>=0.11\nCython>=0.29,!=0.29.18,!=0.29.31\nnumpy>=1.21.2\npandas>=0.19\nscikit-learn>=0.22\nscipy>=1.3.2\nstatsmodels>=0.13.2\nurllib3\nsetuptools>=38.6.0,!=50.0.0\n']
      Adding extra setuptools args
      Traceback (most recent call last):
        File "<string>", line 190, in check_package_status
        File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.2288.0_x64__qbz5n2kfra8p0\Lib\importlib\__init__.py", line 90, in import_module
          return _bootstrap._gcd_import(name[level:], package, level)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
        File "<frozen importlib._bootstrap>", line 1360, in _find_a

The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


In [4]:
#%pip install transformers==4.40.0 
%pip install -U git+https://github.com/huggingface/transformers
%pip install datasets # 2.15.0
%pip install portalocker>=2/0.0
%pip install -q -U git+https://github.com/huggingface/accelerate.git
%pip install torch==2.3.0
%pip install -U torchvision
%pip install protobuf==3.20.*


Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to c:\users\admin\appdata\local\temp\pip-req-build-56lj2yk3
  Resolved https://github.com/huggingface/transformers to commit 05260a1fc1c8571a2b421ce72b680d5f1bc3e5a4
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting filelock (from transformers==4.48.0.dev0)
  Using cached filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers==4.48.0.dev0)
  Downloading huggingface_hub-0.27.0-py3-none-any.whl.metadata (13 kB)
Collecting regex!=2019.12.17 (from transformers==4.48.0.dev0)
  Using cached regex-2024.11.6-cp312-cp312-win_amd64.whl.metadata 

  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers 'C:\Users\Admin\AppData\Local\Temp\pip-req-build-56lj2yk3'


Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


Note: you may need to restart the kernel to use updated packages.


The system cannot find the path specified.


Note: you may need to restart the kernel to use updated packages.
Collecting torch==2.3.0
  Downloading torch-2.3.0-cp312-cp312-win_amd64.whl.metadata (26 kB)
Collecting mkl<=2021.4.0,>=2021.1.1 (from torch==2.3.0)
  Downloading mkl-2021.4.0-py2.py3-none-win_amd64.whl.metadata (1.4 kB)
Collecting intel-openmp==2021.* (from mkl<=2021.4.0,>=2021.1.1->torch==2.3.0)
  Downloading intel_openmp-2021.4.0-py2.py3-none-win_amd64.whl.metadata (1.2 kB)
Collecting tbb==2021.* (from mkl<=2021.4.0,>=2021.1.1->torch==2.3.0)
  Downloading tbb-2021.13.1-py3-none-win_amd64.whl.metadata (1.1 kB)
Downloading torch-2.3.0-cp312-cp312-win_amd64.whl (159.7 MB)
   ---------------------------------------- 0.0/159.7 MB ? eta -:--:--
   - -------------------------------------- 5.2/159.7 MB 31.9 MB/s eta 0:00:05
   --- ------------------------------------ 14.4/159.7 MB 39.4 MB/s eta 0:00:04
   ---- ----------------------------------- 18.6/159.7 MB 33.6 MB/s eta 0:00:05
   ----- ---------------------------------- 2

### Importing required libraries

_It is recommended that you import all required libraries in one place (here):_
* Note: if you got an error after running the cell below, try restarting the Kernel as some packages need a restart to be effective.


In [6]:
import torch
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import AutoConfig,AutoModelForCausalLM,AutoModelForSequenceClassification,BertConfig,BertForMaskedLM,TrainingArguments, Trainer, TrainingArguments
from transformers import AutoTokenizer,BertTokenizerFast,TextDataset,DataCollatorForLanguageModeling
from transformers import pipeline
from datasets import load_dataset

from tqdm.auto import tqdm
import math
import time
import os


# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

Disable tokenizer parallelism to avoid deadlocks.


In [7]:
# Set the environment variable TOKENIZERS_PARALLELISM to 'false'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'


---


# Pretraining and self-supervised fine-tuning


Pretraining is a technique used in natural language processing (NLP) to train large language models (LLMs) on a vast corpus of unlabeled text data. The goal is to capture the general patterns and semantic relationships present in natural language, allowing the model to develop a deep understanding of language structure and meaning.

The motivation behind pretraining transformers is to address the limitations of traditional NLP approaches that often require significant amounts of labeled data for each specific task. By leveraging the abundance of unlabeled text data, pretraining enables the model to learn fundamental language skills through self-supervised objectives, facilitating transfer learning.

The pretraining objectives, such as masked language modeling (MLM) and next sentence prediction (NSP), play a crucial role in the success of transformer models. Pretrained models can be further tuned by training them on domain-specific unlabeled data, which is known as self-supervised fine-tuning.

Also, the model can be fine-tuned on specific downstream tasks using labeled data, a process known as supervised fine-tuning, further improving its performance.

In the following sections of this lab, you will explore pretraining objectives, loading pretrained models, data preparation, and the fine-tuning process. By the end, you will have a solid understanding of pretraining and self-supervised fine-tuning, empowering you to apply these techniques to solve real-world NLP problems.


Let's start with loading a pretrained model from Hugging Face and making an inference:


In [11]:
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

pipe = pipeline("text-generation", model=model,tokenizer=tokenizer)
print(pipe("This movie was really")[0]["generated_text"])

Device set to use cpu


This movie was really good. I was really surprised by how good it was.
I was


## Pre-training Objectives

Pre-training objectives are crucial components of the pre-training process for transformers. These objectives define the tasks that the model is trained on during the pre-training phase, allowing it to learn meaningful contextual representations of language. Three commonly used pre-training objectives are masked language modeling (MLM), next sentence prediction (NSP) and next Ttoken prediction.

1. Masked Language Modeling (MLM):
   Masked language modeling involves randomly masking some words in a sentence and training the model to predict the masked words based on the context provided by the surrounding words(i.e., words that appear either before or after the masked word). The objective is to enable the model to learn contextual understanding and fill in missing information.

2. Next Sentence Prediction (NSP):
   Next sentence prediction involves training the model to predict whether two sentences are consecutive in the original text or randomly chosen from the corpus. This objective helps the model learn sentence-level relationships and understand the coherence between sentences.

3. Next Token Prediction:
    In this objective, the model is trained to predict the next token in a sequence of text. The model is presented with a sequence of text and must learn to predict the most likely next token based on the context.

It's important to note that different pre-trained models may use variations or combinations of these objectives, depending on the specific architecture and training setup.


## Self-supervised training of a BERT model
Training a BERT(Bidirectional Encoder Representations from Transformers) model is a complex and time-consuming process that requires a large corpus of unlabeled text data and significant computational resources. However, we provide you with a simplified exercise to demonstrate the steps involved in pre-training a BERT model using the Masked Language Modeling (MLM) objective.

For this exercise, we'll use the Hugging Face Transformers library, which provides pre-implemented BERT models and tools for pre-training. You will be instructed to:
- Prepare the train dataset
- Train a Tokenizer
- Preprocess the dataset
- Pre-train BERT using an MLM task
- Evaluate the trained model


### Importing required datasets

The WikiText dataset is a widely used benchmark dataset in the field of natural language processing (NLP). The dataset contains a large amount of text extracted from Wikipedia, which is a vast online encyclopedia covering a wide range of topics. The articles in the WikiText dataset are preprocessed to remove formatting, hyperlinks, and other metadata, resulting in a clean text corpus.

The WikiText dataset has 4 different configs, and is divided into three parts: a training set, a validation set, and a test set. The training set is used for training language models, while the validation and test sets are used for evaluating the performance of the models.
First, let's load the datasets and concatenate them together to create a big dataset.

*Note: The original BERT was pretrained on Wikipedia and BookCorpus datasets.


In [9]:
# Load the datasets
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

Generating test split: 100%|██████████| 4358/4358 [00:00<00:00, 442242.74 examples/s]
Generating train split: 100%|██████████| 36718/36718 [00:00<00:00, 1806993.64 examples/s]
Generating validation split: 100%|██████████| 3760/3760 [00:00<00:00, 939788.04 examples/s]


Let's check the dataset:


In [12]:
print(dataset)

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


check a sample record


In [13]:
#check a sample record
dataset["train"][400]

{'text': " When Mason was injured in warm @-@ ups late in the year , Columbus was without an active goaltender on their roster . To remedy the situation , the team signed former University of Michigan goaltender Shawn Hunwick to a one @-@ day , amateur tryout contract . After being eliminated from the NCAA Tournament just days prior , Hunwick skipped an astronomy class and drove his worn down 2003 Ford Ranger to Columbus to make the game . He served as the back @-@ up to Allen York during the game , and the following day , he signed a contract for the remainder of the year . With Mason returning from injury , Hunwick was third on the team 's depth chart when an injury to York allowed Hunwick to remain as the back @-@ up for the final two games of the year . In the final game of the season , the Blue Jackets were leading the Islanders 7 – 3 with 2 : 33 remaining when , at the behest of his teammates , Head Coach Todd Richards put Hunwick in to finish the game . He did not face a shot . 

This dataset contains 36,718 rows of training data. If you do not run the code on a GPU-powered notebook, you will need to decrease the size of dataset to be able to complete the training. You can uncomment the commands below to select a desired section of dataset:


In [14]:
dataset["train"] = dataset["train"].select([i for i in range(1000)])
dataset["test"] = dataset["test"].select([i for i in range(200)])

In [15]:
print(dataset)

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 200
    })
    train: Dataset({
        features: ['text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


Below files are next used in creating TextDataset objects for the training:


In [16]:
# Path to save the datasets to text files
output_file_train = "wikitext_dataset_train.txt"
output_file_test = "wikitext_dataset_test.txt"

# Open the output file in write mode
with open(output_file_train, "w", encoding="utf-8") as f:
    # Iterate over each example in the dataset
    for example in dataset["train"]:
        # Write the example text to the file
        f.write(example["text"] + "\n")

# Open the output file in write mode
with open(output_file_test, "w", encoding="utf-8") as f:
    # Iterate over each example in the dataset
    for example in dataset["test"]:
        # Write the example text to the file
        f.write(example["text"] + "\n")

You need to define a tokenizer to be used for tokenizing the dataset.


In [17]:
# create a tokenizer from existing one to re-use special tokens
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

In [18]:
model_name = 'bert-base-uncased'

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, is_decoder=True)


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


### Training a Tokenizer(Optional)

In the previous cell, you created an instance of tokenizer from a pre-trained BERT tokenizer. If you want to train the tokenizer on your own dataset, you can uncomment the code below. This is specially helpful when using transformers for specific areas such as medicine where tokens are somehow different than the general tokens that tokenizers are created based on. (You can skip this step if you do not want to train the tokenizer on your specific data):


In [19]:
## create a python generator to dynamically load the data
def batch_iterator(batch_size=10000):
    for i in tqdm(range(0, len(dataset), batch_size)):
        yield dataset['train'][i : i + batch_size]["text"]

## create a tokenizer from existing one to re-use special tokens
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

## train the tokenizer using our own dataset
bert_tokenizer = bert_tokenizer.train_new_from_iterator(text_iterator=batch_iterator(), vocab_size=30522)

100%|██████████| 1/1 [00:00<00:00, 10.09it/s]


### Pretraining

In this step, we define the configuration of the BERT model and create the model:
#### Define the BERT Configuration
Here, we define the configuration settings for a BERT model using `BertConfig`. This includes setting various parameters related to the model's architecture:
- **vocab_size=30522**: Specifies the size of the vocabulary. This number should match the vocabulary size used by the tokenizer.
- **hidden_size=768**: Sets the size of the hidden layers.
- **num_hidden_layers=12**: Determines the number of hidden layers in the transformer model.
- **num_attention_heads=12**: Sets the number of attention heads in each attention layer.
- **intermediate_size=3072**: Specifies the size of the "intermediate" (i.e., feed-forward) layer within the transformer.



In [20]:
# Define the BERT configuration
config = BertConfig(
    vocab_size=len(bert_tokenizer.get_vocab()),  # Specify the vocabulary size(Make sure this number equals the vocab_size of the tokenizer)
    hidden_size=768,  # Set the hidden size
    num_hidden_layers=12,  # Set the number of layers
    num_attention_heads=12,  # Set the number of attention heads
    intermediate_size=3072,  # Set the intermediate size
)

 Create the BERT model for pre-training:


In [21]:
# Create the BERT model for pre-training
model = BertForMaskedLM(config)

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


check model configuration


In [22]:
# check model configuration
model

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(12575, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

### Define the Training Dataset
Here, we define a training dataset using the `TextDataset` class, which is suited for loading and processing text data for training language models. This setup typically involves a few key parameters:

- **tokenizer=bert_tokenizer**: Specifies the tokenizer to be used. Here, `bert_tokenizer` is an instance of a BERT tokenizer, responsible for converting text into tokens that the model can understand.
- **file_path="wikitext_dataset_train.txt"**: The path to the pre-training data file. This should point to a text file containing the training data.
- **block_size=128**: Sets the desired block size for training. This defines the length of the sequences that the model will be trained on

The `TextDataset` class is designed to take large pieces of text (such as those found in the specified file), tokenize them, and efficiently handle them in manageable blocks of the specified size.



In [23]:
# Prepare the pre-training data as a TextDataset
train_dataset = TextDataset(
    tokenizer=bert_tokenizer,
    file_path="wikitext_dataset_train.txt",  # Path to your pre-training data file
    block_size=128  # Set the desired block size for training
)
test_dataset = TextDataset(
    tokenizer=bert_tokenizer,
    file_path="wikitext_dataset_test.txt",  # Path to your pre-training data file
    block_size=128  # Set the desired block size for training
)

Token indices sequence length is longer than the specified maximum sequence length for this model (55468 > 512). Running this sequence through the model will result in indexing errors


examining  one sample the token indexes  are shown here with the block size.


In [24]:
train_dataset[0]

tensor([    2,    31,   619,   799,  1414,    31,  3541,   593,   619,    22,
           29, 12314,   799,    12,  3189,    29,   105,   103,  5412,    15,
          490,    17,   619,   184,   171,  2553,    22,    13,    15,  3533,
         3584,   191,   216,   619,   799,  1414,  1540,  2033,    15,   256,
           35,  5441,  1124,    32,    16,    32,  1843,  2188,   398,  1762,
          251,  3402,   188,  3160,    17,  2434,   209,   171,  2290,  4153,
           17,  1349,   180,  1685,  1213,   180,  2033,    15,   288,   256,
          171,  1310,   398,   180,   171,   619,  1220,    17,  6800,   171,
          859,  7536,   184,  5441,   188,  1151,    32,    16,    32,   533,
         1972,   216,   408,  4372,    15,   171,  1504, 10328,  9719,   191,
          171,   447,   398,   188,  6353,   171,     6,  1864,     6,    15,
           35,  3527,  1062,  2489,  4113,   171,  1807,   184,  3575,   514,
          171,   753,  7207,   527,   407,   852,  2471,     3])

Then, we prepare data for the MLM task (masking random tokens):
### Define the Data Collator for Language Modeling
This line of code sets up a `DataCollatorForLanguageModeling` from the Hugging Face Transformers library. A data collator is used during training to dynamically create batches of data. For language modeling, particularly for models like BERT that use masked language modeling (MLM), this collator prepares training batches by automatically masking tokens according to a specified probability. Here are the details of the parameters used:

- **tokenizer=bert_tokenizer**: Specifies the tokenizer to be used with the data collator. The `bert_tokenizer` is responsible for tokenizing the text and converting it to the format expected by the model.
- **mlm=True**: Indicates that the data collator should mask tokens for masked language modeling training. This parameter being set to `True` configures the collator to randomly mask some of the tokens in the input data, which the model will then attempt to predict.
- **mlm_probability=0.15**: Sets the probability with which tokens will be masked. A probability of 0.15 means that, on average, 15% of the tokens in any sequence will be replaced with a mask token.


In [25]:
# Prepare the data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.15
)

In [26]:
# check how collator transforms a sample input data record
data_collator([train_dataset[0]])

{'input_ids': tensor([[    2,    31,   619,  2130,  1414,    31,  3541,     4,   619,    22,
             29,     4,   799,    12,  3189,    29,     4,   103,  5412,    15,
            490,     4,   619,   184,   171,  2553,    22,    13,    15,  3533,
           3584,   191,     4,   619,   799,  1414,  1540,  2033,    15,   256,
             35,  5441,  1124,    32,    16,     4,  1843,  2188,   398,  1762,
            251,  3402,     4,     4,    17,  2434,   209,   171,  2290,  4153,
             17,  1349,   180,  1685,  1213,   180,  2033,    15,     4,   256,
            171,  1310,   398,   180,     4,   619,  1220,    17,  6800,   171,
            859,  7536,   184,     4,   188,  1151,    32,    16,  1662,   533,
           1972,   216,   408,  4372,    15,   171,  1504, 10328,  9719,   191,
            171,   447,   398,   188,  6353,   171,     6,  1864,     6,    15,
             35,  3527,  1062,  2489,  4113,   171,  1807,   184,  3575,   514,
            171,     4,    

Now, we train the BERT Model using the Trainer module. (For a complete list of training arguments, check [here](https://huggingface.co/docs/transformers/v4.33.2/en/main_classes/trainer#transformers.TrainingArguments)):
This section configures the training process by specifying various parameters that control how the model is trained, evaluated, and saved:

- **output_dir="./trained_model"**: Specifies the directory where the trained model and other output files will be saved.
- **overwrite_output_dir=True**: If set to `True`, this will overwrite the contents of the output directory if it already exists. This is useful when running experiments multiple times.
- **do_eval=True**: Enables evaluation of the model. If `True`, the model will be evaluated at the specified intervals.
- **evaluation_strategy="epoch"**: Defines when the model should be evaluated. Setting this to "epoch" means the model will be evaluated at the end of each epoch.
- **learning_rate=5e-5**: Sets the learning rate for training the model. This is a typical learning rate for fine-tuning BERT-like models.
- **num_train_epochs=10**: Specifies the number of training epochs. Each epoch involves a full pass over the training data.
- **per_device_train_batch_size=2**: Sets the batch size for training on each device. This should be set based on the memory capacity of your hardware.
- **save_total_limit=2**: Limits the total number of model checkpoints to be saved. Only the most recent two checkpoints will be kept.
- **logging_steps=20**: Determines how often to log training information, which can help monitor the training process.


In [31]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir="./trained_model",  # Specify the output directory for the trained model
    overwrite_output_dir=True,
    do_eval=True,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=10,  # Specify the number of training epochs
    per_device_train_batch_size=2,  # Set the batch size for training
    save_total_limit=2,  # Limit the total number of saved checkpoints
    logging_steps = 20
    
)

# Instantiate the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Start the pre-training
trainer.train()

  1%|          | 20/2200 [00:13<23:37,  1.54it/s]

{'loss': 8.6854, 'grad_norm': 7.9176154136657715, 'learning_rate': 4.9545454545454553e-05, 'epoch': 0.09}


  2%|▏         | 40/2200 [00:26<23:09,  1.55it/s]

{'loss': 7.7963, 'grad_norm': 6.877941608428955, 'learning_rate': 4.909090909090909e-05, 'epoch': 0.18}


  3%|▎         | 60/2200 [00:39<22:34,  1.58it/s]

{'loss': 7.672, 'grad_norm': 6.9616265296936035, 'learning_rate': 4.863636363636364e-05, 'epoch': 0.27}


  4%|▎         | 80/2200 [00:52<22:51,  1.55it/s]

{'loss': 7.2586, 'grad_norm': 6.833195209503174, 'learning_rate': 4.8181818181818186e-05, 'epoch': 0.36}


  5%|▍         | 100/2200 [01:05<22:34,  1.55it/s]

{'loss': 7.1587, 'grad_norm': 7.211719036102295, 'learning_rate': 4.772727272727273e-05, 'epoch': 0.45}


  5%|▌         | 120/2200 [01:18<21:56,  1.58it/s]

{'loss': 7.2264, 'grad_norm': 6.316149711608887, 'learning_rate': 4.7272727272727275e-05, 'epoch': 0.55}


  6%|▋         | 140/2200 [01:31<21:38,  1.59it/s]

{'loss': 7.1101, 'grad_norm': 5.497359275817871, 'learning_rate': 4.681818181818182e-05, 'epoch': 0.64}


  7%|▋         | 160/2200 [01:43<21:08,  1.61it/s]

{'loss': 7.0734, 'grad_norm': 6.176764488220215, 'learning_rate': 4.636363636363636e-05, 'epoch': 0.73}


  8%|▊         | 180/2200 [01:56<21:19,  1.58it/s]

{'loss': 6.9881, 'grad_norm': 6.242911338806152, 'learning_rate': 4.5909090909090914e-05, 'epoch': 0.82}


  9%|▉         | 200/2200 [02:08<20:49,  1.60it/s]

{'loss': 6.9933, 'grad_norm': 5.604017734527588, 'learning_rate': 4.545454545454546e-05, 'epoch': 0.91}


 10%|█         | 220/2200 [02:21<20:34,  1.60it/s]

{'loss': 7.1042, 'grad_norm': 5.359533309936523, 'learning_rate': 4.5e-05, 'epoch': 1.0}


                                                  
 10%|█         | 220/2200 [02:25<20:34,  1.60it/s]

{'eval_loss': 7.898343086242676, 'eval_runtime': 4.4645, 'eval_samples_per_second': 23.071, 'eval_steps_per_second': 2.912, 'epoch': 1.0}


 11%|█         | 240/2200 [02:38<21:52,  1.49it/s]  

{'loss': 6.8234, 'grad_norm': 5.851937770843506, 'learning_rate': 4.454545454545455e-05, 'epoch': 1.09}


 12%|█▏        | 260/2200 [02:51<20:43,  1.56it/s]

{'loss': 6.8225, 'grad_norm': 6.21433162689209, 'learning_rate': 4.409090909090909e-05, 'epoch': 1.18}


 13%|█▎        | 280/2200 [03:04<20:27,  1.56it/s]

{'loss': 6.8061, 'grad_norm': 4.921005725860596, 'learning_rate': 4.3636363636363636e-05, 'epoch': 1.27}


 14%|█▎        | 300/2200 [03:17<20:05,  1.58it/s]

{'loss': 6.9762, 'grad_norm': 5.653116703033447, 'learning_rate': 4.318181818181819e-05, 'epoch': 1.36}


 15%|█▍        | 320/2200 [03:29<19:23,  1.62it/s]

{'loss': 7.1415, 'grad_norm': 5.1109843254089355, 'learning_rate': 4.2727272727272724e-05, 'epoch': 1.45}


 15%|█▌        | 340/2200 [03:42<19:05,  1.62it/s]

{'loss': 6.9506, 'grad_norm': 5.632959365844727, 'learning_rate': 4.2272727272727275e-05, 'epoch': 1.55}


 16%|█▋        | 360/2200 [03:54<19:08,  1.60it/s]

{'loss': 6.7768, 'grad_norm': 4.918776035308838, 'learning_rate': 4.181818181818182e-05, 'epoch': 1.64}


 17%|█▋        | 380/2200 [04:07<19:30,  1.55it/s]

{'loss': 6.7739, 'grad_norm': 5.500741004943848, 'learning_rate': 4.1363636363636364e-05, 'epoch': 1.73}


 18%|█▊        | 400/2200 [04:20<18:22,  1.63it/s]

{'loss': 7.0603, 'grad_norm': 5.681009292602539, 'learning_rate': 4.0909090909090915e-05, 'epoch': 1.82}


 19%|█▉        | 420/2200 [04:32<18:55,  1.57it/s]

{'loss': 6.9691, 'grad_norm': 4.311225891113281, 'learning_rate': 4.045454545454546e-05, 'epoch': 1.91}


 20%|██        | 440/2200 [04:44<17:09,  1.71it/s]

{'loss': 6.8034, 'grad_norm': 5.501510143280029, 'learning_rate': 4e-05, 'epoch': 2.0}


                                                  
 20%|██        | 440/2200 [04:49<17:09,  1.71it/s]

{'eval_loss': 7.988618850708008, 'eval_runtime': 4.2869, 'eval_samples_per_second': 24.027, 'eval_steps_per_second': 3.032, 'epoch': 2.0}


 21%|██        | 460/2200 [05:01<17:41,  1.64it/s]

{'loss': 7.0108, 'grad_norm': 5.463977336883545, 'learning_rate': 3.954545454545455e-05, 'epoch': 2.09}


 22%|██▏       | 480/2200 [05:13<16:39,  1.72it/s]

{'loss': 6.8901, 'grad_norm': 5.646495819091797, 'learning_rate': 3.909090909090909e-05, 'epoch': 2.18}


 23%|██▎       | 500/2200 [05:25<17:03,  1.66it/s]

{'loss': 6.8794, 'grad_norm': 4.787571907043457, 'learning_rate': 3.8636363636363636e-05, 'epoch': 2.27}


 24%|██▎       | 520/2200 [05:42<16:38,  1.68it/s]  

{'loss': 6.7924, 'grad_norm': 5.253170013427734, 'learning_rate': 3.818181818181819e-05, 'epoch': 2.36}


 25%|██▍       | 540/2200 [05:54<16:41,  1.66it/s]

{'loss': 6.911, 'grad_norm': 4.960644721984863, 'learning_rate': 3.7727272727272725e-05, 'epoch': 2.45}


 25%|██▌       | 560/2200 [06:06<16:17,  1.68it/s]

{'loss': 6.8957, 'grad_norm': 5.300518989562988, 'learning_rate': 3.7272727272727276e-05, 'epoch': 2.55}


 26%|██▋       | 580/2200 [06:18<16:38,  1.62it/s]

{'loss': 6.9509, 'grad_norm': 5.891037464141846, 'learning_rate': 3.681818181818182e-05, 'epoch': 2.64}


 27%|██▋       | 600/2200 [06:31<17:22,  1.54it/s]

{'loss': 6.8585, 'grad_norm': 4.70629358291626, 'learning_rate': 3.6363636363636364e-05, 'epoch': 2.73}


 28%|██▊       | 620/2200 [06:43<15:31,  1.70it/s]

{'loss': 7.0128, 'grad_norm': 5.0897536277771, 'learning_rate': 3.590909090909091e-05, 'epoch': 2.82}


 29%|██▉       | 640/2200 [06:55<15:08,  1.72it/s]

{'loss': 6.9196, 'grad_norm': 5.418538570404053, 'learning_rate': 3.545454545454546e-05, 'epoch': 2.91}


 30%|███       | 660/2200 [07:07<14:44,  1.74it/s]

{'loss': 6.7506, 'grad_norm': 5.332900047302246, 'learning_rate': 3.5e-05, 'epoch': 3.0}


                                                  
 30%|███       | 660/2200 [07:11<14:44,  1.74it/s]

{'eval_loss': 8.077054023742676, 'eval_runtime': 4.406, 'eval_samples_per_second': 23.377, 'eval_steps_per_second': 2.951, 'epoch': 3.0}


 31%|███       | 680/2200 [07:23<14:42,  1.72it/s]

{'loss': 6.8501, 'grad_norm': 5.486556053161621, 'learning_rate': 3.454545454545455e-05, 'epoch': 3.09}


 32%|███▏      | 700/2200 [07:35<14:31,  1.72it/s]

{'loss': 6.7417, 'grad_norm': 5.315493583679199, 'learning_rate': 3.409090909090909e-05, 'epoch': 3.18}


 33%|███▎      | 720/2200 [07:46<14:27,  1.71it/s]

{'loss': 7.0218, 'grad_norm': 5.296100616455078, 'learning_rate': 3.3636363636363636e-05, 'epoch': 3.27}


 34%|███▎      | 740/2200 [07:58<13:57,  1.74it/s]

{'loss': 6.7724, 'grad_norm': 4.756875991821289, 'learning_rate': 3.318181818181819e-05, 'epoch': 3.36}


 35%|███▍      | 760/2200 [08:09<13:40,  1.75it/s]

{'loss': 6.7015, 'grad_norm': 5.597079277038574, 'learning_rate': 3.272727272727273e-05, 'epoch': 3.45}


 35%|███▌      | 780/2200 [08:21<13:32,  1.75it/s]

{'loss': 6.9017, 'grad_norm': 4.529539108276367, 'learning_rate': 3.2272727272727276e-05, 'epoch': 3.55}


 36%|███▋      | 800/2200 [08:32<13:17,  1.75it/s]

{'loss': 6.9803, 'grad_norm': 5.777277946472168, 'learning_rate': 3.181818181818182e-05, 'epoch': 3.64}


 37%|███▋      | 820/2200 [08:44<13:04,  1.76it/s]

{'loss': 6.946, 'grad_norm': 5.125247478485107, 'learning_rate': 3.1363636363636365e-05, 'epoch': 3.73}


 38%|███▊      | 840/2200 [08:55<12:58,  1.75it/s]

{'loss': 6.7627, 'grad_norm': 4.698935031890869, 'learning_rate': 3.090909090909091e-05, 'epoch': 3.82}


 39%|███▉      | 860/2200 [09:07<12:41,  1.76it/s]

{'loss': 6.792, 'grad_norm': 4.561478614807129, 'learning_rate': 3.0454545454545456e-05, 'epoch': 3.91}


 40%|████      | 880/2200 [09:18<12:42,  1.73it/s]

{'loss': 6.8872, 'grad_norm': 5.3332133293151855, 'learning_rate': 3e-05, 'epoch': 4.0}


                                                  
 40%|████      | 880/2200 [09:23<12:42,  1.73it/s]

{'eval_loss': 8.16103458404541, 'eval_runtime': 4.3984, 'eval_samples_per_second': 23.418, 'eval_steps_per_second': 2.956, 'epoch': 4.0}


 41%|████      | 900/2200 [09:35<12:34,  1.72it/s]

{'loss': 6.8564, 'grad_norm': 5.353549957275391, 'learning_rate': 2.954545454545455e-05, 'epoch': 4.09}


 42%|████▏     | 920/2200 [09:46<12:24,  1.72it/s]

{'loss': 6.6681, 'grad_norm': 4.911650657653809, 'learning_rate': 2.909090909090909e-05, 'epoch': 4.18}


 43%|████▎     | 940/2200 [09:58<12:15,  1.71it/s]

{'loss': 6.9156, 'grad_norm': 5.606752872467041, 'learning_rate': 2.863636363636364e-05, 'epoch': 4.27}


 44%|████▎     | 960/2200 [10:10<12:01,  1.72it/s]

{'loss': 6.9168, 'grad_norm': 4.91315221786499, 'learning_rate': 2.818181818181818e-05, 'epoch': 4.36}


 45%|████▍     | 980/2200 [10:21<11:50,  1.72it/s]

{'loss': 6.92, 'grad_norm': 5.263175010681152, 'learning_rate': 2.772727272727273e-05, 'epoch': 4.45}


 45%|████▌     | 1000/2200 [10:33<11:44,  1.70it/s]

{'loss': 6.811, 'grad_norm': 4.652979373931885, 'learning_rate': 2.7272727272727273e-05, 'epoch': 4.55}


 46%|████▋     | 1020/2200 [10:49<11:26,  1.72it/s]

{'loss': 6.915, 'grad_norm': 4.310720443725586, 'learning_rate': 2.681818181818182e-05, 'epoch': 4.64}


 47%|████▋     | 1040/2200 [11:01<11:15,  1.72it/s]

{'loss': 6.8852, 'grad_norm': 4.993492603302002, 'learning_rate': 2.636363636363636e-05, 'epoch': 4.73}


 48%|████▊     | 1060/2200 [11:12<11:06,  1.71it/s]

{'loss': 6.8186, 'grad_norm': 4.899363994598389, 'learning_rate': 2.590909090909091e-05, 'epoch': 4.82}


 49%|████▉     | 1080/2200 [11:24<11:00,  1.70it/s]

{'loss': 6.8041, 'grad_norm': 5.355813980102539, 'learning_rate': 2.5454545454545454e-05, 'epoch': 4.91}


 50%|█████     | 1100/2200 [11:36<10:40,  1.72it/s]

{'loss': 6.9561, 'grad_norm': 5.472437858581543, 'learning_rate': 2.5e-05, 'epoch': 5.0}


                                                   
 50%|█████     | 1100/2200 [11:40<10:40,  1.72it/s]

{'eval_loss': 8.20773696899414, 'eval_runtime': 4.3987, 'eval_samples_per_second': 23.416, 'eval_steps_per_second': 2.955, 'epoch': 5.0}


 51%|█████     | 1120/2200 [11:52<10:24,  1.73it/s]

{'loss': 6.7878, 'grad_norm': 5.413906574249268, 'learning_rate': 2.4545454545454545e-05, 'epoch': 5.09}


 52%|█████▏    | 1140/2200 [12:04<10:21,  1.71it/s]

{'loss': 6.6887, 'grad_norm': 5.311718463897705, 'learning_rate': 2.4090909090909093e-05, 'epoch': 5.18}


 53%|█████▎    | 1160/2200 [12:15<09:54,  1.75it/s]

{'loss': 6.8721, 'grad_norm': 4.6929731369018555, 'learning_rate': 2.3636363636363637e-05, 'epoch': 5.27}


 54%|█████▎    | 1180/2200 [12:27<09:56,  1.71it/s]

{'loss': 6.8393, 'grad_norm': 4.819812774658203, 'learning_rate': 2.318181818181818e-05, 'epoch': 5.36}


 55%|█████▍    | 1200/2200 [12:39<09:30,  1.75it/s]

{'loss': 8.6725, 'grad_norm': 5.260977745056152, 'learning_rate': 2.272727272727273e-05, 'epoch': 5.45}


 55%|█████▌    | 1220/2200 [12:50<09:17,  1.76it/s]

{'loss': 7.0802, 'grad_norm': 6.328892230987549, 'learning_rate': 2.2272727272727274e-05, 'epoch': 5.55}


 56%|█████▋    | 1240/2200 [13:02<09:05,  1.76it/s]

{'loss': 6.6772, 'grad_norm': 4.880925178527832, 'learning_rate': 2.1818181818181818e-05, 'epoch': 5.64}


 57%|█████▋    | 1260/2200 [13:13<08:52,  1.76it/s]

{'loss': 6.8621, 'grad_norm': 5.449679374694824, 'learning_rate': 2.1363636363636362e-05, 'epoch': 5.73}


 58%|█████▊    | 1280/2200 [13:25<08:44,  1.75it/s]

{'loss': 6.5471, 'grad_norm': 6.194889068603516, 'learning_rate': 2.090909090909091e-05, 'epoch': 5.82}


 59%|█████▉    | 1300/2200 [13:36<08:32,  1.76it/s]

{'loss': 6.7196, 'grad_norm': 6.796223163604736, 'learning_rate': 2.0454545454545457e-05, 'epoch': 5.91}


 60%|██████    | 1320/2200 [13:48<08:22,  1.75it/s]

{'loss': 6.6819, 'grad_norm': 5.312436580657959, 'learning_rate': 2e-05, 'epoch': 6.0}


                                                   
 60%|██████    | 1320/2200 [13:52<08:22,  1.75it/s]

{'eval_loss': 8.239398002624512, 'eval_runtime': 4.3982, 'eval_samples_per_second': 23.418, 'eval_steps_per_second': 2.956, 'epoch': 6.0}


 61%|██████    | 1340/2200 [14:04<08:29,  1.69it/s]

{'loss': 6.7874, 'grad_norm': 5.362356662750244, 'learning_rate': 1.9545454545454546e-05, 'epoch': 6.09}


 62%|██████▏   | 1360/2200 [14:16<08:04,  1.73it/s]

{'loss': 6.8024, 'grad_norm': 5.021642208099365, 'learning_rate': 1.9090909090909094e-05, 'epoch': 6.18}


 63%|██████▎   | 1380/2200 [14:27<08:01,  1.70it/s]

{'loss': 6.6418, 'grad_norm': 4.746716022491455, 'learning_rate': 1.8636363636363638e-05, 'epoch': 6.27}


 64%|██████▎   | 1400/2200 [14:39<07:45,  1.72it/s]

{'loss': 6.7071, 'grad_norm': 5.107680797576904, 'learning_rate': 1.8181818181818182e-05, 'epoch': 6.36}


 65%|██████▍   | 1420/2200 [14:51<07:33,  1.72it/s]

{'loss': 6.855, 'grad_norm': 4.9152350425720215, 'learning_rate': 1.772727272727273e-05, 'epoch': 6.45}


 65%|██████▌   | 1440/2200 [15:02<07:23,  1.71it/s]

{'loss': 6.7346, 'grad_norm': 4.554744243621826, 'learning_rate': 1.7272727272727274e-05, 'epoch': 6.55}


 66%|██████▋   | 1460/2200 [15:14<07:11,  1.72it/s]

{'loss': 6.8105, 'grad_norm': 5.873269557952881, 'learning_rate': 1.6818181818181818e-05, 'epoch': 6.64}


 67%|██████▋   | 1480/2200 [15:26<06:56,  1.73it/s]

{'loss': 6.7383, 'grad_norm': 5.49183988571167, 'learning_rate': 1.6363636363636366e-05, 'epoch': 6.73}


 68%|██████▊   | 1500/2200 [15:38<06:43,  1.74it/s]

{'loss': 6.8463, 'grad_norm': 5.053542137145996, 'learning_rate': 1.590909090909091e-05, 'epoch': 6.82}


 69%|██████▉   | 1520/2200 [15:54<06:38,  1.71it/s]

{'loss': 6.6826, 'grad_norm': 5.023460388183594, 'learning_rate': 1.5454545454545454e-05, 'epoch': 6.91}


 70%|███████   | 1540/2200 [16:06<06:22,  1.72it/s]

{'loss': 6.6167, 'grad_norm': 5.70521354675293, 'learning_rate': 1.5e-05, 'epoch': 7.0}


                                                   
 70%|███████   | 1540/2200 [16:10<06:22,  1.72it/s]

{'eval_loss': 8.363091468811035, 'eval_runtime': 4.3353, 'eval_samples_per_second': 23.758, 'eval_steps_per_second': 2.999, 'epoch': 7.0}


 71%|███████   | 1560/2200 [16:22<06:14,  1.71it/s]

{'loss': 6.8568, 'grad_norm': 6.6288018226623535, 'learning_rate': 1.4545454545454545e-05, 'epoch': 7.09}


 72%|███████▏  | 1580/2200 [16:34<05:56,  1.74it/s]

{'loss': 6.8552, 'grad_norm': 5.850818157196045, 'learning_rate': 1.409090909090909e-05, 'epoch': 7.18}


 73%|███████▎  | 1600/2200 [16:46<05:53,  1.70it/s]

{'loss': 6.8497, 'grad_norm': 5.369394779205322, 'learning_rate': 1.3636363636363637e-05, 'epoch': 7.27}


 74%|███████▎  | 1620/2200 [16:57<05:38,  1.71it/s]

{'loss': 6.7611, 'grad_norm': 5.115549087524414, 'learning_rate': 1.318181818181818e-05, 'epoch': 7.36}


 75%|███████▍  | 1640/2200 [17:09<05:29,  1.70it/s]

{'loss': 6.7687, 'grad_norm': 5.412525653839111, 'learning_rate': 1.2727272727272727e-05, 'epoch': 7.45}


 75%|███████▌  | 1660/2200 [17:21<05:15,  1.71it/s]

{'loss': 6.6191, 'grad_norm': 4.506237506866455, 'learning_rate': 1.2272727272727273e-05, 'epoch': 7.55}


 76%|███████▋  | 1680/2200 [17:33<05:03,  1.71it/s]

{'loss': 6.8867, 'grad_norm': 4.6405415534973145, 'learning_rate': 1.1818181818181819e-05, 'epoch': 7.64}


 77%|███████▋  | 1700/2200 [17:44<04:49,  1.73it/s]

{'loss': 6.707, 'grad_norm': 4.9450860023498535, 'learning_rate': 1.1363636363636365e-05, 'epoch': 7.73}


 78%|███████▊  | 1720/2200 [17:56<04:34,  1.75it/s]

{'loss': 6.7947, 'grad_norm': 6.156165599822998, 'learning_rate': 1.0909090909090909e-05, 'epoch': 7.82}


 79%|███████▉  | 1740/2200 [18:07<04:20,  1.76it/s]

{'loss': 6.8575, 'grad_norm': 4.857022285461426, 'learning_rate': 1.0454545454545455e-05, 'epoch': 7.91}


 80%|████████  | 1760/2200 [18:19<04:10,  1.76it/s]

{'loss': 6.8409, 'grad_norm': 4.812510013580322, 'learning_rate': 1e-05, 'epoch': 8.0}


                                                   
 80%|████████  | 1760/2200 [18:23<04:10,  1.76it/s]

{'eval_loss': 8.23674488067627, 'eval_runtime': 4.4492, 'eval_samples_per_second': 23.15, 'eval_steps_per_second': 2.922, 'epoch': 8.0}


 81%|████████  | 1780/2200 [18:35<04:07,  1.70it/s]

{'loss': 6.883, 'grad_norm': 5.015549659729004, 'learning_rate': 9.545454545454547e-06, 'epoch': 8.09}


 82%|████████▏ | 1800/2200 [18:47<03:51,  1.73it/s]

{'loss': 6.6765, 'grad_norm': 4.690027713775635, 'learning_rate': 9.090909090909091e-06, 'epoch': 8.18}


 83%|████████▎ | 1820/2200 [18:59<03:49,  1.66it/s]

{'loss': 6.7644, 'grad_norm': 4.659084796905518, 'learning_rate': 8.636363636363637e-06, 'epoch': 8.27}


 84%|████████▎ | 1840/2200 [19:11<03:27,  1.73it/s]

{'loss': 6.6485, 'grad_norm': 4.568099498748779, 'learning_rate': 8.181818181818183e-06, 'epoch': 8.36}


 85%|████████▍ | 1860/2200 [19:22<03:16,  1.73it/s]

{'loss': 6.6408, 'grad_norm': 5.527541637420654, 'learning_rate': 7.727272727272727e-06, 'epoch': 8.45}


 85%|████████▌ | 1880/2200 [19:34<03:04,  1.73it/s]

{'loss': 6.5976, 'grad_norm': 5.466050624847412, 'learning_rate': 7.272727272727272e-06, 'epoch': 8.55}


 86%|████████▋ | 1900/2200 [19:46<02:54,  1.72it/s]

{'loss': 6.8336, 'grad_norm': 4.779109954833984, 'learning_rate': 6.818181818181818e-06, 'epoch': 8.64}


 87%|████████▋ | 1920/2200 [19:57<02:43,  1.71it/s]

{'loss': 6.8949, 'grad_norm': 4.8586506843566895, 'learning_rate': 6.363636363636363e-06, 'epoch': 8.73}


 88%|████████▊ | 1940/2200 [20:09<02:32,  1.71it/s]

{'loss': 6.7763, 'grad_norm': 5.319136142730713, 'learning_rate': 5.909090909090909e-06, 'epoch': 8.82}


 89%|████████▉ | 1960/2200 [20:21<02:20,  1.71it/s]

{'loss': 6.9531, 'grad_norm': 6.259253025054932, 'learning_rate': 5.4545454545454545e-06, 'epoch': 8.91}


 90%|█████████ | 1980/2200 [20:32<02:08,  1.72it/s]

{'loss': 6.9334, 'grad_norm': 4.414152145385742, 'learning_rate': 5e-06, 'epoch': 9.0}


                                                   
 90%|█████████ | 1980/2200 [20:37<02:08,  1.72it/s]

{'eval_loss': 8.381245613098145, 'eval_runtime': 4.3701, 'eval_samples_per_second': 23.569, 'eval_steps_per_second': 2.975, 'epoch': 9.0}


 91%|█████████ | 2000/2200 [20:49<01:55,  1.73it/s]

{'loss': 6.4592, 'grad_norm': 6.558801174163818, 'learning_rate': 4.5454545454545455e-06, 'epoch': 9.09}


 92%|█████████▏| 2020/2200 [21:07<01:47,  1.67it/s]

{'loss': 6.8006, 'grad_norm': 5.29049825668335, 'learning_rate': 4.0909090909090915e-06, 'epoch': 9.18}


 93%|█████████▎| 2040/2200 [21:19<01:31,  1.74it/s]

{'loss': 6.8004, 'grad_norm': 5.10560417175293, 'learning_rate': 3.636363636363636e-06, 'epoch': 9.27}


 94%|█████████▎| 2060/2200 [21:30<01:22,  1.71it/s]

{'loss': 6.6757, 'grad_norm': 5.200350284576416, 'learning_rate': 3.1818181818181817e-06, 'epoch': 9.36}


 95%|█████████▍| 2080/2200 [21:42<01:10,  1.71it/s]

{'loss': 6.8404, 'grad_norm': 5.327524185180664, 'learning_rate': 2.7272727272727272e-06, 'epoch': 9.45}


 95%|█████████▌| 2100/2200 [21:54<00:57,  1.73it/s]

{'loss': 6.8274, 'grad_norm': 4.971123695373535, 'learning_rate': 2.2727272727272728e-06, 'epoch': 9.55}


 96%|█████████▋| 2120/2200 [22:06<00:45,  1.75it/s]

{'loss': 6.5952, 'grad_norm': 4.928767204284668, 'learning_rate': 1.818181818181818e-06, 'epoch': 9.64}


 97%|█████████▋| 2140/2200 [22:17<00:34,  1.74it/s]

{'loss': 6.9341, 'grad_norm': 5.303535461425781, 'learning_rate': 1.3636363636363636e-06, 'epoch': 9.73}


 98%|█████████▊| 2160/2200 [22:29<00:23,  1.68it/s]

{'loss': 6.7201, 'grad_norm': 5.099649906158447, 'learning_rate': 9.09090909090909e-07, 'epoch': 9.82}


 99%|█████████▉| 2180/2200 [22:41<00:11,  1.74it/s]

{'loss': 6.8119, 'grad_norm': 5.022861003875732, 'learning_rate': 4.545454545454545e-07, 'epoch': 9.91}


100%|██████████| 2200/2200 [22:52<00:00,  1.71it/s]

{'loss': 6.8374, 'grad_norm': 4.74929666519165, 'learning_rate': 0.0, 'epoch': 10.0}


                                                   
100%|██████████| 2200/2200 [23:02<00:00,  1.59it/s]

{'eval_loss': 8.378961563110352, 'eval_runtime': 4.4166, 'eval_samples_per_second': 23.321, 'eval_steps_per_second': 2.943, 'epoch': 10.0}
{'train_runtime': 1382.3563, 'train_samples_per_second': 3.183, 'train_steps_per_second': 1.591, 'train_loss': 6.8910849137739705, 'epoch': 10.0}





TrainOutput(global_step=2200, training_loss=6.8910849137739705, metrics={'train_runtime': 1382.3563, 'train_samples_per_second': 3.183, 'train_steps_per_second': 1.591, 'total_flos': 289464647577600.0, 'train_loss': 6.8910849137739705, 'epoch': 10.0})

## Evaluating Model Performance

Let's check the performance of the trained model. Perplexity is commonly used to compare different language models or different configurations of the same model.
After training, perplexity can be calculated on a held-out evaluation dataset to assess the model's performance. The perplexity is calculated by feeding the evaluation dataset through the model and comparing the predicted probabilities of the target tokens with the actual token values that are masked.

A lower perplexity score indicates that the model has a better understanding of the language and is more effective at predicting the masked tokens. It suggests that the model has learned useful representations and can generalize well to unseen data.


In [32]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

100%|██████████| 13/13 [00:04<00:00,  3.15it/s]

Perplexity: 3906.12





## Loading the saved model
If you want to skip training and load the model that you trained for 10 epochs, go ahead and uncomment the following cell:


In [None]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/BeXRxFT2EyQAmBHvxVaMYQ/bert-scratch-model.pt'
model.load_state_dict(torch.load('bert-scratch-model.pt',map_location=torch.device('cpu')))

The simplest way to try out the model for inference is to use it in a pipeline(). Instantiate a pipeline for fill-mask with your model, and pass your text to it. If you like, you can use the top_k parameter to specify how many predictions to return:


In [33]:
# Define the input text with a masked token
text = "This is a [MASK] movie!"

# Create a pipeline for the "fill-mask" task
mask_filler = pipeline("fill-mask", model=model,tokenizer=bert_tokenizer)

# Generate predictions by filling the mask in the input text
results = mask_filler(text) #top_k parameter can be set 

# Print the predicted sequences
for result in results:
    print(f"Predicted token: {result['token_str']}, Confidence: {result['score']:.2f}")

Device set to use cpu


Predicted token: the, Confidence: 0.06
Predicted token: ,, Confidence: 0.05
Predicted token: ., Confidence: 0.04
Predicted token: of, Confidence: 0.03
Predicted token: and, Confidence: 0.02


You can see that [MASK] is replaced by the most frequent token. This weak performance can be due to insufficient training, lack of training data, model architecture, or not tuning hyperparameters. Let's try a pretrained model from Hugging Face:


## Inferencing a pretrained BERT model


In [34]:
# Load the pretrained BERT model and tokenizer
pretrained_model = BertForMaskedLM.from_pretrained('bert-base-uncased')
pretrained_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Define the input text with a masked token
text = "This is a [MASK] movie!"

# Create the pipeline
mask_filler = pipeline(task='fill-mask', model=pretrained_model,tokenizer=pretrained_tokenizer)

# Perform inference using the pipeline
results = mask_filler(text)
for result in results:
    print(f"Predicted token: {result['token_str']}, Confidence: {result['score']:.2f}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Predicted token: great, Confidence: 0.16
Predicted token: horror, Confidence: 0.08
Predicted token: good, Confidence: 0.08
Predicted token: bad, Confidence: 0.05
Predicted token: fantastic, Confidence: 0.04


This pretrianed model performs way better than the model you just trained for a few epochs using a single dataset. Still, pretrained models cannot be used for specific tasks, such as sentiment extraction or sequence classification. This is why supervised fine-tuning methods are introduced.


---


## Exercise


1. Create a model and tokenizer using Hugging Face library.
2. Go to this [link](https://huggingface.co/datasets?task_categories=task_categories:text-classification&sort=trending)
3. Choose a Text Classification dataset that you can load, for instance 'stanfordnlp/snli'
4. Use that dataset to train your model(please be mindful of the resources available for the training) and evaluate it.

   >Note: The lab environment doesn't have the resources to support the training and this might cause the kernel to die.


<details><summary>Click here for a hint</summary>

-   SNLI has 3 labels
-   You can use `load_dataset("stanfordnlp/snli")` to load the dataset
</details>


<details><summary>Click here for the solution</summary>

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

# Load the SNLI dataset
snli = load_dataset("stanfordnlp/snli")

# Preprocessing function
def preprocess_function(examples):
  premise = examples["premise"]
  hypothesis = examples["hypothesis"]
  return tokenizer(premise, hypothesis, padding="max_length", truncation=True)

model_name = "bert-base-uncased"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

# Apply preprocessing to training and validation sets
train_encoded = snli["train"].map(preprocess_function, batched=True)
val_encoded = snli["validation"].map(preprocess_function, batched=True)

# Training function (replace with your training loop)
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",  # Replace with your output directory
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_encoded,
    eval_dataset=val_encoded,
)

trainer.train()

# Evaluation function (replace with your metrics)
from sklearn.metrics import accuracy_score

predictions, labels = trainer.predict(val_encoded)
accuracy = accuracy_score(labels, predictions.argmax(-1))
print(f"Accuracy on validation set: {accuracy:.4f}")

```

</details>


In [35]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset

# Load the SNLI dataset
snli = load_dataset("stanfordnlp/snli")

# Preprocessing function
def preprocess_function(examples):
  premise = examples["premise"]
  hypothesis = examples["hypothesis"]
  return tokenizer(premise, hypothesis, padding="max_length", truncation=True)

model_name = "bert-base-uncased"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

# Apply preprocessing to training and validation sets
train_encoded = snli["train"].map(preprocess_function, batched=True)
val_encoded = snli["validation"].map(preprocess_function, batched=True)

# Training function (replace with your training loop)
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",  # Replace with your output directory
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_encoded,
    eval_dataset=val_encoded,
)

trainer.train()

# Evaluation function (replace with your metrics)
from sklearn.metrics import accuracy_score

predictions, labels = trainer.predict(val_encoded)
accuracy = accuracy_score(labels, predictions.argmax(-1))
print(f"Accuracy on validation set: {accuracy:.4f}")

Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 1325507.70 examples/s]
Generating validation split: 100%|██████████| 10000/10000 [00:00<00:00, 1999286.91 examples/s]
Generating train split: 100%|██████████| 550152/550152 [00:00<00:00, 4348367.57 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 550152/550152 [02:08<00:00, 4282.09 examples/s]
Map: 100%|██████████| 10000/10000 [00:02<00:00, 4334.01 examples/s]
  0%|          | 56/275076 [01:26<117:22:40,  1.54s/it]

IndexError: Target -1 is out of bounds.

# Congratulations! You have completed the lab


## Authors


[Fateme Akbari](https://author.skills.network/instructors/fateme_akbari)


© Copyright IBM Corporation. All rights reserved.
