# Learning objectives

This is a series of 4 notebooks. By the end, you should able to:
1. Explain the difference between `pretrain`, `finetune`, and `continue pretrain`
2. Understand when and why `pretrain`, `finetune`, and `continue pretrain`
3. Implement `continue pretrain` of a MaskedLM and use the output model for a downstream `finetune` task

- LLM used: RoBERTa-base
- Dataset used: climate change policy dataset
- Machine Learning Libraries: HuggingFace, Tensorflow

Note: The notebook was compiled in Google Colab, with file streaming from Google Drive. To run the notebook in your own environment, change the file path as needed.

In [None]:
PATH = '/content/drive/MyDrive/Projects/Pretrain_MLM_PERMANENT/'

# Concepts

## What is the difference between `pretrain` and `finetune`?



A language model is `pretrained` on a `large` dataset to learn general representations of language. The model is `pretrained` for an `initial task`. For BERT, the initial tasks were fill-mask and next-sentence-prediction.

In contrast, `finetune` refers to the process of taking a `pretrained` model and `repurpose` it for a `different task`, this is generally done with a much `smaller` dataset. For example, we could take BERT and a movie review dataset to `classify` whether the movie reviews are positive or negative (ie, sentiment analysis).

Notice that in this example, BERT was not trained to perform classification, but the model can be `repurposed` to perform the new task (ie, `finetune`). This works well (compared to training a sentiment analysis model using only the movie review dataset from scratch) because the `pretrained` model already learned some general representations of language due to the fact it was trained on a large dataset.

## What is `continue pretraining`?

`Continue pretraining` means we take the `pretrained` model (together with the trained weights) and a new dataset and continue training the model for its `initial task`. For example, we can `continue pretraining` BERT for the initial tasks of fill-mask and next-sentence-prediction.

Notice the difference between `continue pretraining` and `finetuning`. In both cases, we train the `pretrained` model on a new dataset, but when we `continue pretraining`, we continue to train the model for its `initial tasks`, whereas when we `finetune`, the model is `repurposed` to a different task.

## Why `continue pretraining`?



`Continue pretraining` a `pretrained` model could help the model gain a better representation of language within a `specific domain` (ie, Domain-Adaptive Pretraining) or a `specific task` (ie, Task-Adaptive Pretraining). [Support](https://aclanthology.org/2020.acl-main.740.pdf)

For example, BERT was `pretrained` using (mostly) wikipedia texts, so it might not have learned good representations of `domain specific` texts, such as `medical terms`. You might find it beneficial to `continue pretraining` BERT with `domain specific` texts before using on a downstream tasks within that domain to potentially boost the performance.

Another benefit of `continue pretraining` a `self-supervised` model is that the model omits the need for human-annotated labels, which can be expensive and scarce to come by. Imagine you want to train a classifier that uses doctor notes to predict diagnostics, you have lots of doctor notes but only a small portion of them are labelled with diagnosis. If you only use the labelled portion to train your classifier, the majority of your data are essentially wasted. You might then consider using all the doctor notes to `continue pretraining` the model so it gains a better representation of the domain and task of your interest before moving on to building your classification model using the labelled data.

# Word of caution:

Before jumping into `continue pretraining` a large-language-model, understand that this could be a very expensive endeavor due to the shear size (number of parameters) of the model. While `continue pretraining` could boost the performance, it is not guarenteed. And the trade-off between time/resourses and performance should be carefully considered.

The notebook is configured to use a T4 GPU for the `continue pretraining` task. The `continue pretraining` task will take ~3 hours to run using this configuration. It can be run using a more powerful GPU (eg: A100) with larger GPU RAM and larger batch_size (and learning_rate) for shorter amount of time.

# Import Libraries

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# standard libraries
import pandas as pd
import numpy as np
import math

# train/val/test split
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# for models
import tensorflow as tf

!pip install transformers -q
import transformers
from transformers import pipeline, TFRobertaModel, TFRobertaForMaskedLM, RobertaTokenizer, DataCollatorForLanguageModeling

!pip install datasets -q
import datasets

# for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# for evaluation metrics
from sklearn.metrics import classification_report

!pip install evaluate -q
from evaluate import load

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Load RoBERTa


[RoBERTa](https://arxiv.org/abs/1907.11692) is one of the BERT family models that uses a `masked-language-model` approach as its initial task. The researchers that introduced RoBERTa investigated the design choices made in the original BERT model and identified a few improvement areas that lead to performance gain. Below are a few key areas where RoBERTa differs from BERT:

1. Unlike BERT, where masking was performed only once during data preprocessing (training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training, resulting in each training sequence was seen with the same mask four times during training) resulting in a single static mask (`static masking`), RoBERT employed `dynamic masking` where different masking pattern was generated every time a sequence was feed to the model.
2. Unlike BERT, where the initial training tasks consisted of fill-mask and next-sentence-prediction, RoBERTa removed the next-sentence-prediction task from its initial tasks and focused only on the fill-mask task.
3. Further, unlike BERT, where short sequences were randomly injected and the sequence length for the first 90% of updates were reduced, RoBERTa was trained only with full-length sequences (T = 512 tokens), and RoBERTa used a larger byte-level Byte-Pair Encoding vocabulary as well as larger batch sizes in its training.
4. RoBERTa was also trained on a different dataset than BERT, but the dataset matches the overall quantity and quality of the dataset used for BERT.

- [GitHub](https://github.com/facebookresearch/fairseq/tree/main/examples/roberta)
- [HuggingFace](https://huggingface.co/docs/transformers/model_doc/roberta)

Keeping in mind that `RoBERTa` was trained using `Byte-Pair Encoding` tokenizers with casing and `base` has 12 layers while `large` has 24. For the showcase here, we will use `roberta-base`, but the same techniques can be easily adopted to roberta-large.

In [None]:
CHECKPOINT = 'roberta-base'
TOKENIZER = RobertaTokenizer.from_pretrained(CHECKPOINT)
ROBERTA_MODEL = TFRobertaModel.from_pretrained(CHECKPOINT)
ROBERTA_MLM = TFRobertaForMaskedLM.from_pretrained(CHECKPOINT)

# per RoBERTa default
MAX_LEN = 512
VOCAB_SIZE = 50265
HIDDEN_DIM = 768

BATCH_SIZE = 8
LR_SCHEDULE = tf.keras.optimizers.schedules.PolynomialDecay(initial_learning_rate=1e-6,
                                                            decay_steps=5336,
                                                            end_learning_rate=1e-9,
                                                            power=1.0)
OPTIMIZER = tf.keras.optimizers.Adam(learning_rate=LR_SCHEDULE,
                                      beta_1=0.9,
                                      beta_2=0.98,
                                      epsilon=1e-06,
                                      clipnorm=0.0)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from a PyTorch model that you expect to be exact

# Define reusable functions

In [None]:
def to_tokenize(input):
  return TOKENIZER(input,
                   add_special_tokens=True,
                   max_length=MAX_LEN,
                   padding='max_length',
                   return_token_type_ids=True,
                   truncation=True,
                   return_tensors="tf"
                   )

In [None]:
def show_results(model, feature, label, classes):
  yhat_val = model.predict(feature)
  yhat_val_result = np.argmax(yhat_val, axis=-1)

  print('Validation classification Report \n')
  print(classification_report(label, yhat_val_result, target_names=classes))

  ax = sns.heatmap(tf.math.confusion_matrix(label, yhat_val_result),
                 annot=True,
                 fmt='.0f',
                 cmap='Blues',
                 yticklabels=classes,
                 xticklabels=classes,
                 cbar=False)

  ax.set(xlabel='Predicted Label', ylabel='True Label')
  plt.title('Validation Confusion Matrix')
  plt.show()

# Load Dataset

The dataset is sourced from the [climate policy database](https://climatepolicydatabase.org/). The database is updated periodically to include climate policies adopted from different countries around the globe.

The dataset used in this notebook is the 2023 version downloaded on 1/22/2024. The documentation can be found [here](https://climatepolicydatabase.org/sites/default/files/2023-05/CPDB%20Codebook%20v1.2.pdf).

This dataset is not one of the 'classic' datasets commonly used in learning NLP, rather it is a real-life dataset currated by climate change professionals. As we work through the dataset and the models, you will also identify and recognize some of the painpoints (eg, data quality and model performance) in building machine learning models in real-life.

In [None]:
df = pd.read_csv(f'{PATH}climate_policy_database_policies_export.csv')

df.head()

Unnamed: 0,policy_id,country_iso,country_update,policy_name,policy_title,jurisdiction,supranational_region,country,subnational_region,policy_city_or_local,...,end_date,high_impact,policy_objective,reference,last_update,impact_indicators.comments,impact_indicators.name,impact_indicators.value,impact_indicators.base_year,impact_indicators.target_year
0,211001480,FRA,Sporadic,Incandescent Lamp Phase-out France (2008),Incandescent Lamp Phase-out,Country,,France,,,...,,Unknown,"Mitigation, Energy access",http://www.developpement-durable.gouv.fr/artic...,,,,,,
1,211001564,IDN,Annual,Jakarta Regulation No. 38/12 on Green Building...,Jakarta Regulation No. 38/12 on Green Buildings,City,,Indonesia,Jakarta,,...,,Unknown,Mitigation,https://www.iea.org/policies/2523-jakarta-regu...,,,,,,
2,211002621,AUS,Annual,Safeguarding the Future: Australia (1998),Safeguarding the Future:,Country,,Australia,,,...,2007.0,Unknown,Mitigation,http://www.climatechange.gov.au,,,,,,
3,211004420,BRB,Sporadic,National Climate Change Policy Barbados (2012),National Climate Change Policy,Country,,Barbados,,,...,,Unknown,"Mitigation, Adaptation",http://gisbarbados.gov.bb/blog/barbados-nation...,,,,,,
4,211004934,BGR,Sporadic,Spatial Planning Act Bulgaria (2001),Spatial Planning Act,Country,,Bulgaria,,,...,,,Mitigation,https://www.mrrb.bg/en/spatial-development-act...,,,,,,


In [None]:
df.columns

Index(['policy_id', 'country_iso', 'country_update', 'policy_name',
       'policy_title', 'jurisdiction', 'supranational_region', 'country',
       'subnational_region', 'policy_city_or_local', 'policy_instrument',
       'sector', 'policy_description', 'policy_type', 'stringency',
       'policy_status', 'decision_date', 'start_date', 'end_date',
       'high_impact', 'policy_objective', 'reference', 'last_update',
       'impact_indicators.comments', 'impact_indicators.name',
       'impact_indicators.value', 'impact_indicators.base_year',
       'impact_indicators.target_year'],
      dtype='object')

# Identify modeling objectives:
Our modeling objective is to create a `classification` model by using policy_description to predict policy_type.

Each of the 4 notebooks implement one of the models below:
1. Train a FFNN from scratch
2. Finetune a pretrained LLM (RoBERTa-base)
3. Continue pretraining the pretrained LLM (RoBERTa-base)
4. Finetune the continue pretrained LLM (RoBERTa-base)

This notebook showcase the implementation of:
3. Continue pretraining the pretrained LLM (RoBERTa-base)

# Prepare dataset

Remove columns not used for the task

In [None]:
cols = ['policy_description', 'policy_type']
df = df[cols]

df.head()

Unnamed: 0,policy_description,policy_type
0,,Energy efficiency
1,"The policy focuses on energy efficiency, water...",Energy efficiency
2,,Renewables
3,The primary goal of the policy is to establish...,Energy service demand reduction and resource e...
4,Legislative,Non-energy use


For this showcase, I will simply drop instances with missing values. Think about how else you might go about working with datasets that have lots of missing values?

In [None]:
df = df.dropna()
len(df)

4241

Remove any duplicate instances with the same policy_description. Think about why there are so many duplicate instances and how to better utilize datastes that have lots of duplicate data?

In [None]:
df = df.drop_duplicates(subset=['policy_description'],keep='first')
len(df)

3591

Let's also take a look at the label for potential cleaning.

In [None]:
df['policy_type'].value_counts()

Renewables                                                                                                                                               691
Energy efficiency                                                                                                                                        595
Other low-carbon technologies and fuel switch                                                                                                            293
Energy efficiency, Energy service demand reduction and resource efficiency, Non-energy use, Other low-carbon technologies and fuel switch, Renewables    290
Energy service demand reduction and resource efficiency, Energy efficiency, Renewables, Other low-carbon technologies and fuel switch, Non-energy use    285
                                                                                                                                                        ... 
Energy efficiency, Renewables, Energy service demand reduc

Based on the documentation 3.14 Mitigation area (policy type), we see that this field is single or multiple choices from their policy matrix.

For this exercise, if there are multiple types, I will treat them as a new class 'Multi'. Think about how else might you tackle the label cleaning.

Since Pandas dataframe is a pretty heavy object, even though operations can be applied to Pandas dataframe using the apply() method, it's often more efficient to operate on NumPy objects or List objects.

In [None]:
policy_types = []

for policy_type in df['policy_type']:
  if len(policy_type.split(',')) > 1:
    policy_types.append('Multi')
  else:
    policy_types.append(policy_type)

assert len(policy_types) == len(df)
df['policy_type'] = policy_types

df['policy_type'].value_counts()

Multi                                                      1578
Renewables                                                  691
Energy efficiency                                           595
Other low-carbon technologies and fuel switch               293
Non-energy use                                              267
Energy service demand reduction and resource efficiency     131
Unknown                                                      36
Name: policy_type, dtype: int64

# Train/val/test split

Now that the data is ready, let's split to train/validation/test to prepare for building the model. Once the test set is split out, we want to make sure we do not touch the test set at all to avoid any data leakage.

In [None]:
X = df['policy_description'].to_list()
y = df['policy_type'].to_list()

Since the model expects classes in numeric values, we will encode the classes first

In [None]:
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)

classes = list(label_encoder.inverse_transform(range(df['policy_type'].nunique())))
classes

['Energy efficiency',
 'Energy service demand reduction and resource efficiency',
 'Multi',
 'Non-energy use',
 'Other low-carbon technologies and fuel switch',
 'Renewables',
 'Unknown']

We'll assign 30% of the total data to test, and split the remaining 70% data between train and validation at 80/20 split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1234)

In [None]:
print('--- training set ---')
print('number of samples: ', len(X_train))
print('examples of features: ', X_train[:5])
print('examples of labels: ', y_train[:5])

print('--- validation set ---')
print('number of samples: ', len(X_val))
print('examples of features: ', X_val[:5])
print('examples of labels: ', y_val[:5])

print('--- test set ---')
print('number of samples: ', len(X_test))
print('examples of features: ', X_test[:5])
print('examples of labels: ', y_test[:5])

--- training set ---
number of samples:  2010
examples of features:  ['The Hawaii Energy Code (HEC) adopted the 2015 IECC and ASHRAE 90.1-2013 with amendments by the State Building Council on July 14, 2015. The HEC must be adopted separately by the four counties. State law (Act 164, 2014), requires that if the counties do not adopt HEC by 2017, the HEC becomes the interim code for the counties. (For details, see reference below.) IEA/IRENA Global Renewable Energy Policies and Measures Database © OECD/IEA and IRENA, [November 2020]', 'The Small Power Producers (SPP) framework created an enabling environment for private project developments of projects up to 10MW in 2008, through standardised power purchase agreements. (For details, see reference below.)', 'The Action Plan for Renewable Energy Promotion in Mali was established to achieve the renewable energy target of increasing the share of renewables in TPES from less than 1% in 2002 to 15% in 2020. The energy policy is defined by 5 ma

# Tokenize the feature using RoBERTa tokenizer

In [None]:
X_train_tokenized = to_tokenize(X_train)
X_val_tokenized = to_tokenize(X_val)
X_test_tokenized = to_tokenize(X_test)

# Model 3: Continue pretraining the pretrained LLM (RoBERTa-base)

To continue pretraining the model, we'll first duplicate each training sentence 10 times. Since only 15% of the tokens are masked in each batch in each iteration, providing duplicate sentences would allow the model to be exposed to more training samples.

In [None]:
X_train_repeated = [item for item in X_train for _ in range(10)]
X_val_repeated = [item for item in X_val for _ in range(10)]

len(X_train_repeated)

20100

Tokenize the features and set the labels to be the same as the features (self-supervised training)

In [None]:
X_train_repeated_tokenized = to_tokenize(X_train_repeated)
X_train_repeated_tokenized['labels'] = X_train_repeated_tokenized['input_ids']

X_val_repeated_tokenized = to_tokenize(X_val_repeated)
X_val_repeated_tokenized['labels'] = X_val_repeated_tokenized['input_ids']

Reformat the data to the format expected by the model

In [None]:
reformat_train_sentence = datasets.Dataset.from_dict({key: X_train_repeated_tokenized[key].numpy() for key in X_train_repeated_tokenized.keys()})
reformat_val_sentence = datasets.Dataset.from_dict({key: X_val_repeated_tokenized[key].numpy() for key in X_val_repeated_tokenized.keys()})

In [None]:
def continue_pretrain(mlm_model=ROBERTA_MLM,
                      train_data=reformat_train_sentence,
                      val_data=reformat_val_sentence,
                      continue_flag=True):

  # Use DataCollatorForLanguageModeling to implement `dynamic masking`
  data_collator = DataCollatorForLanguageModeling(tokenizer=TOKENIZER,
                                                  mlm_probability=0.15,
                                                  return_tensors='tf')

  train_dataset = mlm_model.prepare_tf_dataset(reformat_train_sentence,
                                               collate_fn=data_collator,
                                               shuffle=True,
                                               batch_size=BATCH_SIZE
                                               )

  val_dataset = mlm_model.prepare_tf_dataset(reformat_val_sentence,
                                             collate_fn=data_collator,
                                             shuffle=True,
                                             batch_size=BATCH_SIZE
                                             )

  mlm_model.compile(optimizer=OPTIMIZER)

  # fit the model if continue pretrain, otherwise return the compiled original model
  if not continue_flag:
    loss = mlm_model.evaluate(val_dataset)
    print(f"Original Perplexity: {math.exp(loss):.2f}")
  else:
    mlm_model.fit(train_dataset,
                  validation_data=val_dataset,
                  batch_size=BATCH_SIZE,
                  epochs=3,
                  verbose=1
                  )
    loss = mlm_model.evaluate(val_dataset)
    print(f"Trained Perplexity: {math.exp(loss):.2f}")

  return mlm_model

First look at the og_pretrained model

In [None]:
og_pretrained = continue_pretrain(continue_flag=False)

Original Perplexity: 5.22


Then fit our dataset to the pretrained model to continue pretraining

In [None]:
new_pretrained = continue_pretrain(continue_flag=True)

Epoch 1/3
Epoch 2/3
Epoch 3/3
Trained Perplexity: 3.55


The perplexity decreased compared to the original model, indicating an improvement in the model.

# Save the MaskedLM after continue pretraining so we can load it later for finetuning again

In [None]:
pretrain_path = f'{PATH}continue_pretrained'
new_pretrained.save_pretrained(pretrain_path)