### Pegasus-TML

#### Additional pre-training task on encoder Pegasus to gain knowledge of Control-flow relations in processes
- Experiment:
    - Pegasus is given another pre-training task, Control-flow Relation Learning, using Triplet Margin Loss (TML) training 
    - Based on the control-flow relations exist in the processes, triplets (anchor, pos, neg) are extracted from the process text 
- Process data:
    - Triplets (anchor, pos, neg)
- Outline:
    - Track the experiment and its results with WandB (Weights & Biases)
    - Define the experiment, data loading, training and validation 
    - Validation loss is tracked to apply early stopping and prevent overfitting

#### Reference
- Pegasus Hugging Face: 
https://huggingface.co/docs/transformers/model_doc/pegasus
- Hugging Face Fine-tuning Transformer tutorial:
https://huggingface.co/docs/transformers/training
- TripletMarginLoss:
https://pytorch.org/docs/stable/generated/torch.nn.TripletMarginLoss.html
- WandB pipeline:
https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch/Simple_PyTorch_Integration.ipynb#scrollTo=FH61NWlVR_SL
- Early stopping:
https://wandb.ai/ayush-thakur/huggingface/reports/Early-Stopping-in-HuggingFace-Examples--Vmlldzo0MzE2MTM

#### Environment Setup 
- Amazon SageMaker Studio
- Kernel - Python 3 (Data Science)

In [2]:
# %%capture
# !pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
# !pip3 install transformers
# !pip3 install sentencepiece
# !pip3 install wandb --upgrade

#### Import Libraries

In [3]:
import json
import random
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import PegasusForConditionalGeneration, PegasusTokenizerFast
from transformers.optimization import Adafactor
from tqdm.auto import tqdm

import wandb

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

#### WandB Setup

In [4]:
# Ensure deterministic behavior
torch.backends.cudnn.deterministic = True
random.seed(hash("setting random seeds") % 2**32 - 1)
np.random.seed(hash("improves reproducibility") % 2**32 - 1)
torch.manual_seed(hash("by removing stochasticity") % 2**32 - 1)
torch.cuda.manual_seed_all(hash("so runs are repeatable") % 2**32 - 1)

In [5]:
# Device configuration
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [6]:
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

#### Define the Experiment and Pipeline 

In [7]:
# Define the configuration of the experiment
config = dict(
    epochs = 20,
    batch_size = 16,
    optimizer = "adafactor",
    es_patience = 5, # early stopping patience steps
    loss_function = "triplet-margin-loss",
    dataset = "bpmai-29-10-2019",
    architecture = "encoder-seq2seq-pegasus", # TML trained on the encoder part of Pegasus model
    retrain = False, # True if continue training from checkpoint of previous iteration
    input_model = "", # specify path of input model if continue training or left blank
    output_model = "" # specify path to save output model, i.e., "./model_TML/TML_{}_epoch.pth"
)

##### Track metadata and hyperparameters with wandb.init

In [8]:
# Define the training pipeline
def model_pipeline(hyperparameters):
    with wandb.init(project="wandb-project-name", entity="wandb-entity-name", config=hyperparameters):
        config = wandb.config
        # set model, data loaders, optimizer, and early stopping with defined config
        model, train_loader, val_loader, optimizer = make(config)
        es = EarlyStopping(patience = config.es_patience)
        # train and validate with early stopping applied
        train_and_val(model, train_loader, val_loader, optimizer, es, config)

    return model

##### Set model, data loaders and optimizer with defined configuration

In [9]:
def make(config):
    # set pretrained tokenizer, model and optimizer
    model_name = 'google/pegasus-large' # 'google/pegasus-xsum'
    tokenizer = PegasusTokenizerFast.from_pretrained(model_name)
    model = PegasusForConditionalGeneration.from_pretrained(model_name, return_dict=True)
    optimizer = Adafactor(model.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None)
    
    # if continue training from checkpoint of previous iteration
    if config.retrain: 
        load(model, optimizer, config.input_model)
        model = PegasusForConditionalGeneration.from_pretrained(model_name, output_hidden_states=True, output_attentions=True, return_dict=True)
        if torch.cuda.device_count() > 1:
            print("Let's use", torch.cuda.device_count(), "GPUs!")
            model = nn.DataParallel(model)
    model.to(device)
    
    # set data loaders
    train_loader = make_loader(train_data, tokenizer, shuffle=True, batch_size=config.batch_size)
    val_loader = make_loader(val_data, tokenizer, shuffle=True, batch_size=config.batch_size)
    # test print data
    for anchor, positive, negative in train_loader:
        break
    print({k: v.shape for k, v in anchor.items()})
    
    return model, train_loader, val_loader, optimizer

#### Define the Data Loading and Model
##### Load data

In [10]:
with open('./data/triplet_train_dataset.json', 'r') as f:
    t_data = json.load(f)
with open('./data/triplet_val_dataset.json', 'r') as f:
    v_data = json.load(f)

train_data = []
val_data = []
train_temp = t_data['easy_negatives'] + t_data['negatives'] + t_data['one_step_away_negs'] + t_data['hard_negatives']
val_temp = v_data['easy_negatives'] + v_data['negatives'] + v_data['one_step_away_negs'] + v_data['hard_negatives']
for d1 in train_temp:
    train_data.append([x.lower() for x in d1])
for d2 in val_temp:
    val_data.append([x.lower() for x in d2])

##### Define Triplet Dataset

In [11]:
class TripletDataset(torch.utils.data.Dataset):
    def __init__(self, triplets, tokenizer):
        self.triplets = triplets
        self.tokenizer = tokenizer
        
    def __getitem__(self, idx):
        triplet = self.triplets[idx]
        triplet_encodings = self.tokenizer(triplet, truncation=True, padding='max_length', max_length=50)
        anchor = {key: torch.tensor(val[0]) for key, val in triplet_encodings.items()}
        positive = {key: torch.tensor(val[1]) for key, val in triplet_encodings.items()}
        negative = {key: torch.tensor(val[2]) for key, val in triplet_encodings.items()}
        # set dummy labels for the decoder part of Pegasus, which is left untrained in this model training process
        anchor['labels'] = torch.tensor(triplet_encodings['input_ids'][0])
        positive['labels'] = torch.tensor(triplet_encodings['input_ids'][1])
        negative['labels'] = torch.tensor(triplet_encodings['input_ids'][2])
        return anchor, positive, negative # input_ids, attention_mask, labels
    
    def __len__(self):
        return len(self.triplets)

##### Define Triplet Data Loader

In [12]:
def make_loader(triplets, tokenizer, shuffle, batch_size):
    triplet_dataset = TripletDataset(triplets, tokenizer)
    triplet_dataloader = DataLoader(
        dataset=triplet_dataset, shuffle=shuffle, batch_size=batch_size
    )
    return triplet_dataloader

##### Define Pegasus-TML Model

In [13]:
class PegasusTMLModel(nn.TripletMarginLoss):
    def __init__(self, model, margin: float = 1.0, p: float = 2., eps: float = 1e-6, 
                 swap: bool = False, size_average=None, reduce=None, reduction: str = 'mean'):
        super().__init__(margin, p, eps, swap, size_average, reduce, reduction)
        self.model = model
        
    def forward(self, triplet_batch):       
        # retrieve triplets in batch
        anchors, positives, negatives = triplet_batch[0], triplet_batch[1], triplet_batch[2]
        # get index of </s> end-of-sentence token representing each anchor, pos and neg
        anchor_eos, positive_eos, negative_eos = get_eos_idx(anchors), get_eos_idx(positives), get_eos_idx(negatives)
        # get model outputs of anchor, pos and neg sentences
        model_output_a = self.model(**anchors)
        model_output_p = self.model(**positives)
        model_output_n = self.model(**negatives)
        # get </s> token from encoder output - last hidden layer
        encoder_output_a = model_output_a.encoder_last_hidden_state
        encoder_output_p = model_output_p.encoder_last_hidden_state
        encoder_output_n = model_output_n.encoder_last_hidden_state
        a_eos = torch.vstack([encoder_output_a[i][anchor_eos[i]] for i in range(encoder_output_a.size(0))])
        p_eos = torch.vstack([encoder_output_p[i][positive_eos[i]] for i in range(encoder_output_p.size(0))])
        n_eos = torch.vstack([encoder_output_n[i][negative_eos[i]] for i in range(encoder_output_n.size(0))])     
        # compute the loss
        triplet_margin_loss = F.triplet_margin_loss(a_eos, p_eos, n_eos, 
                                                    margin=self.margin, p=self.p,
                                                    eps=self.eps, swap=self.swap, 
                                                    reduction=self.reduction)        

        return triplet_margin_loss

In [14]:
# define function used to extract the indexes of </s> end-of-sentence tokens representing each sentence
def get_eos_idx(batch):
    for input_ids in batch['input_ids']:
        eos_id = input_ids == 1
        idx = eos_id.nonzero()[0]
        if 'eos_idx' in locals():
            eos_idx = torch.cat((eos_idx, idx), 0)
        else:
            eos_idx = eos_id.nonzero()[0]
    return eos_idx

#### Define Early Stopping 


In [15]:
class EarlyStopping(object):
    def __init__(self, mode='min', min_delta=0, patience=10, percentage=False):
        self.mode = mode
        self.min_delta = min_delta
        self.patience = patience
        self.best = None
        self.num_bad_epochs = 0
        self.is_better = None
        self._init_is_better(mode, min_delta, percentage)

        if patience == 0:
            self.is_better = lambda a, b: True
            self.step = lambda a: False

    def step(self, metrics):
        if self.best is None:
            self.best = metrics
            return False

        if torch.isnan(metrics):
            return True

        if self.is_better(metrics, self.best):
            self.num_bad_epochs = 0
            self.best = metrics
        else:
            self.num_bad_epochs += 1

        if self.num_bad_epochs >= self.patience:
            return True

        return False

    def _init_is_better(self, mode, min_delta, percentage):
        if mode not in {'min', 'max'}:
            raise ValueError('mode ' + mode + ' is unknown!')
        if not percentage:
            if mode == 'min':
                self.is_better = lambda a, best: a < best - min_delta
            if mode == 'max':
                self.is_better = lambda a, best: a > best + min_delta
        else:
            if mode == 'min':
                self.is_better = lambda a, best: a < best - (
                            best * min_delta / 100)
            if mode == 'max':
                self.is_better = lambda a, best: a > best + (
                            best * min_delta / 100)

#### Define Training Logic
##### Track gradients and weights with wandb.watch and everything else, i.e. loss, with wandb.log

In [16]:
def train_and_val(model, train_loader, val_loader, optimizer, es, config):
    # set the model to train
    pegasus_tml_model = PegasusTMLModel(model)
    wandb.watch(pegasus_tml_model, log="all", log_freq=10)

    # run training and track with wandb
    total_batches = len(train_loader) * config.epochs
    print('num_training_steps', total_batches)
    progress_bar = tqdm(range(total_batches))

    batch_ct = 0
    running_loss = 0.
    last_loss = 0.
    for epoch in range(config.epochs):
        model.train()
        for idx, triplet_batch in enumerate(train_loader):
            loss = train_batch(idx, triplet_batch, pegasus_tml_model, optimizer, progress_bar)
            batch_ct += 1
            # report metrics every 25th batch
            running_loss += loss.item()
            if (batch_ct % 25) == 0:
                last_loss = running_loss / 25 # log loss in average term
                train_log(last_loss, batch_ct, epoch)
                running_loss = 0.
        # validate model after train at each epoch
        model.eval()
        val_loss = val(pegasus_tml_model, val_loader)
        val_log(val_loss, batch_ct, epoch) # log validation loss
        # save model after each epoch
        output_model = config.output_model.format(epoch+1)
        save(model, optimizer, output_model)
        # check whether to apply early stopping (number of patience step)
        if es.step(val_loss):
            break


##### Define functions needed in the training loop

In [17]:
def train_batch(idx, batch, model, optimizer, progress_bar):                                                                                 
    triplet_items = []
    for item in batch:
        triplet_items.append({k: v.to(device) for k, v in item.items()}) 
    # forward pass
    loss = model(triplet_items)      
    # backward pass
    optimizer.zero_grad()
    loss.backward()
    # step with optimizer every 2 step (batch accumulation)
    if (idx+1) % 2 == 0:
        optimizer.step()
        progress_bar.update(1)

    return loss

In [18]:
def val(model, val_loader):
    with torch.no_grad():
        loss = 0
        for _, triplet_batch in enumerate(val_loader):
            triplet_items = []
            for item in triplet_batch:
                triplet_items.append({k: v.to(device) for k, v in item.items()})
            loss += model(triplet_items)
        # output loss in average
        loss /= len(val_loader)
    
    return loss

In [19]:
def train_log(loss, batch_num, epoch):
    wandb.log({"epoch": epoch, "loss": loss}, step=batch_num)
    print(f"Loss after " + str(batch_num).zfill(5) + f" steps: {loss:.3f}")

def val_log(loss, batch_num, epoch):
    wandb.log({"val_loss": loss})
    print(f"Validation Loss after " + str(batch_num).zfill(5) + f" training steps: {loss:.3f}")

In [20]:
def save(model, optimizer, output_model):
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict()
    }, output_model)

def load(model, optimizer, output_model):
    checkpoint = torch.load(output_model)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

#### Build, train and analyze the model with the pipeline

In [None]:
model = model_pipeline(config)

[34m[1mwandb[0m: Currently logged in as: [33myentingwang[0m ([33myenting-thesis[0m). Use [1m`wandb login --relogin`[0m to force relogin


Downloading:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

{'input_ids': torch.Size([16, 50]), 'attention_mask': torch.Size([16, 50]), 'labels': torch.Size([16, 50])}
num_training_steps 15140


  0%|          | 0/15140 [00:00<?, ?it/s]

Loss after 00025 steps: 1.045
Loss after 00050 steps: 1.004
Loss after 00075 steps: 1.020
Loss after 00100 steps: 1.042
Loss after 00125 steps: 1.001
Loss after 00150 steps: 1.054
Loss after 00175 steps: 0.998
Loss after 00200 steps: 0.998
Loss after 00225 steps: 1.063
Loss after 00250 steps: 1.016
Loss after 00275 steps: 0.968
Loss after 00300 steps: 0.992
Loss after 00325 steps: 0.982
Loss after 00350 steps: 0.984
Loss after 00375 steps: 0.977
Loss after 00400 steps: 1.001
Loss after 00425 steps: 0.985
Loss after 00450 steps: 0.962
Loss after 00475 steps: 0.964
Loss after 00500 steps: 0.958
Loss after 00525 steps: 0.986
Loss after 00550 steps: 0.960
Loss after 00575 steps: 0.931
Loss after 00600 steps: 0.986
Loss after 00625 steps: 0.941
Loss after 00650 steps: 0.957
Loss after 00675 steps: 0.936
Loss after 00700 steps: 0.946
Loss after 00725 steps: 0.905
Loss after 00750 steps: 0.979
Validation Loss after 00757 training steps: 0.855
Loss after 00775 steps: 0.931
Loss after 00800 ste