### **Fine-tuneing BART model**

The notebook consists the fine tuning process BART model from Hugging Face (created by ssleifer), which already had a good performance on summarizing the text documents. Since there are also many datasets out there, I want to try fine tuning the model to see whether its performance is improved, and how the tuning process will affect it regarding different use-cases.

The dataset used in this notebook is **wiki_lingua** from GEM Benchmark, which contains large-scale dataset for cross-ligual summarization (in 18 languages). The data was extract from the documents on WikiHow site.

At the beginning, I aim to use the **distilbart-xsum-12-6** model and **english** portion of wiki_lingua dataset for examining purposes. In future work, I'd also want to try out the BARTpho model (created by VinAI) that is specifically used for Vietnamese text summarization, and  the **vietnamese** portion from the wiki_lingua dataset.

#### **Setup**

The notebook was intended to be ran locally, but due to the lack of GPU and memory, I had switched the implementation to Google Colab. Nevertheless, Google Colab cannot stay active for too long unless I pay for Pro subscription, so I decided to use Kaggle as it can process the notebook in the background and will never timeout (before reaching the quota)

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
!pip install datasets
!pip install transformers
!pip install rouge_score
!pip install sentencepiece

In [68]:
import os
import numpy as np
import torch
import datasets
from transformers import (
    BartForConditionalGeneration,
    AutoTokenizer,
    Seq2SeqTrainingArguments, 
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
import nltk

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

#### **Model and Tokenizer**

In [69]:
model_name = "sshleifer/distilbart-xsum-12-6"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

encoder_max_len = 256
decoder_max_len = 64

In [70]:
# Check the vocabulary size of tokenizer and model whether they are matching
mismatch = False

print(f"Tokenizer: {tokenizer.vocab_size}")
print(f"Model: {model.config.vocab_size}")

if len(tokenizer) != model.config.vocab_size:
    mismatch = True

In [71]:
if (mismatch):
    model.resize_token_embeddings(len(tokenizer))
    print(f"Tokenizer: {tokenizer.vocab_size}")
    print(f"Model: {model.config.vocab_size}")

#### **Data preparation**
**Read data**

In [7]:
# # Use local dataset
# src = "drive/MyDrive/Personal Workspace/Colab Notebooks/NLP/data_sm.jsonl"
# data = datasets.load_dataset("json", data_files=src)

# train_val_test = data["train"].train_test_split(shuffle=True, seed=42, test_size=0.1)

# dataset = datasets.DatasetDict({
#     "train": train_val_test["train"], # Train
#     "val": train_val_test["test"], # Validation
# })

In [72]:
# Download dataset
language = "english"

data = datasets.load_dataset("wiki_lingua", name=language, split="train[:1000]")

**Preprocessing and Split**

In [73]:
def flatten(dataset):
    return {
        "document": dataset["article"]["document"],
        "summary": dataset["article"]["summary"],
    }


def list2samples(dataset):
    documents = []
    summaries = []
    for sample in zip(dataset["document"], dataset["summary"]):
        if len(sample[0]) > 0:
            documents += sample[0]
            summaries += sample[1]
    return {"document": documents, "summary": summaries}


dataset = data.map(flatten, remove_columns=["article", "url"])
dataset = dataset.map(list2samples, batched=True)

train_data_txt, validation_data_txt = dataset.train_test_split(test_size=0.2).values()

**Tokenize data**

In [74]:
def batch_tokenizing(batch, tokenizer, max_input_len, max_output_len):
    input_, output_ = batch["document"], batch["summary"]
    input_tokenized = tokenizer(
        input_, padding="max_length", max_length=max_input_len, truncation=True
    )
    output_tokenized = tokenizer(
        output_, padding="max_length", max_length=max_output_len, truncation=True
    )

    batch = {key: value for key, value in input_tokenized.items()}

    batch["labels"] =[[-100 if token == tokenizer.pad_token_id else token for token in l]
                        for l in output_tokenized["input_ids"]]

    return batch

train_data = train_data_txt.map(
    lambda batch: batch_tokenizing(
        batch, tokenizer, encoder_max_len, decoder_max_len
    ),
    batched=True,
    remove_columns=train_data_txt.column_names,
)

val_data = validation_data_txt.map(
    lambda batch: batch_tokenizing(
        batch, tokenizer, encoder_max_len, decoder_max_len
    ),
    batched=True,
    remove_columns=validation_data_txt.column_names,
)

#### **Training model**

**Metrics**

In [75]:
nltk.download("punkt", quiet=True)

metric = datasets.load_metric("rouge")

def postprocess_data(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # Join sequences with newline between them for rougle calculation
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

def calculate_metric(eval_result):
    preds, labels = eval_result
    if isinstance(preds, tuple):
        preds = preds[0]
    
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Preprocess prediction and label for metric computation
    decoded_preds, decoded_labels = postprocess_data(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    pred_len = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["res_"] = np.mean(pred_len)
    result = {key: round(val, 4) for key, val in result.items()}

    return result

**Training arguments**

In [59]:
from huggingface_hub import notebook_login

notebook_login()

In [76]:
train_args = Seq2SeqTrainingArguments(
    output_dir="distilbart-ftn-wiki_lingua",
    num_train_epochs=1,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=5,
    per_device_eval_batch_size=5,
    warmup_steps=420,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    predict_with_generate=True,
    logging_dir="logs",
    logging_steps=50,
    save_total_limit=3,
)

data_colla = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=train_args,
    data_collator=data_colla,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    compute_metrics=calculate_metric,
)

In [77]:
os.environ["WANDB_DISABLED"] = "true"

**Train model (fine-tune)**

In [78]:
trainer.train()

#### **Evaluate and comparison**

Compare the summaries from the fine-tuned BART model and the original BART model

In [79]:
def generate_summary(samples, model):
    inputs = tokenizer(
        samples["document"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_len,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return outputs, output_str

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

sample_test = validation_data_txt.select(range(15))

summary_before = generate_summary(sample_test, original_model)[1]
summary_after = generate_summary(sample_test, model)[1]

In [56]:
from tabulate import tabulate

In [80]:
print(tabulate(
        zip(
            range(len(summary_after)),
            summary_after,
            summary_before,
        ),
        headers=["ID", "Summary before", "Summary after"]
    )
)

print("\nSource document:\n")
print(tabulate(list(enumerate(sample_test["document"])), headers=["ID", "Document"]))

print("\nTarget summary:\n")
print(tabulate(list(enumerate(sample_test["summary"])), headers=["ID", "Target summary"]))

#### **Share model to HuggingFace Hub**

In [63]:
!sudo apt-get install git-lfs
!git lfs install

In [81]:
model.push_to_hub("distilbart-ftn-wiki_lingua", use_temp_dir=True)