### **Fine-tuneing BART model**

The notebook consists the fine tuning process BART model from Hugging Face (created by ssleifer), which already had a good performance on summarizing the text documents. Since there are also many datasets out there, I want to try fine tuning the model to see whether its performance is improved, and how the tuning process will affect it regarding different use-cases.

The dataset used in this notebook is **multi_news** from LILY LAB, which consists of news articles and human-written summaries of these articles from the site newser.com.

At the beginning, I aim to use the **distilbart-cnn-12-6** model and **english** portion of multi_news dataset for examining purposes. In future work, I'd also want to try out the BARTpho model (created by VinAI) that is specifically used for Vietnamese text summarization, and  the **vietnamese** portion from the wiki_lingua dataset.

#### **Setup**

The notebook was intended to be ran locally, but due to the lack of GPU and memory, I had switched the implementation to Google Colab. Nevertheless, Google Colab cannot stay active for too long unless I pay for Pro subscription, so I decided to use Kaggle as it can process the notebook in the background and will never timeout (before reaching the quota)

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
!pip install datasets
!pip install transformers
!pip install rouge_score
!pip install sentencepiece

[0mCollecting rouge_score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Installing collected packages: rouge_score
Successfully installed rouge_score-0.0.4
[0m

In [3]:
#from huggingface_hub import notebook_login

#notebook_login()

In [4]:
import os
import numpy as np
import torch
import datasets
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM, 
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments, 
    Seq2SeqTrainer
)
import nltk

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["WANDB_DISABLED"] = "true"

#### **Model and Tokenizer**

In [5]:
organization = "sshleifer"
model_name = "distilbart-cnn-12-6"

tokenizer = AutoTokenizer.from_pretrained(f"{organization}/{model_name}")
model = AutoModelForSeq2SeqLM.from_pretrained(f"{organization}/{model_name}")

encoder_max_len = 512
decoder_max_len = 128

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

In [6]:
# Check the vocabulary size of tokenizer and model whether they are matching
mismatch = False

print(f"Tokenizer: {tokenizer.vocab_size}")
print(f"Model: {model.config.vocab_size}")

if len(tokenizer) != model.config.vocab_size:
    mismatch = True

Tokenizer: 50265
Model: 50264


In [7]:
if (mismatch):
    model.resize_token_embeddings(len(tokenizer))
    print(f"Tokenizer: {tokenizer.vocab_size}")
    print(f"Model: {model.config.vocab_size}")

Tokenizer: 50265
Model: 50265


#### **Data preparation**
**Read data**

In [8]:
# # Use local dataset
# src = "drive/MyDrive/Personal Workspace/Colab Notebooks/NLP/data_sm.jsonl"
# data = datasets.load_dataset("json", data_files=src)

# train_val_test = data["train"].train_test_split(shuffle=True, seed=42, test_size=0.1)

# dataset = datasets.DatasetDict({
#     "train": train_val_test["train"], # Train
#     "val": train_val_test["test"], # Validation
# })

In [9]:
# Download dataset
language = "default"
dataset_name = "multi_news"

ds_size = 2000
data = datasets.load_dataset(dataset_name, name=language, split=f"train[:{ds_size}]")

Downloading builder script:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/932 [00:00<?, ?B/s]

Downloading and preparing dataset multi_news/default (download: 245.06 MiB, generated: 667.72 MiB, post-processed: Unknown size, total: 912.78 MiB) to /root/.cache/huggingface/datasets/multi_news/default/1.0.0/2e145a8e21361ba4ee46fef70640ab946a3e8d425002f104d2cda99a9efca376...


Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44972 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Dataset multi_news downloaded and prepared to /root/.cache/huggingface/datasets/multi_news/default/1.0.0/2e145a8e21361ba4ee46fef70640ab946a3e8d425002f104d2cda99a9efca376. Subsequent calls will reuse this data.


**Preprocessing and Split**

In [10]:
# def flatten(dataset):
#     return {
#         "document": dataset["article"]["document"],
#         "summary": dataset["article"]["summary"],
#     }


# def list2samples(dataset):
#     documents = []
#     summaries = []
#     for sample in zip(dataset["document"], dataset["summary"]):
#         if len(sample[0]) > 0:
#             documents += sample[0]
#             summaries += sample[1]
#     return {"document": documents, "summary": summaries}


# # dataset = data.map(flatten, remove_columns=["article", "url"])
# dataset = data.map(list2samples, batched=True)

split_ratio = 0.1
train_data_txt, validation_data_txt = data.train_test_split(test_size=split_ratio).values()

**Tokenize data**

In [11]:
def batch_tokenizing(batch, tokenizer, max_input_len, max_output_len):
    input_, output_ = batch["document"], batch["summary"]
    input_tokenized = tokenizer(
        input_, max_length=max_input_len, truncation=True
    )
    
    with tokenizer.as_target_tokenizer():
        output_tokenized = tokenizer(
            output_, max_length=max_output_len, truncation=True
        )

    batch = {key: value for key, value in input_tokenized.items()}

    batch["labels"] = output_tokenized["input_ids"]

    return batch

train_data = train_data_txt.map(
    lambda batch: batch_tokenizing(
        batch, tokenizer, encoder_max_len, decoder_max_len
    ),
    batched=True,
    remove_columns=train_data_txt.column_names,
)

val_data = validation_data_txt.map(
    lambda batch: batch_tokenizing(
        batch, tokenizer, encoder_max_len, decoder_max_len
    ),
    batched=True,
    remove_columns=validation_data_txt.column_names,
)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

#### **Training model**

**Metrics**

In [12]:
nltk.download("punkt", quiet=True)

metric = datasets.load_metric("rouge")

def postprocess_data(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # Join sequences with newline between them for rougle calculation
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

def calculate_metric(eval_result):
    preds, labels = eval_result
    if isinstance(preds, tuple):
        preds = preds[0]
    
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Preprocess prediction and label for metric computation
    decoded_preds, decoded_labels = postprocess_data(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    pred_len = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(pred_len)
    result = {key: round(val, 4) for key, val in result.items()}

    return result

Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

**Training arguments**

In [13]:
!sudo apt-get install git-lfs
!git lfs install




The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded.
Need to get 3316 kB of archives.
After this operation, 11.1 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 git-lfs amd64 2.9.2-1 [3316 kB]
Fetched 3316 kB in 3s (1156 kB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 108264 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.9.2-1_amd64.deb ...
Unpacking git-lfs (2.9.2-1) ...
Setting up git-lfs (2.9.2-1) ...
Processing triggers for man-db (2.9.1-1) ...
Error: Failed to call git rev-parse --git-dir: exit status 128 
Git LFS initialized.


In [14]:
from transformers import EarlyStoppingCallback as callback_early

In [15]:
batch_size = 4
eval_interval = 2

train_args = Seq2SeqTrainingArguments(
    report_to=None,
    output_dir=f"{model_name}-ftn-{dataset_name}",
    num_train_epochs=1,
    learning_rate=2e-5,
    warmup_steps=500,
    weight_decay=0.01,
    label_smoothing_factor=0.1,
    do_train=True,
    do_eval=True,
    seed=42,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_dir="logs",
    logging_steps=100,
    eval_steps=int(ds_size*split_ratio*eval_interval),
    save_steps=int(ds_size*split_ratio*eval_interval),
    save_total_limit=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
    load_best_model_at_end=True,
    hub_token="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",
    evaluation_strategy="steps",
    save_strategy="steps",
    hub_model_id=f"datien228/{model_name}-ftn-{dataset_name}",
)

data_colla = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=train_args,
    data_collator=data_colla,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    compute_metrics=calculate_metric,
    callbacks=[callback_early(early_stopping_patience=3)]
)

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Cloning https://huggingface.co/datien228/distilbart-cnn-12-6-ftn-multi_news into local empty directory.


Download file pytorch_model.bin:   0%|          | 410/1.14G [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 3.17k/3.17k [00:00<?, ?B/s]

Clean file training_args.bin:  32%|###1      | 1.00k/3.17k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/1.14G [00:00<?, ?B/s]

Using amp half precision backend


**Train model (fine-tune)**

In [16]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [17]:
trainer.train()

***** Running training *****
  Num examples = 1800
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 450


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
400,4.1427,4.020215,39.9832,13.0653,22.1761,34.5466,132.41


***** Running Evaluation *****
  Num examples = 200
  Batch size = 4
Saving model checkpoint to distilbart-cnn-12-6-ftn-multi_news/checkpoint-400
Configuration saved in distilbart-cnn-12-6-ftn-multi_news/checkpoint-400/config.json
Model weights saved in distilbart-cnn-12-6-ftn-multi_news/checkpoint-400/pytorch_model.bin
tokenizer config file saved in distilbart-cnn-12-6-ftn-multi_news/checkpoint-400/tokenizer_config.json
Special tokens file saved in distilbart-cnn-12-6-ftn-multi_news/checkpoint-400/special_tokens_map.json
tokenizer config file saved in distilbart-cnn-12-6-ftn-multi_news/tokenizer_config.json
Special tokens file saved in distilbart-cnn-12-6-ftn-multi_news/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from distilbart-cnn-12-6-ftn-multi_news/checkpoint-400 (score: 4.020214557647705).


TrainOutput(global_step=450, training_loss=4.260530700683594, metrics={'train_runtime': 538.3075, 'train_samples_per_second': 3.344, 'train_steps_per_second': 0.836, 'total_flos': 1393120876953600.0, 'train_loss': 4.260530700683594, 'epoch': 1.0})

#### **Share model to HuggingFace Hub**

In [18]:
# Push model and tokenizer seperately or together compressed in a trainer
put_together = True

if put_together:
   trainer.push_to_hub()
else:
   model.push_to_hub(f"{model_name}-ftn-{dataset_name}", use_temp_dir=True)
   tokenizer.push_to_hub(f"{model_name}-ftn-{dataset_name}", use_temp_dir=True)

Saving model checkpoint to distilbart-cnn-12-6-ftn-multi_news
Configuration saved in distilbart-cnn-12-6-ftn-multi_news/config.json
Model weights saved in distilbart-cnn-12-6-ftn-multi_news/pytorch_model.bin
tokenizer config file saved in distilbart-cnn-12-6-ftn-multi_news/tokenizer_config.json
Special tokens file saved in distilbart-cnn-12-6-ftn-multi_news/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 32.0k/1.14G [00:00<?, ?B/s]

remote: error: cannot lock ref 'refs/heads/main': is at 37ab3dee88f9d998231eec71a30b9b49e4c7e166 but expected 67045532b2ba81f7a3b6e1b2360d12a9c89d454a        
To https://huggingface.co/datien228/distilbart-cnn-12-6-ftn-multi_news
 ! [remote rejected] main -> main (failed to update ref)
error: failed to push some refs to 'https://user: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@huggingface.co/datien228/distilbart-cnn-12-6-ftn-multi_news'

Error pushing update to the model card. Please read logs and retry.
$remote: error: cannot lock ref 'refs/heads/main': is at 37ab3dee88f9d998231eec71a30b9b49e4c7e166 but expected 67045532b2ba81f7a3b6e1b2360d12a9c89d454a        
To https://huggingface.co/datien228/distilbart-cnn-12-6-ftn-multi_news
 ! [remote rejected] main -> main (failed to update ref)
error: failed to push some refs to 'https://user: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@huggingface.co/datien228/distilbart-cnn-12-6-ftn-multi_news'



#### **Evaluate and comparison**

Compare the summaries from the fine-tuned BART model and the original BART model

In [19]:
def generate_summary(samples, model):
    inputs = tokenizer(
        samples["document"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_len,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    return outputs, output_str

original_model = AutoModelForSeq2SeqLM.from_pretrained(f"{organization}/{model_name}")

sample_test = validation_data_txt.select(range(5))

summary_before = generate_summary(sample_test, original_model)[1]
summary_after = generate_summary(sample_test, model)[1]

loading configuration file https://huggingface.co/sshleifer/distilbart-cnn-12-6/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/adac95cf641be69365b3dd7fe00d4114b3c7c77fb0572931db31a92d4995053b.a50597c2c8b540e8d07e03ca4d58bf615a365f134fb10ca988f4f67881789178
Model config BartConfig {
  "_name_or_path": "sshleifer/distilbart-cnn-12-6",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_tok

In [20]:
from tabulate import tabulate

In [21]:
print(tabulate(
        zip(
            range(len(summary_after)),
            summary_after,
        ),
        headers=["ID", "Summary after"]
    )
)

print("\nSource document:\n")
print(tabulate(list(enumerate(sample_test["document"])), headers=["ID", "Document"]))

print("\nTarget summary:\n")
print(tabulate(list(enumerate(sample_test["summary"])), headers=["ID", "Target summary"]))

  ID  Summary after
----  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   0  – A German film-maker has started digging in the Bavarian town of Mittenwald, hoping to find a hidden gold hidden in a piece of sheet music written by Hitler's aide Martin Bormann. The piece is thought to have been marked up by the Führer's aide in the waning days of the second world war, reports the Guardian. In an interview with Der Spiegel, Leo