<a href="https://colab.research.google.com/github/Zypperman/DBTT_G1_GRP3/blob/main/Usecase_4_1_V1(Summariser).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use Case 4.1: Application document summarisation for identification

### Problem statement
  In the dashboard, it can be confusing to distinguish applicants from each other, given that there are only a details present. Thai names are a unique problem since they are typically rather long, and make identifying documents error prone. This is especially useful for instances where the loans are specifically about business loans, and a large volume of documents including a business proposal or an earnings report is involved.

----

### Proposed solution and workflow

We propose to summarise whatever documents that a loan applicant might have, and provide it as a 15-word summary avaiable as a tooltip for the documentation icon from the officer applicant overview.

----
### Impact and risks

Doing so will enable the officer to work more efficiently, as they will be able to recall applicants by the semantic information that they dealt with prior, or just have a preview to what to expect from the documentation.

----

### Possible areas for future improvements to be made:

We can fine-tune the BART pre-trained model for our task using PEFT (Parameter Efficient Fine-Tuning) techniques like [RoSA](https://arxiv.org/abs/2401.04679) and [GaLore](https://arxiv.org/abs/2403.03507) To obtain better responses.


# Note: Implementation Theory and Source Code heavily referenced from Ashwin N ([Medium](https://medium.com/@ashwinnaidu1991/multi-document-summarization-with-bart-c06db25df62a) \| [Kaggle](https://www.kaggle.com/code/ashwinnaidu/textsummarization/notebook))

# PS: make sure to connect to a runtime with a GPU.

In [1]:
!pip install llmware transformers huggingface_hub
!pip install PyPDF2

Collecting llmware
  Downloading llmware-0.4.1-py3-none-any.whl.metadata (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m71.7/73.7 kB[0m [31m72.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.7/73.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting boto3>=1.24.53 (from llmware)
  Downloading boto3-1.37.24-py3-none-any.whl.metadata (6.7 kB)
Collecting colorama==0.4.6 (from llmware)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting botocore<1.38.0,>=1.37.24 (from boto3>=1.24.53->llmware)
  Downloading botocore-1.37.24-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3>=1.24.53->llmware)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.12.0,>=0.11.0 (from boto3>=1.24.53->llmware)
 

First, we will preprocess a document that we will be using for summarisation. For demonstration purposes, we will only be taking the first 3 pages.

In [4]:
# download document
import urllib.request
from PyPDF2 import PdfReader, PdfWriter # for capacity building, if we decide to expand on the number of pages at 1 go
import os

pdf_url = "https://raw.githubusercontent.com/Zypperman/DBTT_G1_GRP3/main/Data/Bank_of_thailand_Statement_2023_EN_short.pdf"
filename = "documents/example.pdf"
os.makedirs('./documents',exist_ok=True)

def download_pdf_from_github(url, filename):

    filepath = f"./{filename}"

    # Download the file
    print("Downloading PDF from GitHub")
    urllib.request.urlretrieve(url, filepath)
    print(f"Download complete! File saved at: {filepath}")
    return filepath

downloaded_filepath = download_pdf_from_github(pdf_url, filename)

def get_pages(input_pdf_path, output_pdf_path):
    reader = PdfReader(input_pdf_path)
    writer = PdfWriter()

    for i in range(min(5, len(reader.pages))):
        writer.add_page(reader.pages[i])
    with open(output_pdf_path, 'wb') as output_file:
        writer.write(output_file)

# get_pages(filename, "example.pdf")

Downloading PDF from GitHub
Download complete! File saved at: ./documents/example.pdf


In [6]:
from llmware.library import Library

fp = './documents/'
os.makedirs(fp,exist_ok=True)
lib = Library().create_new_library("my_library")
lib.add_files(input_folder_path=fp, chunk_size=400, max_chunk_size=600, smart_chunking=0)

# standard call to 'ingest' files into a library (implicitly calls Parser and manages the details)
lib.add_files(input_folder_path=fp)

# NOTE: The above code is meant for RAG applications and is supposed to facilitate larger scale document ingestion. For now, we are able to simply parse the text content into a single string for demonstration.


# Load the PDF file
reader = PdfReader("./"+filename)

Input_text = []

# Extract text from each page
for page in reader.pages:
    Input_text.append(page.extract_text())

INFO:llmware.parsers:update:  Duplicate files (skipped): 1
INFO:llmware.parsers:update:  Total uploaded: 0
INFO:llmware.parsers:update:  Duplicate files (skipped): 1
INFO:llmware.parsers:update:  Total uploaded: 0


Now, we install the summarisation transformer, distilBART:

In [7]:
from transformers import BartForConditionalGeneration, AutoTokenizer
model_ckpt = "sshleifer/distilbart-cnn-6-6"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BartForConditionalGeneration.from_pretrained(model_ckpt)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/460M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/460M [00:00<?, ?B/s]

Training can be performed using the code in the below cell, but for demonstration purposes we will assume that as the role of MLOPs, and only be displaying the code for this.

In [None]:
# referencing the dataset used from the example, Multi-news:

from datasets import load_dataset
from transformers import DataCollatorForSeq2Seq
from transformers import TrainingArguments, Trainer


dataset = load_dataset("multi_news")

# tokenize sample text pairs of {text : summary}

def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch["document"], max_length=1024, truncation=True)

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch["summary"], max_length=256, truncation=True)

    return {"input_ids": input_encodings["input_ids"],
           "attention_mask": input_encodings["attention_mask"],
           "labels": target_encodings["input_ids"]}
dataset_pt = dataset.map(convert_examples_to_features, batched=True)

#Collate data in the seq2seq format
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# establish training arguments as a wrapper function
training_args = TrainingArguments(output_dir='bart-multi-news', num_train_epochs=1, warmup_steps=500,                                  per_device_train_batch_size=1, per_device_eval_batch_size=1, weight_decay=0.01, logging_steps=10, push_to_hub=False,
evaluation_strategy='steps', eval_steps=500, save_steps=1e6,
gradient_accumulation_steps=16)

# creating the trainer
trainer = Trainer(model=model, args=training_args, tokenizer=tokenizer,
                  data_collator=seq2seq_data_collator,
                  train_dataset=dataset_pt["train"],
                  eval_dataset=dataset_pt["validation"])
trainer.train()


To generate a 15 word summary, given that each word is 0.75 tokens on average, we will need to generate 11.25 tokens in total. To allow for some overhead, we will instead generate 30.

In [20]:
from torch.cuda import device,set_device

model = model.to("cuda")

sample_text = ''.join(Input_text)

input_ids = tokenizer(sample_text, max_length=1024, truncation=True, padding='max_length', return_tensors='pt')
input_ids = {k: v.to("cuda") for k, v in input_ids.items()}

summaries = model.generate(input_ids=input_ids['input_ids'], attention_mask=input_ids['attention_mask'], max_length=30)



The model has provided us with some latent code that we can now decode to view the tokens that constitute our summary:

In [21]:
decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=True) for s in summaries]

We can now generate a summary of the document by decoding the output of the tokenizer:

In [22]:
print(decoded_summaries[0])


 State Audit Office of the Kingdom of Thailand has audited the financial statements of the Bank of Thailand (the Bank), which comprise the statement
