# Use Case 4.0: Application-tagging for identification

### Problem statement
  In the dashboard, it can be confusing to distinguish applicants from each other, given that there are only a details present. Thai names are a unique problem since they are typically rather long, and make identifying documents error prone. This is especially useful for instances where the loans are specifically about business loans, and a large volume of documents including a business proposal or an earnings report is involved.

----

### Proposed solution and workflow

We propose to attach 3 tags to each loan application based on a pre-selected pool of tags. This is a Guided Topic modelling and tagging task, and we will thus be using BerTopic for this.

----
### Impact and risks

Doing so will enable the officer to work more efficiently, as they will be able to recall applicants by the semantic information that they dealt with prior, or just have a preview to what to expect from the documentation.

----

### Possible areas for future improvements to be made:

We can fine-tune the BerTOPIC pre-trained model for our task using PEFT (Parameter Efficient Fine-Tuning) techniques like [RoSA](https://arxiv.org/abs/2401.04679) and [GaLore](https://arxiv.org/abs/2403.03507) To obtain better responses. Alternatively, we can also fine-tune other aspects of BERTOPIC like its representation models, given that Bertopic is a very versatile and modular framework.

# Note: Implementation Theory and Source Code referenced from Ashwin N ([Medium](https://medium.com/@ashwinnaidu1991/multi-document-summarization-with-bart-c06db25df62a) \| [Kaggle](https://www.kaggle.com/code/ashwinnaidu/textsummarization/notebook))

# PS: make sure to connect to a runtime with a GPU.

In [23]:
!pip install llmware transformers huggingface_hub
!pip install bertopic
!pip install PyPDF2
!pip install umap-learn hdbscan scipy




First, we will preprocess a document that we will be using for tagging. For demonstration purposes, we will only be taking the first 3 pages.

In [46]:
# download document
import urllib.request
from PyPDF2 import PdfReader, PdfWriter # for capacity building, if we decide to expand on the number of pages at 1 go
import os

pdf_url = "https://raw.githubusercontent.com/Zypperman/DBTT_G1_GRP3/main/Data/Bank_of_thailand_Statement_2023_EN.pdf"
filename = "documents/example.pdf"
os.makedirs('./documents',exist_ok=True)

def download_pdf_from_github(url, filename):

    filepath = f"./{filename}"

    # Download the file
    print("Downloading PDF from GitHub")
    urllib.request.urlretrieve(url, filepath)
    print(f"Download complete! File saved at: {filepath}")
    return filepath

downloaded_filepath = download_pdf_from_github(pdf_url, filename)

def get_pages(input_pdf_path, output_pdf_path):
    reader = PdfReader(input_pdf_path)
    writer = PdfWriter()

    for i in range(min(10, len(reader.pages))):
        writer.add_page(reader.pages[i])
    with open(output_pdf_path, 'wb') as output_file:
        writer.write(output_file)

get_pages(filename, "documents/example_short.pdf")

Downloading PDF from GitHub
Download complete! File saved at: ./documents/example.pdf


In [49]:
from llmware.library import Library

fp = './documents/'
os.makedirs(fp,exist_ok=True)
lib = Library().create_new_library("my_library")
lib.add_files(input_folder_path=fp, chunk_size=400, max_chunk_size=600, smart_chunking=0)

# standard call to 'ingest' files into a library (implicitly calls Parser and manages the details)
lib.add_files(input_folder_path=fp)

# NOTE: The above code is meant for RAG applications and is supposed to facilitate larger scale document ingestion. For now, we are able to simply parse the text content into a single string for demonstration.


# Load the PDF file
filename = "documents/example.pdf"
reader = PdfReader("./"+filename)

Input_text = []

# Extract text from each page
for page in reader.pages:
    Input_text.append(page.extract_text())
print(len(Input_text))

INFO:llmware.parsers:update:  Duplicate files (skipped): 2
INFO:llmware.parsers:update:  Total uploaded: 0
INFO:llmware.parsers:update:  Duplicate files (skipped): 2
INFO:llmware.parsers:update:  Total uploaded: 0


37


Now, we install the summarisation transformer, BERTOPIC:

In [51]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

# Using 25 hand-selected topics
topics_to_model = ['Mortgage Loans (Home Loans)',
 'Auto Loans',
 'Personal Loans',
 'Small Business Loans',
 'Home Equity Loans & HELOCs',
 'Student Loans',
 'Construction Loans',
 'Debt Consolidation Loans',
 'Payday Loans (Short-Term High-Interest Loans)',
 'Commercial Real Estate Loans',
 'Credit Score & Credit History',
 'Debt-to-Income Ratio (DTI)',
 'Employment & Income Stability',
 'Loan Purpose & Use of Funds',
 'Collateral (for Secured Loans)',
 'Down Payment or Equity Available',
 'Loan Term Preferences',
 'Existing Debt Obligations',
 'Bankruptcy/Foreclosure History',
 'Documentation Accuracy',
 'Interest Rate Type (Fixed vs. Variable)',
 'Co-Signer or Guarantor Involvement',
 'Regulatory Compliance & Loan Eligibility',
 'Borrower’s Savings & Financial Reserves',
 'Loan-to-Value Ratio (LTV)',
 'Important in mortgage and auto lending.']

seed_topic_list = [[i] for i in topics_to_model]

# using a smaller embedding model for optimization
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

hdbscan_model = HDBSCAN(min_cluster_size=5, prediction_data=True)

topic_model = BERTopic(seed_topic_list=seed_topic_list,
                        umap_model=UMAP(n_components=10),
                        hdbscan_model=hdbscan_model,
                        embedding_model=embedding_model,
                        nr_topics=25,
                        low_memory=True
                        )
Input_text = [doc for doc in Input_text if len(doc.split()) > 5]

topics, probs = topic_model.fit_transform(Input_text)

The model has labelled each page with 1 topic, that we can now view:

In [74]:
print("Topics by page:")
print(topics)
print("="*40)
print("Topics by page:")

topic_unique = []
for i in topics:
    if i not in topic_unique:
        topic_unique.append(i)

for i in topic_unique:
    print(topics_to_model[i]) if i != -1 else 'Bertopic was unsure' # -1 is how bertopic indicates that it isn't sure what topic to assign a page.
print("="*40)

print("Probabilities (by counting frequency of page labels):")
print(probs)

print("="*40)
print("Probabilities (of each topic label per page):")
from collections import defaultdict

Topics_count = defaultdict(int)

for i in topics:
    Topics_count[i] += 1

Top_probs = {'Bertopic was unsure' if i == -1 else topics_to_model[i]: float(j/37*100) for i, j in Topics_count.items()}

for i,j in Top_probs.items():
    print('tag:',i,'\n',f"probability: {round(j,2)}%",'\n')

Topics by page:
[1, 1, 1, 1, -1, 2, 1, 1, 1, 2, 2, -1, 1, -1, -1, 2, -1, -1, 2, -1, 2, 2, 2, -1, -1, -1, -1, 0, 0, 0, -1, 0, 0, -1, -1, 2, -1]
Topics by page:
Auto Loans
Personal Loans
Mortgage Loans (Home Loans)
Probabilities (by counting frequency of page labels):
[1.         1.         1.         1.         0.         1.
 0.98763085 1.         0.94671824 1.         1.         0.
 0.95501062 0.         0.         0.92570402 0.         0.
 0.91793763 0.         1.         1.         0.93854553 0.
 0.         0.         0.         1.         1.         1.
 0.         1.         1.         0.         0.         0.95472905
 0.        ]
Probabilities (of each topic label per page):
tag: Auto Loans 
 probability: 21.62% 

tag: Bertopic was unsure 
 probability: 40.54% 

tag: Personal Loans 
 probability: 24.32% 

tag: Mortgage Loans (Home Loans) 
 probability: 13.51% 



Training and Fine-tuning can be performed on the various representation models and embedding models used, but for demonstration purposes we will assume that as the role of MLOPs. For improvements, we can consider performing unsupervised topic modelling to see if any distinct patterns emerge.