<a href="https://colab.research.google.com/github/Wrynaft/FinancialDataMining/blob/branch%2Fsimulation/WIE3007_DM_GroupProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **WIE3007 Data Mining Group Project**

This project has to be relevant to 'Financial/Business' sector. We will simulate a bank loan risk dataset to conduct 'Loan Default Prediction'.

- Target variable: Loan_Default (Yes/No)
- Numerical features: Credit score, annual income, loan amount
- Textual Features: Loan purpose description, risk analyst note


The business objectives are outlined as follows:

- To automate the loan approval process and reduce the manual workload of risk analysts while maintaining decision quality.
- To minimise financial risk for banking institutions by accurately identifying high-risk loan applicants before funds are approved.

## DO NOT RUN THESE CELLS!!!!

**1. Dataset Simulation**

In [None]:
!pip install -q -U google-generativeai

In [None]:
import google.generativeai as genai
from google.colab import files
import pandas as pd
import json
import time
import math

# Configuration
genai.configure(api_key="AIzaSyAmcIQjXyxk5uNcqwAT_-bwcTQzHG_AHc4")

model = genai.GenerativeModel("models/gemini-2.5-pro")
print("Model configured: gemini-2.5-pro")

Model configured: gemini-2.5-pro


**Logic behind loan default assignment based on credit score and loan amount**

Every customer starts with a 10% (0.1) base chance of defaulting which is a realistic assumption in financial datasets. If credit score < 600 will be considered as high risk and add 40% default likelihood (lower credit score -> higher default prob.). If loan is very high (>40k), risk increases and add 20% additional default probability (borrowing a large loan makes repayment harder).

In [None]:
# Parameters
TARGET_RECORDS = 1000
BATCH_SIZE = 20
NUM_LOOPS = math.ceil(TARGET_RECORDS / BATCH_SIZE)

print(f"Goal: {TARGET_RECORDS} records in {NUM_LOOPS} batches.")
print("Starting generation...")

# Container for all data
all_records = []

# Prompt
prompt_template = """
Generate {batch_size} synthetic loan application records as a JSON array.
Return ONLY valid JSON.
Fields:
- Transaction_ID (UUID)
- Customer_Name
- Application_Date (YYYY-MM-DD, last 2 years)
- Annual_Income (float, >=30000)
- Credit_Score (int 300-850, positively correlated with income)
- Loan_Amount (float 5000-50000)
- Loan_Purpose_Text
- Risk_Analyst_Note (1 short sentence)
- Default_Status (0 or 1; likely 1 if Credit_Score < 600 or Loan_Amount > 40000)
"""

# Generation loops
for i in range(NUM_LOOPS):
    success = False
    attempts = 0

    while not success and attempts < 3:
        print(f"Batch {i+1}/{NUM_LOOPS}...", end=" ")

        try:
            response = model.generate_content(
                prompt_template.format(batch_size=BATCH_SIZE),
                generation_config={
                    "max_output_tokens": 8192,
                    "temperature": 0.7,
                    "response_mime_type": "application/json"
                }
            )

            # Parse JSON
            batch_data = json.loads(response.text)

            if isinstance(batch_data, list) and len(batch_data) > 0:
                all_records.extend(batch_data)
                print(f"Success! (Total: {len(all_records)})")
                success = True

                # OPTIONAL: Small sleep to keep the Pro model connection stable
                time.sleep(1)
            else:
                print("Invalid JSON. Retrying...")
                attempts += 1
                time.sleep(2)

        except Exception as e:
            print(f"\nError: {e}")
            if "429" in str(e):
                print("!!! QUOTA HIT !!!")
                print("Pausing for 60 seconds...")
                time.sleep(60)
            else:
                time.sleep(5)
            attempts += 1

Goal: 1000 records in 50 batches.
Starting generation...
Batch 1/50... Success! (Total: 20)
Batch 2/50... 
Error: Unterminated string starting at: line 135 column 5 (char 4468)
Batch 2/50... Success! (Total: 40)
Batch 3/50... Success! (Total: 60)
Batch 4/50... Success! (Total: 80)
Batch 5/50... Success! (Total: 100)
Batch 6/50... Success! (Total: 120)
Batch 7/50... Success! (Total: 140)
Batch 8/50... Success! (Total: 160)
Batch 9/50... Success! (Total: 180)
Batch 10/50... Success! (Total: 200)
Batch 11/50... Success! (Total: 220)
Batch 12/50... Success! (Total: 240)
Batch 13/50... Success! (Total: 260)
Batch 14/50... Success! (Total: 280)
Batch 15/50... Success! (Total: 300)
Batch 16/50... Success! (Total: 320)
Batch 17/50... 
Error: Unterminated string starting at: line 197 column 5 (char 6781)
Batch 17/50... Success! (Total: 340)
Batch 18/50... Success! (Total: 360)
Batch 19/50... Success! (Total: 380)
Batch 20/50... 
Error: Expecting property name enclosed in double quotes: line 193

In [None]:
# Repair missing batch (Batch 43)
# We want exactly 20 more records to fill the gap from the failed batch

MISSING_BATCH_SIZE = 20

print(f"Current total records: {len(all_records)}")
print(f"Attempting to generate missing batch of {MISSING_BATCH_SIZE} records...")

success = False
attempts = 0

while not success and attempts < 3:
    try:
        # Use the same prompt template and model from your previous cell
        response = model.generate_content(
            prompt_template.format(batch_size=MISSING_BATCH_SIZE),
            generation_config={
                "max_output_tokens": 8192,
                "temperature": 0.7,
                "response_mime_type": "application/json"
            }
        )

        new_data = json.loads(response.text)

        if isinstance(new_data, list) and len(new_data) > 0:
            # APPEND to the existing list (do not overwrite it)
            all_records.extend(new_data)
            print(f"Success! Added {len(new_data)} records.")
            print(f"Final Total: {len(all_records)}")
            success = True
        else:
            print("Invalid JSON received. Retrying...")
            attempts += 1
            time.sleep(2)

    except Exception as e:
        print(f"Error: {e}")
        attempts += 1
        time.sleep(5)

Current total records: 980
Attempting to generate missing batch of 20 records...
Success! Added 20 records.
Final Total: 1000


In [None]:
# Save & download final dataset
if len(all_records) > 0:
    print("\nProcessing CSV...")
    df = pd.DataFrame(all_records)

    # Save to local Colab memory
    filename = "synthetic_loan_data.csv"
    df.to_csv(filename, index=False)
    print(f"Saved {len(df)} rows to {filename}")

    # Trigger download
    try:
        from google.colab import files
        files.download(filename)
    except Exception as e:
        print("Download failed (browser block?). Check the 'Files' folder on the left.")
else:
    print("No records to save.")


Processing CSV...
Saved 1000 rows to synthetic_loan_data.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# 3 duplicated columns (Customer_name, Annual_income & Risk_Analyst_note) generated
# Remove these 3 columns (with many null values) then download the csv file again

df = pd.DataFrame(all_records)

print(f"Original shape: {df.shape} (Rows, Columns)")
print(f"Columns before: {df.columns.tolist()}")

# Select all rows and all columns except the last 3
df_cleaned = df.iloc[:, :-3]

print(f"New shape: {df_cleaned.shape}")
print(f"Columns after: {df_cleaned.columns.tolist()}")

filename = "synthetic_loan_data_cleaned.csv"
df_cleaned.to_csv(filename, index=False)
print(f"\nSaved cleaned file to {filename}")

try:
    files.download(filename)
except:
    print("Download failed. Check the 'Files' folder on the left.")


Original shape: (1000, 12) (Rows, Columns)
Columns before: ['Transaction_ID', 'Customer_Name', 'Application_Date', 'Annual_Income', 'Credit_Score', 'Loan_Amount', 'Loan_Purpose_Text', 'Risk_Analyst_Note', 'Default_Status', 'Customer_name', 'Annual_income', 'Risk_Analyst_note']
New shape: (1000, 9)
Columns after: ['Transaction_ID', 'Customer_Name', 'Application_Date', 'Annual_Income', 'Credit_Score', 'Loan_Amount', 'Loan_Purpose_Text', 'Risk_Analyst_Note', 'Default_Status']

Saved cleaned file to synthetic_loan_data_cleaned.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Check counts of each category
print(df['Default_Status'].value_counts())

Default_Status
0    575
1    425
Name: count, dtype: int64


# Feature Engineering

In [1]:
import pandas as pd

# Load the CSV file into a pandas DataFrame
df = pd.read_csv('synthetic_loan_data_cleaned.csv')

# Display the first 5 rows of the DataFrame
print(df.head())

print("DataFrame loaded and first 5 rows displayed successfully.")

                         Transaction_ID     Customer_Name Application_Date  \
0  a1b2c3d4-e5f6-7890-1234-567890abcdef    Jennifer Smith       2023-05-15   
1  b2c3d4e5-f6a7-8901-2345-67890abcdef1  Michael Williams       2024-01-20   
2  c3d4e5f6-a7b8-9012-3456-7890abcdef12      Mary Johnson       2022-11-01   
3  d4e5f6a7-b8c9-0123-4567-890abcdef123       David Brown       2023-09-10   
4  e5f6a7b8-c9d0-1234-5678-90abcdef1234    Patricia Jones       2024-03-22   

   Annual_Income  Credit_Score  Loan_Amount    Loan_Purpose_Text  \
0       45000.00           580      42500.5   Debt Consolidation   
1      155000.75           790      25000.0     Home Improvement   
2       82000.00           695      15000.0         Car Purchase   
3       38000.50           510       8000.0     Medical Expenses   
4      120000.00           720      48000.0  Business Investment   

                                   Risk_Analyst_Note  Default_Status  
0  High risk due to low credit score and large lo..

Concatenate the 'Loan_Purpose_Text' and 'Risk_Analyst_Note' columns into a new 'Combined_Text' column using a period and space as a separator, then display the first few rows to verify the operation as per the subtask instructions.

In [2]:
df['Combined_Text'] = df['Loan_Purpose_Text'] + '. ' + df['Risk_Analyst_Note']
print(df[['Loan_Purpose_Text', 'Risk_Analyst_Note', 'Combined_Text']].head())
print("Combined_Text column created and displayed successfully.")

     Loan_Purpose_Text                                  Risk_Analyst_Note  \
0   Debt Consolidation  High risk due to low credit score and large lo...   
1     Home Improvement       Low risk profile with strong credit history.   
2         Car Purchase     Standard application, requires routine checks.   
3     Medical Expenses                 High risk due to low credit score.   
4  Business Investment         Large loan amount requires further review.   

                                       Combined_Text  
0  Debt Consolidation. High risk due to low credi...  
1  Home Improvement. Low risk profile with strong...  
2  Car Purchase. Standard application, requires r...  
3  Medical Expenses. High risk due to low credit ...  
4  Business Investment. Large loan amount require...  
Combined_Text column created and displayed successfully.


## Installing Libraries

In [4]:
!pip install torch torchvision
!pip install transformers

INFO: pip is looking at multiple versions of torchvision to determine which version is compatible with other requirements. This could take a while.
Collecting torchvision
  Downloading torchvision-0.24.1-cp311-cp311-win_amd64.whl.metadata (5.9 kB)
Collecting torch
  Using cached torch-2.9.1-cp311-cp311-win_amd64.whl.metadata (30 kB)
Collecting sympy>=1.13.3 (from torch)
  Using cached sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Downloading torchvision-0.24.1-cp311-cp311-win_amd64.whl (4.0 MB)
   ---------------------------------------- 0.0/4.0 MB ? eta -:--:--
   ----------------------- ---------------- 2.4/4.0 MB 44.6 MB/s eta 0:00:01
   ---------------------------------------- 4.0/4.0 MB 34.6 MB/s  0:00:00
Using cached torch-2.9.1-cp311-cp311-win_amd64.whl (111.0 MB)
Using cached sympy-1.14.0-py3-none-any.whl (6.3 MB)
Installing collected packages: sympy, torch, torchvision

  Attempting uninstall: sympy

    Found existing installation: sympy 1.13.1

   --------------------------

  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.5.1+cu121 requires torch==2.5.1+cu121, but you have torch 2.9.1 which is incompatible.




## Initialise Hugging Face Pipelines

Load pre-trained Hugging Face models and initialize pipelines for sentiment analysis and zero-shot classification. This will prepare the LLMs for feature extraction.

In [3]:
import torch
from transformers import pipeline

# Determine the device to use (GPU if available, otherwise CPU)
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'cuda' if device == 0 else 'cpu'}")

# Initialize sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="Sigma/financial-sentiment-analysis", device=device, framework="pt")
print("Sentiment analysis pipeline initialized.")

# Initialize zero-shot classification pipeline
zero_shot_pipeline = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=device, framework="pt")
print("Zero-shot classification pipeline initialized.")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda


Device set to use cuda:0


Sentiment analysis pipeline initialized.


Device set to use cuda:0


Zero-shot classification pipeline initialized.
