<a href="https://colab.research.google.com/github/Wrynaft/FinancialDataMining/blob/branch%2Fsimulation/WIE3007_DM_GroupProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **WIE3007 Data Mining Group Project**

This project has to be relevant to 'Financial/Business' sector. We will simulate a bank loan risk dataset to conduct 'Loan Default Prediction'.

- Target variable: Loan_Default (Yes/No)
- Numerical features: Credit score, annual income, loan amount
- Textual Features: Loan purpose description, risk analyst note


The business objectives are outlined as follows:

- To automate the loan approval process and reduce the manual workload of risk analysts while maintaining decision quality.
- To minimise financial risk for banking institutions by accurately identifying high-risk loan applicants before funds are approved.

## DO NOT RUN THESE CELLS!!!!

**1. Dataset Simulation**

In [None]:
!pip install -q -U google-generativeai

In [None]:
import google.generativeai as genai
from google.colab import files
import pandas as pd
import json
import time
import math

# Configuration
genai.configure(api_key="AIzaSyAmcIQjXyxk5uNcqwAT_-bwcTQzHG_AHc4")

model = genai.GenerativeModel("models/gemini-2.5-pro")
print("Model configured: gemini-2.5-pro")

Model configured: gemini-2.5-pro


**Logic behind loan default assignment based on credit score and loan amount**

Every customer starts with a 10% (0.1) base chance of defaulting which is a realistic assumption in financial datasets. If credit score < 600 will be considered as high risk and add 40% default likelihood (lower credit score -> higher default prob.). If loan is very high (>40k), risk increases and add 20% additional default probability (borrowing a large loan makes repayment harder).

In [None]:
# Parameters
TARGET_RECORDS = 1000
BATCH_SIZE = 20
NUM_LOOPS = math.ceil(TARGET_RECORDS / BATCH_SIZE)

print(f"Goal: {TARGET_RECORDS} records in {NUM_LOOPS} batches.")
print("Starting generation...")

# Container for all data
all_records = []

# Prompt
prompt_template = """
Generate {batch_size} synthetic loan application records as a JSON array.
Return ONLY valid JSON.
Fields:
- Transaction_ID (UUID)
- Customer_Name
- Application_Date (YYYY-MM-DD, last 2 years)
- Annual_Income (float, >=30000)
- Credit_Score (int 300-850, positively correlated with income)
- Loan_Amount (float 5000-50000)
- Loan_Purpose_Text
- Risk_Analyst_Note (1 short sentence)
- Default_Status (0 or 1; likely 1 if Credit_Score < 600 or Loan_Amount > 40000)
"""

# Generation loops
for i in range(NUM_LOOPS):
    success = False
    attempts = 0

    while not success and attempts < 3:
        print(f"Batch {i+1}/{NUM_LOOPS}...", end=" ")

        try:
            response = model.generate_content(
                prompt_template.format(batch_size=BATCH_SIZE),
                generation_config={
                    "max_output_tokens": 8192,
                    "temperature": 0.7,
                    "response_mime_type": "application/json"
                }
            )

            # Parse JSON
            batch_data = json.loads(response.text)

            if isinstance(batch_data, list) and len(batch_data) > 0:
                all_records.extend(batch_data)
                print(f"Success! (Total: {len(all_records)})")
                success = True

                # OPTIONAL: Small sleep to keep the Pro model connection stable
                time.sleep(1)
            else:
                print("Invalid JSON. Retrying...")
                attempts += 1
                time.sleep(2)

        except Exception as e:
            print(f"\nError: {e}")
            if "429" in str(e):
                print("!!! QUOTA HIT !!!")
                print("Pausing for 60 seconds...")
                time.sleep(60)
            else:
                time.sleep(5)
            attempts += 1

Goal: 1000 records in 50 batches.
Starting generation...
Batch 1/50... Success! (Total: 20)
Batch 2/50... 
Error: Unterminated string starting at: line 135 column 5 (char 4468)
Batch 2/50... Success! (Total: 40)
Batch 3/50... Success! (Total: 60)
Batch 4/50... Success! (Total: 80)
Batch 5/50... Success! (Total: 100)
Batch 6/50... Success! (Total: 120)
Batch 7/50... Success! (Total: 140)
Batch 8/50... Success! (Total: 160)
Batch 9/50... Success! (Total: 180)
Batch 10/50... Success! (Total: 200)
Batch 11/50... Success! (Total: 220)
Batch 12/50... Success! (Total: 240)
Batch 13/50... Success! (Total: 260)
Batch 14/50... Success! (Total: 280)
Batch 15/50... Success! (Total: 300)
Batch 16/50... Success! (Total: 320)
Batch 17/50... 
Error: Unterminated string starting at: line 197 column 5 (char 6781)
Batch 17/50... Success! (Total: 340)
Batch 18/50... Success! (Total: 360)
Batch 19/50... Success! (Total: 380)
Batch 20/50... 
Error: Expecting property name enclosed in double quotes: line 193

In [None]:
# Repair missing batch (Batch 43)
# We want exactly 20 more records to fill the gap from the failed batch

MISSING_BATCH_SIZE = 20

print(f"Current total records: {len(all_records)}")
print(f"Attempting to generate missing batch of {MISSING_BATCH_SIZE} records...")

success = False
attempts = 0

while not success and attempts < 3:
    try:
        # Use the same prompt template and model from your previous cell
        response = model.generate_content(
            prompt_template.format(batch_size=MISSING_BATCH_SIZE),
            generation_config={
                "max_output_tokens": 8192,
                "temperature": 0.7,
                "response_mime_type": "application/json"
            }
        )

        new_data = json.loads(response.text)

        if isinstance(new_data, list) and len(new_data) > 0:
            # APPEND to the existing list (do not overwrite it)
            all_records.extend(new_data)
            print(f"Success! Added {len(new_data)} records.")
            print(f"Final Total: {len(all_records)}")
            success = True
        else:
            print("Invalid JSON received. Retrying...")
            attempts += 1
            time.sleep(2)

    except Exception as e:
        print(f"Error: {e}")
        attempts += 1
        time.sleep(5)

Current total records: 980
Attempting to generate missing batch of 20 records...
Success! Added 20 records.
Final Total: 1000
