# Explanation of the Bank Classification Dataset

## Overview
This document provides a detailed explanation of the synthetic dataset generated for a bank-related classification problem. The dataset is designed to simulate real-world scenarios where a bank predicts customer behavior, such as loan default or creditworthiness. It includes a mix of numerical and categorical features, ensuring that all columns are meaningful and relevant to the banking domain.

---

## Dataset Details
### **Rows and Columns**
- **Number of Rows:** 1,000,000
- **Number of Columns:** 25 (excluding the target variable)

### **Target Variable**
- **Name:** `Target`
- **Type:** Categorical
- **Values:**
  - `Default`: Represents customers who default on their loans.
  - `No Default`: Represents customers who do not default.

### **Numerical Features**
1. **Age**: Customer's age in years (range: 18-75).
2. **Annual_Income**: Customer's annual income in USD, following a normal distribution centered around $50,000 with a standard deviation of $15,000.
3. **Loan_Amount**: The amount of the loan issued in USD, normally distributed around $20,000 with a standard deviation of $10,000.
4. **Savings_Balance**: Customer's current savings account balance in USD, normally distributed around $10,000 with a standard deviation of $5,000.
5. **Years_as_Customer**: Number of years the customer has been with the bank (range: 1-30).
6. **Credit_Score**: Customer's credit score (range: 300-850).
7. **Debt_to_Income_Ratio**: Debt-to-income ratio, a derived feature (range: 0.0-1.0).
8. **Credit_Utilization**: Ratio of credit used to the total available credit (range: 0.0-1.0).
9. **Number_of_Credit_Accounts**: Number of credit accounts held by the customer (range: 1-15).
10. **Loan_Term_in_Years**: Duration of the loan in years (range: 1-30).
11. **Monthly_Installment**: Monthly installment payment in USD, derived based on loan amount and term.
12. **Overdraft_Amount**: Overdraft amount used by the customer in USD (mean: $5,000, standard deviation: $2,000).
13. **Annual_Expenditure**: Customer's annual expenditure in USD (mean: $40,000, standard deviation: $12,000).

### **Categorical Features**
1. **Gender**: Customer's gender (“Male”, “Female”, “Non-Binary”).
2. **Marital_Status**: Customer's marital status (“Single”, “Married”, “Divorced”, “Widowed”).
3. **Employment_Status**: Employment status (“Employed”, “Unemployed”, “Self-Employed”, “Student”, “Retired”).
4. **Education_Level**: Highest level of education completed (“High School”, “Bachelor’s”, “Master’s”, “Doctorate”).
5. **Loan_Purpose**: Purpose of the loan (“Car”, “Home”, “Education”, “Personal”, “Business”).
6. **Has_Credit_Card**: Indicates whether the customer owns a credit card (“Yes”, “No”).
7. **Is_Homeowner**: Indicates whether the customer owns a home (“Yes”, “No”).
8. **Account_Type**: Type of account held by the customer (“Savings”, “Checking”, “Business”).
9. **Customer_Segment**: Categorizes customers into segments (“Premium”, “Regular”, “Occasional”).
10. **Preferred_Contact_Channel**: Preferred method of contact (“Email”, “Phone”, “In-Person”).

---

## Code Explanation
### **Numerical Features Generation**
The numerical features are generated using random distributions to simulate realistic data:
- **Uniform Distribution:** Used for features like `Debt_to_Income_Ratio` and `Credit_Utilization`.
- **Normal Distribution:** Used for features like `Annual_Income` and `Loan_Amount` to mimic typical customer distributions.
- **Integer Ranges:** Used for features like `Age` and `Years_as_Customer` to ensure realistic values.

### **Categorical Features Generation**
The categorical features are generated using `np.random.choice` with predefined probabilities to simulate real-world distributions. For instance:
- `Gender` assumes a 48%-48%-4% split for "Male," "Female," and "Non-Binary."
- `Marital_Status` reflects typical distributions in a population.

### **Derived Features**
Derived features, such as `Debt_to_Income_Ratio`, `Credit_Utilization`, and `Monthly_Installment`, are calculated based on other attributes to ensure realism.

### **Target Variable**
The target variable (`Target`) is binary and reflects a 20%-80% split between "Default" and "No Default" cases, mimicking common bank scenarios.

### **Script Execution**
The dataset is saved as a CSV file:
```python
# Save the dataset to a CSV file
data.to_csv("bank_classification_dataset.csv", index=False)
```

---

## Use Case
This dataset is suitable for:
- Training and testing classification models (e.g., logistic regression, random forests, neural networks).
- Practicing feature engineering, data preprocessing, and exploratory data analysis (EDA).
- Experimenting with imbalanced classification problems.

In [1]:
# imports
import numpy as np
import pandas as pd

In [None]:
# Set seed for reproducibility
np.random.seed(42)

# Define the number of rows
n_rows = 1_000_000

# Create numerical features
def generate_numerical_features(n_rows):
    return {
        "Age": np.random.randint(18, 75, size=n_rows),
        "Annual_Income": np.random.normal(50000, 15000, size=n_rows).round(2),
        "Loan_Amount": np.random.normal(20000, 10000, size=n_rows).round(2),
        "Savings_Balance": np.random.normal(10000, 5000, size=n_rows).round(2),
        "Years_as_Customer": np.random.randint(1, 30, size=n_rows),
        "Credit_Score": np.random.randint(300, 850, size=n_rows),
        "Debt_to_Income_Ratio": np.random.uniform(0, 1, size=n_rows).round(2),
        "Credit_Utilization": np.random.uniform(0, 1, size=n_rows).round(2),
        "Number_of_Credit_Accounts": np.random.randint(1, 15, size=n_rows),
        "Loan_Term_in_Years": np.random.randint(1, 30, size=n_rows),
        "Monthly_Installment": (np.random.normal(2000, 800, size=n_rows)).round(2),
        "Overdraft_Amount": np.random.normal(5000, 2000, size=n_rows).round(2),
        "Annual_Expenditure": np.random.normal(40000, 12000, size=n_rows).round(2),
    }

# Create categorical features
def generate_categorical_features(n_rows):
    return {
        "Gender": np.random.choice(["Male", "Female"], size=n_rows, p=[0.58, 0.42]),
        "Marital_Status": np.random.choice(["Single", "Married"], size=n_rows, p=[0.51, 0.49]),
        "Employment_Status": np.random.choice(["Employed", "Unemployed", "Self-Employed", "Student", "Retired"], size=n_rows),
        "Education_Level": np.random.choice(["High School", "Bachelor's", "Master's", "Doctorate"], size=n_rows, p=[0.4, 0.4, 0.15, 0.05]),
        "Loan_Purpose": np.random.choice(["Car", "Home", "Education", "Personal", "Business"], size=n_rows),
        "Has_Credit_Card": np.random.choice(["Yes", "No"], size=n_rows, p=[0.7, 0.3]),
        "Is_Homeowner": np.random.choice(["Yes", "No"], size=n_rows, p=[0.6, 0.4]),
        "Account_Type": np.random.choice(["Savings", "Business"], size=n_rows, p=[0.65, 0.35]),
        "Customer_Segment": np.random.choice(["Premium", "Regular"], size=n_rows, p=[0.52, 0.48]),
        "Preferred_Contact_Channel": np.random.choice(["Email", "Phone", "In-Person"], size=n_rows, p=[0.5, 0.3, 0.2]),
    }

# Generate the target variable
def generate_target_variable(n_rows):
    return np.random.choice(["Default", "No Default"], size=n_rows, p=[0.2, 0.8])

# Combine all features into a DataFrame
numerical_features = generate_numerical_features(n_rows)
categorical_features = generate_categorical_features(n_rows)

data = pd.DataFrame({**numerical_features, **categorical_features})

# Add the target variable
data["Target"] = generate_target_variable(n_rows)

# Save the dataset to a CSV file
data.to_csv("bank_classification_dataset.csv", index=False)

print("Dataset generated and saved as 'bank_classification_dataset.csv'")


# EDA

In [None]:
# load the dataset
data = pd.read_csv("bank_classification_dataset.csv")
# copy the dataset
df = data.copy()
#shape of the dataset
print(df.shape)

In [None]:
# Basic statistics of the dataset
data_summary = df.describe(include='all')

# Check for missing values
missing_values = df.isnull().sum()

# Display results
from IPython.display import display

# Display results
display(data_summary)
display(missing_values)


In [None]:
df.head()

### 1. **Correlation Heatmap for Numerical Features**

**Code:**
```python
plt.figure(figsize=(12, 8))
numerical_features = data.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = data[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Heatmap for Numerical Features")
plt.show()
```

**Explanation:**
This heatmap visualizes the correlation between all numerical features in the dataset. Correlation values range from -1 to 1:
- **1**: Perfect positive correlation.
- **-1**: Perfect negative correlation.
- **0**: No correlation.

**Insights:**
- Features with strong positive or negative correlations can indicate multicollinearity, which may need to be addressed during model building.
- For example, `Debt-to-Income Ratio` and `Loan Amount` might show a correlation, indicating a potential relationship between a customer's debt and the size of their loan.

---

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style for the plots
sns.set(style="whitegrid")

# 1. Correlation Heatmap for Numerical Features
plt.figure(figsize=(12, 8))
numerical_features = data.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = data[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Heatmap for Numerical Features")
plt.show()


### 2. **Boxplot: Loan Amount vs Loan Purpose**

**Code:**
```python
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Loan_Purpose', y='Loan_Amount')
plt.title("Loan Amount Distribution by Loan Purpose")
plt.xticks(rotation=45)
plt.show()
```

**Explanation:**
This boxplot shows the distribution of loan amounts for each loan purpose. Each box represents the interquartile range (IQR) of the loan amounts, and the whiskers represent the range, excluding outliers.

**Insights:**
- You can identify which loan purposes have higher median loan amounts (e.g., `Home` loans might have higher amounts compared to `Car` loans).
- Outliers can indicate exceptionally large loans for specific purposes, which might warrant further investigation.

---

In [None]:
# 2. Boxplot: Loan Amount vs Loan Purpose
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Loan_Purpose', y='Loan_Amount')
plt.title("Loan Amount Distribution by Loan Purpose")
plt.xticks(rotation=45)
plt.show()

### 3. **Bar Plot: Target Distribution**

**Code:**
```python
plt.figure(figsize=(8, 6))
sns.countplot(data=data, x='Target')
plt.title("Target Distribution (Default vs No Default)")
plt.show()
```

**Explanation:**
This bar plot shows the distribution of the target variable (`Default` vs `No Default`). It helps us understand the class balance in the dataset.

**Insights:**
- If one class is significantly larger than the other (e.g., more `No Default` cases), the dataset is imbalanced. This is common in real-world banking scenarios and requires techniques like oversampling, undersampling, or class weighting during model training.

---

In [None]:
# 3. Bar Plot: Target Distribution
plt.figure(figsize=(8, 6))
sns.countplot(data=data, x='Target')
plt.title("Target Distribution (Default vs No Default)")
plt.show()

### 4. **Scatter Plot: Annual Income vs Loan Amount**

**Code:**
```python
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Annual_Income', y='Loan_Amount', hue='Target', alpha=0.3)
plt.title("Annual Income vs Loan Amount (by Target)")
plt.show()
```

**Explanation:**
This scatter plot visualizes the relationship between `Annual Income` and `Loan Amount`, colored by the target variable (`Default` vs `No Default`).

**Insights:**
- You can observe clusters of customers with specific income and loan amount combinations.
- Defaults may cluster in specific ranges of income and loan amount, such as lower incomes and higher loans, indicating potential risk areas.

---

In [None]:
# 4. Scatter Plot: Annual Income vs Loan Amount
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Annual_Income', y='Loan_Amount', hue='Target', alpha=0.3)
plt.title("Annual Income vs Loan Amount (by Target)")
plt.show()

### 5. **Distribution of Debt-to-Income Ratio by Target**

**Code:**
```python
plt.figure(figsize=(10, 6))
sns.histplot(data=data, x='Debt_to_Income_Ratio', hue='Target', kde=True, bins=30, alpha=0.5)
plt.title("Debt-to-Income Ratio Distribution by Target")
plt.show()
```

**Explanation:**
This histogram shows the distribution of `Debt-to-Income Ratio` for each target class (`Default` vs `No Default`), with a kernel density estimate (KDE) overlay.

**Insights:**
- Customers with higher `Debt-to-Income Ratios` are more likely to default, as observed by the peaks in the `Default` category.
- The KDE helps visualize the overall distribution and the overlap between the two classes, which is useful for assessing feature separability.

---

In [None]:
# 5. Distribution of Debt-to-Income Ratio by Target
plt.figure(figsize=(10, 6))
sns.histplot(data=data, x='Debt_to_Income_Ratio', hue='Target', kde=True, bins=30, alpha=0.5)
plt.title("Debt-to-Income Ratio Distribution by Target")
plt.show()

# Data Preprocessing Steps with Explanations and Comparisons

## Step 1: Handling Missing Values

### Explanation:
Missing values can create biases and reduce the predictive power of a machine learning model. For this dataset, we use the following strategies:
- **Numerical Features:** Missing values are imputed using the **mean** because it is simple, effective, and preserves the average distribution of the feature.
  - **Alternative:** Median imputation is robust to outliers but may not reflect the central tendency well if the data is normally distributed.
- **Categorical Features:** Missing values are imputed using the **most frequent value** because it is computationally efficient and preserves the mode of the data.
  - **Alternative:** K-Nearest Neighbors (KNN) imputation considers neighboring data points but is computationally expensive and may introduce noise.

### Code:
```python
# Impute missing values for numerical and categorical features
numerical_features = data.select_dtypes(include=["float64", "int64"]).columns
categorical_features = data.select_dtypes(include=["object"]).columns

numerical_imputer = SimpleImputer(strategy="mean")  # Mean for numerical features
categorical_imputer = SimpleImputer(strategy="most_frequent")  # Mode for categorical features
```

---



In [None]:
#imports
from sklearn.impute import SimpleImputer

# Ensure column names are clean
df.columns = df.columns.str.strip()

# Step 5: Separate Features and Target Variable
if "Target" not in df.columns:
    raise ValueError("The column 'Target' is missing from the dataset.")

X = df.drop(columns=["Target"])  # Exclude the target variable
y = df["Target"]  # Target variable remains unchanged

# Dynamically define numerical and categorical features based on X
numerical_features = X.select_dtypes(include=["float64", "int64"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()

numerical_imputer = SimpleImputer(strategy="mean")  # Mean for numerical features
categorical_imputer = SimpleImputer(strategy="most_frequent")  # Mode for categorical features
# what is impuation 
# Imputation is the process of replacing missing data with substituted values. When substituting for a missing value,
# we can use the mean, the median, or the mode of the non-missing values.
# Imputation preserves the sample size of the data, which results in more accurate statistical analysis.
# Imputation is a better option than dropping missing values from the dataset, as it preserves the sample size and
# ensures that the data is not lost.

print("Categorical features :",categorical_features)
print('-'*250)
print("numerical features :",numerical_features)

In [None]:
# checking unique values in categorical features
X[categorical_features].nunique()

In [None]:
# checking unique values in numerical features
X[numerical_features].nunique()

In [None]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

# Ordinal Encoding
ordinal_cols = ['Education_Level', 'Loan_Purpose']
ordinal_encoder = OrdinalEncoder()
X[ordinal_cols] = ordinal_encoder.fit_transform(X[ordinal_cols])

# Label Encoding
label_cols = ['Gender', 'Marital_Status', 'Employment_Status', 'Has_Credit_Card', 
              'Is_Homeowner', 'Account_Type', 'Customer_Segment', 'Preferred_Contact_Channel']
label_encoder = LabelEncoder()
for col in label_cols:
    X[col] = label_encoder.fit_transform(X[col])

# Also encoding the target variable
y = label_encoder.fit_transform(y)

# Display the first few rows of the transformed dataset
X.head()


In [None]:
# dispay target variable
y

In [None]:
# checking outliners in numerical features
plt.figure(figsize=(18, 12))
X[numerical_features].boxplot()
plt.title("Boxplot of Numerical Features")
plt.xticks(rotation=45)
plt.show()

In [18]:
# Define a function to cap outliers
def cap_outliers(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Cap outliers
    df_capped = df.clip(lower=lower_bound, upper=upper_bound, axis=1)
    return df_capped

# Cap outliers in numerical features
X[numerical_features] = cap_outliers(X[numerical_features])


In [None]:
# plotting the boxplot after capping the outliners
plt.figure(figsize=(18, 12))
X[numerical_features].boxplot()
plt.title("Boxplot of Numerical Features (After Capping Outliers)")
plt.xticks(rotation=45)
plt.show()

In [None]:
# checking for class imbalance
class_distribution = df["Target"].value_counts(normalize=True) * 100
print(class_distribution)

In [None]:
# handling class imbalance before splitting the data
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Display the class distribution after applying SMOTE
y_resampled_distribution = pd.Series(y_resampled).value_counts(normalize=True) * 100
print(y_resampled_distribution)

In [None]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets 
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

In [23]:
# Standardize the numerical features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])

In [None]:
# training the model
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
from sklearn.metrics import classification_report, accuracy_score

# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))
print('+'*50)

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print('+'*50)

In [None]:
# Feature Importance
feature_importance = rf_classifier.feature_importances_
feature_importance_df = pd.DataFrame({"Feature": X.columns, "Importance": feature_importance})
feature_importance_df = feature_importance_df.sort_values(by="Importance", ascending=False)

# Plot the Feature Importance   
plt.figure(figsize=(12, 8))
sns.barplot(data=feature_importance_df, x="Importance", y="Feature", palette="viridis")
plt.title("Feature Importance")
plt.show()


In [None]:
# setting sensitive features as categorical features
# remove loan purpose and education level from sensitive features
sensitive_features = [feature for feature in categorical_features if feature not in ['Loan_Purpose', 'Education_Level']]
print(sensitive_features)

In [None]:
import os
import time
import subprocess
import shutil
import csv
from tempfile import gettempdir

from tqdm import tqdm
from fairlearn.metrics import MetricFrame
from sklearn.metrics import accuracy_score, precision_score, recall_score
from bokeh.io import output_file, save
from bokeh.plotting import figure
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from fairlearn.metrics import selection_rate, count, false_positive_rate, false_negative_rate, true_positive_rate, true_negative_rate
# Create the plots directory
os.makedirs("plots", exist_ok=True)

# Prepare the report data
report_data = []


def generate_screenshot(html_file, output_png):
    """
    Detects the best browser, constructs the command, and generates a screenshot of the given HTML file.

    Parameters:
        html_file (str): Path to the input HTML file.
        output_png (str): Path to save the output PNG screenshot.

    Returns:
        bool: True if the screenshot was successfully generated, False otherwise.
    """
    browser_priority = ["chrome", "edge", "safari", "firefox", "opera"]
    browser_paths = {
        "chrome": [
            "chrome", "google-chrome", "chrome.exe",
            r"C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe",
            r"C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe"
        ],
        "edge": [
            "msedge", "msedge.exe",
            r"C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe",
            r"C:\\Program Files\\Microsoft\\Edge\\Application\\msedge.exe"
        ],
        "safari": ["/Applications/Safari.app/Contents/MacOS/Safari"],
        "firefox": [
            "firefox", "firefox.exe",
            r"C:\\Program Files\\Mozilla Firefox\\firefox.exe",
            r"C:\\Program Files (x86)\\Mozilla Firefox\\firefox.exe"
        ],
        "opera": [
            "opera", "opera.exe",
            r"C:\\Program Files\\Opera\\launcher.exe",
            r"C:\\Program Files (x86)\\Opera\\launcher.exe"
        ]
    }

    def find_browser():
        for browser in browser_priority:
            for executable in browser_paths.get(browser, []):
                path = shutil.which(executable) or (
                    executable if os.path.isfile(executable) else None)
                if path:
                    return browser, path
        return None, None

    def build_command(browser, browser_path, html_file, output_png):
        preset_width, preset_height = 1280,520
        if browser in ["chrome", "edge", "opera"]:
            return [
                browser_path, "--headless", "--disable-gpu",
                f"--window-size={preset_width},{preset_height}",
                f"--screenshot={output_png}",
                f"file:///{html_file}"
            ]
        elif browser == "firefox":
            return [
                browser_path, "--headless",
                "--window-size", f"{preset_width},{preset_height}",
                "--screenshot", output_png,
                f"file:///{html_file}"
            ]
        elif browser == "safari":
            raise NotImplementedError(
                "Safari does not natively support headless mode.")
        else:
            raise ValueError(
                f"Unsupported browser for headless mode: {browser}")

    def execute_command(command):
        try:
            result = subprocess.run(
                command, check=True, capture_output=True, text=True)
            return True
        except subprocess.CalledProcessError:
            return False
        except Exception:
            return False

    if not os.path.isfile(html_file):
        return False

    browser, browser_path = find_browser()
    if not browser or not browser_path:
        return False

    try:
        command = build_command(browser, browser_path, html_file, output_png)
        return execute_command(command)
    except (NotImplementedError, ValueError):
        return False


# Start the main processing
y_true = y_test
start_time = time.time()

total_rows = len(y_test)
sensitive_columns = sensitive_features

# Main processing loop
for i in tqdm(sensitive_columns, desc="Processing sensitive columns", total=len(sensitive_columns)):
    iteration_start_time = time.time()

    metrics = {
        "Accuracy": accuracy_score,
        "Precision": lambda y_true, y_pred: precision_score(y_true, y_pred, zero_division=0),
        "False Positive Rate": false_positive_rate,
        "False Negative Rate": false_negative_rate,
        "True Positive Rate": true_positive_rate,
        "True Negative Rate": true_negative_rate,
        "Selection Rate": selection_rate,
        "Count": count,
    }

    protected_class = X_test[i]
    metric_frame = MetricFrame(
        metrics=metrics,
        y_true=y_test,
        y_pred=y_pred,
        sensitive_features=protected_class
    )

    iteration_end_time = time.time()
    metric_time = iteration_end_time - iteration_start_time
    print(
        f"Time taken for metrics calculation (feature '{i}'): {metric_time:.4f} seconds")

    metrics_data = metric_frame.by_group.reset_index()
    sensitive_feature_column = metrics_data.columns[0]
    metrics_data = metrics_data.melt(
        id_vars=[sensitive_feature_column], var_name="Metric", value_name="Value")

    colors = [
        "#1f77b4", "#ff7f0e", "#2ca02c", "#d62728",
        "#9467bd", "#8c564b", "#e377c2", "#7f7f7f"
    ]

    plots = []
    for idx, metric in enumerate(metrics_data['Metric'].unique()):
        data = metrics_data[metrics_data['Metric'] == metric]
        data[sensitive_feature_column] = data[sensitive_feature_column].astype(
            str)
        source = ColumnDataSource(data)

        p = figure(
            title=metric,
            x_range=data[sensitive_feature_column].unique(),
            height=200,
            width=300
        )
        p.vbar(x=sensitive_feature_column, top='Value', width=0.9,
               source=source, color=colors[idx % len(colors)])
        p.xaxis.axis_label = sensitive_feature_column
        p.yaxis.axis_label = metric
        p.title.text_font_size = "14pt"
        plots.append(p)

    grid = gridplot([plots[:4], plots[4:]])

    html_file = os.path.abspath(os.path.join(
        "plots", f"metrics_bokeh_{i}.html"))
    output_file(html_file)
    save(grid)

    png_start_time = time.time()
    output_png = html_file.replace(".html", ".png")
    success = generate_screenshot(html_file, output_png)
    png_end_time = time.time()

    image_time = png_end_time - png_start_time
    if success:
        print(
            f"Time taken to save PNG (feature '{i}'): {image_time:.4f} seconds")
    else:
        print(f"Failed to generate PNG for feature '{i}'")

    # Add data to the report
    report_data.append({
        "Sensitive Column": i,
        "Metric Calculation Time (s)": metric_time,
        "Image Generation Time (s)": image_time,
    })

end_time = time.time()
total_time = end_time - start_time
print(f"Total execution time: {total_time:.4f} seconds")

# Add summary to the report
total_sensitive_columns = len(sensitive_columns)
report_data.append({
    "Sensitive Column": "Summary",
    "Metric Calculation Time (s)": "---",
    "Image Generation Time (s)": "---",
    "Total Sensitive Columns": total_sensitive_columns,
    "Total Rows": total_rows,
    "Total Execution Time (s)": total_time
})

# Write the report to a CSV file
report_file = os.path.join("plots", "report.csv")
with open(report_file, mode="w", newline="") as file:
    writer = csv.DictWriter(file, fieldnames=[
        "Sensitive Column", "Metric Calculation Time (s)", "Image Generation Time (s)",
        "Total Sensitive Columns", "Total Rows", "Total Execution Time (s)"
    ])
    writer.writeheader()
    writer.writerows(report_data)

print(f"Report saved to {report_file}")