# Explanation of the Bank Classification Dataset

## Overview
This document provides a detailed explanation of the synthetic dataset generated for a bank-related classification problem. The dataset is designed to simulate real-world scenarios where a bank predicts customer behavior, such as loan default or creditworthiness. It includes a mix of numerical and categorical features, ensuring that all columns are meaningful and relevant to the banking domain.

---

## Dataset Details
### **Rows and Columns**
- **Number of Rows:** 1,000,000
- **Number of Columns:** 25 (excluding the target variable)

### **Target Variable**
- **Name:** `Target`
- **Type:** Categorical
- **Values:**
  - `Default`: Represents customers who default on their loans.
  - `No Default`: Represents customers who do not default.

### **Numerical Features**
1. **Age**: Customer's age in years (range: 18-75).
2. **Annual_Income**: Customer's annual income in USD, following a normal distribution centered around $50,000 with a standard deviation of $15,000.
3. **Loan_Amount**: The amount of the loan issued in USD, normally distributed around $20,000 with a standard deviation of $10,000.
4. **Savings_Balance**: Customer's current savings account balance in USD, normally distributed around $10,000 with a standard deviation of $5,000.
5. **Years_as_Customer**: Number of years the customer has been with the bank (range: 1-30).
6. **Credit_Score**: Customer's credit score (range: 300-850).
7. **Debt_to_Income_Ratio**: Debt-to-income ratio, a derived feature (range: 0.0-1.0).
8. **Credit_Utilization**: Ratio of credit used to the total available credit (range: 0.0-1.0).
9. **Number_of_Credit_Accounts**: Number of credit accounts held by the customer (range: 1-15).
10. **Loan_Term_in_Years**: Duration of the loan in years (range: 1-30).
11. **Monthly_Installment**: Monthly installment payment in USD, derived based on loan amount and term.
12. **Overdraft_Amount**: Overdraft amount used by the customer in USD (mean: $5,000, standard deviation: $2,000).
13. **Annual_Expenditure**: Customer's annual expenditure in USD (mean: $40,000, standard deviation: $12,000).

### **Categorical Features**
1. **Gender**: Customer's gender (“Male”, “Female”, “Non-Binary”).
2. **Marital_Status**: Customer's marital status (“Single”, “Married”, “Divorced”, “Widowed”).
3. **Employment_Status**: Employment status (“Employed”, “Unemployed”, “Self-Employed”, “Student”, “Retired”).
4. **Education_Level**: Highest level of education completed (“High School”, “Bachelor’s”, “Master’s”, “Doctorate”).
5. **Loan_Purpose**: Purpose of the loan (“Car”, “Home”, “Education”, “Personal”, “Business”).
6. **Has_Credit_Card**: Indicates whether the customer owns a credit card (“Yes”, “No”).
7. **Is_Homeowner**: Indicates whether the customer owns a home (“Yes”, “No”).
8. **Account_Type**: Type of account held by the customer (“Savings”, “Checking”, “Business”).
9. **Customer_Segment**: Categorizes customers into segments (“Premium”, “Regular”, “Occasional”).
10. **Preferred_Contact_Channel**: Preferred method of contact (“Email”, “Phone”, “In-Person”).

---

## Code Explanation
### **Numerical Features Generation**
The numerical features are generated using random distributions to simulate realistic data:
- **Uniform Distribution:** Used for features like `Debt_to_Income_Ratio` and `Credit_Utilization`.
- **Normal Distribution:** Used for features like `Annual_Income` and `Loan_Amount` to mimic typical customer distributions.
- **Integer Ranges:** Used for features like `Age` and `Years_as_Customer` to ensure realistic values.

### **Categorical Features Generation**
The categorical features are generated using `np.random.choice` with predefined probabilities to simulate real-world distributions. For instance:
- `Gender` assumes a 48%-48%-4% split for "Male," "Female," and "Non-Binary."
- `Marital_Status` reflects typical distributions in a population.

### **Derived Features**
Derived features, such as `Debt_to_Income_Ratio`, `Credit_Utilization`, and `Monthly_Installment`, are calculated based on other attributes to ensure realism.

### **Target Variable**
The target variable (`Target`) is binary and reflects a 20%-80% split between "Default" and "No Default" cases, mimicking common bank scenarios.

### **Script Execution**
The dataset is saved as a CSV file:
```python
# Save the dataset to a CSV file
data.to_csv("bank_classification_dataset.csv", index=False)
```

---

## Use Case
This dataset is suitable for:
- Training and testing classification models (e.g., logistic regression, random forests, neural networks).
- Practicing feature engineering, data preprocessing, and exploratory data analysis (EDA).
- Experimenting with imbalanced classification problems.

In [1]:
# imports
import numpy as np
import pandas as pd

In [2]:
# Set seed for reproducibility
np.random.seed(42)

# Define the number of rows
n_rows = 1_000_000

# Create numerical features
def generate_numerical_features(n_rows):
    return {
        "Age": np.random.randint(18, 75, size=n_rows),
        "Annual_Income": np.random.normal(50000, 15000, size=n_rows).round(2),
        "Loan_Amount": np.random.normal(20000, 10000, size=n_rows).round(2),
        "Savings_Balance": np.random.normal(10000, 5000, size=n_rows).round(2),
        "Years_as_Customer": np.random.randint(1, 30, size=n_rows),
        "Credit_Score": np.random.randint(300, 850, size=n_rows),
        "Debt_to_Income_Ratio": np.random.uniform(0, 1, size=n_rows).round(2),
        "Credit_Utilization": np.random.uniform(0, 1, size=n_rows).round(2),
        "Number_of_Credit_Accounts": np.random.randint(1, 15, size=n_rows),
        "Loan_Term_in_Years": np.random.randint(1, 30, size=n_rows),
        "Monthly_Installment": (np.random.normal(2000, 800, size=n_rows)).round(2),
        "Overdraft_Amount": np.random.normal(5000, 2000, size=n_rows).round(2),
        "Annual_Expenditure": np.random.normal(40000, 12000, size=n_rows).round(2),
    }

# Create categorical features
def generate_categorical_features(n_rows):
    return {
        "Gender": np.random.choice(["Male", "Female"], size=n_rows, p=[0.58, 0.42]),
        "Marital_Status": np.random.choice(["Single", "Married"], size=n_rows, p=[0.51, 0.49]),
        "Employment_Status": np.random.choice(["Employed", "Unemployed", "Self-Employed", "Student", "Retired"], size=n_rows),
        "Education_Level": np.random.choice(["High School", "Bachelor's", "Master's", "Doctorate"], size=n_rows, p=[0.4, 0.4, 0.15, 0.05]),
        "Loan_Purpose": np.random.choice(["Car", "Home", "Education", "Personal", "Business"], size=n_rows),
        "Has_Credit_Card": np.random.choice(["Yes", "No"], size=n_rows, p=[0.7, 0.3]),
        "Is_Homeowner": np.random.choice(["Yes", "No"], size=n_rows, p=[0.6, 0.4]),
        "Account_Type": np.random.choice(["Savings", "Business"], size=n_rows, p=[0.65, 0.35]),
        "Customer_Segment": np.random.choice(["Premium", "Regular"], size=n_rows, p=[0.52, 0.48]),
        "Preferred_Contact_Channel": np.random.choice(["Email", "Phone", "In-Person"], size=n_rows, p=[0.5, 0.3, 0.2]),
    }

# Generate the target variable
def generate_target_variable(n_rows):
    return np.random.choice(["Default", "No Default"], size=n_rows, p=[0.2, 0.8])

# Combine all features into a DataFrame
numerical_features = generate_numerical_features(n_rows)
categorical_features = generate_categorical_features(n_rows)

data = pd.DataFrame({**numerical_features, **categorical_features})

# Add the target variable
data["Target"] = generate_target_variable(n_rows)

# Save the dataset to a CSV file
data.to_csv("bank_classification_dataset.csv", index=False)

print("Dataset generated and saved as 'bank_classification_dataset.csv'")


# EDA

In [None]:
# load the dataset
data = pd.read_csv("bank_classification_dataset.csv")
# copy the dataset
df = data.copy()
#shape of the dataset
print(df.shape)

In [None]:
# Basic statistics of the dataset
data_summary = df.describe(include='all')

# Check for missing values
missing_values = df.isnull().sum()

# Display results
from IPython.display import display

# Display results
display(data_summary)
display(missing_values)


In [None]:
df.head()

### 1. **Correlation Heatmap for Numerical Features**

**Code:**
```python
plt.figure(figsize=(12, 8))
numerical_features = data.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = data[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Heatmap for Numerical Features")
plt.show()
```

**Explanation:**
This heatmap visualizes the correlation between all numerical features in the dataset. Correlation values range from -1 to 1:
- **1**: Perfect positive correlation.
- **-1**: Perfect negative correlation.
- **0**: No correlation.

**Insights:**
- Features with strong positive or negative correlations can indicate multicollinearity, which may need to be addressed during model building.
- For example, `Debt-to-Income Ratio` and `Loan Amount` might show a correlation, indicating a potential relationship between a customer's debt and the size of their loan.

---

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style for the plots
sns.set(style="whitegrid")

# 1. Correlation Heatmap for Numerical Features
plt.figure(figsize=(12, 8))
numerical_features = data.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = data[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Heatmap for Numerical Features")
plt.show()


### 2. **Boxplot: Loan Amount vs Loan Purpose**

**Code:**
```python
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Loan_Purpose', y='Loan_Amount')
plt.title("Loan Amount Distribution by Loan Purpose")
plt.xticks(rotation=45)
plt.show()
```

**Explanation:**
This boxplot shows the distribution of loan amounts for each loan purpose. Each box represents the interquartile range (IQR) of the loan amounts, and the whiskers represent the range, excluding outliers.

**Insights:**
- You can identify which loan purposes have higher median loan amounts (e.g., `Home` loans might have higher amounts compared to `Car` loans).
- Outliers can indicate exceptionally large loans for specific purposes, which might warrant further investigation.

---

In [None]:
# 2. Boxplot: Loan Amount vs Loan Purpose
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Loan_Purpose', y='Loan_Amount')
plt.title("Loan Amount Distribution by Loan Purpose")
plt.xticks(rotation=45)
plt.show()

### 3. **Bar Plot: Target Distribution**

**Code:**
```python
plt.figure(figsize=(8, 6))
sns.countplot(data=data, x='Target')
plt.title("Target Distribution (Default vs No Default)")
plt.show()
```

**Explanation:**
This bar plot shows the distribution of the target variable (`Default` vs `No Default`). It helps us understand the class balance in the dataset.

**Insights:**
- If one class is significantly larger than the other (e.g., more `No Default` cases), the dataset is imbalanced. This is common in real-world banking scenarios and requires techniques like oversampling, undersampling, or class weighting during model training.

---

In [None]:
# 3. Bar Plot: Target Distribution
plt.figure(figsize=(8, 6))
sns.countplot(data=data, x='Target')
plt.title("Target Distribution (Default vs No Default)")
plt.show()

### 4. **Scatter Plot: Annual Income vs Loan Amount**

**Code:**
```python
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Annual_Income', y='Loan_Amount', hue='Target', alpha=0.3)
plt.title("Annual Income vs Loan Amount (by Target)")
plt.show()
```

**Explanation:**
This scatter plot visualizes the relationship between `Annual Income` and `Loan Amount`, colored by the target variable (`Default` vs `No Default`).

**Insights:**
- You can observe clusters of customers with specific income and loan amount combinations.
- Defaults may cluster in specific ranges of income and loan amount, such as lower incomes and higher loans, indicating potential risk areas.

---

In [None]:
# 4. Scatter Plot: Annual Income vs Loan Amount
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Annual_Income', y='Loan_Amount', hue='Target', alpha=0.3)
plt.title("Annual Income vs Loan Amount (by Target)")
plt.show()

### 5. **Distribution of Debt-to-Income Ratio by Target**

**Code:**
```python
plt.figure(figsize=(10, 6))
sns.histplot(data=data, x='Debt_to_Income_Ratio', hue='Target', kde=True, bins=30, alpha=0.5)
plt.title("Debt-to-Income Ratio Distribution by Target")
plt.show()
```

**Explanation:**
This histogram shows the distribution of `Debt-to-Income Ratio` for each target class (`Default` vs `No Default`), with a kernel density estimate (KDE) overlay.

**Insights:**
- Customers with higher `Debt-to-Income Ratios` are more likely to default, as observed by the peaks in the `Default` category.
- The KDE helps visualize the overall distribution and the overlap between the two classes, which is useful for assessing feature separability.

---

In [None]:
# 5. Distribution of Debt-to-Income Ratio by Target
plt.figure(figsize=(10, 6))
sns.histplot(data=data, x='Debt_to_Income_Ratio', hue='Target', kde=True, bins=30, alpha=0.5)
plt.title("Debt-to-Income Ratio Distribution by Target")
plt.show()

# Data Preprocessing Steps with Explanations and Comparisons

## Step 1: Handling Missing Values

### Explanation:
Missing values can create biases and reduce the predictive power of a machine learning model. For this dataset, we use the following strategies:
- **Numerical Features:** Missing values are imputed using the **mean** because it is simple, effective, and preserves the average distribution of the feature.
  - **Alternative:** Median imputation is robust to outliers but may not reflect the central tendency well if the data is normally distributed.
- **Categorical Features:** Missing values are imputed using the **most frequent value** because it is computationally efficient and preserves the mode of the data.
  - **Alternative:** K-Nearest Neighbors (KNN) imputation considers neighboring data points but is computationally expensive and may introduce noise.

### Code:
```python
# Impute missing values for numerical and categorical features
numerical_features = data.select_dtypes(include=["float64", "int64"]).columns
categorical_features = data.select_dtypes(include=["object"]).columns

numerical_imputer = SimpleImputer(strategy="mean")  # Mean for numerical features
categorical_imputer = SimpleImputer(strategy="most_frequent")  # Mode for categorical features
```

---



In [None]:
#imports
from sklearn.impute import SimpleImputer

# Ensure column names are clean
df.columns = df.columns.str.strip()

# Step 5: Separate Features and Target Variable
if "Target" not in df.columns:
    raise ValueError("The column 'Target' is missing from the dataset.")

X = df.drop(columns=["Target"])  # Exclude the target variable
y = df["Target"]  # Target variable remains unchanged

# Dynamically define numerical and categorical features based on X
numerical_features = X.select_dtypes(include=["float64", "int64"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()

numerical_imputer = SimpleImputer(strategy="mean")  # Mean for numerical features
categorical_imputer = SimpleImputer(strategy="most_frequent")  # Mode for categorical features
# what is impuation 
# Imputation is the process of replacing missing data with substituted values. When substituting for a missing value,
# we can use the mean, the median, or the mode of the non-missing values.
# Imputation preserves the sample size of the data, which results in more accurate statistical analysis.
# Imputation is a better option than dropping missing values from the dataset, as it preserves the sample size and
# ensures that the data is not lost.

print("Categorical features :",categorical_features)
print('-'*250)
print("numerical features :",numerical_features)
print('-'*250)
print("Unique Values in Categorical features :",X[categorical_features].nunique())
print('-'*250)
print("Unique values in Numerical features :",X[numerical_features].nunique())
print('-'*250)


# Explanation of Encoding in the Code

## Code:

```python
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

# Ordinal Encoding
ordinal_cols = ['Education_Level', 'Loan_Purpose']
ordinal_encoder = OrdinalEncoder()
X[ordinal_cols] = ordinal_encoder.fit_transform(X[ordinal_cols])

# Label Encoding
label_cols = ['Gender', 'Marital_Status', 'Employment_Status', 'Has_Credit_Card', 
              'Is_Homeowner', 'Account_Type', 'Customer_Segment', 'Preferred_Contact_Channel']
label_encoder = LabelEncoder()
for col in label_cols:
    X[col] = label_encoder.fit_transform(X[col])

# Also encoding the target variable
y = label_encoder.fit_transform(y)

# Display the first few rows of the transformed dataset
X.head()
```

## What This Code Does:

This code transforms "categorical data" (data that consists of categories like "Male/Female" or "Yes/No") into a numeric format so that machine learning algorithms can understand it.

### Why Encoding is Necessary
Machine learning models can work only with numbers, not text. For example, a model won't understand words like "Male" or "Female," but it can work with numbers like `0` and `1`.

### What Types of Encoding Are Used Here?

1. **Ordinal Encoding**:
   - Used for features (columns) that have a natural order. For example, "Education Level" might include categories like:
     - "High School"
     - "Undergraduate"
     - "Postgraduate"
   - These are converted into numbers such as:
     - High School → 0
     - Undergraduate → 1
     - Postgraduate → 2

2. **Label Encoding**:
   - Used for features with no specific order, like "Gender" or "Marital Status." For example:
     - Gender:
       - Male → 0
       - Female → 1
     - Marital Status:
       - Single → 0
       - Married → 1

   - Similarly, the target variable (`y`) is also encoded to turn the labels into numbers.

### Why Not Use One-Hot Encoding?
One-hot encoding is another way to transform categories into numbers, but it creates additional columns for each category. For example:
   - "Male" and "Female" would become two separate columns like:
     - Male: [1, 0]
     - Female: [0, 1]
   
   Since this dataset's categorical features mostly have binary values or only a few unique categories, one-hot encoding would create unnecessary additional columns, increasing the dataset's size without adding much value.

### Step-by-Step Explanation:

1. **Import Libraries**:
   - The code imports tools (`OrdinalEncoder` and `LabelEncoder`) from `sklearn`, a popular library for machine learning in Python.

2. **Ordinal Encoding**:
   - Two columns, `'Education_Level'` and `'Loan_Purpose'`, are identified as having a natural order.
   - They are encoded using `OrdinalEncoder`.

3. **Label Encoding**:
   - A list of other categorical columns (like `'Gender'`, `'Marital_Status'`, etc.) is encoded using `LabelEncoder`.
   - Each category is replaced with a number.
   - This is done using a loop to apply the encoding to each column.

4. **Target Variable Encoding**:
   - The target variable (`y`), which the model predicts, is also encoded into numeric format using `LabelEncoder`.

5. **View Transformed Data**:
   - Finally, the first few rows of the transformed dataset are displayed with `X.head()`.

### Why This Approach Was Chosen:
- **Efficiency**: Encoding with numbers keeps the dataset simple and compact.
- **Suitability**: Since most of the categorical data has a small number of unique values (like "Yes/No"), using `LabelEncoder` and `OrdinalEncoder` avoids the unnecessary complexity of creating extra columns, as one-hot encoding would.
- **Compatibility**: The encoded data is now ready to be used with machine learning algorithms.


In [None]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

# Ordinal Encoding
ordinal_cols = ['Education_Level', 'Loan_Purpose']
ordinal_encoder = OrdinalEncoder()
X[ordinal_cols] = ordinal_encoder.fit_transform(X[ordinal_cols])

# Label Encoding
label_cols = ['Gender', 'Marital_Status', 'Employment_Status', 'Has_Credit_Card', 
              'Is_Homeowner', 'Account_Type', 'Customer_Segment', 'Preferred_Contact_Channel']
label_encoder = LabelEncoder()
for col in label_cols:
    X[col] = label_encoder.fit_transform(X[col])

# Also encoding the target variable
y = label_encoder.fit_transform(y)

# Display the first few rows of the transformed dataset
X.head()


# Explanation of Checking Outliers in Numerical Features

## Code:

```python
# Checking outliers in numerical features
plt.figure(figsize=(18, 12))
X[numerical_features].boxplot()
plt.title("Boxplot of Numerical Features")
plt.xticks(rotation=45)
plt.show()
```

## What This Code Does:

This code visualizes the numerical features of the dataset to check for outliers using a **boxplot**. Outliers are data points that are significantly different from the rest of the data. Detecting and handling outliers is important because they can negatively impact the performance of machine learning models.

### Step-by-Step Explanation:

1. **Set the Figure Size**:
   - `plt.figure(figsize=(18, 12))` sets the size of the figure to be large enough to display the boxplots for all numerical features clearly.

2. **Boxplot Creation**:
   - `X[numerical_features].boxplot()` creates a boxplot for each numerical feature in the dataset (`numerical_features` is a list of column names containing numerical data).
   - A boxplot is a statistical graph that shows the distribution of data:
     - The box represents the interquartile range (IQR), where 50% of the data lies.
     - The line inside the box shows the median.
     - Points outside the whiskers are potential outliers.

3. **Title**:
   - `plt.title("Boxplot of Numerical Features")` adds a title to the plot for better readability.

4. **Rotate X-Axis Labels**:
   - `plt.xticks(rotation=45)` rotates the labels on the x-axis by 45 degrees. This helps when the feature names are long and need to be displayed more clearly.

5. **Display the Plot**:
   - `plt.show()` renders the boxplot.

### Why Use a Boxplot?
- **Outlier Detection**: Boxplots are an effective way to visually identify outliers in numerical data. Outliers are usually shown as individual points outside the whiskers.
- **Distribution Overview**: Boxplots provide a summary of the distribution of data, including the median, range, and variability.

In [None]:
# checking outliners in numerical features
plt.figure(figsize=(18, 12))
X[numerical_features].boxplot()
plt.title("Boxplot of Numerical Features")
plt.xticks(rotation=45)
plt.show()

# Explanation of Capping Outliers in Numerical Features

## Code:

```python
# Define a function to cap outliers
def cap_outliers(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Cap outliers
    df_capped = df.clip(lower=lower_bound, upper=upper_bound, axis=1)
    return df_capped

# Cap outliers in numerical features
X[numerical_features] = cap_outliers(X[numerical_features])
```

## What This Code Does:

This code identifies and modifies extreme values (outliers) in numerical features of a dataset to bring them within a reasonable range. It uses a method based on the **Interquartile Range (IQR)** to determine what counts as an outlier and then "caps" (or replaces) those values with the calculated boundaries.

### Step-by-Step Explanation:

1. **Define the Function**:
   - The function `cap_outliers(df)` takes a DataFrame (`df`) as input and adjusts any extreme values in its columns.

2. **Calculate Quartiles**:
   - `Q1 = df.quantile(0.25)` calculates the 25th percentile (lower quartile).
   - `Q3 = df.quantile(0.75)` calculates the 75th percentile (upper quartile).
   - The Interquartile Range (IQR) is calculated as `IQR = Q3 - Q1`. This range contains the middle 50% of the data.

3. **Determine Bounds**:
   - **Lower Bound**: `Q1 - 1.5 * IQR` — values below this are considered outliers.
   - **Upper Bound**: `Q3 + 1.5 * IQR` — values above this are considered outliers.

4. **Cap Outliers**:
   - `df.clip(lower=lower_bound, upper=upper_bound, axis=1)` replaces values below the lower bound with the lower bound and values above the upper bound with the upper bound.

5. **Apply the Function**:
   - The function is applied to the numerical features of the dataset (`X[numerical_features]`), and the capped data replaces the original values.

### Why Cap Outliers?

- **Improves Model Performance**: Extreme values can skew the results of machine learning models, leading to poor predictions.
- **Retains Data**: Instead of removing outliers, this method adjusts them to be within a reasonable range, ensuring that no data is lost.

---

In [18]:
# Define a function to cap outliers
def cap_outliers(df):
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Cap outliers
    df_capped = df.clip(lower=lower_bound, upper=upper_bound, axis=1)
    return df_capped

# Cap outliers in numerical features
X[numerical_features] = cap_outliers(X[numerical_features])


In [None]:
# plotting the boxplot after capping the outliners
plt.figure(figsize=(18, 12))
X[numerical_features].boxplot()
plt.title("Boxplot of Numerical Features (After Capping Outliers)")
plt.xticks(rotation=45)
plt.show()

# Explanation of Checking for Class Imbalance

## Code:

```python
# Checking for class imbalance
class_distribution = df["Target"].value_counts(normalize=True) * 100
print(class_distribution)
```

## What This Code Does:

This code checks whether there is a class imbalance in the target variable (`"Target"`) of the dataset. A class imbalance occurs when one class has significantly more data points than the others, which can negatively impact the performance of machine learning models.

### Step-by-Step Explanation:

1. **Access the Target Column**:
   - `df["Target"]` extracts the column named `"Target"` from the DataFrame `df`. This is the variable that the model is trying to predict.

2. **Calculate Class Distribution**:
   - `value_counts(normalize=True)` counts the occurrences of each unique value in the `"Target"` column and normalizes the counts to show percentages instead of raw counts.
   - Multiplying by `100` converts the proportions into percentages.

3. **Print the Distribution**:
   - `print(class_distribution)` displays the percentage of data points belonging to each class in the `"Target"` column.

### Why Check for Class Imbalance?

- **Model Bias**: Machine learning models can become biased towards the majority class if the dataset is imbalanced. For example, if 90% of the data points belong to class A and only 10% to class B, the model may learn to predict class A for most cases and ignore class B.
- **Performance Issues**: An imbalanced dataset may result in misleading performance metrics (e.g., high accuracy but poor precision/recall for the minority class).
- **Next Steps**: If a class imbalance is detected, techniques like oversampling (e.g., SMOTE) or undersampling can be applied to balance the dataset.

---

In [None]:
# checking for class imbalance
class_distribution = df["Target"].value_counts(normalize=True) * 100
print(class_distribution)

# Explanation of Handling Class Imbalance with SMOTE

## Code:

```python
# Handling class imbalance before splitting the data
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Display the class distribution after applying SMOTE
y_resampled_distribution = pd.Series(y_resampled).value_counts(normalize=True) * 100
print(y_resampled_distribution)
```

## What This Code Does:

This code uses **SMOTE (Synthetic Minority Oversampling Technique)** to address the class imbalance in the dataset. SMOTE is a technique that generates synthetic samples for the minority class to balance the dataset. The code ensures that the target variable (`y`) has an equal or near-equal number of instances for each class.

### Step-by-Step Explanation:

1. **Import SMOTE**:
   - The code imports `SMOTE` from the `imblearn` library, which specializes in handling imbalanced datasets.

2. **Initialize SMOTE**:
   - `smote = SMOTE(random_state=42)` initializes the SMOTE object. The `random_state` parameter ensures reproducibility by setting a fixed random seed.

3. **Apply SMOTE**:
   - `smote.fit_resample(X, y)` oversamples the minority class by creating synthetic data points.
   - This results in a new dataset with balanced classes:
     - `X_resampled`: The new feature set with synthetic samples added.
     - `y_resampled`: The new target variable with balanced classes.

4. **Check Class Distribution**:
   - `pd.Series(y_resampled).value_counts(normalize=True) * 100` calculates the percentage distribution of each class in the resampled target variable.
   - `print(y_resampled_distribution)` displays the class distribution after applying SMOTE.

### Why Use SMOTE?

- **Balances the Dataset**: SMOTE generates synthetic samples instead of duplicating existing ones, ensuring a better representation of the minority class.
- **Improves Model Performance**: A balanced dataset helps machine learning models learn equally well for all classes, improving metrics like precision and recall for the minority class.
- **Before Splitting the Data**: SMOTE is applied before splitting the dataset into training and testing sets to ensure the training data is balanced.

### Notes:
- While SMOTE improves class balance, it should be used carefully, especially for very small datasets, as it introduces synthetic data that may not fully represent the original distribution.

---

In [None]:
# handling class imbalance before splitting the data
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Display the class distribution after applying SMOTE
y_resampled_distribution = pd.Series(y_resampled).value_counts(normalize=True) * 100
print(y_resampled_distribution)

# Explanation of Splitting the Data into Training and Testing Sets

## Code:

```python
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets 
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
```

## What This Code Does:

This code splits the resampled dataset into two subsets:
1. **Training Set**: Used to train the machine learning model.
2. **Testing Set**: Used to evaluate the model's performance on unseen data.

### Step-by-Step Explanation:

1. **Import the Function**:
   - `train_test_split` is imported from `sklearn.model_selection`. This function simplifies the process of dividing data into training and testing subsets.

2. **Apply the Split**:
   - `train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)` performs the following:
     - **Inputs**:
       - `X_resampled` and `y_resampled`: The features and target variable after handling class imbalance.
       - `test_size=0.2`: Allocates 20% of the data to the testing set and 80% to the training set.
       - `random_state=42`: Ensures reproducibility by fixing the random seed.
     - **Outputs**:
       - `X_train` and `y_train`: The training set (features and target).
       - `X_test` and `y_test`: The testing set (features and target).

3. **Display the Shapes**:
   - `X_train.shape`, `y_train.shape`, `X_test.shape`, and `y_test.shape` show the dimensions of the resulting subsets, confirming the split proportions.

### Why Split the Data?

- **Train-Test Separation**: Ensures that the model is trained on one subset of the data and evaluated on a completely different subset, preventing overfitting.
- **Testing Generalization**: Helps assess how well the model performs on unseen data, which is critical for real-world applications.

---


In [None]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets 
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

# Explanation of Standardizing Numerical Features

## Code:

```python
# Standardize the numerical features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])
```

## What This Code Does:

This code standardizes the numerical features in the dataset, which means it transforms the data so that each feature has a mean of `0` and a standard deviation of `1`. Standardization is a common preprocessing step in machine learning to ensure that features are on a similar scale, which can improve the performance and convergence of certain models.

### Step-by-Step Explanation:

1. **Import the StandardScaler**:
   - The `StandardScaler` is imported from `sklearn.preprocessing`. It is a tool that standardizes features by removing the mean and scaling to unit variance.

2. **Initialize the Scaler**:
   - `scaler = StandardScaler()` creates an instance of the scaler.

3. **Fit and Transform Training Data**:
   - `scaler.fit_transform(X_train[numerical_features])` calculates the mean and standard deviation of each numerical feature in the training data and then standardizes the data using those values.
   - This ensures that each feature has:
     - Mean = 0
     - Standard deviation = 1

4. **Transform Testing Data**:
   - `scaler.transform(X_test[numerical_features])` uses the mean and standard deviation calculated from the training data to standardize the testing data.
   - **Note**: The testing data is only transformed (not fit) to avoid data leakage.

5. **Update the Datasets**:
   - The standardized values are saved back into `X_train` and `X_test` for the numerical features.

### Why Standardize Data?

- **Improves Model Performance**: Many machine learning algorithms, such as logistic regression, support vector machines (SVMs), and neural networks, perform better when input features are on a similar scale.
- **Speeds Up Training**: Gradient-based optimization algorithms converge faster on standardized data.
- **Prevents Bias**: Features with larger ranges can dominate others if not standardized, which may bias the model.

---


In [23]:
# Standardize the numerical features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])

# Explanation of Training and Evaluating the Model

## Code:

```python
# Training the model
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
from sklearn.metrics import classification_report, accuracy_score

# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))
print('+'*50)

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print('+'*50)
```

## What This Code Does:

This code trains a machine learning model using the **Random Forest Classifier** algorithm to classify the data into its target classes. It also evaluates the model's performance using a **classification report** and **accuracy score**.

### Step-by-Step Explanation:

1. **Import the Random Forest Classifier**:
   - `RandomForestClassifier` is imported from `sklearn.ensemble`. This algorithm builds multiple decision trees and combines their outputs for more accurate and stable predictions.

2. **Initialize the Model**:
   - `rf_classifier = RandomForestClassifier(random_state=42)` creates an instance of the classifier. The `random_state=42` ensures reproducibility of the results by fixing the random seed.

3. **Train the Model**:
   - `rf_classifier.fit(X_train, y_train)` trains the Random Forest Classifier on the training data (`X_train` and `y_train`). During this process:
     - The algorithm creates multiple decision trees.
     - Each tree is trained on a subset of the data and features, introducing randomness to make the model robust.

4. **Make Predictions**:
   - `y_pred = rf_classifier.predict(X_test)` predicts the target variable for the test set (`X_test`) using the trained model.

5. **Evaluate the Model**:
   - **Classification Report**:
     - `classification_report(y_test, y_pred)` generates a detailed report showing the precision, recall, F1-score, and support for each class.
       - **Precision**: How many of the predicted positive cases were actually positive.
       - **Recall**: How many of the actual positive cases were correctly predicted.
       - **F1-score**: A balance between precision and recall.
       - **Support**: The number of actual instances for each class.
   - **Accuracy Score**:
     - `accuracy_score(y_test, y_pred)` calculates the overall accuracy, which is the proportion of correctly classified instances.

6. **Print Results**:
   - The classification report and accuracy score are printed to provide insights into the model's performance.

### Why Use Random Forest?

- **Accuracy**: Combines the predictions of multiple decision trees, reducing overfitting and improving accuracy.
- **Versatility**: Works well for both classification and regression problems.
- **Feature Importance**: Can identify the most important features for prediction.

### Notes:
- The accuracy score gives an overall measure of performance but should be considered alongside the precision, recall, and F1-score, especially for imbalanced datasets.

---

In [None]:
# training the model
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
from sklearn.metrics import classification_report, accuracy_score

# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))
print('+'*50)

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print('+'*50)

In [None]:
# setting sensitive features as categorical features
# remove loan purpose and education level from sensitive features
sensitive_features = [feature for feature in categorical_features if feature not in ['Loan_Purpose', 'Education_Level']]
print(sensitive_features)

In [32]:
# just taking 2 sensitive features only that are Employment_Status and Has_Credit_Card
slice_sensitive_features = ['Employment_Status', 'Has_Credit_Card']

# Explanation of AIValidator Help

## Code:

```python
from cimcon.AIValidator import AIValidator as avi

# Create an object of the class
ai_obj = avi()

# Display the available tests and functions
ai_obj.aiv_help()
```

## What This Code Does:

This code imports the `AIValidator` class from the `cimcon` library, initializes an instance of the class, and displays a list of supported tests and functions. It also provides guidance on how to access details for specific topics or functionalities.

### Step-by-Step Explanation:

1. **Import the AIValidator Class**:
   - `from cimcon.AIValidator import AIValidator as avi` imports the `AIValidator` class, which is renamed to `avi` for convenience.

2. **Initialize the AIValidator Object**:
   - `ai_obj = avi()` creates an instance of the `AIValidator` class. This object (`ai_obj`) is used to interact with the library and access its functions.

3. **Display Help Information**:
   - `ai_obj.aiv_help()` displays:
     - **Available Tests**: A list of tests that can be run, such as `Fairness`, `Interpretability`, and `LLMRiskAssessment`.
     - **Available Functions**: Methods provided by the class, such as `run_test`, `auth_config`, and `save_result`.
   - The output also shows how to get detailed information about any specific test or function:
     ```python
     ai_obj.aiv_help('<TopicName>')
     ```

### Example Output:

The output includes:

- **Tests**:
  - `Fairness`: Tests related to fairness in AI models.
  - `Interpretability`: Tests related to explaining model decisions.
  - `DataDriftAndDataQuality`: Checks for data drift and quality.
  - `LLMHallucination`: Assesses hallucinations in Large Language Models (LLMs).

- **Functions**:
  - `run_test`: Executes a specific test.
  - `auth_config`: Configures authentication settings.
  - `add_euc`: Adds EUC (End User Computing) details.
  - `save_result`: Saves the test results.

### Why Use `aiv_help()`?

- **Overview**: Quickly provides an overview of the library’s capabilities.
- **Guidance**: Explains how to access more detailed information for each test or function.

### Next Steps:

To explore a specific test or function, use the `aiv_help()` method with the topic name. For example:
```python
ai_obj.aiv_help('Fairness')
ai_obj.aiv_help('run_test')
```

---

In [None]:
from cimcon.AIValidator import AIValidator as avi
# Create an object of the class
ai_obj = avi()
ai_obj.aiv_help()

# Explanation of Running a Fairness Test Using AIValidator

## Code:

```python
params = {'X_test': X_test,
          'y_test': y_test,
          'y_pred': y_pred,
          'sensitive_feature': slice_sensitive_features,
          }

ai_obj.run_test("Fairness", **params)
```

## What This Code Does:

This code uses the `AIValidator` library to run a **Fairness Test** on the model's predictions. It evaluates how fair the model's predictions are with respect to one or more sensitive features.

### Step-by-Step Explanation:

1. **Define Parameters**:
   - A dictionary named `params` is created to hold the required inputs for the fairness test:
     - `X_test`: The test set features used for prediction.
     - `y_test`: The actual target values for the test set.
     - `y_pred`: The predicted target values generated by the model.
     - `sensitive_feature`: A specific feature or a slice of features that are considered sensitive (e.g., gender, race).

2. **Run the Fairness Test**:
   - `ai_obj.run_test("Fairness", **params)` executes the fairness test:
     - `"Fairness"`: Specifies the type of test to run.
     - `**params`: Passes the parameters defined in the `params` dictionary as arguments to the function.

3. **What Happens in the Fairness Test**:
   - The test evaluates how the model's predictions vary based on the values of the sensitive feature(s).
   - It may compute fairness metrics like:
     - **Disparate Impact**: Measures how predictions differ between groups.
     - **Equal Opportunity**: Compares true positive rates across groups.
     - **Demographic Parity**: Ensures predictions are independent of sensitive attributes.

### Why Run a Fairness Test?

- **Bias Detection**: Identifies whether the model is biased against certain groups based on sensitive features.
- **Compliance**: Ensures the model adheres to ethical standards and regulations.
- **Model Improvement**: Provides insights to refine the model for fairer predictions.

---

In [None]:
params = {'X_test': X_test,
          'y_test': y_test,
          'y_pred': y_pred,
          'sensitive_feature': slice_sensitive_features,
          }

ai_obj.run_test("Fairness", **params)