# Machine Learning Pipeline - Clean Code Version

This notebook demonstrates a comprehensive machine learning pipeline, including data preprocessing, feature engineering, model training, and evaluation for both binary and multi-class classification tasks. Below is an outline of each key section:

1. **Library Imports**: Loads all essential libraries for data handling, visualization, model training, evaluation, and saving.

2. **Data Loading and Cleaning**: Reads the dataset, standardizes column names, and applies initial data quality checks for missing and infinite values.

3. **Data Preprocessing**:
   - Categorical features are filled and encoded using binary encoding.
   - Missing values in numerical features are imputed.
   - Missing data indicators are created for further analysis.

4. **Feature Engineering**:
   - Calculates Variance Inflation Factor (VIF) to identify and remove features with high multicollinearity.
   - Applies Recursive Feature Elimination (RFE) with Linear Regression and RandomForest for feature selection.

5. **Multi-Class Strategy Setup**:
   - Defines a function to wrap models in appropriate multi-class strategies (`OneVsOne` or `OneVsRest`) when applicable.

6. **Model Evaluation Functions**:
   - `evaluate_model_single`: Evaluates binary classification models, displaying ROC and Precision-Recall curves, metrics, and confusion matrix.
   - `evaluate_model_multi`: Evaluates multi-class classification models with class-specific ROC and Precision-Recall curves.

7. **Column Removal**: Removes specified columns from training and test datasets to ensure only relevant features are included in modeling.

8. **Model Training and Evaluation**:
   - Iterates over models, evaluating each based on user-defined selection and multi-class strategy, with progress tracked by a progress bar.
   - Stores and displays the evaluation metrics, including accuracy, precision, recall, F1-score, ROC-AUC, and cross-validation accuracy.

9. **Results Summary and Visualization**:
   - Summarizes results in a DataFrame and saves them to a CSV file.
   - Plots a comparison of performance metrics across models for quick assessment.

10. **Model Saving and Reloading**:
    - Saves all trained models to disk for future use.
    - Demonstrates reloading saved models and making predictions to validate accuracy.

This pipeline is designed to handle both binary and multi-class problems, supports multiple models, and provides detailed performance analysis and visualization for decision-making.

# Let the Fun Begin

**Below code Block Explanation**: This block imports essential libraries required for data handling, encoding, visualization, machine learning models, feature selection, and evaluation metrics. Grouping imports helps keep the code organized, and importing them all at once avoids repetitive imports later in the code.

In [None]:
# === Import Required Libraries ===

# Data Manipulation and Preprocessing
import pandas as pd         # Core data manipulation library
import numpy as np          # Mathematical operations
import category_encoders as ce  # Encoding categorical variables

# Statistical Analysis
from scipy import stats
from scipy.stats import normaltest, shapiro, anderson, kstest, skew
from statsmodels.stats.outliers_influence import variance_inflation_factor  # For multicollinearity checks (VIF)

# Data Visualization
import matplotlib.pyplot as plt         # Basic plotting
import seaborn as sns                   # Advanced static visualizations with themes
import plotly.express as px             # Interactive plots
import missingno as msno                # Visualizing missing data

# Data Preprocessing and Encoding
from sklearn.preprocessing import StandardScaler, LabelEncoder, label_binarize  # Scaling and label encoding
from sklearn.model_selection import train_test_split, cross_val_score           # Data splitting and cross-validation

# Machine Learning Models
from sklearn.linear_model import LogisticRegression, LinearRegression           # Linear models for classification and regression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor          # Decision Tree models
from sklearn.ensemble import (                                                 
    RandomForestClassifier, RandomForestRegressor,                              # Random Forest models
    GradientBoostingClassifier, GradientBoostingRegressor                       # Gradient Boosting models
)
from sklearn.svm import SVC, SVR                                               # Support Vector Machine for classification and regression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor        # K-Nearest Neighbors for classification and regression
from sklearn.naive_bayes import GaussianNB                                     # Naive Bayes for classification
from sklearn.cluster import KMeans                                             # KMeans clustering
from sklearn.decomposition import PCA                                          # Principal Component Analysis (PCA) for dimensionality reduction
from sklearn.multiclass import OneVsOneClassifier, OneVsRestClassifier         # Multi-class classification strategies

# Advanced Machine Learning Models
from xgboost import XGBClassifier         # Extreme Gradient Boosting
from lightgbm import LGBMClassifier       # Light Gradient Boosting Machine
from catboost import CatBoostClassifier   # CatBoost Gradient Boosting

# Feature Selection
from sklearn.feature_selection import RFE  # Recursive Feature Elimination (RFE) for feature selection

# Model Evaluation Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,      # Basic classification metrics
    roc_auc_score, confusion_matrix,                             # Advanced metrics and confusion matrix
    roc_curve, precision_recall_curve, average_precision_score    # Curve metrics for model evaluation
)

# Deep Learning with TensorFlow/Keras
from tensorflow.keras.models import Sequential          # Sequential model setup in Keras
from tensorflow.keras.layers import Dense               # Dense layers for neural networks

# Utilities
from tqdm import tqdm       # Progress bar for loops
import joblib               # For saving/loading models

**Below code Block Explanation**: This block loads the dataset and standardizes column names by removing extra spaces, converting to lowercase, and replacing spaces with underscores for better accessibility in code. Printing the cleaned headers and the shape of the DataFrame provides a quick verification that the data has loaded correctly.

In [None]:
# Load and Clean Data
df = pd.read_csv("c:\\Users\\kiera\\OneDrive\\Documents\\GitHub\\dsif-git-main-project\\elvtr_main_project\\data\\1-raw\\lending-club-2007-2020Q3\\Loan_status_2007-2020Q3-100ksample.csv")
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print("Cleaned headers:", df.columns.tolist())
print(df.shape)

**Below code Block Explanation**: Here, the display options for pandas are set to show all columns and rows, which is useful during data exploration to get a complete view. Additionally, a white grid theme is applied to Seaborn plots, providing a consistent look for visualizations.

In [None]:
# Data Exploration and Display Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
sns.set_theme(style='whitegrid')

**Below code Block Explanation**: Now let's look at our Data Frame.

In [None]:
df.head()

Our data contains 143 columns and 99999 rows of data. It is comprised of numerical (float, int) and categorical data (object)

Our target/Y feature is `loan_status`. Let's look through our feature list and determine which fields are of most value when predicting `loan_status`.

**Below Code Blcok Explanation**: Hereunder we've listed, based on the definitions of the data dictionary, all features that are [pre hardship flags and useful for our analysis.

This will form our initial `feature_list`.

In [None]:
feature_list = [
    'acc_now_delinq', 'acc_open_past_24mths', 'addr_state', 'all_util', 'annual_inc', 
    'annual_inc_joint', 'application_type', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 
    'chargeoff_within_12_mths', 'collections_12_mths_ex_med', 'delinq_2yrs', 'delinq_amnt', 
    'dti', 'dti_joint', 'earliest_cr_line', 'emp_length', 'emp_title', 
    'fico_range_high', 'fico_range_low', 'funded_amnt', 'funded_amnt_inv', 'grade', 
    'home_ownership', 'il_util', 'initial_list_status', 'inq_fi', 'inq_last_12m', 
    'inq_last_6mths', 'installment', 'int_rate', 'issue_d', 'loan_amnt', 'loan_status', 
    'max_bal_bc', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 
    'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_last_delinq', 
    'mths_since_last_major_derog', 'mths_since_last_record', 'mths_since_rcnt_il', 
    'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 
    'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 
    'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 
    'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 
    'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'open_acc', 'open_acc_6m', 'open_il_12m', 
    'open_il_24m', 'open_act_il', 'open_rv_12m', 'open_rv_24m', 'out_prncp', 
    'out_prncp_inv', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'policy_code', 'pub_rec', 
    'pub_rec_bankruptcies', 'purpose', 'pymnt_plan', 'revol_bal', 'revol_util', 
    'sub_grade', 'tax_liens', 'term', 'title', 'tot_coll_amt', 'tot_cur_bal', 
    'tot_hi_cred_lim', 'total_acc', 'total_bal_ex_mort', 'total_bal_il', 'total_bc_limit', 
    'total_cu_tl', 'total_il_high_credit_limit', 'total_pymnt', 'total_pymnt_inv', 
    'total_rec_int', 'total_rec_late_fee', 'total_rec_prncp', 'total_rev_hi_lim', 
    'verification_status', 'zip_code'
]

**Below Code Block Explanation**: This block filters our data frame (`df`) to the `feature_list`.

In [None]:
print(f"Our data frame is complrised of (rows, cols): {df[feature_list].shape}")
df[feature_list].head(25)

**Below Code Block Explanation**: The below code creates a bar char for us to evaluate our`loan_status` data.

In [None]:
# Calculate the value counts for loan status
loan_status_counts = df['loan_status'].value_counts()

# Plot with matplotlib
plt.figure(figsize=(7, 5))  # width=500/100 and height=350/100 for similar sizing in inches
loan_status_counts.plot(kind='bar')

# Set title and labels
plt.title("Loan Status Counts")
plt.xlabel("Loan Status")
plt.ylabel("Count")

# Show the plot
plt.show()

Let's simplify this list and create a more black and white view.

**Below Code Block Explained**: This block creates a logical grouping for Paid Loans, and Defaulted loans. It then further loops through the data frame and retains only the rows that contain Paid Loans and Defaulted Loans.

In [None]:
# Define the logical groupings for 'loan_status'
loan_status_groupings = {
    'Fully Paid': 'Paid Loan',
    'Does not meet the credit policy. Status:Fully Paid': 'Paid Loan',
    'Charged Off': 'Defaulted Loan',
    'Does not meet the credit policy. Status:Charged Off': 'Defaulted Loan',
    'Default': 'Defaulted Loan'
}

# Apply the grouping to the 'loan_status' column
df['loan_status_grouped_kn'] = df['loan_status'].replace(loan_status_groupings)

# Retain only rows with 'Paid Loan' or 'Defaulted Loan'
df = df[df['loan_status_grouped_kn'].isin(['Paid Loan', 'Defaulted Loan'])]

# Verify the groupings
print(df['loan_status_grouped_kn'].value_counts())

In [None]:
# Newly created columns
new_columns = ['loan_status_grouped_kn']  # Replace with columns name

# Add new columns to feature_list if they're not already in the list
feature_list.extend([col for col in new_columns if col not in feature_list])

In [None]:
df.shape

**Below Code Block Explanation**: This block visualizes the distribution of loan amounts for each loan status category using Kernel Density Estimation (KDE) plots. It iterates over the unique values of loan_status_grouped_kn and creates a filled KDE plot for each category (Paid Loan or Defaulted Loan). The resulting graph allows for a comparison of loan amount distributions between different loan statuses, providing insights into any potential differences in loan amount trends across the two groups.

In [None]:
plt.figure(figsize=(10, 6))
for status in df['loan_status_grouped_kn'].unique():
    sns.kdeplot(df[df['loan_status_grouped_kn'] == status]['loan_amnt'], label=status, fill=True)
plt.title('Distribution of Loan Amount by Loan Status')
plt.xlabel('Loan Amount')
plt.ylabel('Density')
plt.legend(title="Loan Status")
plt.show()

Lets look at the loan_status against employment length.

In [None]:
df['emp_length'].info()

In [None]:
df['emp_length'].value_counts()

Here we can see that `emp_length` is a string value. We'll change this later on but for our initial analysis let's 

**Below Code Block Explanation**: This block processes the emp_length column, converting it to numeric values and handling missing data. It then creates a crosstab of emp_length_cleaned and loan_status_grouped_kn, calculates the percentage distribution of loan statuses for each employment length, and formats these percentages to two decimal places with a percentage sign. The resulting table (emp_length_percentage) shows the loan status distribution across different employment lengths, sorted in descending order.

In [None]:
# Set the option to handle future downcasting behavior
pd.set_option('future.no_silent_downcasting', True)

# Replace NaN values in emp_length with None and convert to integer values
df['emp_length_cleaned'] = df['emp_length'].replace({
    '10+ years': 10,
    '9 years': 9, '8 years': 8, '7 years': 7, '6 years': 6, '5 years': 5,
    '4 years': 4, '3 years': 3, '2 years': 2, '1 year': 1, '< 1 year': 0,
    'n/a': None  # Assuming 'n/a' represents missing values
}).astype('Int64')  # Use 'Int64' for integer with support for NaN

# Drop any NaN values in emp_length_cleaned if necessary
df = df.dropna(subset=['emp_length_cleaned'])

# Create a crosstab of loan_status_grouped_kn and emp_length_cleaned
emp_length_counts = pd.crosstab(df['emp_length_cleaned'], df['loan_status_grouped_kn'])

# Calculate the percentage for each loan status within each employment length year
emp_length_percentage = emp_length_counts.div(emp_length_counts.sum(axis=1), axis=0) * 100

# Sort the index of emp_length_percentage in descending order
emp_length_percentage = emp_length_percentage.sort_index(ascending=False)

# Format each column to two decimal places with a % sign using map
for col in emp_length_percentage.columns:
    emp_length_percentage[col] = emp_length_percentage[col].map(lambda x: f"{x:.2f} %")

# Display the resulting table
emp_length_percentage

Now that we've finised our EDA let's check some of the categorical points with our target variable

In [None]:
# Newly created columns
new_columns = ['emp_length_cleaned']  # Replace with columns name

# Add new columns to feature_list if they're not already in the list
feature_list.extend([col for col in new_columns if col not in feature_list])

**Below Code Block Explanation**: This block generates a cross-tabulation between loan_status_grouped_kn (loan status) and purpose. It uses pd.crosstab() to calculate the number of loans for each combination of loan status and loan purpose, resulting in a summary table (comparison_loan_status_purpose). This table provides insight into how different loan purposes relate to loan outcomes, helping to identify patterns between the purpose of the loan and its status (e.g., "Paid Loan" or "Defaulted Loan").

In [None]:
# Create a cross-tabulation of purpose and loan_status
comparison_loan_status_purpose = pd.crosstab(df['loan_status_grouped_kn'], df['purpose'])

# Display the result
comparison_loan_status_purpose

**Below Code Block Explanation**: This block sorts loan purposes by total loan counts across all statuses in descending order. It then creates a stacked bar chart to visualize the distribution of loan statuses for each purpose (comparison_loan_status_purpose_sorted). Labels and a title are added for clarity, and x-axis labels are rotated for readability. The stacked chart helps highlight how each loan status (e.g., "Paid Loan" or "Defaulted Loan") contributes to the total loan count for each purpose.

In [None]:
# Sort data by total loan counts across all statuses for each purpose (descending order)
sorted_data = comparison_loan_status_purpose.sum(axis=0).sort_values(ascending=False)
sorted_columns = sorted_data.index
comparison_loan_status_purpose_sorted = comparison_loan_status_purpose[sorted_columns]

# Plot a stacked bar chart
fig, ax = plt.subplots(figsize=(12, 8))

# Plot each loan status as a stacked segment
comparison_loan_status_purpose_sorted.T.plot(kind='bar', stacked=True, ax=ax)

# Set labels and title
ax.set_xlabel('Purpose')
ax.set_ylabel('Number of Loans')
ax.set_title('Distribution of Loan Statuses by Loan Purpose')

# Rotate x-axis labels for readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.legend(title='Loan Status')

# Display the plot
plt.show()


**Below Code Block Explanation**: This block generates a cross-tabulation between loan_status_grouped_kn (loan status) and verification_status. Using pd.crosstab(), it calculates the count of loans for each combination of loan status and verification status, resulting in a summary table (comparison_loan_status_ver_status). This table helps in understanding the distribution of different loan statuses (e.g., "Paid Loan" or "Defaulted Loan") across various verification statuses, providing insights into how verification affects loan outcomes.

In [None]:
# Create a cross-tabulation of purpose and loan_status
comparison_loan_status_ver_status = pd.crosstab(df['loan_status_grouped_kn'], df['verification_status'])

# Display the result
comparison_loan_status_ver_status

**Below Code Block Explanation**: This block visualizes the distribution of loan statuses (loan_status_grouped_kn) across different verification statuses using a stacked bar chart. It first transposes the comparison_loan_status_ver_status DataFrame for easier plotting of the different loan statuses as stacked segments for each verification status. The chart is labeled with appropriate axis labels (Verification Status and Number of Loans) and a title (Distribution of Loan Statuses by Verification Status). The x-axis labels are rotated for better readability, and tight_layout() ensures that all elements fit properly within the figure. A legend is included to indicate the different loan statuses in the stacked bars.

In [None]:
import matplotlib.pyplot as plt

# Plot a stacked bar chart
fig, ax = plt.subplots(figsize=(10, 6))

# Transpose the DataFrame for easier plotting and plot each loan status as a stacked segment
comparison_loan_status_ver_status.T.plot(kind='bar', stacked=True, ax=ax)

# Set labels and title
ax.set_xlabel('Verification Status')
ax.set_ylabel('Number of Loans')
ax.set_title('Distribution of Loan Statuses by Verification Status')

# Rotate x-axis labels for readability
plt.xticks(rotation=0)
plt.tight_layout()
plt.legend(title='Loan Status')

# Display the plot
plt.show()

This is an interesting view, as the the Not Verified, and Source Verified represent both roughly 15-20% of the charged off loans for each status but Verified accounts for roughly 23% of the total. I was expecting a lot of the Charged Off in the Not Verified `verification_status`.

**Below Code Block Explanation**: This block creates a cross-tabulation between loan_status_grouped_kn (loan status) and addr_state (state). It uses pd.crosstab() to count the number of loans for each combination of loan status and state, providing a summary table that shows how loan statuses (such as "Paid Loan" or "Defaulted Loan") are distributed across different states. The resulting DataFrame, comparison_loan_status_addr_state, helps in understanding patterns or variations in loan outcomes by geographic location.

In [None]:
# Create a cross-tabulation of purpose and loan_status
comparison_loan_status_addr_state = pd.crosstab(df['loan_status_grouped_kn'], df['addr_state'])

# Display the result
comparison_loan_status_addr_state

**Below Code Block Explanation**: This block visualizes the number of defaulted loans by state. It first filters the data to only include loans with a Defaulted Loan status and sorts the states in descending order based on the number of defaulted loans. A bar plot is then generated to display this information, using salmon-colored bars for better visual appeal. The plot includes labels for the x-axis (State) and y-axis (Number of Defaulted Loans), as well as a title. The x-axis labels are rotated for readability, and tight_layout() is applied to ensure the plot elements fit well within the figure.

In [None]:
# Filter data to only include the 'Defaulted' loan status
defaulted_by_state = comparison_loan_status_addr_state.loc['Defaulted Loan']

# Sort the data by the number of defaulted loans in descending order
defaulted_by_state_sorted = defaulted_by_state.sort_values(ascending=False)

# Plot the data
plt.figure(figsize=(12, 8))
defaulted_by_state_sorted.plot(kind='bar', color='salmon', edgecolor='black')

# Set labels and title
plt.xlabel("State")
plt.ylabel("Number of Defaulted Loans")
plt.title("Number of Defaulted Loans by State (Sorted)")

# Rotate x-axis labels for readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()

# Show the plot
plt.show()


In [None]:
import matplotlib.pyplot as plt

# Group data by 'loan_status' and 'addr_state' and count occurrences
grouped_data = df.groupby(['addr_state', 'loan_status_grouped_kn']).size().unstack()

# Sort the data by the total count of loan statuses in descending order
grouped_data = grouped_data.loc[grouped_data.sum(axis=1).sort_values(ascending=False).index]

# Plot the bar chart
grouped_data.plot(kind='bar', stacked=True, figsize=(10, 7))

# Adding labels and title
plt.title('Loan Status by Address State')
plt.xlabel('Address State')
plt.ylabel('Count of Loan Status')
plt.legend(title='Loan Status')

# Display the plot
plt.show()


**Below Code Block Explanation**: This block calculates and displays the percentage of loans that defaulted for each state. It first calculates the total number of loans for each state and then extracts the number of loans that defaulted. The percentage of defaulted loans is calculated by dividing the number of defaulted loans by the total number of loans for each state, multiplying by 100. Finally, a DataFrame is created to combine the total loans, defaulted loans, and percentage of defaulted loans for easy viewing, sorted by the percentage of defaulted loans in descending order to highlight states with the highest default rates.

In [None]:
# Calculate the total number of loans for each state
total_loans_by_state = comparison_loan_status_addr_state.sum(axis=0)

# Extract the number of defaulted loans for each state
defaulted_loans_by_state = comparison_loan_status_addr_state.loc['Defaulted Loan']

# Calculate the percentage of defaulted loans
defaulted_percentage_by_state = (defaulted_loans_by_state / total_loans_by_state) * 100

# Combine into a DataFrame for easy viewing
defaulted_percentage_df = pd.DataFrame({
    'Total Loans': total_loans_by_state,
    'Defaulted Loans': defaulted_loans_by_state,
    '% Defaulted': defaulted_percentage_by_state
})

# Display the result
defaulted_percentage_df = defaulted_percentage_df.sort_values(by='% Defaulted', ascending=False)
defaulted_percentage_df


#### Key Observations
- **Top States**: The states with the highest loan counts include **CA (California)**, **TX (Texas)**, **NY (New York)**, and **FL (Florida)**. These states exhibit a high volume of loans, likely due to their larger populations and economic activities.
- **Completed Loans**: The **Completed** loan status (blue segment) constitutes a significant portion in most states, suggesting a high rate of loan completion across regions.
- **In Progress Loans**: The **In Progress** loan status (green segment) appears prominently in states with higher loan volumes, indicating ongoing loan activities.
- **Defaulted and Late Loans**: **Defaulted** (orange) and **Late** (red) loans make up smaller portions of the overall loan distribution. However, states with higher loan counts (e.g., CA, TX, NY) also show relatively higher counts in these categories.


# Data split for Analysis

In [None]:
def split_data_frame(features_list, df):
    """
    Splits the provided DataFrame into three lists containing Boolean, Numerical, and Categorical column names.
    Converts floats with trailing zeros into integers and replaces NaN values with 0 for integers, 0.00 for floats.

    Parameters:
    features_list (list): List of column names to be checked.
    df (pd.DataFrame): The input DataFrame to split.

    Returns:
    tuple: A tuple containing three lists (boolean_cols, numerical_cols, categorical_cols).
    """
    boolean_cols = []
    numerical_cols = []
    categorical_cols = []

    # Define acceptable boolean values
    acceptable_boolean_values = {0, 1, True, False, 0.0, 1.0}

    for col in features_list:
        # Treat each column explicitly as a Series
        column_series = df[col]

        # Handle cases where columns might be interpreted incorrectly
        if pd.api.types.is_bool_dtype(column_series) or all(column_series.dropna().isin(acceptable_boolean_values)):
            boolean_cols.append(col)
        elif pd.api.types.is_numeric_dtype(column_series):
            # Check for floats with trailing zeros
            if column_series.dtype == 'float64':
                # Check if all float values are equivalent to integers
                if all(column_series.dropna() == column_series.dropna().astype(int)):
                    df[col] = column_series.fillna(0).astype(int)  # Replace NaNs with 0 and convert to int
                else:
                    df[col] = column_series.fillna(0.00)  # Replace NaNs with 0.00 for floats
                numerical_cols.append(col)
            else:
                df[col] = column_series.fillna(0)  # Replace NaNs with 0 for integers
                numerical_cols.append(col)
        else:
            categorical_cols.append(col)
    
    # Print a summary of the count of columns in each list
    print(f"Summary of column counts:")
    print(f"boolean_list contains {len(boolean_cols)} values")
    print(f"numerical_list contains {len(numerical_cols)} values")
    print(f"categorical_list contains {len(categorical_cols)} values")
    print(f"The data frame we'll continue our analysis with contains (rows, cols) {df[feature_list].shape} rows and columns.")

    return boolean_cols, numerical_cols, categorical_cols

# Instructions:
# this calls the split_data_frame function above create three lists to capture the sorting outputs in. 
# These will later be used to pull some graphs to evaluate the data and what possible transformations we've missed.
# boolean_list, numerical_list, categorical_list = split_data_frame(new_features, df_dropped)


In [None]:
boolean_list, numerical_list, categorical_list = split_data_frame(feature_list, df)

In [None]:
df_clean = df[feature_list].copy()

In [None]:
df_clean.shape

In [None]:
df_clean[boolean_list].head()

In [None]:
df_clean[numerical_list].head()

In [None]:
df_clean[categorical_list].head()

# Missing Value Analysis

**Below code Block Explanation**: This block identifies columns with missing values and sorts them in descending order by count. It provides a clear view of the extent of missing data, aiding decisions on handling or imputing missing values based on the proportion of NaNs in each column.

In [None]:
# Check for Missing Values
nan_list = df_clean.isna().sum()
if nan_list.sum() == 0:
    print("No column has NaN values")
else:
    print("Columns with NaN values (sorted high to low):")
    print(nan_list[nan_list > 0].sort_values(ascending=False))


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def plot_missing_value_analysis(df, feature_list, target_column='loan_status_grouped_kn'):
    # Check if there are any missing values in the specified columns
    missing_cols = [col for col in feature_list if df[col].isnull().any()]
    if not missing_cols:
        print("There are no null values to analyse. Implying that NaN values have been transformed (0, mean, median, mode, etc.).")
        return

    # Initialize dictionaries to store results for plotting
    missing_dict = {}
    not_missing_dict = {}

    # Function to collect percentages for missing and non-missing data
    def missing_value_analysis(column):
        missing = df[df[column].isnull()][target_column].value_counts(normalize=True) * 100
        not_missing = df[df[column].notnull()][target_column].value_counts(normalize=True) * 100
        missing_dict[column] = missing
        not_missing_dict[column] = not_missing

    # Apply the function for all columns in missing_cols
    for col in missing_cols:
        missing_value_analysis(col)

    # Create DataFrames for heatmaps
    missing_df = pd.DataFrame(missing_dict).fillna(0)  # Fill NaN with 0 for heatmap display
    not_missing_df = pd.DataFrame(not_missing_dict).fillna(0)

    # Plotting heatmaps one below the other
    fig, ax = plt.subplots(2, 1, figsize=(12, 16), gridspec_kw={'height_ratios': [1, 1]})  # Adjust aspect ratio

    # Heatmap for missing data
    sns.heatmap(missing_df, annot=False, cmap="Blues", ax=ax[0], cbar_kws={"shrink": .75})
    ax[0].set_title('Percentage of Loan Status for Missing Data')
    ax[0].tick_params(axis='x', rotation=90, labelsize=10)  # Rotate x-axis labels for readability
    ax[0].tick_params(axis='y', labelsize=10)  # Adjust y-axis label size

    # Heatmap for non-missing data
    sns.heatmap(not_missing_df, annot=False, cmap="Greens", ax=ax[1], cbar_kws={"shrink": .75})
    ax[1].set_title('Percentage of Loan Status for Non-Missing Data')
    ax[1].tick_params(axis='x', rotation=90, labelsize=10)  # Rotate x-axis labels for readability
    ax[1].tick_params(axis='y', labelsize=10)  # Adjust y-axis label size

    # Adjust layout to prevent overlap
    plt.tight_layout()
    plt.show()

# Example usage
# plot_missing_value_analysis(df, numerical_list, target_column='loan_status_grouped_kn')

In [None]:
df['loan_status_grouped_kn'].head()

In [None]:
plot_missing_value_analysis(df_clean, categorical_list, target_column='loan_status_grouped_kn')

In [None]:
plot_missing_value_analysis(df_clean, numerical_list, target_column='loan_status_grouped_kn')

In [None]:
import missingno as msno

def visualize_missing_data(df, feature_list):
    """
    Visualize missing data in specified columns of a DataFrame.

    Parameters:
    - df (pd.DataFrame): The DataFrame to analyze.
    - feature_list (list): List of columns to check for missing values.

    Returns:
    - None. Displays visualizations of missing data if present.
    """
    # Identify columns from feature_list with missing values
    missing_values = df[feature_list].isnull().sum()
    missing_cols = missing_values[missing_values > 0].index

    # Check if there are any missing values to display
    if not missing_cols.empty:
        print("Categorical Data Missing Values\n")
        
        # Filter DataFrame to include only columns with missing values
        missing_values_graph = df[missing_cols]
        
        # Visualize the missing data using the missingno library
        msno.matrix(missing_values_graph)
        msno.bar(missing_values_graph)
        msno.heatmap(missing_values_graph)
    else:
        print("No columns with missing values found in the specified feature list.")

# Example usage
# visualize_missing_data(df, feature_list)


In [None]:
visualize_missing_data(df, categorical_list)

In [None]:
# Visualize the missing data using the missingno library
msno.matrix(df)
msno.bar(df)
msno.heatmap(df)
# msno.dendrogram(missing_values_graph) #removed for the final anlysis to avoid cluttering the document with the same data but a different way to show it

**Below code Block Explanation**: This block generates binary indicators for missing values across columns, adding additional columns to flag where data was missing. These indicators can sometimes be useful as features, helping models understand data patterns related to missingness.

In [None]:
# Create Missing Data Indicators for Categorical Columns
def create_missing_indicators(df):
    missing_indicators = pd.DataFrame(index=df.index)  # Create an empty DataFrame to store missing indicators
    
    # Loop through all columns in df
    for col in df.columns:
        # Check if column is categorical
        if df[col].dtype == 'object':
            # Create missing indicator for the categorical column
            missing_indicators[f"{col}_missing"] = df[col].isna().astype(int)
    
    # Concatenate the original DataFrame with the missing indicators
    df_with_indicators = pd.concat([df, missing_indicators], axis=1)
    
    return df_with_indicators

# Run the function on the DataFrame to create missing indicators for categorical columns
df_with_missing_indicators = create_missing_indicators(df)

# Display the first few rows of the resulting DataFrame
print(df_with_missing_indicators.head())

**Below code Block Explanation**: This block preprocesses categorical variables by filling missing values with "Other" and applying binary encoding. Binary encoding is used here to handle high-cardinality categorical features efficiently, making the encoded features suitable for machine learning models. The target column (loan_status) is excluded from encoding to avoid unintended transformations.

In [None]:
# Preprocess Categorical Variables with Progress Bar
categorical_columns = [col for col in df_with_missing_indicators.select_dtypes(include=['object', 'category']).columns if col != 'loan_status']
for col in tqdm(categorical_columns, desc="Filling missing values in categorical columns"):
    df[categorical_columns] = df[categorical_columns].fillna("Other")

binary_encoder = ce.BinaryEncoder(cols=categorical_columns, drop_invariant=True)
X_encoded = binary_encoder.fit_transform(df.drop(columns=['loan_status']))

print(f"DataFrame shape after categorical preprocessing: {X_encoded.shape}")

**Below code Block Explanation**: This code fills any remaining missing values in numerical columns with zero, ensuring there are no NaN values in the dataset, which might disrupt model training. This imputation method may not be suitable for all cases but is quick for models that handle sparse data well.

In [None]:
# Fill Missing Numerical Values with Progress Bar
X_encoded = X_encoded.select_dtypes(include=['number']).columns
for col in tqdm(X_encoded, desc="Imputing missing numerical values"):
    df[col] = df[col].fillna(0)

print(f"DataFrame shape after numerical preprocessing: {X_encoded.shape}")

**Below code Block Explanation**: This function checks the DataFrame for infinite values, which can disrupt calculations and model training. If any columns contain infinite values, it lists them; otherwise, it confirms no infinite values exist. This is a helpful quality check step before proceeding with further data processing.

In [None]:
# Check for Infinite Values
def check_infinity(df):
    try:
        # Apply np.isinf only to numeric columns
        numeric_cols = df.select_dtypes(include=[np.number])
        infinite_mask = np.isinf(numeric_cols)
        infinite_list = infinite_mask.sum()
        if infinite_list.sum() == 0:
            print("No column has infinite values")
        else:
            print("Columns with infinite values:")
            print(infinite_list[infinite_list > 0].sort_values(ascending=False))
    except Exception as e:
        # Identify the columns that may be causing the error
        problematic_cols = []
        for col in df.columns:
            try:
                np.isinf(df[col])
            except TypeError:
                problematic_cols.append(col)
        
        print(f"An error occurred while checking for infinite values: {e}")
        if problematic_cols:
            print(f"The following columns may be causing the issue due to incompatible types: {problematic_cols}")

check_infinity(X_encoded)

**Below code Block Explanation**: Here, X and y are defined as the feature matrix and target variable, respectively. The data is then split into training and testing sets, reserving 20% of the data for testing. This separation is essential for evaluating model performance on unseen data.

In [None]:
# Ensure that X_encoded and y are aligned
assert len(X_encoded) == len(df), "Mismatch between X_encoded and df, ensure they are aligned before splitting."

# Define X and y Variables
X = X_encoded
y = df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Below code Block Explanation**: The target variable (loan_status) is label-encoded to convert categorical values into numeric labels, which are necessary for most machine learning algorithms. This step ensures compatibility with scikit-learn’s models.

In [None]:
# Encode Target Variable
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)

**Below code Block Explanation**: This block calculates the Variance Inflation Factor (VIF) for each feature to assess multicollinearity. High VIF values indicate high correlation among predictors, which can affect model stability and interpretability. This information guides the removal of redundant features.

In [None]:
# Check for Missing or Infinite Values in DataFrame
def check_missing_or_infinite(df):
    columns_with_issues = []
    for col in df.columns:
        # Only process columns if they are numeric
        if df[col].dtype in [np.float64, np.float32, np.int64, np.int32]:
            if df[col].isna().any() or np.isinf(df[col]).any():
                columns_with_issues.append(col)

    if len(columns_with_issues) > 0:
        print(f"The following columns contain NaN or infinite values: {columns_with_issues}")
    else:
        print("No columns contain NaN or infinite values.")

# Run the check on X_encoded before calculating VIF
check_missing_or_infinite(X_encoded)

In [None]:
X_encoded[['dti', 'mths_since_last_delinq', 'mths_since_last_record', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'annual_inc_joint', 'dti_joint', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'revol_bal_joint', 'sec_app_fico_range_low', 'sec_app_fico_range_high', 'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util', 'sec_app_open_act_il', 'sec_app_num_rev_accts', 'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med', 'deferral_term', 'hardship_amount', 'hardship_length', 'hardship_dpd', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount']].head()

In [None]:
X_encoded[['dti', 'mths_since_last_delinq', 'mths_since_last_record', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'annual_inc_joint', 'dti_joint', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'revol_bal_joint', 'sec_app_fico_range_low', 'sec_app_fico_range_high', 'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util', 'sec_app_open_act_il', 'sec_app_num_rev_accts', 'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med', 'deferral_term', 'hardship_amount', 'hardship_length', 'hardship_dpd', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount']].info()

In [None]:
# Function to calculate VIF with Progress Bar
def calculate_vif(X):
    vif_data = pd.DataFrame()
    vif_data["feature"] = X.columns
    vif_data["VIF"] = [
        variance_inflation_factor(X.values, i) for i in tqdm(range(X.shape[1]), desc="Calculating VIF")
    ]
    return vif_data

# Calculate VIF for the original DataFrame
vif_data_original = calculate_vif(X_encoded)

# Remove features with high VIF
high_vif_features = vif_data_original[vif_data_original["VIF"] > 5]["feature"].tolist()
X_encoded = X_encoded.drop(columns=high_vif_features)

# Calculate VIF and print results for missing data indicators
missing_indicators_columns = [col for col in X_encoded.columns if '_missing' in col]
vif_data_missing = calculate_vif(X_encoded[missing_indicators_columns])
print("VIF for missing data indicators:\n", vif_data_missing)

**Below code Block Explanation**: This block removes features with a VIF above 5, indicating high collinearity. By filtering out these features, we reduce redundancy, making the feature set more interpretable and less prone to multicollinearity issues.

In [None]:
# Remove High-VIF Features
high_vif_features = vif_data[vif_data["VIF"] > 5]["feature"].tolist()
X_vif_reduced = X.drop(columns=high_vif_features)
print("\nFeatures remaining after VIF filtering:\n", X_vif_reduced.columns)


**Below code Block Explanation**: Using Recursive Feature Elimination (RFE) with a linear regression model, this block selects the top 5 most informative features from the reduced feature set (X_vif_reduced). RFE iteratively removes the least important features based on the model's criteria, improving model efficiency.

In [None]:
# Recursive Feature Elimination (RFE)
model = LinearRegression()
rfe = RFE(estimator=model, n_features_to_select=5)
X_rfe_reduced = rfe.fit_transform(X_vif_reduced, y)

selected_features = X_vif_reduced.columns[rfe.support_]
print("\nFeatures selected by RFE:\n", selected_features)

**Below code Block Explanation**: This final block applies RFE using a RandomForestClassifier on numerical columns. The model selects 48 features and eliminates 18 in each iteration, focusing on numerical features only. The selected features from RandomForest RFE are displayed, providing insight into the most informative numerical predictors.

In [None]:
# RandomForest RFE for Numerical Columns
rf = RandomForestClassifier(n_estimators=150, random_state=42)
rfe = RFE(estimator=rf, n_features_to_select=48, step=18, verbose=3)
X_train_numerical = X_train.select_dtypes(include=['number'])
rfe.fit(X_train_numerical, y_train)

selected_features_rf = X_train_numerical.columns[rfe.support_]
print("\nNumerical Features selected by RandomForest RFE:\n", selected_features_rf)

**Below code Block Explanation**: Models we'll run

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Define a dictionary of models to evaluate
models = {
    "Logistic Regression": LogisticRegression(max_iter=200, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
    "Support Vector Machine": SVC(probability=True, random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42),
    "LightGBM": LGBMClassifier(random_state=42),
    "CatBoost": CatBoostClassifier(silent=True, random_state=42)
}


**Below code Block Explanation**: This block evaluates multiple models in a dictionary called models using a progress bar to track completion. For each model, it chooses the correct evaluation function (evaluate_model_multi or evaluate_model_single) based on the multi_class_strategy parameter. If models is not defined, it provides an alert to avoid errors. The results from each evaluation are stored in a list for later review or analysis.

In [None]:
# Evaluating models based on user selection with progress bar
results = []
if models is not None:
    for name, model in tqdm(models.items(), desc="Evaluating models"):
        # Choose the correct evaluation function based on model type
        if multi_class_strategy:
            result = evaluate_model_multi(
                name, model, X_train, X_test, y_train, y_test,
                save_path=SAVE_PATH,
                multi_class_strategy=multi_class_strategy  # Pass strategy to evaluation function
            )
        else:
            result = evaluate_model_single(
                name, model, X_train, X_test, y_train, y_test,
                save_path=SAVE_PATH
            )
            
        results.append(result)
else:
    print("Model evaluation was not performed due to invalid selection.")


**Below code Block Explanation**: This block converts the list of evaluation results into a DataFrame for a clear summary and displays it. It optionally saves the results to a CSV file, making it easier to analyze or share the performance metrics across different models.

In [None]:
# Convert results list to a DataFrame for better visualization and analysis
results_df = pd.DataFrame(results)
print("\nEvaluation Results Summary:")
print(results_df)

# Optionally, save results to a CSV file
results_df.to_csv(f"{SAVE_PATH}/model_evaluation_summary.csv", index=False)

**Below code Block Explanation**: This block generates a line plot to compare different performance metrics across models. Each line represents a metric, such as accuracy or F1-score, helping identify which model performs best in each area. This visualization is valuable for quickly assessing model strengths and trade-offs.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming `results` is the list of dictionaries created by the model evaluation function
# Convert the list of results into a DataFrame for tabular display
results_df = pd.DataFrame(results)

# Display results as a table for easy comparison
print("\nModel Performance Metrics:")
display(results_df)  # If running in a Jupyter notebook, this will display a nice formatted table

# Visualize Performance Comparison
plt.figure(figsize=(12, 6))
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'Cross-Validation Mean Accuracy']

# Plot each metric for all models
for metric in metrics:
    plt.plot(results_df['Model'], results_df[metric], marker='o', label=metric)

# Customize the plot
plt.title("Model Performance Comparison")
plt.xlabel("Model")
plt.ylabel("Score")
plt.legend(loc="best")
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

**Below code Block Explanation**: This block saves each model to disk, enabling easy reuse without retraining. Each model is saved with a descriptive filename, storing it in the specified SAVE_PATH. This is especially useful when working with multiple models and allows for future analysis or deployment.

In [None]:
# Save All Models for Future Use
for name, model in models.items():
    joblib.dump(model, f"{SAVE_PATH}/{name}_final_model.pkl")
print("\nAll models have been saved for future use.")


In [None]:
# === SHAP Explanations for Model Interpretability ===
import shap

# Initialize SHAP Explainer for each model
def explain_model_with_shap(model, X_sample, model_name="Model"):
    """
    Generates SHAP explanations for the given model and dataset sample.

    Parameters:
    ----------
    model : estimator object
        The trained model to explain.
    X_sample : DataFrame
        A sample of the dataset to generate SHAP values for.
    model_name : str
        Name of the model for display in plots.
    """
    print(f"\nGenerating SHAP explanations for {model_name}...")
    
    # Use SHAP's TreeExplainer for tree-based models (e.g., RandomForest, XGBoost, LightGBM)
    if hasattr(model, 'predict_proba'):
        explainer = shap.Explainer(model, X_sample, check_additivity=False)
    else:
        explainer = shap.Explainer(model)
    
    # Calculate SHAP values
    shap_values = explainer(X_sample)

    # Plot feature importance summary
    shap.summary_plot(shap_values, X_sample, plot_type="bar", show=True)
    
    # Detailed summary plot with individual SHAP values per feature and instance
    shap.summary_plot(shap_values, X_sample, show=True)

    # Example force plot for the first prediction (requires a single instance)
    shap.force_plot(explainer.expected_value, shap_values[0, :], X_sample.iloc[0, :], matplotlib=True)


# Choose a subset of data to explain (e.g., a random sample of 100 rows)
X_sample = X_test.sample(100, random_state=42)

# Explain the main trained model (example: RandomForestClassifier)
explain_model_with_shap(RandomForestClassifier, X_sample, model_name="RandomForestClassifier")

# Loop to explain all models in a dictionary (if multiple models)
for name, model in models.items():
    explain_model_with_shap(model, X_sample, model_name=name)