# Exploratory Data Analysis

Looking at the Kaggle dataset [Realistic Loan Approval Dataset | US & Canada](https://www.kaggle.com/datasets/parthpatel2130/realistic-loan-approval-dataset-us-and-canada/data) for its major features, and preparing it for classification.


## Description of dataset from Author

1️⃣ Real-World Approval Logic The dataset implements actual banking criteria:

 - DTI ratio > 50% = automatic rejection
 - Defaults on file = instant reject
 - Credit score bands match real lending thresholds
 - Employment verification for loans ≥$20K


2️⃣ Realistic Correlations

 - Higher income → Better credit scores
 - Older applicants → Longer credit history
 - Students → Lower income, special treatment for small loans
 - Loan intent affects approval (Education best, Debt Consolidation worst)


3️⃣ Product-Specific Rules

 - Cards: More lenient, higher limits
 - Personal Loans: Standard criteria, up to $100K
 - Line of Credit: Capped at $50K, manual review for high amounts


4️⃣ Edge Cases Included

 - Young applicants (age 18) building first credit
 - Students with thin credit files
 - Self-employed with variable income
 - High debt-to-income ratios
 - Multiple delinquencies


## Exploration Steps

1. Fetch Data and Store Artifact
2. Initial Visual and Tabular Assessment
3. Data Cleaning
4. Model Training and Selection
5. Model Assessment Against Baseline

In [1]:
# Python 3 Standard Library
import os
from pathlib import Path
import re

# if you are re-running this on your system, you'll probably need to change this path to match where you are storing these files
PROJECT_ROOT_PATH = Path(f"{os.environ['USERPROFILE']}\\OneDrive\\Education\\WGU\\Capstone")

# making sure I'm in the right directory for EDA, need to be in root
if not re.match(r'.*Capstone$', os.getcwd()):
    os.chdir(PROJECT_ROOT_PATH)

# Data Science Modules
## Data Analytics and Visualization
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm_notebook as tqdm

## Machine Learning
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, KFold, cross_validate, train_test_split
from sklearn.metrics import make_scorer, roc_auc_score, accuracy_score, precision_score, fbeta_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# Custom modules
from src.utilities import new_logger, save_atomic

# Setting Pandas DataFrame options
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

This is a module-wide logger that is tied to this notebook. During operation, it may also trigger writes to `logs/utils/src.utilities.log` when using `src.utilities` functions.

In [2]:
logger = new_logger("eda.data_exploration", "logs")

### 1. Fetch Data and Store Artifact

This dataset was downloaded locally as a CSV file.

I'm being careful to use `Path` in order to create absolute references to files on disk. This will help prevent odd behavior for relative references when run on other operating systems.

In [3]:
# Fetch local dataset
data_path = Path('data/Loan_approval_data_2025.csv').resolve()
logger.info(f"Attempting to fetch data from '{data_path}'")

loan_appr_df = pd.read_csv(data_path)
logger.info(f"Successfully created DataFrame {loan_appr_df.shape}")

In [4]:
# saving original in Parquet format for submission as artifact
orig_data_path_parquet = save_atomic(loan_appr_df, Path("data/loan_approval_data_2025.orig.parquet"), fmt="parquet")

### 2. Initial Visual and Tabular Assessment

#### Tabular Assessment (Directed and Non-Directed)

During this part of the process I am looking for multiple issues common to datasets from the Internet:
- Missing data
- Malformed data/typos
- Improper data ranges
- Improper data types

The first section deals with looking for missing data and dtypes, then we move on to looking at the correlations between numeric features as well as their distributions.

In [None]:
loan_appr_df.info()

There doesn't seem to be any missing data. This makes sense because this is a synthetic dataset, though it is based on realistic business rules surrounding approving or denying loans.

In [None]:
loan_appr_df.isna().sum()

Now I'll look for duplicated rows within the dataset. Since this is synthetically generated data, I don't expect to see any duplicated rows. This will exclude the `customer_id` column since that column will make each row unique by default.

A value of 0 indicates that there are not duplicated rows.

In [None]:
loan_appr_df.drop(columns=['customer_id']).duplicated().sum()

In [None]:
loan_appr_df.describe()

The basic descriptive statistics for each numeric columns look reasonable based on my knowledge of each of these columns.

Now I'll do a non-directed assessment of the head, tail, and a random sample of rows within the dataset. This is an attempt to look for consistency throughout the data. I am also looking for situations where the columns should be joined, melted, or otherwise engineered to produce a better analysis.

In [None]:
loan_appr_df.head(25)

In [None]:
loan_appr_df.tail(25)

In [None]:
loan_appr_df.sample(25)

#### Exploratory Visualizations

First, I isolate the columns into their various types. Since there are so few `object` columns, I will create that column grouping first, and then use it to narrow down to the numeric types. For ease of analysis, I will make two separate numeric groupings - one with the target variable `loan_status` and one without.

Then, the `obj_cols` group will be examined for repeated values - this indicate a good candidate for a categorical column.

Next, the `num_cols` will be analyzed using a correlation heatmap to look at relationships among variables. [I may also look at various slices of the data to see how relationships may change??]

Finally, I will take a look at the distributions of the `num_cols` group to get a sense of how normal each distribution is, or whether it has unique characteristics such as discrete values, or bimodality.

In [7]:
# Categorical columns, currently represented as `object` (string) dtypes
obj_cols = [col for col in loan_appr_df.dtypes[loan_appr_df.dtypes == 'object'].index if col != 'customer_id']

# all numerical columns including the `loan_status` target variable
num_cols_with_target = loan_appr_df.drop(columns=obj_cols).drop(columns=['customer_id']).columns.values

# all numerical columns excluding the `loan_status` target variable
num_cols = loan_appr_df.drop(columns=obj_cols).drop(columns=['customer_id', 'loan_status']).columns.values

#### Categorical Feature Exploration

Moving on to the object features, looking at the current possible values.

In [None]:
for col in obj_cols:
    print(f"{loan_appr_df[col].value_counts()}\n")

Based on what we can see above, these columns are all good candidates for becoming categorical. Let's look at the distributions of each feature to see if they make sense.

In [None]:
for col in obj_cols:
    val_counts = loan_appr_df[col].value_counts()
    fig, ax = plt.subplots(figsize=(15,9))
    ax.bar(val_counts.index.values, val_counts.values)
    ax.set_xlabel(val_counts.index.name)
    ax.set_ylabel("Frequency")
    ax.set_title(f"Frequency of {val_counts.index.name}")

These categories seem like they make sense, and no changes are needed. This is expected from a synthetic dataset.

#### Numerical Feature Exploration

In [None]:
# create a correlation matrix with all of the numeric values
correlation_matrix = loan_appr_df[num_cols_with_target].corr()

In [None]:
# create a Seaborn heatmap showing all of the correlation values
plt.figure(figsize=(15,10))
sns.heatmap(correlation_matrix, annot=True, cmap="vlag", fmt=".2f")
plt.title("Loan Application: Numeric Variables Correlation Heatmap")

Moving on, we can summarize how each numeric variable varies against every other numeric variable using Seaborn's `pairplot()` function. I will take a look at each distribution individually as well, below.

In [None]:
sns.pairplot(loan_appr_df[num_cols_with_target])

In [None]:
for col in num_cols:
    fig, ax = plt.subplots(figsize=(15,9))
    n, bins, patches = ax.hist(loan_appr_df[col])
    ax.set_xlabel(col)
    ax.set_ylabel("Frequency")
    ax.set_title(f"Distribution of {col}")

These distributions look appropriate for each category, no major changes will be required. There is no case of "normal" distributions, which for looking at human populations is expected. The only "bimodal" feature seen is `loan_status` which doesn't really count since it's the binary response variable (the classification we're making).

Other non-continuous distributions are for `derogatory_marks`, `defaults_on_file`, and `delinquencies_last_2_years` are expected as these are discrete variables.

### 3. Data Cleaning

Based on what I can see in this dataset, there is barely any cleaning that must take place. Instead, there simply needs to be a pipeline to create a `OneHotEncoder` for the categorical columns and `StandardScaler` for the numeric columns for use in a Random Forest Classifier.

At this point, just dropping the customer ID column because it isn't useful in classification. This may have been able to be done earlier in the process as well. An alternative treatment is to use this column as the index, rather than the existing non-semantic index.

In addition, I will adjust the dtype of the object columns to be categorical for more efficient storage of data moving forward, and to solidify the idea that they are categories that will benefit from One-Hot Encoding before training the various models.

Finally, I have removed the `payment_to_income_ratio` column because it has perfect correlation with `loan_to_income_ratio` and does not add any value to this analysis.

In [8]:
# remove the unique Customer_ID identifier
loan_appr_wip = loan_appr_df.drop(columns=['customer_id', 'payment_to_income_ratio'])

In [9]:
# make each object feature a category feature instead
for col in obj_cols:
    loan_appr_wip[col] = loan_appr_wip[col].astype('category')

In [10]:
# save clean dataset as Parquet, for logging to W&B later
data_path = save_atomic(loan_appr_wip, Path("data/loan_approval_data_2025.clean.parquet"), fmt="parquet")

Now that cleaning is complete, let's perform a quick look at the clean DataFrame to make sure that the dtypes are correct and we still have no null values.

In [11]:
loan_appr_wip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   age                      50000 non-null  int64   
 1   occupation_status        50000 non-null  category
 2   years_employed           50000 non-null  float64 
 3   annual_income            50000 non-null  int64   
 4   credit_score             50000 non-null  int64   
 5   credit_history_years     50000 non-null  float64 
 6   savings_assets           50000 non-null  int64   
 7   current_debt             50000 non-null  int64   
 8   defaults_on_file         50000 non-null  int64   
 9   delinquencies_last_2yrs  50000 non-null  int64   
 10  derogatory_marks         50000 non-null  int64   
 11  product_type             50000 non-null  category
 12  loan_intent              50000 non-null  category
 13  loan_amount              50000 non-null  int64   
 14  intere

Now I have to redefine the various column lists, since the WIP DataFrame is not the same as the original DataFrame.

In [12]:
# Categorical columns, currently represented as `object` (string) dtypes
obj_cols = [col for col in loan_appr_wip.dtypes[loan_appr_wip.dtypes == 'category'].index]

# all numerical columns including the `loan_status` target variable
num_cols_with_target = loan_appr_wip.drop(columns=obj_cols).columns.values

# all numerical columns excluding the `loan_status` target variable
num_cols = loan_appr_wip.drop(columns=obj_cols).drop(columns=['loan_status']).columns.values

In [14]:
obj_cols

['occupation_status', 'product_type', 'loan_intent']

### 4. Model Training and Selection

This section of the notebook provides the basis for the model training and selection process that will be implemented as a part of the completed MLOps pipeline.

First, I will split the X and y variables. Then, I will generate the training and testing sets. One alternative that may produce a better model is to perform cross-validation, switching up the training and testing sets to take full advantage of all of our available data. TODO: look into implementing cross-validation in this case, how would I treat the metrics and the model in that case?

Next, I will create a machine learning pipeline that includes the preprocessing steps for both the categorical (`OneHotEncoder`) and numeric (`StandardScaler`) variables. This ensures that the trained pipeline can be used for both training and inference as the same steps are being performed in both instances (provided the DataFrame columns are similar).

Finally, that machine learning pipeline will be trained using the `GridSearchCV` method for hyperparameter tuning. This will allow me to narrow in on which parameters are best for this scenario to reach the optimal hyperparameters for this dataset.

In [15]:
# split the data into X and y sets
y = loan_appr_wip.pop('loan_status')
X = loan_appr_wip

In [16]:
X.shape
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   age                      50000 non-null  int64   
 1   occupation_status        50000 non-null  category
 2   years_employed           50000 non-null  float64 
 3   annual_income            50000 non-null  int64   
 4   credit_score             50000 non-null  int64   
 5   credit_history_years     50000 non-null  float64 
 6   savings_assets           50000 non-null  int64   
 7   current_debt             50000 non-null  int64   
 8   defaults_on_file         50000 non-null  int64   
 9   delinquencies_last_2yrs  50000 non-null  int64   
 10  derogatory_marks         50000 non-null  int64   
 11  product_type             50000 non-null  category
 12  loan_intent              50000 non-null  category
 13  loan_amount              50000 non-null  int64   
 14  intere

In [17]:
y.shape

(50000,)

In [18]:
# Building a column transformer out of OneHotEncoder and StandardScaler
logger.info("Starting inference training pipeline for all models")
cat_preproc = OneHotEncoder()
logger.debug(f"Created OneHotEncoder for columns {obj_cols}")
num_preproc = StandardScaler()
logger.debug(f"Created StandardScaler for columns {num_cols}")

preproc = ColumnTransformer(
    transformers=[
        ("cat_transform", cat_preproc, obj_cols),
        ("num_transform", num_preproc, num_cols)
    ],
    remainder='drop',
    verbose=True
)
logger.debug("Created ColumnTransformer for categorical and numerical preprocessing with all above columns. Any other columns will be dropped.")

In [19]:
preproc

0,1,2
,transformers,"[('cat_transform', ...), ('num_transform', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,True
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,copy,True
,with_mean,True
,with_std,True


In [20]:
def nested_cross_validation(pipeline: Pipeline, X: pd.DataFrame, y: pd.Series, est_name: str, param_grid: dict[str, list], trials: int = 1, outer_cv_splits: int = 5, inner_cv_splits: int = 3, random_state: int = 72925, verbose: int = 0, n_jobs: int = -1) -> pd.DataFrame:
    """Using a Machine Learning Pipeline and Parameter Grid, perform nested Cross-Validation and return a DataFrame with the results.

    Please not that this function can take a LONG time to run, if the hyperparameter space is large, or if the combined number of cross-validation folds and trials is very large.

    This function has a very high time-complexity, and should only be run if you have a lot of time to spare. To put this in common terms, if we have four hyperparameters with 3 options each, we have a hyperparameter space of 

    Args:
        pipeline (sklearn.pipeline.Pipeline):
            The machine learning Pipeline that contains the preprocessed columns and the model to use.
        X (pd.DataFrame):
            The X matrix to use for training and testing, contains only the predictor variables.
        y (pd.Series):
            the y array to use for training and testing, contains only the reponse variable.
        est_name (str):
            The estimator name, for logging.
        param_grid (dict[str, list]):
            A parameter grid dictionary with the parameter names as 'model__parameter' as keys and the list of hyperparameter options as the values.
        trials (int):
            The number of cross-validation trials to perform, defaults to 1.
        outer_cv_splits (int):
            The number of cross-validation splits for each trial, defaults to 5.
        inner_cv_splits (int):
            The number of hyperparameter tuning splits for each outer cross-validation fold, defaults to 3.
        random_state (int):
            The random state to use for better comparison across models. Defaults to 72925.
        verbose (int):
            The level of verbosity to use for the GridSearchCV and cross_validate methods, defaults to 0, letting the loops show the progress.
        n_jobs (int):
            The number of processors to use to parallelize the jobs. Defaults to -1.
    
    Returns:
        pd.DataFrame: a results DataFrame with the fit_time, score_time, estimator object, F-Beta, Accuracy, Precision, ROC_AUC, and the best hyperparameters for that model.
    """
    logger.info(f"Starting nested cross-validation for {est_name}")
    logger.info(f"Using parameter grid for GridSearchCV: {param_grid}")

    nested_model_results = []

    # inner tqdm loop showing trials, from Walters 2022
    for _ in tqdm(range(trials), desc=f"{est_name} Cross Validation Trials", leave=True):
        # define Inner CV Loop
        inner_cv = KFold(n_splits=inner_cv_splits, shuffle=True, random_state=random_state)
        grid_search = GridSearchCV(pipeline, param_grid, verbose=verbose, cv=inner_cv, n_jobs=n_jobs)

        # define Outer CV Loop
        outer_cv = KFold(n_splits=outer_cv_splits, shuffle=True, random_state=random_state)
        nested_model_results.append(
            cross_validate(
                grid_search,
                X,
                y,
                cv=outer_cv,
                scoring={
                    "fbeta": make_scorer(fbeta_score, beta=0.5),
                    "accuracy": make_scorer(accuracy_score),
                    "precision": make_scorer(precision_score),
                    "roc_auc": make_scorer(roc_auc_score)
                },
                return_estimator=True,
                verbose=verbose,
                n_jobs=n_jobs
            )
        )
    
    # place results in a DataFrame
    results = pd.DataFrame()

    for result in nested_model_results:
        if results.shape[0] != 0:
            # concat to existing DataFrame
            results = pd.concat([results, pd.DataFrame(result)])
        else:
            # create DataFrame
            results = pd.DataFrame(result)

    # add the hyperparameters to the results DataFrame
    results['hyperparameters'] = [est.best_params_ for est in results['estimator']]
    # name the model
    results['model'] = est_name

    return results


To actually perform the training and tuning, run the following code with the model dictionary as follows. This contains all of the model names, their objects, and their hyperparameter grids.

In [22]:
model_definitions = [
    {
        "name": "Support Vector Machine",
        "pipeline": Pipeline([
            ('preprocessing', preproc), # Preprocessing step
            ('clf', SVC(gamma='scale', max_iter=-1,
                        random_state=72925)) # Model step
        ]),
        "param_grid": {
            "clf__C": list(np.logspace(-4,4,4)),
            "clf__kernel": ['rbf', 'poly'],
            "clf__degree": [3,4,5] # for 'poly' kernel only
        }
    },
    {
        "name": "Logistic Regression",
        "pipeline": Pipeline([
            ('preprocessing', preproc), # Preprocessing step
            ('clf', LogisticRegression(penalty='l2', random_state=72925)) # Model step
        ]),
        "param_grid": {
            # C must be positive, starting with default value and moving up on log scale 4 places
            "clf__C": list(np.logspace(-4,4,4)),
            "clf__solver": ['lbfgs', 'sag', 'saga', 'newton-cholesky'],
            "clf__max_iter": [x for x in range(1000,2001,250)]
        }
    },
    {
        "name": "Gaussian Naive Bayes",
        "pipeline": Pipeline([
            ('preprocessing', preproc), # Preprocessing step
            ('clf', GaussianNB()) # Model step
        ]),
        "param_grid": {
            "clf__var_smoothing": list(np.logspace(0,-9, num=100))
        }
    },
    {
        "name": "Adaptive Boosting",
        "pipeline": Pipeline([
            ('preprocessing', preproc), # Preprocessing step
            ('clf', AdaBoostClassifier(random_state=72925)) # Model step
        ]),
        "param_grid": {
            "clf__n_estimators": [x for x in range(50,250,50)],
            "clf__learning_rate": [10**x for x in [-2,-1,0]]
        }
    },
    {
        "name": "Random Forest",
        "pipeline": Pipeline([
            ('preprocessing', preproc), # Preprocessing step
            ('clf', RandomForestClassifier()) # Model step
        ]),
        "param_grid": {
            'clf__n_estimators': [10**x for x in range(0,4)],
            'clf__max_features': ['sqrt'],
            'clf__max_depth': [x for x in range(1,6)],
            'clf__min_samples_split': [x*2 for x in range(1,6)]
        }
    },
]

In [23]:
temp_results_path = Path('data/model_metrics.wip.parquet')
final_results_path = Path('data/model_metrics.final.parquet')

if not final_results_path.exists():
    if not temp_results_path.exists():
        # no file found on disk, create an empty DataFrame to start with
        all_results = pd.DataFrame()
    else:
        # load the existing DataFrame and add to it without overwriting initially
        all_results = pd.read_parquet(temp_results_path)

    for model in tqdm(model_definitions, desc=f"Model Training Experiment Loop", leave=True):
        logger.debug(f"Training iteration with {model['name']} as the inference model.")

        _result = nested_cross_validation(model['pipeline'], X, y, model['name'], model['param_grid'], trials=5, verbose=3)
        logger.debug(f"Results DataFrame for {model['name']} is complete with shape {_result.shape}")

        # log the specific _result DataFrame to disk for recovery purposes, this is a very long training process and we don't want to overwrite if we can help it
        _model_save_path = Path(f"data/model_metrics_{model['name'].lower().replace(' ', '-')}.wip.parquet")
        logger.info(f"Saving the results from the {model['name']} training runs to {_model_save_path} for safe-keeping.")
        save_atomic(_result.drop(columns=['estimator']), _model_save_path)

        if all_results.shape[0] == 0:
            # empty DataFrame
            all_results = _result
            logger.debug(f"Initialized all_results DataFrame from the {model['name']} _result DataFrame.")
        else:
            # adding to DataFrame instead
            all_results = pd.concat([all_results, _result])
            logger.debug(f"Added the {model['name']} _result DataFrame to the existing all_results DataFrame. Current shape: {all_results.shape}")
        
        # log DataFrame to disk so that it can be recovered if 
        logger.info(f"Saving the results from all currently completed training runs to {temp_results_path} for safe-keeping.")
        save_atomic(all_results.drop(columns=['estimator']), temp_results_path)

    # if you reach this point, CONGRATS you are done, thanks for playing!
    logger.info(f"Saving the final results of training to {final_results_path}.")
    save_atomic(all_results.drop(columns=['estimator']), final_results_path)
else:
    # final results have been created, load into DataFrame and don't repeat the process
    all_results = pd.read_parquet(final_results_path)

Model Training Experiment Loop:   0%|          | 0/5 [00:00<?, ?it/s]

Support Vector Machine Cross Validation Trials:   0%|          | 0/5 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 14 concurrent workers.


OSError: [WinError 1450] Insufficient system resources exist to complete the requested service

In [None]:
all_results.info()

In [None]:
all_results

In [None]:
all_results.groupby(by=['model']).agg({
    'fit_time': 'mean',
    'score_time': 'mean',
    'test_fbeta': 'mean',
    'test_accuracy': 'mean',
    'test_precision': 'mean', 
    'test_roc_auc': 'mean'
})

These code blocks are for each individual model, able to be trained separately from the rest of the models as needed. They have been commented out to allow the core training loop to run as normal.

**Random Forest**

In [None]:
# rf_pipeline = Pipeline([
#     ('preprocessing', preproc), # Preprocessing step
#     ('clf', RandomForestClassifier()) # Model step
# ])
# logger.debug("Pipeline created with RandomForestClassifier as the inference model.")

# # Define the parameter grid for hyperparameter tuning
# rf_param_grid = {
#     'clf__n_estimators': [10**x for x in range(0,4)],
#     'clf__max_features': [0.3, 0.5, 'sqrt'],
#     'clf__max_depth': [x for x in range(1,6)],
#     'clf__min_samples_split': [x*2 for x in range(1,6)]
# }

# rf_results = nested_cross_validation(rf_pipeline, X, y, "Random Forest", rf_param_grid, trials=5)

In the interest of exploring how multiple classification models perform against one another, let's work on training multiple classifiers and comparing them against one another.

**Logistic Regression**

In [None]:
# TODO: train a logistic regression model on this same data, with hyperparameter tuning
# lr_pipeline = Pipeline([
#     ('preprocessing', preproc), # Preprocessing step
#     ('clf', LogisticRegression(penalty='l2', random_state=72925)) # Model step
# ])
# logger.debug("Pipeline created with LogisticRegression as the inference model.")

# lr_param_grid = {
#     # C must be positive, starting with default value and moving up on log scale 4 places
#     "clf__C": list(np.logspace(-4,4,4)),
#     "clf__solver": ['lbfgs', 'liblinear', 'sag', 'saga', 'newton-cholesky'],
#     "clf__max_iter": [x for x in range(1000,2001,250)]
# }
# lr_results = nested_cross_validation(lr_pipeline, X, y, "Logistic Regression", lr_param_grid, trials=5)

**AdaBoost**

In [None]:
# TODO: train an AdaBoost model on this same data, with hyperparameter tuning
# ab_pipeline = Pipeline([
#     ('preprocessing', preproc), # Preprocessing step
#     ('clf', AdaBoostClassifier(random_state=72925)) # Model step
# ])
# logger.debug("Pipeline created with AdaBoost as the inference model.")

# ab_param_grid = {
#     "clf__n_estimators": [x for x in range(50,250,50)],
#     "clf__learning_rate": [10**x for x in [-2,-1,0]],
#     "clf__estimator__max_depth": [1,2,3]
# }

# ab_results = nested_cross_validation(ab_pipeline, X, y, "Adaptive Boosting", ab_param_grid, trials=5)

**Support Vector Machine**

In [None]:
# TODO: train an SVM model on this same data, with hyperparameter 
# svc_pipeline = Pipeline([
#     ('preprocessing', preproc), # Preprocessing step
#     ('clf', SVC(gamma='scale', max_iter=-1,
#                 random_state=72925)) # Model step
# ])
# logger.debug("Pipeline created with SVM as the inference model.")

# svc_param_grid = {
#     "clf__C": list(np.logspace(-4,4,4)),
#     "clf__kernel": ['rbf', 'poly'],
#     "clf__degree": [3,4,5,6] # for 'poly' kernel only
# }

# svc_results = nested_cross_validation(svc_pipeline, X, y, "Support Vector Machine", svc_param_grid, trials=5)

**Gaussian Naive Bayes**

In [None]:
# TODO: train a naive Bayes model on this same data, with hyperparameter tuning
# be VERY careful about collinearity and assumptions about independence... identify them and show where in the data this is a concern
# gnb_pipeline = Pipeline([
#     ('preprocessing', preproc), # Preprocessing step
#     ('clf', GaussianNB()) # Model step
# ])
# logger.debug("Pipeline created with GaussianNB as the inference model.")

# gnb_param_grid = {
#     "clf__var_smoothing": list(np.logspace(0,-9, num=100))
# }

# gnb_results = nested_cross_validation(gnb_pipeline, X, y, "Gaussian Naive Bayes", gnb_param_grid, trials=5)

#### 5. Model Assessment Against Baseline

Actually I probably won't compare against a baseline anymore...

Now that we've determined which method will work best for our data, we'll attempt to further tune that model and then compare its results against the baseline estimator.

In [None]:
# # Split data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7292025)

In [None]:
# Creating the dummy classifier baselines using multiple strategies

dummy_strategies = {
    "most_frequent": DummyClassifier(strategy="most_frequent"),
    "stratified": DummyClassifier(strategy='stratified', random_state=72925),
    "uniform": DummyClassifier(strategy="uniform", random_state=72925),
    "constant_0": DummyClassifier(strategy="constant", constant=0)
}

# evaluate and fit each dummy strategy
results_data = {}

for name, clf in dummy_strategies.items():
    clf.fit(X_train, y_train)
    y_pred_dummy = clf.predict(X_test)
    results_data[name] = [
        accuracy_score(y_test, y_pred_dummy),
        precision_score(y_test, y_pred_dummy, zero_division=0),
        fbeta_score(y_test, y_pred_dummy, beta=0.5, zero_division=0)
    ]

### 5. Model Assessment Using McNemar's Test


In [None]:
import mlxtend