# This notebook implements some of the baseline modeling for the project. It reconstructs a data processing pipeline from an early notebook, trains a logistic regression model as a baseline, and derives insights on feature importance on what is the primary drivers of default risk. Then, a Random Forest classifier is trained as a more flexible, non-linear model to compare performace and assess whether more complex models will improve predictive power.

## Goals for this notebook:
- Implement preprocessing pipeline developed in earlier notebooks
- Develop logistic regression model with processed training data
- Deliver insights from the logistic regression model to inform which features are the most important for other algorithms
- Develop a Random Forest model to reaffirm findings in logistic regression model
- Compare logistic regression and Random Forest classifier models to determine feature importance going forward
- Define conclusions drawn from baseline modeling


In [46]:
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

## Beginning of the data processing pipeline

In [5]:
# Opening file

root = Path.cwd().parent

path = root / "data" / "interim" / "application_train.csv"

df = pd.read_csv(path)

In [6]:
# Assigning a placeholder to df to maintain original integrity
df_processing = df

In [7]:
# Determining the aggregated missingness of columns in the dataframe
missing = df_processing.isna().mean()

In [8]:
# Creating an index of which columns to drop from the data frame (those with more than 50% missing data)
cols_to_drop = missing > 0.5

In [9]:
# Dropping the columns using indexing
df_processing = df_processing.loc[:, ~cols_to_drop].copy()

In [10]:
# Replacing sentinel values ('365243') in DAYS_EMPLOYED with NaN values
df_processing["DAYS_EMPLOYED"] = df_processing["DAYS_EMPLOYED"].replace(365243, np.nan)

In [11]:
# Setting up the dataframe to be split into training and validation dataframes. y-variable is "TARGET", as it is the variable indicating a default within an account
# y is stratified to ensure the default rate is generally similar in the training and validation datasets

df_processing = df_processing.copy()

X = df_processing.drop(columns=["TARGET"])
y = df_processing["TARGET"]

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y,
    test_size = 0.2,
    stratify = y,
    random_state = 69
)

In [12]:
# Indexing numeric columns from the X-dataframes to ensuring imputation is only applied to numeric columns
numeric_cols = X_train.select_dtypes(include=[np.number]).columns

In [13]:
# Utilizing SKlearn SimpleImputer to impute NaN values in numeric columns only
imputer = SimpleImputer(strategy = 'median')

X_train = X_train.copy()
X_valid = X_valid.copy()

X_train[numeric_cols] = imputer.fit_transform(X_train[numeric_cols])
X_valid[numeric_cols] = imputer.transform(X_valid[numeric_cols])

In [14]:
# Indexing categorical columns from the X-dataframes to ensure One-Hot encoding is only applied to categorical columns
categorical_cols = X_train.select_dtypes(include=["object"]).columns

In [15]:
# Defining the OneHotEncoder
one_hot_encoding = OneHotEncoder(
    handle_unknown = "ignore",
    sparse_output = False
)

In [16]:
# Applying One Hot encoding to the X-dataframes
X_train_cat = one_hot_encoding.fit_transform(X_train[categorical_cols])
X_valid_cat = one_hot_encoding.transform(X_valid[categorical_cols])

In [17]:
# Filtering the X-dataframes to select only numeric columns to be combined with the One Hot encoded categorical rows
X_train_num = X_train[numeric_cols].to_numpy()
X_valid_num = X_valid[numeric_cols].to_numpy()

In [18]:
# Combining the numeric and categorical rows back into two now imputed and One Hot encoded X-dataframes
X_train_final = np.hstack([X_train_num, X_train_cat])
X_valid_final = np.hstack([X_valid_num, X_valid_cat])

In [19]:
# Reassigning dataset name for clarity
X_train = X_train_final
X_valid = X_valid_final

## End of the data processing pipeline

At this stage we have 4 fully numeric dataframes ready to be passed into SKlearn models; X_train, X_valid, y_train, y_valid:
- X_train: Dataframe used to train ML models in following sections. Drops "TARGET", imputes NaN numerical values, and One Hot encodes categorical values
- X_valid: Dataframe used to validate ML models in following sections. Drops "TARGET", imputes NaN numerical values, and One Hot encodes categorical values
- y_train: Dataframe containing the default indicating feature "TARGET". Utilizes a stratified distribution of the column to ensure training and validation data have similar defaulting rates (8.07%). y_train will be used as the dependent variable in the training data for the logistic regression and Random Forest Classifier models.
- y_valid: Dataframe containing the default indicating feature "TARGET". Utilizes a stratified distribution of the column to ensure training and validation data have similar defaulting rates (8.07%). y_valid will be used to validate the results of the trained datasets on following ML models.


## Beginning of the modeling section

In [20]:
# Scaling datasets for use with Logistic Regression
scaler = StandardScaler()
X_train_num_scaled = scaler.fit_transform(X_train_num)
X_valid_num_scaled = scaler.transform(X_valid_num)

X_train_scaled = np.hstack([X_train_num_scaled, X_train_cat])
X_valid_scaled = np.hstack([X_valid_num_scaled, X_valid_cat])

# Initializing logistic regression
model_lr = LogisticRegression(
    max_iter = 3000,
    class_weight="balanced"
)
model_lr.fit(X_train_scaled, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",'balanced'
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [56]:
y_valid_proba_lr = model_lr.predict_proba(X_valid_scaled)[:,1]

In [57]:
threshold = 0.5
y_pred = (y_valid_proba_lr >= threshold).astype(int)

In [26]:
cm = confusion_matrix(y_valid, y_pred)
print(cm)

[[38984 17554]
 [ 1670  3295]]


In [31]:
#  Calculating recall
recall = 3295 / (3295 + 1670)
print(f"The model caught {recall * 100:.1f}% of defaulters")

The model caught 66.4% of defaulters


In [36]:
precision = 3295 / (3295 + 17554)
print(f"The model flags {precision*100:.1f}% that actually default")

The model flags 15.8% that actually default


At a 0.5 decision threshold the model predicts:
- 38984 true negatives: Meaning, 38984 cases correctly they would not default and did not (non-defaulters)
- 17554 false positive: Meaning, 17554 cases flagged predicted to default, but were safe (mislabled non-defaulters)
- 1670 false negatives: Meaning 1670 cases that were predicted safe, but defaulted (missed defaulters)
- 3295 true positive: Meaning 3295 cases that were predicted not safe, and defaulted (actual defauters)
 
Based on the precision and recall scores:
The model caught 66.4% of defaulters
The model flags 15.8% that actually default

The model appears to be quite aggressively flagging risk, as shown by the 17554 false positive cases, but is catching the majority of defaulters (66.4%). I am going to test the model on 0.3 (should catch more and flag more) and 0.7 (shoud catch less and flag less) see how the model behaves

In [58]:
# 0.3 decision threshold
threshold = 0.3
y_pred = (y_valid_proba_lr >= threshold).astype(int)
cm = confusion_matrix(y_valid, y_pred)
print(cm)

[[18905 37633]
 [  464  4501]]


In [38]:
recall = 4501 / (4501 + 464)
print(f"The model caught {recall * 100:.1f}% of defaulters")
precision = 4501 / (4501 + 37633)
print(f"The model flags {precision*100:.1f}% that actually default")

The model caught 90.7% of defaulters
The model flags 10.7% that actually default


At a 0.3 decision threshold the model predicts:
- 18905 true negatives: Meaning, 38984 cases correctly they would not default and did not (non-defaulters)
- 37663 false positive: Meaning, 17554 cases flagged predicted to default, but were safe (mislabled non-defaulters)
- 464 false negatives: Meaning 1670 cases that were predicted safe, but defaulted (missed defaulters)
- 4501 true positive: Meaning 3295 cases that were predicted not safe, and defaulted (actual defauters)

Based on the precision and recall scores:
The model caught 90.7% of defaulters
The model flags 10.7% that actually default

This decision threshold is *extremely* aggressive and while it offers extreme high recall for determining who actually defaults, it also flags many people who are safe which represents alot of lost business. In general, I think this threshold is a little too sensitive and disruptive for real world operations.

In [59]:
# 0.7 decision threshold
threshold = 0.7
y_pred = (y_valid_proba_lr >= threshold).astype(int)
cm = confusion_matrix(y_valid, y_pred)
print(cm)

[[51600  4938]
 [ 3369  1596]]


In [40]:
recall = 1596 / (1596 + 3369)
print(f"The model caught {recall * 100:.1f}% of defaulters")
precision = 1596 / (1596 + 4938)
print(f"The model flags {precision*100:.1f}% that actually default")

The model caught 32.1% of defaulters
The model flags 24.4% that actually default


At a 0.7 decision threshold the model predicts:
- 51600 true negatives: Meaning, 38984 cases correctly they would not default and did not (non-defaulters)
- 4938 false positive: Meaning, 17554 cases flagged predicted to default, but were safe (mislabled non-defaulters)
- 3369 false negatives: Meaning 1670 cases that were predicted safe, but defaulted (missed defaulters)
- 1596 true positive: Meaning 3295 cases that were predicted not safe, and defaulted (actual defauters)

Based on the precision and recall scores:
The model caught 32.1% of defaulters
The model flags 24.4% that actually default

This decision threshold is very conservative at flagging potential defaulters. While it minimizes disruption to good customers, it doesn't provice much protection to the bank against default risk

Based on the 3 decision thresholds tested for the logistic regression model, I believe that the original 0.5 threshold provides the most balanced solution in a real business setting. While not the most sensitive at catching defaulters or not as permissive towards good customers, it provides adaquet perfomance in both.

Below I am going to determine the feature importance on the logistic regression model. For important reference, on a logistic regression model:
- Positive coefficient = increase to default risk
- Negative coefficient = decrease to default risk
- Larger absolute value of coefficient = how much it influcences the model (bigger = more influence on the outcome)

In [43]:
# Getting list of feature names. Using .get_feature_names_out to restore original categorical column names.
num_features = list(numeric_cols)
cat_features = list(one_hot_encoding.get_feature_names_out(categorical_cols))

feature_names = num_features + cat_features

In [44]:
coefs = model_lr.coef_[0]

importance_df_lr = pd.DataFrame({
    "feature": feature_names,
    "Coefficient": coefs,
    "abs_coefficient": np.abs(coefs)
}).sort_values("abs_coefficient", ascending = False)

In [45]:
importance_df_lr.head(20)

Unnamed: 0,feature,Coefficient,abs_coefficient
5,AMT_GOODS_PRICE,-0.988365,0.988365
171,ORGANIZATION_TYPE_Realtor,0.957105,0.957105
3,AMT_CREDIT,0.924915,0.924915
87,NAME_INCOME_TYPE_Pensioner,-0.63817,0.63817
90,NAME_INCOME_TYPE_Unemployed,0.585866,0.585866
92,NAME_EDUCATION_TYPE_Academic degree,-0.55726,0.55726
189,ORGANIZATION_TYPE_Transport: type 3,0.556985,0.556985
152,ORGANIZATION_TYPE_Industry: type 12,-0.490501,0.490501
28,EXT_SOURCE_3,-0.485058,0.485058
164,ORGANIZATION_TYPE_Legal Services,0.457407,0.457407


The top 3 most impactful features in the logistic regression model were:
- AMT_GOODS_PRICE: This is price of goods being financed (car, school, personal)
- ORGANIZATION_TYPE_Realtor: One of the One-Hot encoded variables representing job sector
- AMT_CREDIT: One of the expected variables in top 3, this means that larger loans have a higher default risk

Some insight on the relationship between AMT_GOODS_PRICE and AMT_CREDIT. Holding loan size constant:
- larger loans = higher default risk
- higher goods price = lower default risk

Conceptually, this could be represented by something like a downpayment, which indicates higher financial stabiltiy and more personal investment in ensuring a loan doesn't default (because then they 'lose' their downpayment). This establishes a relationship between these two features implying that the lower that the proportion between loan amount and goods price is, the less risk of default there is. 

Some other obeservations; it seems like organization type is one of the most important features in determining default risk. It appears both as a positive and negative impact to default risk in 12 of the top 20 most impactful features. This tends to imply the relationship between some jobs have more or less stable sources of income (contract vs salary, job volatility, income predictability). Jobs with high variability of income predictability like realty impact default risk heavily. 

Some other expected insights, EXT_SOURCE_2 and EXT_SOURCE_3 represent reporting from external bureaus (like credit score), and a higher credit score lowers the risk of default. 

Additionally, it appears graduating college reduces default risk, being unemployed increases default risk, and pension holders decrease default risk.

The top 5 most important impacts to default risk appears to be (in no particular order):
- Where you work (and its volatility)
- Your loan amount
- Your credit history
- Your education
- The proportion between loan amount and goods price

Below we are going to begin modeling the dataset on a Random Forest classifier

In [75]:
model_clf = RandomForestClassifier(
    class_weight="balanced",
    min_samples_leaf = 50,
    random_state = 69)

model_clf.fit(X_train, y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",50
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


In [76]:
y_valid_proba_clf = model_clf.predict_proba(X_valid)[:,1]

threshold = 0.5
y_pred_clf = (y_valid_proba_clf >= threshold).astype(int)

confusion_matrix(y_valid, y_pred_clf)
print(classification_report(y_valid, y_pred_clf))

              precision    recall  f1-score   support

           0       0.95      0.80      0.87     56538
           1       0.19      0.52      0.28      4965

    accuracy                           0.78     61503
   macro avg       0.57      0.66      0.57     61503
weighted avg       0.89      0.78      0.82     61503



The interpretation of this information is that using the Random Forest Classifier:
- The model catches 52% of all defaulters
- Of the people the model flags as defaulters, 19% actually default
- This means, for every 1 in 5 people the model flags as defaulters, they actually default

- The model approves 80% of non-defaulters
- 95% of those are actually safe
- This means for every 1 in 20 people the model flags as non-defaulters, they actaully default

In [79]:
importances = model_clf.feature_importances_

importance_df_clf = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
}).sort_values("importance", ascending = False)

importance_df_clf.head(20)

Unnamed: 0,feature,importance
28,EXT_SOURCE_3,0.155519
27,EXT_SOURCE_2,0.145784
8,DAYS_EMPLOYED,0.051541
7,DAYS_BIRTH,0.045489
40,DAYS_LAST_PHONE_CHANGE,0.040441
5,AMT_GOODS_PRICE,0.03618
3,AMT_CREDIT,0.033999
10,DAYS_ID_PUBLISH,0.033858
4,AMT_ANNUITY,0.031903
9,DAYS_REGISTRATION,0.029238


The top 3 most important features in Random Forest Classifier:
- EXT_SOURCE_3/EXT_SOURCE_2: External reporting (credit history)
- DAYS_EMPLOYED: Employment history
- DAYS_BIRTH: Age

Present again is AMT_GOODS_PRICE and AMT_CREDIT, which we explored the relationship bewtween earlier, credit history (EXT_SOURCE_2/3), and education level. 

Interestingly, the random forest classifier and LR models kind of took different routes when determining default risk. Many of the most heavily weighted features in Random Forest focus on age, where someone lives, and how long they have lived there/worked in the area. This seems to imply that some of the earliest trees in random forest are filtering wealthy, older, established individuals out of default risk very early. This is consistent with the idea that someone is less likely to default when they are well established and wealthy.

Top 5 features for Random Forest (in particular order):
- Credit history
- Applicant stability (age, employment legnth, how long they have lived in one spot)
- Indicators of wealth and living conditions (home size, home cost, home area)
- Loan size and structure (amount, proporiton of financing to cost of item, how much the monthly payment is)
- Education/occupation 

**As a note, SK_ID_CURR is an identifier and has no real meaning**

Now to calculate the capture rate for both models at 5, 10, 15, and 20% of the riskiest

Using a capture rate function, I will run it on both the earlier defined evaluation dataframes for both models, then compare insights

In [80]:
# Setting up df to perform capture rate analysis on logistic regression model
df_eval_lr = pd.DataFrame({
    "y" : y_valid.values,
    "p" : y_valid_proba_lr
}).sort_values("p", ascending=False)

# Setting up the df to evaluate capture rate on Random Forest Classifier
df_eval_clf = pd.DataFrame({
    "y": y_valid.values,
    "p": y_valid_proba_clf
}).sort_values("p", ascending=False)

In [81]:
def capture_and_lift(df_eval, top_percents=[0.05, 0.10, 0.15, 0.20]):
    results = []
    total_defaults = df_eval["y"].sum()
    
    for p in top_percents:
        k = int(p * len(df_eval))
        top_slice = df_eval.iloc[:k]
        
        captured = top_slice["y"].sum()
        capture_rate = captured / total_defaults
        
        lift = capture_rate / p
        
        results.append({
            "Top %": f"{int(p*100)}%",
            "Captured % of defaulters": round(capture_rate * 100, 2),
            "Lift vs Random": round(lift, 2)
        })
    
    return pd.DataFrame(results)

In [82]:
capture_and_lift(df_eval_lr)

Unnamed: 0,Top %,Captured % of defaulters,Lift vs Random
0,5%,18.57,3.71
1,10%,31.02,3.1
2,15%,40.1,2.67
3,20%,49.1,2.46


In [83]:
capture_and_lift(df_eval_clf)

Unnamed: 0,Top %,Captured % of defaulters,Lift vs Random
0,5%,18.35,3.67
1,10%,30.27,3.03
2,15%,39.94,2.66
3,20%,48.72,2.44


## Baseline Modeling Conclusions

### Notebook Summary:
To recap, in this notebook we reconstructed the data preprocessing pipeline, trained two ML models (Logistic Regression and Random Forest Classifier) using the processed data, and evaluated performance with relevant metrics including precision & recall, capture and lift rate, and feature importance.

#### Logistic Regression
The logistic regression model performed best with a decision threshold of 0.5 and produced an interpretable and agressive (flagged often) result that was strong at catching defaulters. In general, this model seemed to weigh job sector/volatility, education level, loan amount, item value, and credit history as the primary features for determining default risk. This pattern seems to be focused on linear financial ratios and categorical effects that influence those (loan-to-value, job sector volatility definition).

#### Random Forest Classifier
The RFC model, following the logistic regression model, useda decision threshold of 0.5 and produced a more conservative (flagged less/caught less), precise model (when flagged, caught often). In general, this model seemed to create a customer profile that was in a lower default risk threshold based on applicant stability features (time at residence, age), socioeconomic features (residence cost/features, education level), credit history, loan amount, and item value. 

#### Choosing a Champion model
Based on the perfomance of each model, logistic regression is selected to be the champion model going forward. The logistic regression may flag more false positives, but on baseline metrics (66.4% vs 52%) logistic regression performs better. Additionally when filtering with capture and lift rate, the models are close in performance, but logistic regression, overall, tends to perform better. 

In a larger-scope project, there is potential value in using both models in a complementary way—for example, applying the more aggressive logistic regression model to higher-risk segments and the more conservative Random Forest to lower-risk segments where minimizing unnecessary intervention is important.

#### Next Steps
The next phase of the project will focus on improving model performance through feature engineering (e.g., financial ratios such as loan-to-value), incorporating additional datasets such as bureau credit history, and re-evaluating both models under the expanded feature set.