# Term Project Proposal

### Names: Alexander Romero-Barrionuevo
### Names: Dylan Lam

### EID's: ANR3784
### EID's: DXL85

In [21]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [22]:
# import data
df = pd.read_csv('train.csv')

----

### Examine and Clean the Dataset

In [23]:
# data shapes
print('dataset shape', df.shape)

dataset shape (614, 13)


In [24]:
# data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [25]:
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [26]:
# quick look at the datasets
df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [27]:
# examine total sum of null values per collumn
df.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [28]:
# find total sum of duplicates in the dataset
print('Total sum of duplicates in the dataset:', df.duplicated().sum())

Total sum of duplicates in the dataset: 0


In [29]:
# remove Loan_ID from the dataframe
df = df.drop(columns=['Loan_ID'], axis = 0)
df.head(5)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [30]:
# function to replace null values of categorical columns with mode
def replace_null_with_mode(df, categorical_columns):
    for col in categorical_columns:
        mode_val = df[col].mode()[0]
        df[col].fillna(mode_val, inplace=True)

# function to replace null values of numerical columns with mean
def replace_null_with_mean(df, numerical_columns):
    for col in numerical_columns:
        mean_val = df[col].mean()
        df[col].fillna(mean_val, inplace=True)


# replace null values with respective modes or means
categorical_columns = ['Gender', 'Married', 'Dependents', 'Self_Employed']
numerical_columns = ['LoanAmount', 'Loan_Amount_Term', 'Credit_History']
replace_null_with_mode(df, categorical_columns)
replace_null_with_mean(df, numerical_columns)

In [31]:
# verify null values have been replaced
df.isna().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

----

### Feature Engineering

In [32]:
# Create categorical and numerical columns to train model
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area','Credit_History']
numerical_columns = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term' ]

In [33]:
# establish new dataframe with target and non-target collumns
target = df['Loan_Status']
train_df = df.drop(['Loan_Status'], axis=1)

In [34]:
# Encode categorical columns
encoder = OneHotEncoder()
encoded = encoder.fit_transform(train_df[categorical_columns])
encoded_df = pd.DataFrame(encoded.toarray(), columns=encoder.get_feature_names_out(categorical_columns))
encoded_train_df = pd.concat([train_df[numerical_columns], encoded_df], axis=1)

In [35]:
encoded_train_df

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_0,Dependents_1,...,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Credit_History_0.0,Credit_History_0.8421985815602837,Credit_History_1.0
0,5849,0.0,146.412162,360.0,0.0,1.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,4583,1508.0,128.000000,360.0,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,3000,0.0,66.000000,360.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
3,2583,2358.0,120.000000,360.0,0.0,1.0,0.0,1.0,1.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,6000,0.0,141.000000,360.0,0.0,1.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,2900,0.0,71.000000,360.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
610,4106,0.0,40.000000,180.0,0.0,1.0,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
611,8072,240.0,253.000000,360.0,0.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
612,7583,0.0,187.000000,360.0,0.0,1.0,0.0,1.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


----

### **Data model**

In [36]:
# Split the data for training and testing
X = encoded_train_df
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [37]:
from sklearn.linear_model import LogisticRegression


# Create model and fit it
model = RandomForestClassifier(n_estimators = 100, random_state = 42)
model.fit(X_train, y_train)

# Model predictions
predictions = model.predict(X_test)

In [38]:
# Evaluate model performance
report = classification_report(y_test, predictions)
print('Classification Report:')
print(report)

Classification Report:
              precision    recall  f1-score   support

           N       0.78      0.42      0.55        43
           Y       0.75      0.94      0.83        80

    accuracy                           0.76       123
   macro avg       0.77      0.68      0.69       123
weighted avg       0.76      0.76      0.73       123



### **Improving Model Performance**

In [39]:
# Import libraries
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV


# Preprocess numerical features
numerical_scaler = StandardScaler()  # Standardize numerical features
scaled_numerical = numerical_scaler.fit_transform(train_df[numerical_columns])
scaled_numerical_df = pd.DataFrame(scaled_numerical, columns=numerical_columns)

# Combine processed features
preprocessed_features = pd.concat([encoded_train_df, scaled_numerical_df], axis=1)

# Define Stratified KFold for imbalanced class handling (optional)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)  # Stratified for imbalanced classes

# Define hyperparameter grid for Random Forest (optional)
param_grid_rf = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [3, 4, 5, 6],  # List of values to explore
}



# Loop through Random Forest model with cross-validation
print("\nEvaluating Model: Random Forest")
for train_index, test_index in skf.split(preprocessed_features, target):
    X_train, X_test = preprocessed_features.iloc[train_index], preprocessed_features.iloc[test_index]
    y_train, y_test = target.iloc[train_index], target.iloc[test_index]

    # Hyperparameter tuning with RandomizedSearchCV
    rand_search = RandomizedSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=3, n_iter=50, scoring='f1_macro')
    rand_search.fit(X_train, y_train)
    best_model = rand_search.best_estimator_

    # Make predictions and evaluate model performance
    predictions = best_model.predict(X_test)
    report = classification_report(y_test, predictions)
    print(f"\nFold Report:\n{report}")



Evaluating Model: Random Forest





Fold Report:
              precision    recall  f1-score   support

           N       0.94      0.45      0.61        38
           Y       0.80      0.99      0.88        85

    accuracy                           0.82       123
   macro avg       0.87      0.72      0.75       123
weighted avg       0.84      0.82      0.80       123






Fold Report:
              precision    recall  f1-score   support

           N       0.90      0.47      0.62        38
           Y       0.81      0.98      0.88        85

    accuracy                           0.82       123
   macro avg       0.85      0.73      0.75       123
weighted avg       0.83      0.82      0.80       123






Fold Report:
              precision    recall  f1-score   support

           N       0.94      0.41      0.57        39
           Y       0.78      0.99      0.87        84

    accuracy                           0.80       123
   macro avg       0.86      0.70      0.72       123
weighted avg       0.83      0.80      0.78       123






Fold Report:
              precision    recall  f1-score   support

           N       0.82      0.36      0.50        39
           Y       0.76      0.96      0.85        84

    accuracy                           0.77       123
   macro avg       0.79      0.66      0.68       123
weighted avg       0.78      0.77      0.74       123






Fold Report:
              precision    recall  f1-score   support

           N       0.90      0.47      0.62        38
           Y       0.80      0.98      0.88        84

    accuracy                           0.82       122
   macro avg       0.85      0.72      0.75       122
weighted avg       0.83      0.82      0.80       122



## Analysis of Random Forest Model Performance

This analysis evaluates a Random Forest model using cross-validation.

**Overall Accuracy:**

* The Random Forest model achieved an average accuracy of around 81% across all folds (ranges from 77% to 82%), which is roughly a 5% improvement from the original model.

**Class Imbalance:**

* The classification report shows a significant imbalance in the dataset. The "N" class (likely representing loan rejections) has much lower recall (around 40%) compared to the "Y" class (likely approvals) which has recall values close to 99%. This suggests the model struggles to identify loan rejections accurately.
* The F1 score, which balances precision and recall, is also lower for the "N" class (around 0.6) compared to the "Y" class (around 0.88). This further emphasizes the class imbalance issue.

**Next Steps:**

* Consider addressing the class imbalance issue using various techniques.
* Consider adjusting the `n_iter` parameter in RandomizedSearchCV to perform a more comprehensive hyperparameter search if the parameter grid size isn't already set to explore 50 unique combinations.

While the Random Forest model shows decent overall accuracy, the class imbalance issue needs to be addressed to improve its ability to identify loan rejections accurately. Explore class imbalance techniques and potentially re-evaluate the model's performance after incorporating them.