# üè¶ Project: Loan Eligibility Prediction
## üöÄ Phase 2: Data Preprocessing & Machine Learning Pipeline

---

### üìñ **Overview**
Welcome to the engine room of the project. After exploring the data in **Phase 1 (EDA)**, we now transition to building predictive models. The goal is to automate the loan eligibility process (real-time) based on customer details provided while filling out online application forms.

### üéØ **The Mission**
To build a robust binary classifier that predicts `Loan_Status` (Approved/Rejected).
* **Business Goal:** Minimize risk for the bank while ensuring eligible applicants aren't turned away.
* **Key Metrics:** We prioritize **Accuracy** and **Weighted F1-Score** to balance precision and recall.

### ‚öôÔ∏è **Notebook Workflow**
1.  **Preprocessing & Imputation:** Using `KNNImputer` for numerical gaps and Mode for categorical gaps.
2.  **Feature Engineering:** Log-transforming skewed financial data (`Total_Income`) and One-Hot Encoding categories.
3.  **Baseline Screening:** Testing **14 different algorithms** (Linear, Trees, Ensembles, SVMs) to find top performers.
4.  **Hyperparameter Tuning:** Using `GridSearchCV` to optimize the best candidates.
5.  **Final Selection:** Choosing the "Champion Model" for the final evaluation phase.

---

In [1]:
# --- Data Manipulation ---
import numpy as np
import pandas as pd

# --- Scikit-Learn: Preprocessing & Imputation ---
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer  # For Log scaling
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV

# --- Scikit-Learn: Models ---
from sklearn.linear_model import LogisticRegression, SGDClassifier, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

# --- Scikit-Learn: Metrics ---
from sklearn.metrics import accuracy_score, f1_score

# --- Configuration ---
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## 2. üìÇ Data Loading & Initial Inspection
We load the preprocessed dataset saved from the previous EDA phase.
* **Source:** `preprocessed_loan.csv`
* **Action:** verification of the first few rows to ensure data integrity.

In [2]:
# Load the dataset
df = pd.read_csv(r"..\data\preprocessed_loan.csv")

# Display the first 5 rows to check structure
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,Total_Income,Loan_Amount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,6091.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,4941.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000.0,141.0,360.0,1.0,Urban,Y


## 3. üõ†Ô∏è Feature Formatting & Target Encoding
Before splitting the data, we must ensure features have the correct data types.

1.  **Loan_Amount_Term:** Converted to `object` (categorical) because loan terms are discrete categories (e.g., 360 months, 180 months), not continuous numbers.
2.  **Credit_History:** Converted to `category` as it represents a binary state (0 or 1).
3.  **Loan_Status (Target):** We map the target variable `'Y'`/`'N'` to binary `1`/`0` for machine learning compatibility.

In [3]:
# 1. Cast Loan_Amount_Term to Int64 (handles NaNs) then to object (categorical)
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].astype('Int64')
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].astype('object')

# 2. Cast Credit_History to category
df['Credit_History'] = df['Credit_History'].astype('category')

# 3. Check distribution of the target variable before encoding
print("Target Class Distribution (Before Encoding):")
print(df['Loan_Status'].value_counts())

# 4. Encode Target: Y -> 1 (Approved), N -> 0 (Rejected)
df['Loan_Status'] = df['Loan_Status'].map({'Y':1, 'N':0})

# Verify the changes
df.info()

Target Class Distribution (Before Encoding):
Loan_Status
Y    422
N    192
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   Loan_ID           614 non-null    object  
 1   Gender            601 non-null    object  
 2   Married           611 non-null    object  
 3   Dependents        599 non-null    object  
 4   Education         614 non-null    object  
 5   Self_Employed     582 non-null    object  
 6   Total_Income      614 non-null    float64 
 7   Loan_Amount       592 non-null    float64 
 8   Loan_Amount_Term  600 non-null    object  
 9   Credit_History    564 non-null    category
 10  Property_Area     614 non-null    object  
 11  Loan_Status       614 non-null    int64   
dtypes: category(1), float64(2), int64(1), object(8)
memory usage: 53.6+ KB


## 4. üìä Feature Engineering Strategy
We separate our features into **Numerical** and **Categorical** groups. This is crucial because they require different preprocessing pipelines:
* **Numeric:** Requires scaling (to handle outliers like high incomes).
* **Categorical:** Requires encoding (to convert text labels to numbers).

### **Feature Groups:**
* **Target:** `Loan_Status`
* **Numeric:** `Total_Income`, `Loan_Amount`
* **Categorical:** Gender, Married, Dependents, Education, Self_Employed, Loan_Amount_Term, Credit_History, Property_Area.

In [4]:
# Define feature groups
target_feature = 'Loan_Status'

numeric_features = ['Total_Income', 'Loan_Amount']

categorical_features = ['Gender',
                        'Married',
                        'Dependents',
                        'Education',
                        'Self_Employed',
                        'Loan_Amount_Term',
                        'Credit_History',
                        'Property_Area'
                            ]

print(f"‚úÖ Numeric Features: {numeric_features}")
print(f"‚úÖ Categorical Features: {categorical_features}")

# Ensure all categorical features are strictly cast to type 'category'
# This saves memory and ensures compatibility with certain sklearn selectors
for col in categorical_features:
    df[col] = df[col].astype('category')

‚úÖ Numeric Features: ['Total_Income', 'Loan_Amount']
‚úÖ Categorical Features: ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Loan_Amount_Term', 'Credit_History', 'Property_Area']


## 5. ‚úÇÔ∏è Data Splitting (Train-Validation-Test)
To build a robust model and prevent overfitting, we use a **three-way split strategy**:

1.  **Training Set (64%):** Used to fit the models.
2.  **Validation Set (16%):** Used for unbiased model evaluation and hyperparameter tuning during the development phase.
3.  **Test Set (20%):** Held out completely until the very end to provide a final performance estimate.

**Method:** We use `stratify=y` to ensure the proportion of Approved/Rejected loans remains consistent across all three datasets.

In [5]:
# Separate features (X) and target (y)
X = df.drop("Loan_Status", axis=1)
y = df["Loan_Status"]

# 1. First Split: Separate out the Test set (20%)
X_train_temp, X_test, y_train_temp, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # Essential for imbalanced datasets
)

# 2. Second Split: Separate the remaining 80% into Train and Validation
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_temp, y_train_temp,
    test_size=0.2, # 0.2 * 0.8 = 0.16 (16% of total data)
    random_state=42,
    stratify=y_train_temp
)

# Optional: Save sets to disk for reproducibility in other notebooks
# X_test.to_csv(r'../data/X_test.csv', index=False)
# y_test.to_csv(r'../data/y_test.csv', index=False)
# X_train.to_csv(r"../data/X_train.csv", index=False)
# y_train.to_csv(r"../data/y_train.csv", index=False)
# X_valid.to_csv(r"../data/X_valid.csv", index=False)
# y_valid.to_csv(r"../data/y_valid.csv", index=False)

# Check the shape of the resulting splits
print(f"Training Shape:   {X_train.shape}")
print(f"Validation Shape: {X_valid.shape}")
print(f"Test Shape:       {X_test.shape}")

Training Shape:   (392, 11)
Validation Shape: (99, 11)
Test Shape:       (123, 11)


## 6. ‚öôÔ∏è Preprocessing Pipelines
Machine learning models cannot handle missing values or raw text. We build **Pipelines** to automate the cleanup process. This ensures that the exact same transformations applied to the training set are applied to the test set, preventing **data leakage**.

### **Transformation Strategy:**
1.  **Numeric Pipeline (`num`):**
    * **Imputation:** We use `KNNImputer` (K-Nearest Neighbors). Instead of just filling with the "average," this looks at similar borrowers to guess the missing income or loan amount.
    * **Scaling:** We use `np.log1p` (Log Transformation). Financial data (like Income) is often skewed. Log transformation makes it more "normal" (bell-curve shaped), which helps models like Logistic Regression and SVM.
2.  **Categorical Pipeline (`cat`):**
    * **Imputation:** We use `SimpleImputer(strategy='most_frequent')` to fill missing text with the most common category (Mode).
    * **Encoding:** We use `OneHotEncoder` to convert categories (e.g., "Graduate", "Not Graduate") into binary columns (1s and 0s).

In [6]:
# 1. Define distinct steps for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', KNNImputer(n_neighbors=5)),                   # Fill missing using neighbors
    ('logtransformer', FunctionTransformer(np.log1p, validate=False)) # Log transform for skewness
])

# 2. Define distinct steps for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),     # Fill missing with mode
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) # Convert to binary
])

# 3. Combine them into a single ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', numeric_transformer, numeric_features)
    ])

# Visualizing the Pipeline object
print("‚úÖ Preprocessing pipeline created successfully.")

‚úÖ Preprocessing pipeline created successfully.


## 7. üß™ Baseline Model Screening
We don't know yet which algorithm will best understand the patterns in loan approvals. Therefore, we define a "dictionary" of distinct classifiers to test them all at once.

**We are testing 14 different algorithms across 4 families:**
1.  **Linear Models:** Logistic Regression, Ridge, SGD (Good baselines).
2.  **Tree-Based:** Decision Tree, Random Forest, Extra Trees (Good for capturing non-linear complex rules).
3.  **Boosting:** AdaBoost, Gradient Boosting (High performance, builds weak learners into strong ones).
4.  **Others:** SVM, KNN, Naive Bayes (Gaussian/Bernoulli), Discriminant Analysis.

In [7]:
# Dictionary of models to evaluate
models = {
    # Linear & Distance based
    "Logistic Regression": LogisticRegression(),
    "Ridge Classifier": RidgeClassifier(),
    "SGD Classifier": SGDClassifier(),
    "KNN": KNeighborsClassifier(),
    "SVM": SVC(),

    # Tree & Ensemble based
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Extra Trees": ExtraTreesClassifier(),
    "AdaBoost": AdaBoostClassifier(),
    "Gradient Boosting": GradientBoostingClassifier(),

    # Bayesian & Discriminant
    "GaussianNB": GaussianNB(),
    "BernoulliNB": BernoulliNB(),
    "Linear Discriminant Analysis": LinearDiscriminantAnalysis(),
    "Quadratic Discriminant Analysis": QuadraticDiscriminantAnalysis()
}

print(f"‚úÖ Initialized {len(models)} models for screening.")

‚úÖ Initialized 14 models for screening.


## 8. üèÉ‚Äç‚ôÇÔ∏è Model Training & Evaluation Loop
Here, we iterate through our dictionary of 14 models. For each algorithm, we create a temporary **Pipeline** that:
1.  **Accepts raw data.**
2.  **Runs the Preprocessor** (imputes missing values, scales numbers, one-hot encodes text).
3.  **Fits the Model** on the `X_train` data.
4.  **Predicts** results on the `X_valid` (Validation) data.

**Metrics Used:**
* **Accuracy:** Overall correctness (Correct Predictions / Total Predictions).
* **F1 Score (Weighted):** The harmonic mean of Precision and Recall. This is often a better metric than accuracy for loan datasets, where we want to balance the risk of approving bad loans vs. rejecting good ones.

In [8]:
results = {}

print("üöÄ Starting model training loop...")

for name, model in models.items():
    # Create a pipeline for the specific model
    # This ensures the preprocessor runs immediately before the model trains
    final_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        (name, model)
    ])
    
    # Train the model
    final_pipeline.fit(X_train, y_train)
    
    # Predict on Validation set
    y_pred = final_pipeline.predict(X_valid)

    # Store metrics
    results[name] = {
        "Accuracy": accuracy_score(y_valid, y_pred),
        "F1 Score": f1_score(y_valid, y_pred, average='weighted') 
    }
    
    print(f"   ‚úÖ {name} trained.")

print("üèÅ Loop finished.")

üöÄ Starting model training loop...
   ‚úÖ Logistic Regression trained.
   ‚úÖ Ridge Classifier trained.
   ‚úÖ SGD Classifier trained.
   ‚úÖ KNN trained.
   ‚úÖ SVM trained.
   ‚úÖ Decision Tree trained.
   ‚úÖ Random Forest trained.
   ‚úÖ Extra Trees trained.
   ‚úÖ AdaBoost trained.
   ‚úÖ Gradient Boosting trained.
   ‚úÖ GaussianNB trained.
   ‚úÖ BernoulliNB trained.
   ‚úÖ Linear Discriminant Analysis trained.
   ‚úÖ Quadratic Discriminant Analysis trained.
üèÅ Loop finished.




## 9. üèÜ Performance Leaderboard
We convert our results dictionary into a Pandas DataFrame to easily compare the models. We sort by **Accuracy** to see the top performers.

In [9]:
# Convert results to DataFrame and transpose so models are rows
results_df = pd.DataFrame(results).T

# Sort by Accuracy to find the best models
results_df = results_df.sort_values(by='Accuracy', ascending=False)

# Display the top 10 models
results_df.head(10)

Unnamed: 0,Accuracy,F1 Score
BernoulliNB,0.818182,0.796131
Ridge Classifier,0.808081,0.782492
Linear Discriminant Analysis,0.808081,0.782492
SVM,0.79798,0.768464
SGD Classifier,0.79798,0.768464
Logistic Regression,0.79798,0.768464
Random Forest,0.79798,0.785473
AdaBoost,0.787879,0.759596
Gradient Boosting,0.787879,0.768994
Extra Trees,0.777778,0.770568


## 10. üéõÔ∏è Hyperparameter Tuning
Based on the screening results, we select the top candidates for fine-tuning. We use **GridSearchCV** to exhaustively search through a specified parameter grid to find the optimal configuration for each model.

**Selected Models for Tuning:**
1.  **Bernoulli Naive Bayes:** performed surprisingly well; we will tune the smoothing parameter (`alpha`).
2.  **Extra Trees:** A strong ensemble method; we will tune the number of trees and split criteria.
3.  **Decision Tree:** Included as a simpler baseline to compare against the ensembles.
4.  **Support Vector Machine (SVM):** A robust classifier; we will tune the regularization (`C`) and kernel type.

In [10]:
# Define the parameter grid for selected models
model_params = {
    # 1. Bernoulli Naive Bayes
    "BernoulliNB": {
        "model": BernoulliNB(),
        "params": {
            "model__alpha": [0.01, 0.1, 0.5, 1.0, 5.0, 10.0], # Smoothing parameter
            "model__fit_prior": [True, False]
        }
    },

    # 2. Extra Trees Classifier
    "Extra Trees": {
        "model": ExtraTreesClassifier(),
        "params": {
            "model__n_estimators": [50, 100, 200],        # Number of trees
            "model__max_depth": [None, 5, 10, 20],        # Max depth of tree
            "model__min_samples_split": [2, 5, 10],       # Min samples required to split
            "model__min_samples_leaf": [1, 2, 4],         # Min samples in a leaf
            "model__bootstrap": [True, False]
        }
    },

    # 3. Decision Tree
    "Decision Tree": {
        "model": DecisionTreeClassifier(),
        "params": {
            "model__criterion": ["gini", "entropy", "log_loss"],
            "model__max_depth": [None, 5, 10, 20],
            "model__min_samples_split": [2, 5, 10],
            "model__min_samples_leaf": [1, 2, 4]
        }
    },

    # 4. Support Vector Machine (SVM)
    "SVM": {
        "model": SVC(),
        "params": {
            "model__C": [0.1, 1, 10, 100],             # Regularization parameter
            "model__kernel": ["linear", "rbf"],        # Kernel type
            "model__gamma": ["scale", "auto"]          # Kernel coefficient
        }
    }
}
print("‚úÖ Parameter grids defined.")

‚úÖ Parameter grids defined.


## 11. üîç Running GridSearchCV
We loop through the selected models. For each combination of parameters, we use **5-Fold Cross-Validation**. This means the training data is split into 5 chunks; the model trains on 4 and tests on 1, rotating 5 times. This ensures the "Best Score" is reliable and not just a fluke.

In [None]:
# %%
results_tuning = {}

print("üöÄ Starting Grid Search...")

for name, mp in model_params.items():
    # Construct the pipeline
    pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('model', mp["model"])
    ])
    
    # Initialize Grid Search
    # cv=5: 5-fold cross-validation
    # scoring='accuracy': optimizing for accuracy
    clf = GridSearchCV(pipe, mp["params"], cv=5, scoring='accuracy', error_score=np.nan)
    
    # Fit on the training set
    clf.fit(X_train, y_train)
    
    # Store the best results
    results_tuning[name] = {
        "Best Parameters": clf.best_params_,
        "Best Accuracy": clf.best_score_
    }
    
    print(f"   ‚úÖ {name} tuned. Best Score: {clf.best_score_:.4f}")

print("üèÅ Tuning finished.")

üöÄ Starting Grid Search...
   ‚úÖ BernoulliNB tuned. Best Score: 0.7985


## 12. üìä Tuning Results & Model Selection
Let's view the best parameters found for each model.

In [None]:
# Convert to DataFrame for easy viewing
results_tuning_df = pd.DataFrame(results_tuning).T
results_tuning_df

## 13. ‚úÖ Final Model Selection
**Conclusion:**
After reviewing the Grid Search results, we observe strong performances from BernoulliNB, Extra Trees, and SVM.

**Decision:** We will proceed with the **Support Vector Machine (SVM)** using the optimal hyperparameters identified during tuning.

**Winning Hyperparameters:**
* **C (Regularization):** `0.1` (Strong regularization, prevents overfitting)
* **Kernel:** `linear` (The data is linearly separable in the high-dimensional space created by One-Hot Encoding)
* **Gamma:** `scale`

**Next Steps:**
In the final evaluation notebook (`03_final_evaluation.ipynb`), we will:
1.  Instantiate the SVM with these specific parameters.
2.  Train it on the full training data.
3.  Evaluate it on the unseen **Test Set** (X_test) to get the final performance metrics.