# Upgraded Loan Eligibility Prediction Model

This notebook details the process of building an improved machine learning model for predicting loan eligibility. 

**Upgrades include:**
1.  **Feature Engineering**: Introduction of a synthetic CIBIL score and other insightful features like Total Income and Loan-to-Income Ratio.
2.  **Data Preprocessing**: Use of log transformation for skewed data and StandardScaler for normalization.
3.  **Advanced Model**: Implementation of a `RandomForestClassifier` for better predictive performance.
4.  **Model Persistence**: Saving the trained model and the data scaler for use in the Flask web application.

## 1. Import Libraries and Load Data

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('data/train.csv')

## 2. Data Cleaning and Preprocessing

In [None]:
# Drop Loan_ID as it's not useful for prediction
df = df.drop('Loan_ID', axis=1)

# Handle missing values for categorical features with mode
categorical_cols = ['Gender', 'Married', 'Dependents', 'Self_Employed']
for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Handle missing values for numerical features with median (more robust to outliers than mean)
numerical_cols = ['LoanAmount', 'Loan_Amount_Term', 'Credit_History']
for col in numerical_cols:
    df[col].fillna(df[col].median(), inplace=True)

## 3. Feature Engineering

In [None]:
# Function to generate a synthetic CIBIL score based on Credit_History
def generate_cibil(credit_history):
    if credit_history == 1.0:
        # Good credit history -> Higher CIBIL score range
        return np.random.randint(680, 900)
    else: # credit_history == 0.0
        # Bad credit history -> Lower CIBIL score range
        return np.random.randint(300, 680)

# Apply the function to create the CIBIL_Score column
df['CIBIL_Score'] = df['Credit_History'].apply(generate_cibil)

# Create Total_Income feature
df['Total_Income'] = df['ApplicantIncome'] + df['CoapplicantIncome']

# Create Loan_to_Income_Ratio feature
df['Loan_to_Income_Ratio'] = df['LoanAmount'] / df['Total_Income']

# Create EMI-like feature (Loan Amount per Term)
df['EMI_feature'] = df['LoanAmount'] / df['Loan_Amount_Term']

# Log transformation for skewed numerical data to normalize distribution
df['ApplicantIncome_log'] = np.log(df['ApplicantIncome'] + 1)
df['CoapplicantIncome_log'] = np.log(df['CoapplicantIncome'] + 1)
df['LoanAmount_log'] = np.log(df['LoanAmount'] + 1)
df['Total_Income_log'] = np.log(df['Total_Income'] + 1)

# Drop original skewed columns and CoapplicantIncome which is now part of Total_Income
df = df.drop(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Total_Income'], axis=1)

## 4. Encoding Categorical Variables

In [None]:
# Convert categorical columns to numerical using one-hot encoding
df = pd.get_dummies(df, columns=['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area'], drop_first=True)

# Map Loan_Status to 0 and 1
df['Loan_Status'] = df['Loan_Status'].map({'Y': 1, 'N': 0})

## 5. Model Training and Evaluation

In [None]:
# Define features (X) and target (y)
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale the numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the RandomForestClassifier
model = RandomForestClassifier(n_estimators=150, max_depth=10, random_state=42)
model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Evaluate the model
print(f"Accuracy Score: {accuracy_score(y_test, y_pred) * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## 6. Save the Model and Scaler

In [None]:
# Save the trained model to a file
with open('model/loan_eligibility_model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Save the scaler to a file
with open('model/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

print("Model and Scaler have been saved successfully.")