# **Business Problem Understanding**
Dream Housing Finance company deais in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('LoanData.csv')
data.head()

In [None]:
data.columns

# **Data Understanding**
 - Loan_ID  ---> Unique loan id
 - Gender  ---> Male / female
 - Married  ---> Applicant Married or not
 - Dependents  ---> Number of dependents
 - Education  ---> Applicants Education
 - Self_Employed  ---> Self employed or not
 - ApplicantIncome  ---> Applicant income
 - CoapplicantIncome  ---> Coapplicant income
 - LoanAmount  ---> Loan amount in thousands
 - Loan_Amount_Term  ---> Loan amount term in months
 - Credit_History  ---> Credit history meets guidlines
 - Property_Area  ---> Urban or rural area
 - Loan_Status  ---> Loan approved or not **Target Varible**

In [None]:
data.info()

In [None]:
data['Loan_ID'].nunique()

**Drop Unimportant columns as per the Feature selection (Filter Methods)**

In [7]:
data.drop(columns=['Loan_ID'],inplace= True)

In [None]:
data['Gender'].unique()

In [None]:
data['Gender'].value_counts()

In [None]:
data['Married'].unique()

In [None]:
data['Married'].value_counts()

In [None]:
data['Dependents'].unique()

In [None]:
data['Dependents'].value_counts()

In [None]:
data['Education'].unique()

In [None]:
data['Education'].value_counts()

In [None]:
data['Self_Employed'].unique()

In [None]:
data['Self_Employed'].value_counts()

In [None]:
data['ApplicantIncome'].describe()

In [None]:
data['CoapplicantIncome'].describe()

**Create New columns as per the Reqirement**

In [20]:
data['Income'] = data['ApplicantIncome'] + data['CoapplicantIncome']
data.drop(columns=['ApplicantIncome','CoapplicantIncome'],inplace=True)

In [None]:
data['Income'].describe()

In [None]:
data['Loan_Amount_Term'].unique()

In [None]:
data['Loan_Amount_Term'].value_counts()

In [None]:
data['Credit_History'].unique()

In [25]:
data['Credit_History']= data['Credit_History'].replace({1:'good',0:'bad'})

In [None]:
data["Credit_History"].unique()

In [None]:
data['Credit_History'].value_counts()

In [None]:
data['Property_Area'].unique()

In [None]:
data['Property_Area'].value_counts

In [None]:
data['Loan_Status'].unique()

In [None]:
data['Loan_Status'].value_counts()

In [32]:
continous = ['Income','LoanAmount']

discrete_categorical = ['Gender', 'Married', 'Education',
                        'Self_Employed', 'Credit_History',
                        'Property_Area', 'Loan_Status']

discrete_count =['Dependents','Loan_Amount_Term']

**Exploratory Data Analysis (EDA)**

For Continous Varible

In [None]:
data[continous].describe().T

In [None]:
plt.rcParams['figure.figsize'] = (18,8)

plt.subplot(1,2,1)
sns.histplot(data['Income'],kde=True)

plt.subplot(1,2,2)
sns.histplot(data['LoanAmount'],kde=True)

plt.suptitle('Univariate Analysis on Numerical Columns')
plt.show()

In [None]:
data[continous].skew()

In [None]:
sns.heatmap(data[continous].corr(), annot=True)
plt.show()

In [None]:
sns.pairplot(data[continous])
plt.show()

For discrete Varible

In [None]:
data[discrete_categorical].describe().T

In [None]:
plt.rcParams['figure.figsize'] = (18,8)

plt.subplot(2,3,1)
sns.countplot(data['Gender'])

plt.subplot(2,3,2)
sns.countplot(data['Married'])

plt.subplot(2,3,3)
sns.countplot(data['Self_Employed'])

plt.subplot(2,3,4)
sns.countplot(data['Property_Area'])

plt.subplot(2,3,5)
sns.countplot(data['Education'])

plt.subplot(2,3,6)
sns.countplot(data['Loan_Status'])

plt.suptitle('Univariate Analysis on Categorical Columns')
plt.show()

**Checking the Missing values**

In [None]:
data.isnull().sum()

**Checking for skewness**

In [None]:
data[continous].skew()

**Checking the Outliers**

In [None]:
# Visulize using boxplot

plt.subplot(1,2,1)
sns.boxplot(data['Income'])

plt.subplot(1,2,2)
sns.boxplot(data['LoanAmount'])

plt.suptitle('Outliers in the Data')
plt.show()

# **Step-3 Data Preprocessing**
1. **Data Cleaning**
    - Wrong Data 
    - Missing Values
    - Wrong Data Type
    - Duplicate Data
    - Outliers
2. **Data Wrangling**
    - Transformation (applicable on continous)
    - Scaling (applicable on continous)
    - Encoding (applicable on discrete categorical)
    - **Dont Apply Data Wrangling on count Varible**

**Worng Data Treatement**

In [43]:
data['Dependents']=data['Dependents'].replace({"3+":3})

**Missing Value Treatment**

In [44]:
data['Dependents']=data['Dependents'].fillna(0)

data['Gender']=data['Gender'].fillna(data['Gender'].mode()[0])
data['Married']=data['Married'].fillna(data['Married'].mode()[0])
data['Self_Employed']=data['Self_Employed'].fillna(data['Self_Employed'].mode()[0])

data = data.dropna(subset=["Income",'LoanAmount','Loan_Amount_Term','Credit_History'])

**Data Type Conversion**

In [45]:
data['Dependents'] = data['Dependents'].astype('int')
data['Loan_Amount_Term'] = data['Loan_Amount_Term'].astype('int')

**Outlier Treatment**

In [46]:
# outlier should be retrained

**Transformation**

In [47]:
from scipy.stats import boxcox
data['Income'],a = boxcox(data['Income'])
data['LoanAmount'],c = boxcox(data['LoanAmount'])

In [None]:
data[continous].skew()

**Encoding**

In [49]:
data['Gender'] = data['Gender'].replace({'Male': 1, 'Female':0})
data['Married'] = data['Married'].replace({'Yes': 1, 'No':0})
data['Education'] = data['Education'].replace({'Graduate': 1, 'Not Graduate':0})
data['Self_Employed'] = data['Self_Employed'].replace({'Yes':1, 'No':0})
data['Property_Area'] = data['Property_Area'].replace({'Rural': 0, 'Semiurban': 1, 'Urban':2})
data['Credit_History'] = data['Credit_History'].replace({'good': 1, 'bad' : 0})
data['Loan_Status'] = data['Loan_Status'].replace({'Y':1, 'N':0})

In [None]:
data.head()

**X & Y**

In [51]:
X = data.drop('Loan_Status',axis=1)
y = data['Loan_Status']

**Best Random State**

In [None]:
Train = []
CV = []

for i in range (0,100):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state = i)
    
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X_train,y_train)
    
    ypred_train = model.predict(X_train)
    
    from sklearn.metrics import r2_score
    Train.append(r2_score(y_train,ypred_train))
    
    from sklearn.model_selection import cross_val_score
    CV.append(cross_val_score(model,X_train,y_train,cv=5).mean())

em = pd.DataFrame({'Train':Train,'CV':CV})
gm = em[(abs(em['Train']-em['CV'])<=0.05)]
rs = gm [gm['CV']==gm['CV'].max()].index.tolist()[0]
print('Best Random State Number :',rs)

**Train Test Split**

In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state = 10)

# **Modelling**

In [54]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

**Logistic Regression**

In [None]:
# Modelling
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()
logistic_model.fit(X_train,y_train)

# Predict & Evaluate on train data
y_pred_train = logistic_model.predict(X_train) # default thresold value is 0.5
print("Train Accuracy :", accuracy_score(y_train, y_pred_train))

# Cross Validation on train data
print('CV Score :', cross_val_score(logistic_model,X_train,y_train,cv=5,scoring="accuracy").mean())

# Predict on test data
y_pred_test = logistic_model.predict(X_test)
print("Test Accuracy :", accuracy_score(y_test, y_pred_test))

In [None]:
print(confusion_matrix(y_test,y_pred_test))
print(classification_report(y_test,y_pred_test))

In [None]:
from sklearn.metrics import roc_curve,auc,RocCurveDisplay,roc_auc_score

fpr, tpr, thresholds = roc_curve(y_test,y_pred_test)
roc_auc = auc(fpr,tpr)
RocCurveDisplay(fpr=fpr,tpr=tpr,roc_auc=roc_auc).plot()
plt.show()

In [None]:
print ('AUC Score:',roc_auc_score(y_test,y_pred_test))

**KNN**

In [None]:
# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

# Define the estimator and parameter grid
estimator = KNeighborsClassifier()
param_grid = {'n_neighbors': list(range(1, 50)), 'p': [1, 2]}

# Perform Grid Search
knn_grid = GridSearchCV(estimator, param_grid, cv=5, scoring='accuracy')
knn_grid.fit(X_train, y_train)

# KNN Model with Best Parameters
knn_grid.best_estimator_

In [None]:
# Modelling
knn_model = knn_grid.best_estimator_
knn_model.fit(X_train,y_train)

# Predicting
y_pred = knn_model.predict(X_train)
print ("KNN Model Accuracy:", accuracy_score(y_train, y_pred))

#Cross Validation on Train data
print ("KNN Model Cross Validation Accuracy:", cross_val_score(knn_model, X_train, y_train, cv=5, scoring="accuracy").mean())

# Predicting on Test data
y_pred_test = knn_model.predict(X_test)
print ("KNN Model Test Accuracy:", accuracy_score(y_test, y_pred_test))

**SVM**

In [61]:
# Hyperparameter Tuning
from sklearn.svm import SVC

# Define the estimator and parameter grid
estimator = SVC()
param_grid = {'C': [0.01,0.1,1], 'kernel': ['linear', 'rbf', 'poly','sigmoid']}

# Perform Grid Search
svm_grid = GridSearchCV(estimator, param_grid, cv=5, scoring='accuracy')
svm_grid.fit(X_train, y_train)

# KNN Model with Best Parameters
svm_grid.best_estimator_

In [62]:
# Modelling
svm_model = svm_grid.best_estimator_
svm_model.fit(X_train,y_train)

# Predicting
y_pred = svm_model.predict(X_train)
print ("SVM Model Accuracy:", accuracy_score(y_train, y_pred))

#Cross Validation on Train data
print ("SVM Model Cross Validation Accuracy:", cross_val_score(svm_model, X_train, y_train, cv=5, scoring="accuracy").mean())

# Predicting on Test data
y_pred_test = svm_model.predict(X_test)
print ("SVM Model Test Accuracy:", accuracy_score(y_test, y_pred_test))

**Decision Tree**

In [None]:
model = DecisionTreeClassifier(random_state = True)
model.fit(X_train,y_train)

In [None]:
from sklearn.tree import plot_tree
plt.figure(figsize=(20,20),dpi=300)
plot_tree(model,filled=True,
          feature_names=X_train.columns.tolist(),
          class_names=['0','1'])
plt.show()

In [None]:
# Hpper Parameter tuning
estimator = DecisionTreeClassifier(random_state=True)

param_grid = {'criterion':['gini','entropy'],
              'max_depth':list(range(1,16))}

dt_grid = GridSearchCV(estimator,param_grid,scoring='accuracy',cv=5)
dt_grid.fit(X_train,y_train)

dt = dt_grid.best_estimator_
dt

In [None]:
# Important Features 
feats_ab = pd.DataFrame(data=dt.feature_importances_,
                        index=X.columns,
                        columns=['Importance'])

imporatant_features_dt = feats_ab[feats_ab['Importance']>0].index.tolist()
imporatant_features_dt

**Creating Decision tree model with important parameters and important features**

In [None]:
# Selection train & Test Data
X_train_dt = X_train[imporatant_features_dt]
X_test_dt = X_test[imporatant_features_dt]

# Modelling 
dt = dt_grid.best_estimator_
dt.fit(X_train_dt,y_train)

# Evaluaion
ypred_train = dt.predict(X_train_dt)
ypred_test = dt.predict(X_test_dt)

print ("Train Accuracy :",accuracy_score(y_train,ypred_train))
print ("CV Score :",cross_val_score(dt,X_train_dt,y_train,cv=5,scoring='accuracy').mean())
print ("Test Accuracy :",accuracy_score(y_test,ypred_test))

**Random Forest Classifier**

In [None]:
# Hyper parameter tunin
estimator = RandomForestClassifier(random_state=True)

param_grid = {'n_estimators':list(range(1,51))}

rf_grid = GridSearchCV(estimator,param_grid, scoring="accuracy",cv=5)
rf_grid.fit(X_train,y_train)

rf = rf_grid.best_estimator_
rf

In [None]:
# Important features
feats_rf = pd.DataFrame(data = rf.feature_importances_,
                        index=X.columns,
                        columns = ['Importance'])

imporatant_features_rf = feats_rf[feats_rf['Importance']>0].index.tolist()
imporatant_features_rf

In [None]:
X_train_rf = X_train[imporatant_features_rf]
X_test_rf = X_test[imporatant_features_rf]

#Modelling
rf = rf_grid.best_estimator_
rf.fit(X_train_rf,y_train)

# Evaluation
ypred_train = rf.predict(X_train_rf)
ypred_test = rf.predict(X_test_rf)

print ("Train Accuracy :",accuracy_score(y_train,ypred_train))
print ("CV Score :",cross_val_score(rf,X_train_rf,y_train,cv=5,scoring='accuracy').mean())
print ("Test Accuracy :",accuracy_score(y_test,ypred_test))

**AdaBoost Classifer**

In [None]:
estimator = AdaBoostClassifier(random_state=True)
param_grid = {'n_estimators':list(range(1,51))}

ab_grid = GridSearchCV(estimator,param_grid, scoring="accuracy",cv=5)
ab_grid.fit(X_train,y_train)

ab = ab_grid.best_estimator_
ab

In [None]:
# Important features
feats_ab = pd.DataFrame(data = ab.feature_importances_,
                        index=X.columns,
                        columns = ['Importance'])

imporatant_features_ab = feats_ab[feats_ab['Importance']>0].index.tolist()
imporatant_features_ab

In [None]:
X_train_ab = X_train[imporatant_features_ab]
X_test_ab = X_test[imporatant_features_ab]

#Modelling
ab = ab_grid.best_estimator_
ab.fit(X_train_ab,y_train)

# Evaluation
ypred_train = ab.predict(X_train_ab)
ypred_test = ab.predict(X_test_ab)

print ("Train Accuracy :",accuracy_score(y_train,ypred_train))
print ("CV Score :",cross_val_score(ab,X_train_ab,y_train,cv=5,scoring='accuracy').mean())
print ("Test Accuracy :",accuracy_score(y_test,ypred_test))

**Gradient Boost Classifier**

In [None]:
estimator = GradientBoostingClassifier(random_state=True)
param_grid = {'n_estimators':list(range(1,10)),
              'learning_rate':[0.1,0.2,0.3,0.4,0.5,0.7,0.8,0.9,1.0]}

gb_grid = GridSearchCV(estimator,param_grid, scoring="accuracy",cv=5)
gb_grid.fit(X_train,y_train)

gb = gb_grid.best_estimator_
gb

In [None]:
# Important features
feats_gb = pd.DataFrame(data = gb.feature_importances_,
                        index=X.columns,
                        columns = ['Importance'])

imporatant_features_gb = feats_gb[feats_gb['Importance']>0].index.tolist()
imporatant_features_gb

In [None]:
X_train_gb = X_train[imporatant_features_gb]
X_test_gb = X_test[imporatant_features_gb]

#Modelling
gb = gb_grid.best_estimator_
gb.fit(X_train_gb,y_train)

# Evaluation
ypred_train = gb.predict(X_train_gb)
ypred_test = gb.predict(X_test_gb)

print ("Train Accuracy :",accuracy_score(y_train,ypred_train))
print ("CV Score :",cross_val_score(gb,X_train_gb,y_train,cv=5,scoring='accuracy').mean())
print ("Test Accuracy :",accuracy_score(y_test,ypred_test))

**XGBoost Classifier**

In [None]:
# Hyperparameter tuning
estimator = XGBClassifier()
param_grid = {'n_estimators':[10,20,40,100],
              'max_depth':[3,4,5],
              'gamma':[0,0.15,0.3,0.5,1]}

xgb_grid = GridSearchCV(estimator,param_grid, scoring="accuracy",cv=5)
xgb_grid.fit(X_train,y_train)

xgb = xgb_grid.best_estimator_
xgb

In [None]:
# Important features
feats_xgb = pd.DataFrame(data = xgb.feature_importances_,
                        index=X.columns,
                        columns = ['Importance'])

imporatant_features_xgb = feats_xgb[feats_xgb['Importance']>0].index.tolist()
imporatant_features_xgb

In [None]:
X_train_xgb = X_train[imporatant_features_xgb]
X_test_xgb = X_test[imporatant_features_xgb]

#Modelling
xgb = xgb_grid.best_estimator_
xgb.fit(X_train_xgb,y_train)

# Evaluation
ypred_train = xgb.predict(X_train_xgb)
ypred_test = xgb.predict(X_test_xgb)

print ("Train Accuracy :",accuracy_score(y_train,ypred_train))
print ("CV Score :",cross_val_score(xgb,X_train_xgb,y_train,cv=5,scoring='accuracy').mean())
print ("Test Accuracy :",accuracy_score(y_test,ypred_test))

# **Save the best model**

In [None]:
from joblib import dump

dump(dt,'loan.joblib')