### Domain:
**○	Finance and Banking.**
#### Context:
Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customers first apply for a home loan after that company manually validates the customer eligibility for loan. 
Company wants to automate the loan eligibility process based on customer detail provided while filling the details online. 
They need a web application where a user can access their website and register, login, and enter the required details such as Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others for checking the eligibility for the home loan.

**Project Objective:**
 1. This is a standard supervised classification task. A classification problem where we have to predict whether a customer is eligible for loan or not based on a given set of independent variable(s).
 2. To build a Python Flask ML application where a user has to get registered by entering the username and password and login to the website and then enter their details to check whether they are eligible for loan or not.


**Dataset Description:**
- **Loan ID**: Unique Loan ID
- **Gender**: Male or Female 
- **Married**: Applicant married (Y/N)
- **Dependents**: Number of dependents
- **Self employed**: Self employed (Y/N)
- **Education**: Graduate/Undergraduate
- **Applicant Income**: Applicant income (in dollars)
- **Co Applicant Income**: Co Applicant Income (in dollars)
- **Loan Amount**: Loan amount in thousands (in dollars)
- **Loan Amount Term**: Term of loan in months
- **Credit History**: Credit history meets guidelines Yes/No(1/0)
- **Property area**: Urban/Semi Urban/Rural
- **Loan Status(Target)**: Loan Approved (Y/N)


### 1. Import required libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

### 2. Load the dataset

In [3]:
df = pd.read_csv('loan_approval_data.csv')
df.head(5)

Unnamed: 0,loan_id,gender,married,dependents,education,self_employed,applicantincome,coapplicantincome,loanamount,loan_amount_term,credit_history,property_area,loan_status
0,lp001002,male,no,0.0,graduate,no,5849,0.0,,360.0,1.0,urban,y
1,lp001003,male,yes,1.0,graduate,no,4583,1508.0,128.0,360.0,1.0,rural,n
2,lp001005,male,yes,0.0,graduate,yes,3000,0.0,66.0,360.0,1.0,urban,y
3,lp001006,male,yes,0.0,not graduate,no,2583,2358.0,120.0,360.0,1.0,urban,y
4,lp001008,male,no,0.0,graduate,no,6000,0.0,141.0,360.0,1.0,urban,y


In [4]:
df.tail(5)

Unnamed: 0,loan_id,gender,married,dependents,education,self_employed,applicantincome,coapplicantincome,loanamount,loan_amount_term,credit_history,property_area,loan_status
609,lp002978,female,no,0.0,graduate,no,2900,0.0,71.0,360.0,1.0,rural,y
610,lp002979,male,yes,3.0,graduate,no,4106,0.0,40.0,180.0,1.0,rural,y
611,lp002983,male,yes,1.0,graduate,no,8072,240.0,253.0,360.0,1.0,urban,y
612,lp002984,male,yes,2.0,graduate,no,7583,0.0,187.0,360.0,1.0,urban,y
613,lp002990,female,no,0.0,graduate,yes,4583,0.0,133.0,360.0,0.0,semiurban,n


### 3. Check the shape and basic information of the dataset.

In [5]:
df.shape

(614, 13)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   loan_id            614 non-null    object 
 1   gender             601 non-null    object 
 2   married            611 non-null    object 
 3   dependents         599 non-null    float64
 4   education          614 non-null    object 
 5   self_employed      582 non-null    object 
 6   applicantincome    614 non-null    int64  
 7   coapplicantincome  614 non-null    float64
 8   loanamount         592 non-null    float64
 9   loan_amount_term   600 non-null    float64
 10  credit_history     564 non-null    float64
 11  property_area      614 non-null    object 
 12  loan_status        614 non-null    object 
dtypes: float64(5), int64(1), object(7)
memory usage: 62.5+ KB


### 4. Check for the presence of the duplicate records in the dataset? If present drop them

In [6]:
len(df[df.duplicated()])

0

### 5. Drop the columns which you think redundant for the analysis

In [7]:
df.drop(['loan_id'],axis=1,inplace=True)

### 6. Check the percentage of missing values in each column of the data frame. Drop the missing values if there are any

In [8]:
df.isnull().sum()/len(df)*100

gender               2.117264
married              0.488599
dependents           2.442997
education            0.000000
self_employed        5.211726
applicantincome      0.000000
coapplicantincome    0.000000
loanamount           3.583062
loan_amount_term     2.280130
credit_history       8.143322
property_area        0.000000
loan_status          0.000000
dtype: float64

In [9]:
df['gender'].fillna(df['gender'].mode()[0], inplace=True)
df['married'].fillna(df['married'].mode()[0], inplace=True)
df['dependents'].fillna(df['dependents'].mode()[0], inplace=True)
df['self_employed'].fillna(df['self_employed'].mode()[0], inplace=True)
df['credit_history'].fillna(df['credit_history'].mode()[0], inplace=True)
df['loan_amount_term'].fillna(df['loan_amount_term'].mode()[0], inplace=True)
df['loanamount'].fillna(df['loanamount'].median(), inplace=True)

df.isnull().sum()/len(df)*100

gender               0.0
married              0.0
dependents           0.0
education            0.0
self_employed        0.0
applicantincome      0.0
coapplicantincome    0.0
loanamount           0.0
loan_amount_term     0.0
credit_history       0.0
property_area        0.0
loan_status          0.0
dtype: float64

### 7. Encode the categorical columns

In [10]:
object_type_variables = [i for i in df[["gender", "married", "education","self_employed","property_area","loan_status"]] if df.dtypes[i] == object]
object_type_variables 


le = LabelEncoder()

def encoder(df):
    for i in object_type_variables:
        q = le.fit_transform(df[i].astype(str))  
        df[i] = q                               
        df[i] = df[i].astype(int)
encoder(df)

In [11]:
df.head()

Unnamed: 0,gender,married,dependents,education,self_employed,applicantincome,coapplicantincome,loanamount,loan_amount_term,credit_history,property_area,loan_status
0,1,0,0.0,0,0,5849,0.0,128.0,360.0,1.0,2,1
1,1,1,1.0,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0.0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,1,1,0.0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,1,0,0.0,0,0,6000,0.0,141.0,360.0,1.0,2,1


In [12]:
df.tail(5)

Unnamed: 0,gender,married,dependents,education,self_employed,applicantincome,coapplicantincome,loanamount,loan_amount_term,credit_history,property_area,loan_status
609,0,0,0.0,0,0,2900,0.0,71.0,360.0,1.0,0,1
610,1,1,3.0,0,0,4106,0.0,40.0,180.0,1.0,0,1
611,1,1,1.0,0,0,8072,240.0,253.0,360.0,1.0,2,1
612,1,1,2.0,0,0,7583,0.0,187.0,360.0,1.0,2,1
613,0,0,0.0,0,1,4583,0.0,133.0,360.0,0.0,1,0


### 8. Separate the target and independent features.

In [13]:
X = df.drop(['loan_status'],axis=1)
y = df['loan_status']

In [14]:
X.head()

Unnamed: 0,gender,married,dependents,education,self_employed,applicantincome,coapplicantincome,loanamount,loan_amount_term,credit_history,property_area
0,1,0,0.0,0,0,5849,0.0,128.0,360.0,1.0,2
1,1,1,1.0,0,0,4583,1508.0,128.0,360.0,1.0,0
2,1,1,0.0,0,1,3000,0.0,66.0,360.0,1.0,2
3,1,1,0.0,1,0,2583,2358.0,120.0,360.0,1.0,2
4,1,0,0.0,0,0,6000,0.0,141.0,360.0,1.0,2


In [15]:
y.head()

0    1
1    0
2    1
3    1
4    1
Name: loan_status, dtype: int32

### 9. Split the data into train and test.

In [16]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)

print(X_train.shape,X_test.shape)
print(y_train.shape,y_test.shape)

(429, 11) (185, 11)
(429,) (185,)


### 10. Build a Random forest Regressor model and check the r2-score for train and test.

In [17]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score

model = RandomForestRegressor(n_estimators = 100)
model.fit(X_train, y_train)

RandomForestRegressor()

In [18]:
y_pred = model.predict(X_test)

### 11. Create a pickle file with an extension as .pkl

In [19]:
import pickle
pickle.dump(model, open('model.pkl','wb'))
model = pickle.load(open('model.pkl','rb'))