### Domain:
**Finance and Banking.**

#### Context:
Dream Housing Finance company deals in all home loans. They have presence across all urban, semi urban and rural areas. Customers first apply for a home loan after that company manually validates the customer eligibility for loan. 
Company wants to automate the loan eligibility process based on customer detail provided while filling the details online. 
They need a web application where a user can access their website and register, login, and enter the required details such as Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others for checking the eligibility for the home loan.

**Project Objective:**
 1. This is a standard supervised classification task. A classification problem where we have to predict whether a customer is eligible for loan or not based on a given set of independent variable(s).
 2. To build a Python Flask ML application where a user has to get registered by entering the username and password and login to the website and then enter their details to check whether they are eligible for loan or not.

**Dataset Description:**
- **Loan ID**: Unique Loan ID
- **Gender**: Male or Female 
- **Married**: Applicant married (Y/N)
- **Dependents**: Number of dependents
- **Self employed**: Self employed (Y/N)
- **Education**: Graduate/Undergraduate
- **Applicant Income**: Applicant income (in dollars)
- **Co Applicant Income**: Co Applicant Income (in dollars)
- **Loan Amount**: Loan amount in thousands (in dollars)
- **Loan Amount Term**: Term of loan in months
- **Credit History**: Credit history meets guidelines Yes/No(1/0)
- **Property area**: Urban/Semi Urban/Rural
- **Loan Status(Target)**: Loan Approved (Y/N)

### 1. Import required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score, confusion_matrix
import datetime

### 2. Load the dataset

In [None]:
# Reading and loading the dataset
dataFrame = pd.read_csv('loan_approval_data.csv')
dataFrame.head(5)

In [None]:
dataFrame.tail(5)

### 3. Check the shape and basic information of the dataset.

In [None]:
## check the shape 
dataFrame.shape

In [None]:
## dataset information
dataFrame.info()

In [None]:
dataFrame.describe()

### 4. Check for the presence of the duplicate records in the dataset? If present drop them

In [None]:
## Cheking for duplicate records
len(dataFrame[dataFrame.duplicated()])

### 5. Drop the columns which you think redundant for the analysis

In [None]:
## removing redundant columns
dataFrame.drop(['loan_id'],axis=1,inplace=True)

### 6. Check the percentage of missing values in each column of the data frame. Drop the missing values if there are any

In [None]:
# Finding Missing Value Percentage in each colunms before correcting
dataFrame.isnull().sum()/len(dataFrame)*100

In [None]:
# replacing missing values with the mode
dataFrame['gender'].fillna(dataFrame['gender'].mode()[0], inplace=True)
dataFrame['married'].fillna(dataFrame['married'].mode()[0], inplace=True)
dataFrame['dependents'].fillna(dataFrame['dependents'].mode()[0], inplace=True)
dataFrame['self_employed'].fillna(dataFrame['self_employed'].mode()[0], inplace=True)
dataFrame['credit_history'].fillna(dataFrame['credit_history'].mode()[0], inplace=True)
dataFrame['loan_amount_term'].fillna(dataFrame['loan_amount_term'].mode()[0], inplace=True)
dataFrame['loanamount'].fillna(dataFrame['loanamount'].median(), inplace=True)

In [None]:
# Finding Missing Value Percentage in each colunms after correcting
dataFrame.isnull().sum()/len(dataFrame)*100

### 7. Encode the categorical columns

In [None]:
## Label Encoding categorical variables "gender", "married", "education", "self_employed","property_area" and"loan_status" using label encoder

categories = [i for i in dataFrame[["gender", "married", "education","self_employed","property_area","loan_status"]] if dataFrame.dtypes[i] == object]
categories

In [None]:
labelEncoder = LabelEncoder()

def encoder(dataFrame):
    for i in categories:
        q = labelEncoder.fit_transform(dataFrame[i].astype(str))  
        dataFrame[i] = q                               
        dataFrame[i] = dataFrame[i].astype(int)
encoder(dataFrame)

In [None]:
dataFrame.head()

In [None]:
dataFrame.tail(5)

### 8. Separate the target and independent features.

In [None]:
X = dataFrame.drop(['loan_status'],axis=1)
Y = dataFrame['loan_status']

In [None]:
X.head()

In [None]:
Y.head()

### 9. Split the data into train and test.

In [None]:
#Spliting data into training dataSet and testing dataSet
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state=0)

print(X_train.shape,X_test.shape)
print(Y_train.shape,Y_test.shape)

### 10. Build a Random forest Regressor model and check the r2-score for train and test.

In [None]:
# import libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

randomForestRegressor = RandomForestRegressor(n_estimators = 100)
randomForestRegressor.fit(X_train,Y_train)

In [None]:
# make prediction
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
y_train_pred = randomForestRegressor.predict(X_train)
y_test_pred = randomForestRegressor.predict(X_test)

r2_score_train = r2_score(Y_train,y_train_pred)
r2_score_test = r2_score(Y_test,y_test_pred)

print('Tain - r2-score:',r2_score_train)
print('Test - r2-score',r2_score_test)

### 11. Create a pickle file with an extension as .pkl

In [None]:
import pickle

pickle.dump(randomForestRegressor, open('modelFile.pkl','wb'))


model = pickle.load(open('modelFile.pkl','rb'))