# Loan Classification Problem
Loan approval prediction. Generally, it involves the lender looking at various background information about the applicant, and deciding whether the bank should grant the loan. Parameters like credit score, loan amount, lifestyle, career, and assets are the deciding factors in getting the loan approved. If, in the past, people with parameters similar to yours have paid their dues timely, it is more likely that your loan would be granted as well.

<br>

## Table of Content
- [ 1 - Packages ](#1)
- [ 2 - Understanding the data ](#2)
  - [ 2.1 Loading and visualizing the data ](#2.1)
- [ 3 - Data preprocessing ](#3)
  - [ 3.1 Data Cleaning ](#3.1)
  - [ 3.2 Feature transformation ](#3.2)
  - [ 3.3 Conclusion ](#3.3)
- [ 4 - Models Implementation ](#4)
  - [ 4.1 Gradient Boosting Classifier ](#4.1)
  - [ 4.2 Random Forest Classifier ](#4.2)
  - [ 4.3 Decision Tree Classifier ](#4.3)
  - [ 4.4 K-Neighbors Classifier ](#4.4)
  - [ 4.5 Linear Support Vector Classifier ](#4.5)
  - [ 4.6 XGB Classifier ](#4.6)
  - [ 4.7 Logistic Regression Classifier ](#4.7)
- [ 5 - Evaluating the Models ](#5)
  - [ 5.1 Gradient Boosting Classifier ](#5.1)
  - [ 5.2 Random Forest Classifier ](#5.2)
  - [ 5.3 Decision Tree Classifier ](#5.3)
  - [ 5.4 K-Neighbors Classifier ](#5.4)
  - [ 5.5 Linear Support Vector Classifier ](#5.5)
  - [ 5.6 XGB Classifier ](#5.6)
  - [ 5.7 Logistic Regression Classifier ](#5.7)
- [ 6 - Conclusion ](#6)


<a id="1"></a>
## 1 - Packages

First, Let's import all the packages that we will need during this project

- [numpy](www.numpy.org) is the fundamental package for scientific computing with Python.
- [pandas](https://pandas.pydata.org) is a famous package for data analysis and manipulation tool in Python.
- [matplotlib](https://matplotlib.org) is a famous library to plot graphs in Python.
- [seaborn](https://seaborn.pydata.org) is a famous Python data visualization library based on matplotlib.
- [sklearn](https://scikit-learn.org) is a wide-used, simple and efficient tool for predictive data analysis in Python.
- [xgboost](https://xgboost.readthedocs.io) is an optimized distributed gradient boosting library that implements machine learning algorithms in Python.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import MinMaxScaler

<a id="2"></a>
## 2 - Understanding the data

The data is collected from *<a href="https://www.kaggle.com/datasets/burak3ergun/loan-data-set">Kaggle</a>* which address the attributes of many applicants and the state of the loan.
<br>
<a id="2.1"></a>
### 2.1 Loading and visualizing the data

In [None]:
#Read data
df = pd.read_csv("assets/data.csv")

#### View the features
Let's get more familiar with the dataset.

In [None]:
#preview data
df.head()

<a id="2.2"></a>
### 2.2 Understanding the dataset features
##### Displaying information about the features

In [None]:
#Preview data information
df.info()

##### The meaning of the features

| Variable Name     | Description                                                                        |
|-------------------|------------------------------------------------------------------------------------|
| Loan_ID           | Loan reference number(unique ID)                                                   |
| Gender            | Applicant gender(Male or Female)                                                   |
| Married           | Applicant marital status(Married or not married)                                   |
| Dependents        | Number of family members                                                           |
| Education         | 	Applicant education/qualification(graduate or not graduate)                       |
| Self_Employed     | 	Applicant employment status(yes for self-employed, no for employed/others)        |
| ApplicantIncome   | Applicant's monthly salary/income                                                  |
| CoapplicantIncome | Additional applicant's monthly salary/income                                       |
| LoanAmount        | Loan amount                                                                        |
| Loan_Amount_Term  | 	The loan's repayment period (in days)                                             |
| Credit_History    | 	Records of previous credit history(0: bad credit history, 1: good credit history) |
| Property_Area     | The location of property(Rural/Semiurban/Urban)                                    |
| Loan_Status       | Status of loan(Y: accepted, N: not accepted)                                       |



<a id="3"></a>
## 3 - Data preprocessing
Data preprocessing is used to transform the raw data in a useful and efficient format by cleaning the data, transform it and reduction it and  so on.
<br>

<a id="3.1"></a>
### 3.1 Data Cleaning
First, let's check whether the data have missing values or not.

In [None]:
#Preview data information
df.info()

In [None]:
#Check missing values
df.isnull().sum()

- Therefor, the features `Gender`, `Married`, `Dependents`, `Self-Employed`, `LoanAmount`, `Loan_Amount_Term` and `Credit_History` have missing values.

- Let's fix them one by one.

#### Gender - Missing Value

In [None]:
# percent of missing "Gender"
print('Percent of missing "Gender" records is %.2f%%' %((df['Gender'].isnull().sum()/df.shape[0])*100))
# %s specifically is used to perform concatenation of strings together.
print("Number of people who take a loan group by gender :")
print(df['Gender'].value_counts())

##### Visualizing Gender

In [None]:
#visuals
df['Gender'].value_counts().plot.bar(rot=0)
sns.countplot(x='Gender', data=df, palette = 'Set2')

#### Married - Missing Value

In [None]:
# percent of missing "Married"
print('Percent of missing "Married" records is %.2f%%' %((df['Married'].isnull().sum()/df.shape[0])*100))
print("Number of people who take a loan group by marital status :")
print(df['Married'].value_counts())

##### Visualizing Married

In [None]:
#visuals
df['Married'].value_counts().plot.bar(rot=0)

In [None]:
sns.countplot(x='Married', data=df, palette = 'Set2')

#### Dependents - Missing Value

In [None]:
# percent of missing "Dependents"
print('Percent of missing "Dependents" records is %.2f%%' %((df['Dependents'].isnull().sum()/df.shape[0])*100))
print("Number of people who take a loan group by dependents :")
print(df['Dependents'].value_counts())

##### Visualizing Dependents

In [None]:
#visuals
df['Dependents'].value_counts().plot.bar(rot=0)

In [None]:
sns.countplot(x='Dependents', data=df, palette = 'Set2')

#### Education - Missing Value

In [None]:
# percent of missing "Education"
print('Percent of missing "Self_Employed" records is %.2f%%' %((df['Education'].isnull().sum()/df.shape[0])*100))
print("Number of people who take a loan group by Education :")
print(df['Education'].value_counts())

##### Visualizing Education

In [None]:
#visuals
df['Education'].value_counts().plot.bar(rot=0)

In [None]:
sns.countplot(x='Education', data=df, palette = 'Set2')

#### Self Employed - Missing Value

In [None]:
# percent of missing "Self_Employed"
print('Percent of missing "Self_Employed" records is %.2f%%' %((df['Self_Employed'].isnull().sum()/df.shape[0])*100))
print("Number of people who take a loan group by self employed :")
print(df['Self_Employed'].value_counts())

##### Visualizing Self Employed

In [None]:
#visuals
df['Self_Employed'].value_counts().plot.bar(rot=0)

In [None]:
sns.countplot(x='Self_Employed', data=df, palette = 'Set2')

#### Loan Amount - Missing Value

In [None]:
# percent of missing "LoanAmount"
print('Percent of missing "LoanAmount" records is %.2f%%' %((df['LoanAmount'].isnull().sum()/df.shape[0])*100))

##### Visualizing Loan Amount

In [None]:
#visuals
ax = df["LoanAmount"].hist(density=True, stacked=True, color='teal', alpha=0.6)
df["LoanAmount"].plot(kind='density', color='teal')
ax.set(xlabel='Loan Amount')
plt.show()

##### Loan Amount is skewed and have outliers

In [None]:
df['LoanAmount'].median()
df['LoanAmount'].mode()
df['LoanAmount'].mean()
sns.boxplot(y='LoanAmount', data=df)
sns.histplot(data=df, x='LoanAmount', palette='Set2')

#### Loan Amount Term - Missing Value

In [None]:
# percent of missing "Loan_Amount_Term"
print('Percent of missing "Loan_Amount_Term" records is %.2f%%' %((df['Loan_Amount_Term'].isnull().sum()/df.shape[0])*100))
print("Number of people who take a loan group by loan amount term :")
print(df['Loan_Amount_Term'].value_counts())

##### Visualizing Loan Amount Term

In [None]:
#visuals
sns.countplot(x='Loan_Amount_Term', data=df, palette = 'Set2')

In [None]:
sns.histplot(data=df,x='LoanAmount', palette = 'Set2')

#### Credit History - Missing Value

In [None]:
# percent of missing "Credit_History"
print('Percent of missing "Credit_History" records is %.2f%%' %((df['Credit_History'].isnull().sum()/df.shape[0])*100))
print("Number of people who take a loan group by credit history :")
print(df['Credit_History'].value_counts())

##### Visualizing Credit History

In [None]:
sns.countplot(x='Credit_History', data=df, palette = 'Set2')

#### Conclusion
*Based on the previous visualization, we will fill the missing data of the features by*:
- If `Gender` is missing = Male (mode).
- If `Married` is missing = yes (mode).
- If `Dependents` is missing = 0 (mode).
- If `Self_Employed` is missing = no (mode).
- If `LoanAmount` is missing = median of data. (it's a numeric data, mode doesn't make sense)
- If `Loan_Amount_Term` is missing = 360 (mode).
- If `Credit_History` is missing = 1.0 (mode).

In [None]:
train_data = df.copy()
train_data['Gender'].fillna(train_data['Gender'].mode()[0], inplace=True)
train_data['Married'].fillna(train_data['Married'].mode()[0], inplace=True)
train_data['Dependents'].fillna(train_data['Dependents'].mode()[0], inplace=True)
train_data['Self_Employed'].fillna(train_data['Self_Employed'].mode()[0], inplace=True)
train_data["LoanAmount"].fillna(train_data["LoanAmount"].median(), inplace=True)
train_data['Loan_Amount_Term'].fillna(train_data['Loan_Amount_Term'].mode()[0], inplace=True)
train_data['Credit_History'].fillna(train_data['Credit_History'].mode()[0], inplace=True)

In [None]:
#Check missing values
train_data.isnull().sum()
# train_data

<strong>The data looks fine, no more missing values ... moving into the next point.</strong>

<a id="3.2"></a>
### 3.2 Data transformation
Here, we transform the data into appropriate forms suitable for mining process by doing Normalization.
<br>
#### Transform and Convert categorical object data type to Numeric(int64)
generate the new data type to transform into

In [None]:
gender_stat = {"Female": 1, "Male": 2}
yes_no_stat = {'No': 1, 'Yes': 2}
dependents_stat = {'0': 0, '1': 1, '2': 2, '3+': 3}
education_stat = {'Not Graduate': 1, 'Graduate': 2}
property_stat = {'Semiurban': 0, 'Urban': 1, 'Rural': 2}

now replace the categorical objects with the generated data type.

In [None]:
train_data['Gender'] = train_data['Gender'].replace(gender_stat)
train_data['Married'] = train_data['Married'].replace(yes_no_stat)
train_data['Dependents'] = train_data['Dependents'].replace(dependents_stat)
train_data['Education'] = train_data['Education'].replace(education_stat)
train_data['Self_Employed'] = train_data['Self_Employed'].replace(yes_no_stat)
train_data['Property_Area'] = train_data['Property_Area'].replace(property_stat)

Let's preview the data once more.

In [None]:
train_data.head()

#### Feature Scaling
there are many problems in the data in terms of the scales of the data like the next following example

In [None]:
sns.countplot(x='Loan_Amount_Term',data=df,palette='Set3')
"""loan amount term is numerical data not following the normal distribution """

##### Min-Max Normalization
for Numeric data

In [None]:
#minimax scaler
#for numeric
min_max_scaler = MinMaxScaler()
min_max_scaler.fit(train_data.iloc[:, 6:10])
normData = pd.DataFrame(min_max_scaler.transform(train_data.iloc[:, 6:10]), index=train_data.index, columns=train_data.iloc[:, 6:10].columns)
train_data.iloc[:, 6:10] = normData

let's preview the data

In [None]:
train_data.head()

for property area and dependents features

In [None]:
#for property area and dependents
prop_depend_scaler = MinMaxScaler()
prop_depend_scaler.fit(train_data.loc[:, ['Dependents', 'Property_Area']])
prop_depend_norm = pd.DataFrame(prop_depend_scaler.transform(train_data.loc[:, ['Dependents', 'Property_Area']]), index=train_data.loc[:, ['Dependents', 'Property_Area']].index, columns=train_data.loc[:, ['Dependents', 'Property_Area']].columns)
train_data.loc[:, ['Dependents', 'Property_Area']] = prop_depend_norm

let's preview the data again

In [None]:
train_data.head()

<a id="3.3"></a>
### 3.3 Conclusion
let's see the data stats before preprocessing

In [None]:
#Preview data information
df.info()
df.isnull().sum()

Now, after data preprocessing

In [None]:
train_data.info()
train_data.isnull().sum()

In [None]:
train_data.describe()

<a id="4"></a>
## 4 - Models Predictions
First, we split the data into training set and test set

In [None]:
#split data
x = train_data.iloc[:,1:12]
y = train_data.iloc[:,12]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=1234)
expected_y = y_test

<a id="4.1"></a>
### 4.1 Gradient Boosting Classifier

In [None]:
#GradientBoostingClassifier
GBC = GradientBoostingClassifier()
GBC.fit(X_train, y_train)
GBC_predicted_y = GBC.predict(X_test)

<a id="4.2"></a>
### 4.2 Random Forest Classifier

In [None]:
#RandomForestClassifier
RFC = RandomForestClassifier(n_estimators=10)
RFC.fit(X_train, y_train)
RFC_predicted_y = RFC.predict(X_test)

<a id="4.3"></a>
### 4.3 Decision Tree Classifier

In [None]:
#DecisionTreeClassifier
DTC = DecisionTreeClassifier()
DTC.fit(X_train, y_train)
DTC_predicted_y = DTC.predict(X_test)

<a id="4.4"></a>
### 4.4 K-Neighbors Classifier

In [None]:
#KNeighborsClassifier
KNN = KNeighborsClassifier()
KNN.fit(X_train, y_train)
KNN_predicted_y = KNN.predict(X_test)

<a id="4.5"></a>
### 4.5 Linear Support Vector Classifier

In [None]:
#LinearSVC
SVM = svm.LinearSVC(max_iter=5000)
SVM.fit(X_train, y_train)
SVM_predicted_y = SVM.predict(X_test)

<a id="4.6"></a>
### 4.6 XGB Classifier

In [None]:
#XGBClassifier
XGBC = xgb.XGBClassifier()
XGBC.fit(X_train, y_train)
XGBC_predicted_y = XGBC.predict(X_test)

<a id="4.7"></a>
### 4.7 Logistic Regression Classifier

In [None]:
#LogisticRegression
LRC = LogisticRegression()
LRC.fit(X_train, y_train)
LRC_predicted_y = LRC.predict(X_test)

<a id="5"></a>
## 5 - Evaluation the Models

Let's prepare to store all the scores for the Conclusion later

In [None]:
scores = []
classifier = ('Gradient Boosting' , 'Random Forest' ,'Decision Tree' , 'K-Nearest Neighbor' , 'SVM' ,'XGBoost','LogisticRegression')
y_pos = np.arange(len(classifier))

<a id="5.1"></a>
### 5.1 Gradient Boosting Classifier

##### Accuracy

In [None]:
GBC_accuracy_score = accuracy_score(expected_y, GBC_predicted_y)*100
scores.append(GBC_accuracy_score)
print('The accuracy of GBC classification is %.2f%%' % GBC_accuracy_score)

##### F1-Score

In [None]:
print('The F1 Score of GBC classification is %.3f' %(f1_score(expected_y, GBC_predicted_y, average='micro')))

##### Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, GBC_predicted_y)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['N', 'Y'])
disp.plot(cmap=plt.cm.Blues)

<a id="5.2"></a>
### 5.2 Random Forest Classifier

##### Accuracy

In [None]:
RFC_accuracy_score = accuracy_score(expected_y, RFC_predicted_y)*100
scores.append(RFC_accuracy_score)
print('The accuracy of RFC classification is %.2f%%' % RFC_accuracy_score)

##### F1-Score

In [None]:
print('The F1 Score of RFC classification is %.3f' %(f1_score(expected_y, RFC_predicted_y, average='micro')))

##### Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, RFC_predicted_y)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['N', 'Y'])
disp.plot(cmap=plt.cm.Blues)

<a id="5.3"></a>
### 5.3 Decision Tree Classifier

##### Accuracy

In [None]:
DTC_accuracy_score = accuracy_score(expected_y, DTC_predicted_y)*100
scores.append(DTC_accuracy_score)
print('The accuracy of DTC classification is %.2f%%' % DTC_accuracy_score)

##### F1-Score

In [None]:
print('The F1 Score of DTC classification is %.3f' %(f1_score(expected_y, DTC_predicted_y, average='micro')))

##### Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, DTC_predicted_y)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['N', 'Y'])
disp.plot(cmap=plt.cm.Blues)

<a id="5.4"></a>
### 5.4 K-Neighbor Classifier

##### Accuracy

In [None]:
KNN_accuracy_score = accuracy_score(expected_y, KNN_predicted_y)*100
scores.append(KNN_accuracy_score)
print('The accuracy of KNN classification is %.2f%%' % KNN_accuracy_score)

##### F1-Score

In [None]:
print('The F1 Score of KNN classification is %.3f' %(f1_score(expected_y, KNN_predicted_y, average='micro')))

##### Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, KNN_predicted_y)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['N', 'Y'])
disp.plot(cmap=plt.cm.Blues)

<a id="5.5"></a>
### 5.5 Linear Support Vector Classifier

##### Accuracy

In [None]:
SVM_accuracy_score = accuracy_score(expected_y, SVM_predicted_y)*100
scores.append(SVM_accuracy_score)
print('The accuracy of SVM classification is %.2f%%' % SVM_accuracy_score)

##### F1-Score

In [None]:
print('The F1 Score of SVM classification is %.3f' %(f1_score(expected_y, SVM_predicted_y, average='micro')))

##### Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, SVM_predicted_y)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['N', 'Y'])
disp.plot(cmap=plt.cm.Blues)

<a id="5.6"></a>
### 5.6 XGB Classifier

##### Accuracy

In [None]:
XGB_accuracy_score = accuracy_score(expected_y, XGBC_predicted_y)*100
scores.append(XGB_accuracy_score)
print('The accuracy of XGBC classification is %.2f%%' % XGB_accuracy_score)

##### F1-Score

In [None]:
print('The F1 Score of XGBC classification is %.3f' %(f1_score(expected_y, XGBC_predicted_y, average='micro')))

##### Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, XGBC_predicted_y)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['N', 'Y'])
disp.plot(cmap=plt.cm.Blues)

<a id="5.7"></a>
### 5.7 Logistic Regression Classifier

##### Accuracy

In [None]:
LR_accuracy_score = accuracy_score(expected_y, LRC_predicted_y)*100
scores.append(LR_accuracy_score)
print('The accuracy of LGC classification is %.2f%%' % LR_accuracy_score)

##### F1-Score

In [None]:
print('The F1 Score of LGC classification is %.3f' %(f1_score(expected_y, LRC_predicted_y, average='micro')))

##### Confusion Matrix

In [None]:
cm = confusion_matrix(y_test, LRC_predicted_y)
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['N', 'Y'])
disp.plot(cmap=plt.cm.Blues)

<a id="6"></a>
## 6 - Conclusion

Let's Compare all accuracy of all the previous Classification Models

In [None]:
plt.barh(y_pos, scores, align='center', alpha=0.5)
plt.yticks(y_pos, classifier)
plt.xlabel('Score')
plt.title('Classification Performance')
plt.show()

##### The result is *Gradient Boosting Classifier* have the highest score from other classification algorithm.