# Programming for Data Science and Artificial Intelligence

## Case Study - Loan Prediction

In this workshop we will be working on a dataset called the Loan Prediction dataset.

This dataset concern the data of loan application and the result whether if the loan was approved or not.

We are given 2 set, the training set and the test set.

The training set contains 614 samples and 13 features, 12 of which are the independent variables and the last feature `Loan_Status` is the dependent variable.

The test set contains 367 samples with the same 12 features but without the `Loan_Status` columns. So it will be representing the unseen data that we will be implementing our model on.

Our goal is to do analyze the data to understand this problem and create a classification model for predicting the `Loan_Status`

The first thing to do is to clean the data, by filling in missing values and converting categorical data to real numbers. We will use the Python libraries pandas and sklearn to help with the data cleaning and preparation.

Our tasks will be divided into 2 parts

1. Exploratory Data Analysis
    * Load and view the Dataset
    * Are there any null values ? How will you wrangle/handle them?
    * Are there any outliers values ? How will you wrangle/handle them?
    * Do you notice any patterns or anomalies in the data? Can you plot them?
2. Statistical Analysis
    * Training a machine learning model for Loan prediction

# 1) Exploratory Data Analysis

* Load and view the dataset
* Are there any null values or outliers? How will you wrangle/handle them?
* Do you notice any patterns or anomalies in the data? Can you plot them?

## 1.1) Load and view the Dataset

In [1]:
# Import Pandas
import pandas as pd


# Import the 2 dataset from these paths
# 'data/test_LoanPrediction.csv'
# 'data/test_LoanPrediction.csv'
train_data = pd.read_csv('data/train_LoanPrediction.csv')
test_data  = pd.read_csv('data/test_LoanPrediction.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'data/train_LoanPrediction.csv'

In [None]:
# Check the shape of Training and Test set
print('Training data shape : ', train_data.shape)
print('Test data shape     : ', test_data.shape)

In [None]:
# Let's see the "head" of the training set
train_data.head()

In [None]:
# Let's see the "info" of the training set
# Notice that this will tell us the  Non-null counts and the dtypes of each colum
train_data.info()

## 1.2) Are there any null values? How will you handle them?

In [None]:
# Let's check for missing values in the Training and Test set again with isnull()
print('Missing values in Train data : \n', train_data.isnull().sum() )
print("="*30)
print('Missing values in Test data : \n', test_data.isnull().sum() )

### Filling categorical values


Let's first focus on the `Married` column. We can see from the table above that the `Married` has 3 missing values in the training set and 0 missing values in the test set so let's fill the training set!

We will use the distribution over the train dataset then fill in the missing values in approximately the same ratio with fillna()

Here are the steps:

   * Compute ratio of each category value in the training set
   * Divide the missing values based on ratio
   * Fill in the missing values according to the ratio
   * Don't forget to print the values before and after filling the missing values for confirmation

In [None]:
# Count the values in 'Married' columns.
print(train_data['Married'].value_counts())

In [None]:
# Compute ratio of each category value in the training set
married = train_data['Married'].value_counts()

ratio_married = married[0] / sum(married.values)
ratio_not_married = married[1] / sum(married.values)

print('Numner of unique elements in Married variable : ', married.shape)
print('Ratio of Married to all     : ', ratio_married)
print('Ratio of Not Married to all : ', ratio_not_married)

In [None]:
# Divide the missing values based on ratio
yes_num_train = round(ratio_married * train_data['Married'].isnull().sum())
no_num_train  = round(ratio_not_married * train_data['Married'].isnull().sum())

In [None]:
# Fill in the missing values according to the ratio
# Hint : use the parameter called 'limit' in fillna()
train_data['Married'].fillna('Yes', inplace = True, limit = yes_num_train)
train_data['Married'].fillna('No', inplace = True, limit = no_num_train)  

# Check if all missing data were filled
print(train_data['Married'].value_counts()) 
print('Missing values in Train data : \n', train_data.isnull().sum() )
print("="*30)
print('Missing values in Test data : \n', test_data.isnull().sum() )

Now the number of missing values in the `Married` attribute is 0. We have successfully filled the `Married` column!

But we still have to fill the following categorical columns :
        
        - Gender
        - Dependents
        - Self_Employed
        - Loan_Amount_Term
        - Credit_History
        
For some of them we need to fill in both the training and test set!

So let's write a function that can calculate the ratio from the training data and fill the missing values accordingly.
* Notice that we will use the distribution from the TRAINING set to fill in both the training and test set!

In [None]:
# Your function here
def fill_data_with_ratio(train_data, fill_data_1, fill_data_2, column_name):
    
    count_column = train_data[column_name].value_counts()
    value_list = list(count_column.index)
    
    limits_1 = []
    limits_2 = []
    
    for value in value_list:
        ratio = count_column[value] / sum(count_column.values)
        limits_1.append(round(ratio * fill_data_1[column_name].isnull().sum()))
        if any(fill_data_2):
            limits_2.append(round(ratio * fill_data_2[column_name].isnull().sum()))
    
    for id_, limit in enumerate(limits_1):
        if limit == 0 :
            limit = 1
        fill_data_1[column_name].fillna(value_list[id_], inplace = True, limit = limit)
        
    if any(fill_data_2):
        for id_, limit in enumerate(limits_2):
            if limit == 0 :
                limit = 1
            fill_data_2[column_name].fillna(value_list[id_], inplace = True, limit = limit)

In [None]:
# Let's use our function with the 'Gender' column !
# This columm has missing values in both the training and test set.

# Count the values of 'Gender' Before filling
print("========== BEFORE ==========")
print(train_data['Gender'].value_counts())
print(test_data['Gender'].value_counts())

fill_data_with_ratio(train_data, train_data, test_data, 'Gender')

# Count the values of 'Gender' after filling
print("========== AFTER ==========")
print(train_data['Gender'].value_counts())
print(test_data['Gender'].value_counts())

#### Repeat this step for all categorical columns

In [None]:
# Let's use our function to fill the 'Dependents' column !
# This columm has missing values in both the training and test set.

# Count the values of 'Dependents' Before filling
print("========== BEFORE ==========")
print(train_data['Dependents'].value_counts())
print(test_data['Dependents'].value_counts())

# Use your finction here
fill_data_with_ratio(train_data, train_data, test_data, 'Dependents')

# Count the values of 'Dependents' after filling
print("========== AFTER ==========")
print(train_data['Dependents'].value_counts())
print(test_data['Dependents'].value_counts())

In [None]:
# Notice that we have value 0, 1, 2 ,3+
# Let's convert category value "3+" to "4"
# so that we can convert them to int and it will be easier for the model later

# 'replace' value '3+' with '4'
train_data['Dependents'].replace('3+', '4', inplace = True)
test_data['Dependents'].replace('3+', '4', inplace = True)

# Notice that the values are still of type string, we will fix this later!

In [None]:
# Let's use our function with the 'Self_Employed' column !
# This columm also has missing values in both the training and test set.

# Count the values of 'Self_Employed' Before filling
print("========== BEFORE ==========")
print(train_data['Self_Employed'].value_counts())
print(test_data['Self_Employed'].value_counts())

fill_data_with_ratio(train_data, train_data, test_data, 'Self_Employed')

# Count the values of 'Self_Employed' after filling
print("========== AFTER ==========")
print(train_data['Self_Employed'].value_counts())
print(test_data['Self_Employed'].value_counts())

In [None]:
# Let's use our function with the 'Loan_Amount_Term' column !
# This columm also has missing values in both the training and test set.

# Count the values of 'Loan_Amount_Term' Before filling
print("========== BEFORE ==========")
print(train_data['Loan_Amount_Term'].value_counts())
print(test_data['Loan_Amount_Term'].value_counts())

fill_data_with_ratio(train_data, train_data, test_data, 'Loan_Amount_Term')

# Count the values of 'Loan_Amount_Term' after filling
print("========== AFTER ==========")
print(train_data['Loan_Amount_Term'].value_counts())
print(test_data['Loan_Amount_Term'].value_counts())

In [None]:
# Let's use our function with the 'Credit_History' column !
# This columm also has missing values in both the training and test set.

# Count the values of 'Credit_History' Before filling
print("========== BEFORE ==========")
print(train_data['Credit_History'].value_counts())
print(test_data['Credit_History'].value_counts())

fill_data_with_ratio(train_data, train_data, test_data, 'Credit_History')

# Count the values of 'Credit_History' after filling
print("========== AFTER ==========")
print(train_data['Credit_History'].value_counts())
print(test_data['Credit_History'].value_counts())

### Filling Numerical Values
Finally, `LoanAmount` still has some missing values.
This column contains numeric attribute.
We should check the distribution of the data before deciding how to fill them.

In [None]:
# plot a histogram to see the data distribution of 'LoanAmount'

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,4))
sns.displot(train_data['LoanAmount'])
plt.title('LoanAmount distribution', size=16)
plt.ylabel('count');

In [None]:
# also try plot a box plot
train_data['LoanAmount'].plot(kind='box', figsize=(3,4), patch_artist=True)

There are some outliers so we will be using the median over the training set to fill the missing values to avoid effects of outliers on the center value.

In [None]:
# Use the median of the Training set to fill both the training and test set.
train_data['LoanAmount'].fillna((train_data['LoanAmount'].median()), inplace=True)
test_data['LoanAmount'].fillna((train_data['LoanAmount'].median()), inplace=True)

In [None]:
# check null values in every column again, there shoul be no missing data left now
print('Missing values in Train data : \n', train_data.isnull().sum() )
print("="*30)
print('Missing values in Test data : \n', test_data.isnull().sum() )

## 1.3) Are there any outlier ? How will you handle them?

In [None]:
# Check the outliers of all numerical columns 

# select columns to plot
import numpy as np
df_to_plot = train_data.drop(columns=['Loan_ID','Loan_Amount_Term','Credit_History']).select_dtypes(include=np.number)

# make a subplot out of df_to_plot and plot the box plots
df_to_plot.plot(subplots=True, layout=(4,4), kind='box', figsize=(12,14), patch_artist=True)
plt.subplots_adjust(wspace=0.5);

#### Multiple features contain outliers, but the nothing indicate data entry errors.

## 1.4) Do you notice any patterns or anomalies in the data? Can you plot them?

### Distribution Plots for categorical data
Let's do distribution plot to see how many people in each category applied for the loan.

In [None]:
# import seaborn
import seaborn as sns
sns.set_theme(style="whitegrid")

# plot a distribution plot of 'Gender'
sns.displot(train_data['Gender'])
plt.title('Gender', size=16)
plt.ylabel('count');

In [None]:
# plot a distribution plot of 'Married'
sns.displot(train_data['Married'])
plt.title('Married', size=16)
plt.ylabel('count')

In [None]:
# plot a distribution plot of 'Dependents'
sns.displot(train_data['Dependents'])
plt.title('Dependents', size=16)
plt.ylabel('count')

In [None]:
# plot a distribution plot of 'Education'
sns.displot(train_data['Education'])
plt.title('Education', size=16)
plt.ylabel('count')

In [None]:
# plot a distribution plot of 'Self_Employed'
sns.displot(train_data['Self_Employed'])
plt.title('Self_Employed', size=16)
plt.ylabel('count')

In [None]:
# plot a distribution plot of 'Credit_History'
sns.displot(train_data['Credit_History'])
plt.title('Credit_History', size=16)
plt.ylabel('count')

In [None]:
# plot a distribution plot of 'Property_Area'
sns.displot(train_data['Property_Area'])
plt.title('Property_Area', size=16)
plt.ylabel('count')

### Distribution Plots for Numerical Data

In [None]:
# plot a distribution plot of 'ApplicantIncome'
sns.displot(train_data[train_data['ApplicantIncome']<20000]['ApplicantIncome'])
plt.title('ApplicantIncome distribution', size=16)
plt.ylabel('count');

In [None]:
# plot a distribution plot of 'CoapplicantIncome'
sns.displot(train_data[train_data['CoapplicantIncome']<10000]['CoapplicantIncome'])
plt.title('CoapplicantIncome distribution', size=16)
plt.ylabel('count');

In [None]:
# plot a distribution plot of 'LoanAmount'
sns.displot(train_data[train_data['LoanAmount']<10000]['LoanAmount'])
plt.title('LoanAmount distribution', size=16)
plt.ylabel('count');

### Regression Model between 2 features

We can also try plotting the data and regression model between the `ApplicantIncome` and `LoanAmount`

In [None]:
sns.lmplot(x='ApplicantIncome', y='LoanAmount', data=train_data[train_data['ApplicantIncome'] < 10000]);

In [None]:
# Let's do the same for `CoapplicantIncome` and `LoanAmount`
sns.lmplot(x='CoapplicantIncome', y='LoanAmount', data=train_data[train_data['CoapplicantIncome'] < 6000]);

### Box Plots

Let's plot some box plots to see the relationships between some of the attributes with `LoanAmount`

In [None]:
# Make a boxplot with x as 'Dependents' and y as 'LoanAmount'
plt.figure(figsize=(4,4))
sns.boxplot(x='Dependents', y='LoanAmount', data=train_data);

In [None]:
# Make a boxplot with x as 'Education' and y as 'LoanAmount'
plt.figure(figsize=(4,4))
sns.boxplot(x='Education', y='LoanAmount', data=train_data);

In [None]:
# Make a boxplot with x as 'Property_Area' and y as 'LoanAmount'
plt.figure(figsize=(4,4))
sns.boxplot(x='Property_Area', y='LoanAmount', data=train_data);

In [None]:
# Make a boxplot with x as 'Married' and y as 'LoanAmount'
plt.figure(figsize=(4,4))
sns.boxplot(x='Married', y='LoanAmount', data=train_data);

In [None]:
# Make a boxplot with x as 'Self_Employed' and y as 'LoanAmount'
plt.figure(figsize=(4,4))
sns.boxplot(x='Self_Employed', y='LoanAmount', data=train_data);

# 2) Statistical Analysis
In this workshop we will only do one type of statistical analysis
   * Training a machine learning model for Loan prediction

We will train the model to decide if we should approve or reject the loan.

The independent variables or X will be:
    
    - Gender
    - Married
    - Dependents
    - Education
    - Self_Employed
    - ApplicantIncome
    - CoapplicantIncome
    - LoanAmount
    - Loan_Amount_Term
    - Credit_History
    - Property_Area
    
These will be used to predict the dependent variable or y which is `Loan_status`

### Preparing Data for the model

First, we will convert all values to type int or float so that the model can process them, as we can see here that some columns still have `object` as dtype.

In [None]:
print(train_data.info())
print("="*50)
print(test_data.info())

In [None]:
# First we will drop the `Loan_ID` column as it will not help with learning of the model.
train_data.drop(columns=['Loan_ID'], inplace=True)
test_data.drop(columns=['Loan_ID'], inplace=True)

train_data.head()

We can replace string with int.

In [None]:
# Next, let's convert the `Gender` column to 0.0 for Male and 1.0 for Female
train_data['Gender'].replace('Male', 0.0, inplace = True)
train_data['Gender'].replace('Female', 1.0, inplace = True)

test_data['Gender'].replace('Male', 0.0, inplace = True)
test_data['Gender'].replace('Female', 1.0, inplace = True)

Let's repeat this step for all of the categorical columns!

There's a lot more columns to do!
So you might want to use that 'factorize' function !
(https://pandas.pydata.org/docs/reference/api/pandas.factorize.html)

or we can just replace them.

In [None]:
# Convert values in 'Married' to int
train_married = pd.Categorical(list(train_data['Married']), categories=['No','Yes'])
test_married = pd.Categorical(list(test_data['Married']), categories=['No','Yes'])

train_codes, uniques = pd.factorize(train_married,sort=True)
train_data['Married'] = train_codes

test_codes, uniques = pd.factorize(test_married, sort=True)
test_data['Married'] = test_codes

In [None]:
# Convert values in 'Education' to int
train_education = pd.Categorical(list(train_data['Education']), categories=['Not Graduate','Graduate'])
test_education = pd.Categorical(list(test_data['Education']), categories=['Not Graduate','Graduate'])

train_codes, uniques = pd.factorize(train_education,sort=True)
train_data['Education'] = train_codes

test_codes, uniques = pd.factorize(test_education, sort=True)
test_data['Education'] = test_codes

In [None]:
# Convert values in 'Self_Employed' to int
train_selfem = pd.Categorical(list(train_data['Self_Employed']), categories=['No','Yes'])
test_selfem = pd.Categorical(list(test_data['Self_Employed']), categories=['No','Yes'])

train_codes, uniques = pd.factorize(train_selfem,sort=True)
train_data['Self_Employed'] = train_codes

test_codes, uniques = pd.factorize(test_selfem, sort=True)
test_data['Self_Employed'] = test_codes

In [None]:
# Convert values in 'Property_Area' to int
train_prop = pd.Categorical(list(train_data['Property_Area']), categories=['Rural','Semiurban','Urban'])
test_prop = pd.Categorical(list(test_data['Property_Area']), categories=['Rural','Semiurban','Urban'])

train_codes, uniques = pd.factorize(train_prop,sort=True)
train_data['Property_Area'] = train_codes

test_codes, uniques = pd.factorize(test_prop, sort=True)
test_data['Property_Area'] = test_codes

In [None]:
# Convert values in 'Loan_Status' to int
train_status = pd.Categorical(list(train_data['Loan_Status']), categories=['N','Y'])

train_codes, uniques = pd.factorize(train_status,sort=True)
train_data['Loan_Status'] = train_codes

In [None]:
# Convert values in 'Dependents' to int
train_data['Dependents'] = train_data['Dependents'].astype(int)
test_data['Dependents'] = test_data['Dependents'].astype(int)

In [None]:
# Let's see which columns are still not float or int
train_data.info()

In [None]:
test_data.info()

#### All data is either float or int now!

## 2.1) Training a machine learning model for Loan prediction

You would want to split the train data into train and validation set first. 
Then use the training and validation set for training and validation of the model.

You can use any classification model from sklearn to classify `Loan_Status`. 
Experiment with various hyperparameters, you may use Cross-Validation, Validation curve , Learning Curve or GridSearch as you want but dont forget to get the validation accuracy and the best model.

Then, use your best model to predict `Loan_status` of the test set.
(we cannot calculate the accuracy on the test data because we don't have the ground truth (or the real values of 'Loan_Status')

### Create X and y data for a Machine Learning Model (or any other classification model)

Let's see which shape sklearn needs for Logistic Regression
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


In [None]:
# print shape of train and test data
print(train_data.shape)
print(test_data.shape)

train_data has 1 column more than test_data. That colum is the `Loan_Status` column that we want to predict (or our y) !

In [None]:
# Let's convert train_data and test_data to X_train, y_train and X_test
X = train_data.iloc[:,:-1].to_numpy()
print(X.shape)

y = train_data.iloc[:,-1].to_numpy()
print(y.shape)

X_test = test_data.to_numpy()
print(X_test.shape)

### Split the data

In [None]:
# split X and y into X_train, X_val, y_train, y_val
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42)

### Train a classification model

In [None]:
# import a classification model
from sklearn.linear_model import LogisticRegression

In [None]:
# Fit the model on the training set
clf = LogisticRegression(random_state=0)
clf.fit(X_train, y_train)

### Report Validation Accuracy

In [None]:
# report the model accuracy on validation set
from sklearn.metrics import classification_report
y_pred = clf.predict(X_val)
print(classification_report(y_val, y_pred))

### Use the best model to predict the Test Set

In [None]:
y_test_pred = clf.predict(X_test)
print(y_test_pred)