In [4]:
#import pandas as pd
import pandas as pd
#read in csv as a dataframe
data = pd.read_csv('LoanApprovalPrediction.csv')
#check the shape of the dataframe
data.shape

(598, 13)

In [5]:
#some quick info about the dataframe
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 598 entries, 0 to 597
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            598 non-null    object 
 1   Gender             598 non-null    object 
 2   Married            598 non-null    object 
 3   Dependents         586 non-null    float64
 4   Education          598 non-null    object 
 5   Self_Employed      598 non-null    object 
 6   ApplicantIncome    598 non-null    int64  
 7   CoapplicantIncome  598 non-null    float64
 8   LoanAmount         577 non-null    float64
 9   Loan_Amount_Term   584 non-null    float64
 10  Credit_History     549 non-null    float64
 11  Property_Area      598 non-null    object 
 12  Loan_Status        598 non-null    object 
dtypes: float64(5), int64(1), object(7)
memory usage: 60.9+ KB


In [8]:
#here we will check for missing values
data.isna().sum()

Loan_ID               0
Gender                0
Married               0
Dependents           12
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           21
Loan_Amount_Term     14
Credit_History       49
Property_Area         0
Loan_Status           0
dtype: int64

In [7]:
#checking the number of unique values in Loan_ID
data.Loan_ID.nunique()

598

In [9]:
#dropping the Loan_ID column because it will have no impact on the models as the ID is simply a unique identifier
data.drop('Loan_ID', axis=1, inplace=True)

In [10]:
#checking for how many missing values are in the dataframe total
data.isna().sum().sum()

#96/598 ~= 16% of the data is missing, indicating this is a problem and needs to be cleaned up if we want an accurate model/accurate results

96

In [11]:
#maping gender from Male:Female to 0:1
data.Gender = data.Gender.map({'Male': 0, 'Female':1})

In [18]:
data.info

<bound method DataFrame.info of      Gender Married  Dependents     Education Self_Employed  ApplicantIncome  \
0         0      No         0.0      Graduate            No             5849   
1         0     Yes         1.0      Graduate            No             4583   
2         0     Yes         0.0      Graduate           Yes             3000   
3         0     Yes         0.0  Not Graduate            No             2583   
4         0      No         0.0      Graduate            No             6000   
..      ...     ...         ...           ...           ...              ...   
593       1      No         0.0      Graduate            No             2900   
594       0     Yes         3.0      Graduate            No             4106   
595       0     Yes         1.0      Graduate            No             8072   
596       0     Yes         2.0      Graduate            No             7583   
597       1      No         0.0      Graduate           Yes             4583   

     Co

In [12]:
#import packages for data processing
from sklearn.preprocessing import LabelEncoder
#run label encoder 
le = LabelEncoder()
#check for object type columns
obj = (data.dtypes == 'object')
#for each column, that is obj[obj], run label encoder.fit_transpose on each column
for col in list (obj[obj].index):
    data[col] = le.fit_transform(data[col])

In [13]:
#applying the mean for each column to fill in any missing values
for col in data.columns:
    data[col] = data[col].fillna(data[col].mean())

In [14]:
#checking to confirm we have 0 missing values
data.isnull().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [15]:
#setting parameters for the model, x = all columns except for Loan_Status, y = Loan_Status
x = data.drop('Loan_Status', axis=1)
y = data.Loan_Status

In [16]:
#import train_test_split from sklearn
from sklearn.model_selection import train_test_split
#split the data into training and testing sets, 70% training, 30% testing, random_state = 7
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7)

In [25]:
#import packages for modeling and serialization
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score
import pickle

#run the model
model = RidgeClassifier()

#function to fit the model and return the accuracy score
def modeling(model):
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    return accuracy_score(y_test, y_pred) * 100

#serilizing the model for future use     
with open('train_model.pkl', mode = 'wb') as pkl:
    pickle.dump(model, pkl)

modeling(model)

82.22222222222221

I tried several models before determining the RidgeClassifer had the best performance on this data. 

That lead me to the question what is a RidgeClassifer and how does it work? Let's explore a bit of that now.

Ridge regression is a style of regression that helps models avoid overfitting, it makes the model more general. 

Overfitting of a model on data occurs when the training data allows the model to train very effciently (very low loss function) but doesnt allow the model to generalize and performs very poorly on the test data. 

Ridge regression reduces weights and biasis in the model which penalizes the model in training for a better performance during testing. 

Least sqaures regression = min(sum of squared distance away from expected result)

Rdige regression = min(sum of squared distance away from expected result) + (ALPHA * slope^2)

where alpha varies from 0 (exactly the least square regression slope) and as alpha increases the model reduces in slope and becomes less sensitive to the varaiations of the indepent variable. 

Because this was the best performing model, we can assume some of the loss across all of the models would be attributed to the model being overly senstive to the independent variables and unable to generalize across the complexities of the data. That being said we used a fairly small data set here, and I believe with some additional data and a few small model tweaks an accuracy of 90+% is easily achieveable. 

