# Introduction

This notebook analyses a loan dataset taken from a GitHub user ([prasertcbs](https://github.com/prasertcbs/basic-dataset)). There wasn't any extra information alongside it (e.g. we have the loan amount but not the currency or frequency, and we don't know where the data itself was taken from) but I noticed that (a) the columns were all self-explanatory and (b) there were relatively few rows (only 614), and I was wondering if it would be possible to train a model to predict whether a given loan would be approved with such a small sample. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_palette("Set2")

df = pd.read_csv('Loan-Approval-Prediction.csv')
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.tail()

In [None]:
df.info()

Most of the features are categorical and there are missing values in some of the columns. For most of these the nulls make up at most about 5% of the values so we should be okay to impute these. For the credit history, however, this rises to 8% so we will have to take a further look at this. 

In [None]:
df.isna().mean()

A brief look over the head of the dataframe also gives rise to the following observations.
- The dependents column is given as an integer but encoded as a string.
- The coapplicant income, loan amount and loan term columns are encoded as floats but look as though they should be integers.
- The credit history column is given as a float but looks as though it should be a binary. 

Given that the dataset is small, we'll take a quick look at each feature in turn to make sure we know what we're dealing with before modelling. Beforehand, even though it will be made binary for modelling, we'll change the credit history dtype to object so that it is treated in the same way as for the other categorical features.

In [None]:
df['Credit_History'] = df['Credit_History'].astype('object')

In [None]:
categorical = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 
               'Property_Area']

numerical = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']

In [None]:
#Inspect numerical columns
df[numerical].hist(grid=False, bins=40, alpha=0.7)

plt.tight_layout();

In [None]:
df[numerical].describe()

- `ApplicantIncome`, `CoapplicantIncome` and `LoanAmount` are all right-skewed.
- The majority of loan-applicant incomes are below 6000 but there are a few outliers and at least one at 81000.
- Most coapplicant incomes are below 2000 and 25% of coapplicants have zero income.
- The median loan amount is 128 and 75% of loans are less than 170, though these go as high as 700. 
- The vast majority of the loans are at a fixed-term of 360 (days, presumably). We note that these are discrete values and so should be treated categorically for imputation. 

Now we'll check that the string entries in the categorical columns look okay and plot each feature as a countplot.

In [None]:
#Check that there aren't any typos in the cat entries
for cat in categorical:
    print(df[cat].value_counts(), '\n')

In [None]:
#Inspect categorical columns
plt.figure(figsize=(12, 8))
plt.subplots_adjust(hspace=0.3, wspace=0.3)

for i, cat in enumerate(categorical):
    ax = plt.subplot(3, 3, i+1)
    sns.countplot(x=cat, data=df, ax=ax, hue='Loan_Status')
    plt.legend([], frameon=False)

plt.legend(bbox_to_anchor=(1.05, 1.0), loc='upper left', borderaxespad=0)

In [None]:
df[categorical].describe(include='object')

- Most of the applicants (489 out of 601 that we have information for) are male and are labelled as graduates (480 of 614), though it should be noted that we do not have any information on whether this is from high school, college or university.
- Credit history has been obtained for most applicants (475 of 564) and most are not self-employed.
- About twice as many applicants are married as are not and the majority, more than half, have no dependents, with the rest of the population having 1, 2 or 3+.

We finally examine our target variable, `Loan_Status`, which is the result of the application.

In [None]:
sns.countplot(x='Loan_Status', data=df);

There are more than twice as many loans that are approved as not, so we have an imbalanced dataset, though not severely so. If our model seems to be struggling to predict the minority class later on we made need to address this (e.g. via oversampling), but for the moment we'll simply continue with our preparation for the modelling stage.

At this point a further step in the analysis would be to plot each of the features against the loan status to see if we could discern any interesting relationships or trends that might highlight factors that might be informative as to the outcome of the application, but for the moment we'll just proceed to preprocessing ahead of modelling. 