<a href="https://colab.research.google.com/github/fdeiab/loan-default-prediction/blob/main/Loan_Default_Predicition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background

Loan approval is a critical decision for any lending institution. One of two scenarios may happen, the bank loses out on potential income by rejecting a loan to an individual or a company. Or, they lend a loan to a party exhibiting risky behaviour where heavy losses. This uncertainty is why it is a good idea that banks utilize good loan default prediction models as they stand much to gain. 

**The major goal of this notebook is to assess if a loan should be approved and predict whether a loan will default.**

# The Data

The [dataset](https://www.kaggle.com/datasets/larsen0966/sba-loans-case-data-set) used in this project, "SBA Case", is a subset of the "National SBA" data, which contains historical data from 1987 through 2014. The subset dataset focuses solely on the state of California through the years of 1989 to 2012. The dataset used contains **2,102** records under **35** attributes:

| **Variable Name**   | **Description of Variable**                            |
|:-------------------:|:------------------------------------------------------:|
|  LoanNr\_ChkDgt     |  Identifier – Primary key                              |
|  Name               |  Borrower name                                         |
|  City               |  Borrower city                                         |
|  State              |  Borrower state                                        |
|  Zip                |  Borrower zip code                                     |
|  Bank               |  Bank name                                             |
|  BankState          |  Bank state                                            |
|  NAICS              |  North American industry classification system code    |
|  ApprovalDate       |  Date SBA commitment issued                            |
|  ApprovalFY         |  Fiscal year of commitment                             |
|  Term               |  Loan term in months                                   |
|  NoEmp              |  Number of business employees                          |
|  NewExist           |  1 = Existing business, 2 = New business               |
|  CreateJob          |  Number of jobs created                                |
|  RetainedJob        |  Number of jobs retained                               |
|  FranchiseCode      |  Franchise code, \(00000 or 00001\) = No franchise     |
|  UrbanRural         |  1 = Urban, 2 = rural, 0 = undefined                   |
|  RevLineCr          |  Revolving line of credit: Y = Yes, N = No             |
|  LowDoc             |  LowDoc Loan Program: Y = Yes, N = No                  |
|  ChgOffDate         |  The date when a loan is declared to be in default     |
|  DisbursementDate   |  Disbursement date                                     |
|  DisbursementGross  |  Amount disbursed                                      |
|  BalanceGross       |  Gross amount outstanding                              |
|  MIS\_Status        |  Loan status charged off = CHGOFF, Paid in full = PIF  |
|  ChgOffPrinGr       |  Charged\-off amount                                   |
|  GrAppv             |  Gross amount of loan approved by bank                 |
|  SBA\_Appv          |  SBA's guaranteed amount of approved loan              |
|  New         |  =1 if NewExist=2 \(New Business\), =0 if NewExist=1 \(Existing Business\)                                                                  |
|  Portion     |  Proportion of gross amount guaranteed by SBA                                                                                               |
|  RealEstate  |  =1 if loan is backed by real estate, =0 otherwise                                                                                          |
|  Recession   |  =1 if loan is active during Great Recession, =0 otherwise                                                                                  |
|  Selected    |  =1 if the data are selected as training data to build model for assignment, =0 if the data are selected as testing data to validate model  |
|  Default     |  =1 if MIS\_Status=CHGOFF, =0 if MIS\_Status=PIF                                                                                            |
|  daysterm    |  Extra variable generated when creating “Recession” in Section 4\.1\.6                                                                      |
|  xx          |  Extra variable generated when creating “Recession” in Section 4\.1\.6                                                                      |


Logistic Regression will be used for this project to predict odds ratios and probabilities of a loan defaulting or not.

Target value for this project: **MIS\_Status** / **Default**

In [3]:
#@title #Importing Libraries & Loading The Dataset 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('/content/drive/MyDrive/SBAcase.11.13.17.csv')



# Step 1: Exploratory Data Analysis

First, we'll examine the number of null values present, the column data types, and the summary statistics to tell us more about the dataset.


In [6]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2102 entries, 0 to 2101
Data columns (total 35 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Selected           2102 non-null   int64  
 1   LoanNr_ChkDgt      2102 non-null   int64  
 2   Name               2102 non-null   object 
 3   City               2102 non-null   object 
 4   State              2102 non-null   object 
 5   Zip                2102 non-null   int64  
 6   Bank               2099 non-null   object 
 7   BankState          2099 non-null   object 
 8   NAICS              2102 non-null   int64  
 9   ApprovalDate       2102 non-null   int64  
 10  ApprovalFY         2102 non-null   int64  
 11  Term               2102 non-null   int64  
 12  NoEmp              2102 non-null   int64  
 13  NewExist           2101 non-null   float64
 14  CreateJob          2102 non-null   int64  
 15  RetainedJob        2102 non-null   int64  
 16  FranchiseCode      2102 

Unnamed: 0,Selected,LoanNr_ChkDgt,Zip,NAICS,ApprovalDate,ApprovalFY,Term,NoEmp,NewExist,CreateJob,...,ChgOffPrinGr,GrAppv,SBA_Appv,New,RealEstate,Portion,Recession,daysterm,xx,Default
count,2102.0,2102.0,2102.0,2102.0,2102.0,2102.0,2102.0,2102.0,2101.0,2102.0,...,2102.0,2102.0,2102.0,2102.0,2102.0,2102.0,2102.0,2102.0,2099.0,2102.0
mean,0.5,4469172000.0,92698.612274,531630.90295,16179.58706,2004.03568,126.980495,10.150809,1.153736,2.549952,...,20029.08,233064.1,189175.9,0.154139,0.2745,0.671055,0.068506,3809.414843,20076.896141,0.326356
std,0.500119,2530069000.0,1878.208435,521.836986,1454.931276,4.006321,93.798944,34.40242,0.362099,8.010175,...,75432.29,343631.0,298926.8,0.361168,0.446368,0.186519,0.252673,2813.968318,2854.510377,0.468991
min,0.0,1004285000.0,65757.0,531110.0,10554.0,1989.0,0.0,0.0,0.0,0.0,...,0.0,4500.0,2250.0,0.0,0.0,0.29677,0.0,0.0,11524.0,0.0
25%,0.0,2392978000.0,91402.0,531210.0,15695.75,2003.0,60.0,2.0,1.0,0.0,...,0.0,30000.0,15000.0,0.0,0.0,0.5,0.0,1800.0,18316.5,0.0
50%,0.5,3621730000.0,92559.5,531312.0,16556.0,2005.0,84.0,3.0,1.0,0.0,...,0.0,61000.0,41680.0,0.0,0.0,0.5,0.0,2520.0,19270.0,0.0
75%,1.0,6551607000.0,94127.75,532230.0,17149.75,2007.0,240.0,8.0,1.0,2.0,...,15073.5,300000.0,239756.2,0.0,1.0,0.85,0.0,7200.0,22335.0,1.0
max,1.0,9958873000.0,96161.0,533110.0,18911.0,2012.0,306.0,650.0,2.0,130.0,...,1509550.0,2350000.0,2115000.0,1.0,1.0,1.0,1.0,9180.0,27598.0,1.0


