# Homework 2 (Loans)

Description from Kaggle competation:  
The goal is to decide whether or not to approve a loan to a new client based on the predictors that are provided. If you predict a 1, that means that you are predicting that the customer will pay back the laon. The data columns are described in the MetaData.csv file. The response for training is the MIS_Status column, where PIF = paid in full = a successful loan.

You'll evaluate using the mean F1 score (see the Kaggle overview page for more information). Mean F1 scores closer to 1 are better scores. On the leaderboard I've included a benchmark which I created by randomly predicting 0 or 1 for the test data set (see random.guess on the leaderboard).

In [32]:
# import analysis packages
import keras
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn.compose import make_column_selector
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import tensorflow as tf

### Data Cleaning

In [38]:
# read data from .csvs
trainDF = pd.read_csv('./loan_train.csv')
testDF = pd.read_csv('./loan_test.csv')

# separate response/prediction columns
trainResp = trainDF['MIS_Status']
trainDF.drop('MIS_Status', axis = 1, inplace = True)
testIDs = testDF['CustomerId']
testDF.drop('CustomerId', axis = 1, inplace = True)

# combine data sets for preprocessing
trainDF['source'] = 'train'
testDF['source'] = 'test'
fullDF = pd.concat([trainDF, testDF], axis = 0)

# check data frame dimensions
display(trainDF.shape)
display(testDF.shape)
display(fullDF.shape)

# factor categorical predictors
fullDF['NAICS'] = fullDF['NAICS'].apply(lambda x: str(x))
fullDF['NewExist'] = fullDF['NewExist'].apply(lambda x: str(x))
fullDF['UrbanRural'] = fullDF['UrbanRural'].apply(lambda x: str(x))
fullDF['RevLineCr'] = np.where(fullDF['RevLineCr'] == 'Y', 'Y', 'N')
fullDF['LowDoc'] = np.where(fullDF['LowDoc'] == 'Y', 'Y', 'N')
fullDF['New'] = fullDF['New'].apply(lambda x: str(x))
fullDF['RealEstate'] = fullDF['RealEstate'].apply(lambda x: str(x))
fullDF['Recession'] = fullDF['Recession'].apply(lambda x: str(x))

# selected predictors
predictors = ['NAICS', 'Term', 'NoEmp', 'CreateJob', 'RetainedJob', 'UrbanRural', 'DisbursementGross', 'GrAppv', 'New', 'Portion', 'Recession']
src = fullDF['source']
fullDF = fullDF[predictors]

# check data types
display(fullDF.info())

# peek at data
display(fullDF.head())

(1102, 31)

(1000, 31)

(2102, 31)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2102 entries, 0 to 999
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   NAICS              2102 non-null   object 
 1   Term               2102 non-null   int64  
 2   NoEmp              2102 non-null   int64  
 3   CreateJob          2102 non-null   int64  
 4   RetainedJob        2102 non-null   int64  
 5   UrbanRural         2102 non-null   object 
 6   DisbursementGross  2102 non-null   int64  
 7   GrAppv             2102 non-null   int64  
 8   New                2102 non-null   object 
 9   Portion            2102 non-null   float64
 10  Recession          2102 non-null   object 
dtypes: float64(1), int64(6), object(4)
memory usage: 197.1+ KB


None

Unnamed: 0,NAICS,Term,NoEmp,CreateJob,RetainedJob,UrbanRural,DisbursementGross,GrAppv,New,Portion,Recession
0,531210,84,2,0,0,2,11000,11000,0,0.5,0
1,531312,300,7,0,7,1,866800,866800,0,0.75,0
2,532230,84,3,0,3,1,77377,85000,0,0.5,0
3,531312,300,10,0,0,1,800100,810000,0,0.75,0
4,531210,300,2,0,2,1,1054200,1056200,0,0.72611,0


In [40]:
# scale numeric predictors and encode categorical predictors
findNumPredictors = make_column_selector(dtype_exclude = object)
findCatPredictors = make_column_selector(dtype_include = object)
transform = make_column_transformer((MinMaxScaler(), findNumPredictors),
                                    (OneHotEncoder(), findCatPredictors))

# get new column names
colNames = transform.fit(fullDF).get_feature_names_out()

# transform data
modelDF = pd.DataFrame.sparse.from_spmatrix(transform.fit_transform(fullDF), columns = colNames)

Unnamed: 0,minmaxscaler__Term,minmaxscaler__NoEmp,minmaxscaler__CreateJob,minmaxscaler__RetainedJob,minmaxscaler__DisbursementGross,minmaxscaler__GrAppv,minmaxscaler__Portion,onehotencoder__NAICS_531110,onehotencoder__NAICS_531120,onehotencoder__NAICS_531130,...,onehotencoder__NAICS_532420,onehotencoder__NAICS_532490,onehotencoder__NAICS_533110,onehotencoder__UrbanRural_0,onehotencoder__UrbanRural_1,onehotencoder__UrbanRural_2,onehotencoder__New_0,onehotencoder__New_1,onehotencoder__Recession_0,onehotencoder__Recession_1
0,0.27451,0.003077,0.0,0.0,0.002669,0.002771,0.288995,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
1,0.980392,0.010769,0.0,0.013084,0.373118,0.36764,0.644498,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,0.27451,0.004615,0.0,0.005607,0.031401,0.034321,0.288995,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,0.980392,0.015385,0.0,0.0,0.344246,0.343424,0.644498,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
4,0.980392,0.003077,0.0,0.003738,0.454238,0.448391,0.610525,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
