## Classification on Credit Card Application data

In this Jupyter Notebook, I will perform data cleaning and transformation on confidential credit card application data to classify the unknown data. I will specifically look at a certain attribute in the data, *A16*, being "+" by creating linear and logistic regression models.

In [1]:
import pandas as pd
import numpy as np
# warning code taken from 
# https://stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from sklearn.utils.testing import ignore_warnings
from sklearn.exceptions import ConvergenceWarning
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import sklearn.linear_model as slm
from sklearn.model_selection import train_test_split
from random import sample

In [2]:
# Load the data.
credit = pd.read_csv("data/credit-card-applications.csv.bz2")
credit.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [3]:
# Check dataframe dimensions.
print(credit.shape)

(690, 16)


In [4]:
# Check for missing values.
credit = credit.replace('?', np.nan)
missings = credit.isna().sum()
print(missings)

A1     12
A2     12
A3      0
A4      6
A5      6
A6      9
A7      9
A8      0
A9      0
A10     0
A11     0
A12     0
A13     0
A14    13
A15     0
A16     0
dtype: int64


This dataset is about credit card applications that has 690 rows of information for 16 variables. I found that there were NA values in the form of a `?`, but per the associated metadata, 37 cases (5%) have one or more missing values.The meaningless variables names of A1-A16 and symbols inside each variable were utilized to protect user information and confidentiality of the data.

In [5]:
# Ensure the variables are of appropriate type.
print(credit.dtypes)

A1      object
A2      object
A3     float64
A4      object
A5      object
A6      object
A7      object
A8     float64
A9      object
A10     object
A11      int64
A12     object
A13     object
A14     object
A15      int64
A16     object
dtype: object


Inside the dataset contains a mixture of variable types including `object`, `float64`, and `int64`. This is as expected because Strings are objects, and the variables with decimial points being `A3` and `A8` are considered floats. The numeric variables would be `A11` and `A15` which is indeed of appropriate int type after checking. `A9`, `A10`, `A12` seem to be boolean type but is considered object type because the content includes shorthanded names such as `t` and `f`, which likely stands for `true` and `false`, respectively. I was surprised that `A2` was considered an object, even though it contained decimal points and looks like a float. It likely wasn't considered a float because it only had 2 decimial points, and therefore got classified as an object.

In [6]:
# Describe the variables: for the categorical variables print the possible categories, for the numeric
# variables compute the means and range.
print("A1:", pd.unique(credit['A1']), "A4:", pd.unique(credit['A4']), "A5", pd.unique(credit['A5']),\
      "A6:", pd.unique(credit['A6']), "A7:", pd.unique(credit['A7']), "A9:", pd.unique(credit['A9']),\
      "A10:", pd.unique(credit['A10']), "A12:", pd.unique(credit['A12']),\
      "A13:", pd.unique(credit['A13']), "A14:", pd.unique(credit['A14']), "A16:", pd.unique(credit['A16']))

credit['A2'] = credit['A2'].astype('float')
mean_beforeA2 = credit['A2'].mean()
credit['A2'] = credit['A2'].fillna(mean_beforeA2)
print("A2: mean:", credit['A2'].mean(), "range:", credit['A2'].max() - credit['A2'].min())

mean_beforeA3 = credit['A3'].mean()
credit['A3'] = credit['A3'].replace('?', mean_beforeA3)
print("A3: mean:", credit['A3'].mean(), "range:", credit['A3'].max() - credit['A3'].min())

mean_beforeA8 = credit['A8'].mean()
credit['A8'] = credit['A8'].replace('?', mean_beforeA8)
print("A8: mean:", credit['A8'].mean(), "range:", credit['A8'].max() - credit['A8'].min())

mean_beforeA11 = credit['A11'].mean()
credit['A11'] = credit['A11'].replace('?', mean_beforeA11)
print("A11: mean:", credit['A11'].mean(), "range:", credit['A11'].max() - credit['A11'].min())

mean_beforeA15 = credit['A15'].mean()
credit['A15'] = credit['A15'].replace('?', mean_beforeA15)
print("A15: mean:", credit['A15'].mean(), "range:", credit['A15'].max() - credit['A15'].min())

A1: ['b' 'a' nan] A4: ['u' 'y' nan 'l'] A5 ['g' 'p' nan 'gg'] A6: ['w' 'q' 'm' 'r' 'cc' 'k' 'c' 'd' 'x' 'i' 'e' 'aa' 'ff' 'j' nan] A7: ['v' 'h' 'bb' 'ff' 'j' 'z' nan 'o' 'dd' 'n'] A9: ['t' 'f'] A10: ['t' 'f'] A12: ['f' 't'] A13: ['g' 's' 'p'] A14: ['00202' '00043' '00280' '00100' '00120' '00360' '00164' '00080' '00180'
 '00052' '00128' '00260' '00000' '00320' '00396' '00096' '00200' '00300'
 '00145' '00500' '00168' '00434' '00583' '00030' '00240' '00070' '00455'
 '00311' '00216' '00491' '00400' '00239' '00160' '00711' '00250' '00520'
 '00515' '00420' nan '00980' '00443' '00140' '00094' '00368' '00288'
 '00928' '00188' '00112' '00171' '00268' '00167' '00075' '00152' '00176'
 '00329' '00212' '00410' '00274' '00375' '00408' '00350' '00204' '00040'
 '00181' '00399' '00440' '00093' '00060' '00395' '00393' '00021' '00029'
 '00102' '00431' '00370' '00024' '00020' '00129' '00510' '00195' '00144'
 '00380' '00049' '00050' '00381' '00150' '00117' '00056' '00211' '00230'
 '00156' '00022' '00228' '

I treated `A2` as a float rather than as object because the differing number values doesn't mean as much as calculating mean and range for a variable of numbers. For the numbers marked with `?` with is an NA value, I replaced it with the average of the variable for a more representative calculation.

In [7]:
# Split your data into training/testing part.
X =  np.stack((credit['A2'].values, credit['A3'].values), axis=1)
credit['A16'] = credit['A16'].replace('+', 1)
credit['A16'] = credit['A16'].replace('-', 0)
y = credit['A16'].values.astype('int')
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size = 0.2)

In [8]:
# Create a simple initial LMP predicting A16 being positive. 
m = slm.LinearRegression().fit(Xtr, ytr)
yhat = m.predict(X)
yhat = np.where(yhat >= 0.5, 1, yhat) 
yhat = np.where(yhat < 0.5, 0, yhat) 

In order to conduct a representative crosstab for linear regression, I converted the values inside the yhat array into 1s and 0s, depending on if the values are `>= 0.5` or `< 0.5`. 

In [9]:
# Compute the prediction accuracy on testing data using this model.
pd.crosstab(y, yhat)
print("Accuracy:", np.mean(y == yhat))

Accuracy: 0.618840579710145


In [10]:
# Repeat these two steps with a similar logistic regression model.
m_log = slm.LogisticRegression(solver='lbfgs').fit(Xtr, ytr)
yhat_log = m_log.predict(X)
pd.crosstab(y, yhat_log)
print("Accuracy:", np.mean(y == yhat_log))

Accuracy: 0.6260869565217392


In [11]:
# Develop two more complex models adding more variables from the data.
# Perform Linear probabliity regression and logistic regression versions of both models. 

# First Complex Model
credit['A10'] = credit['A10'].replace('t', True)
credit['A10'] = credit['A10'].replace('f', False)
credit['A10'] = credit['A10'].values.astype('int')

X1 =  np.stack((credit['A2'].values, credit['A3'].values, credit['A11'].values), axis=1)
Xtr, Xte, ytr, yte = train_test_split(X1, y, test_size = 0.2)

m1_lin = slm.LinearRegression().fit(Xtr, ytr)
yhat1_lin = m1_lin.predict(X1)
yhat1_lin = np.where(yhat1_lin >= 0.5, 1, yhat1_lin) 
yhat1_lin = np.where(yhat1_lin < 0.5, 0, yhat1_lin) 
crossed_lin1 = pd.crosstab(y, yhat1_lin)
print("First Model:")
print("Linear Regression Accuracy:", np.mean(y == yhat1_lin))

m1_log = slm.LogisticRegression(solver="lbfgs").fit(Xtr, ytr)
yhat1_log = m1_log.predict(X1)
crossed_log = pd.crosstab(y, yhat1_log)
print("Logistic Regression Accuracy:", np.mean(y == yhat1_log))

# Second Complex Model
credit['A1'] = credit['A1'].fillna('a')
credit['A1'] = credit['A1'].replace('a', 0)
credit['A1'] = credit['A1'].replace('b', 1)
credit['A4'] = credit['A4'].fillna('u')
credit['A4'] = (credit.A4 != "u").values.astype('int')
credit['A13'] = credit['A13'].fillna('g')
credit['A13'] = (credit.A13 != "g").values.astype('int')
credit['A9'] = credit['A9'].replace('t', True)
credit['A9'] = credit['A9'].replace('f', False)
credit['A9'] = credit['A9'].values.astype('int')
credit['A12'] = credit['A12'].replace('t', True)
credit['A12'] = credit['A12'].replace('f', False)
credit['A12'] = credit['A12'].values.astype('int')

X2 =  np.stack((credit['A1'].values, credit['A2'].values, credit['A3'].values, credit['A4'].values,\
                credit['A8'].values, credit['A9'].values, credit['A10'].values, credit['A11'].values,\
                credit['A12'], credit['A15']), axis=1)
Xtr, Xte, ytr, yte = train_test_split(X2, y, test_size = 0.2)

m2_lin = slm.LinearRegression().fit(Xtr, ytr)
yhat2_lin = m2_lin.predict(X2)
yhat2_lin = np.where(yhat2_lin >= 0.5, 1, yhat2_lin) 
yhat2_lin = np.where(yhat2_lin < 0.5, 0, yhat2_lin) 
crossed_lin2 = pd.crosstab(y, yhat2_lin)
print()
print("Second Model:")
print("Linear Regression Accuracy:", np.mean(y == yhat2_lin))

m2_log = slm.LogisticRegression(solver="lbfgs", max_iter=1000).fit(Xtr, ytr)
yhat2_log = m2_log.predict(X2)
crossed_log2 = pd.crosstab(y, yhat2_log)
print("Logistic Regression Accuracy:", np.mean(y == yhat2_log))

First Model:
Linear Regression Accuracy: 0.7246376811594203
Logistic Regression Accuracy: 0.7391304347826086

Second Model:
Linear Regression Accuracy: 0.8579710144927536
Logistic Regression Accuracy: 0.8565217391304348


In [12]:
# Finally, present my best model (in terms of accuracy on test data).

The Logistic Regression Model performed better in my case by yielding a higher accuracy value. The best model in regards to test model was my second complex model, by resulting in a linear regression accuracy of 0.8565217391304348 and logistic regression accuracy of 0.8695652173913043. This model performed better because it contained more variables that could potentially be pertinent and good predictors in explaning `A16`.