# Credit card application study

<p style="text-align: justify;">
    <br /> 
For the past 10 years I worked at a Bank where one of my duties was credit analysis and approval. Back in 2010 we did them manually on an excel spreadsheet and with the documentation printed out. In the last few years we implemented a software based on rules that discarded or approved the credit application, but it allways seemed to me that it was an inflexible model, and the features were taken into account individually and not in a combiened way. So I came across this dataset from the UCI Machine Learning Repository about credit cards approvals and tried to implement a ML model to see if it can be predicted accurately.
    <br /> 
<br /> 
Altough te column names are available, the data was anonymized to protect the privacy.
</p>

---

### Extract and transform

In [1]:
import pandas as pd
df = pd.read_csv('datasets/cc_approvals.data', header=None)

In [2]:
df.columns = [['Gender',
               'Age',
               'Debt',
               'Married',
               'BankCustomer',
               'EducationLevel',
               'Ethnicity',
               'YearsEmployed',
               'PriorDefault',
               'Employed',
               'CreditScore',
               'DriversLicense',
               'Citizen',
               'ZipCode',
               'Income',
               'ApprovalStatus']]
df.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [3]:
import numpy as np

# There is some values incorrectly labeled, for example:
print('incorrect label example : '+ df.iloc[673][0])

# Replace the '?'s with NaN
df = df.replace('?', np.NaN)

# There is almost 5% of missing data, as it isn't much we imputate the mean
df.fillna(df.mean(), inplace=True)

# Count the number of NaNs in the dataset to verify
print('')
print('Count of NaNs:')
print(df.isnull().sum())

for col in df.columns:
    # Check if the column is of object type
    if df[col].dtypes == 'object':
        # Impute with the most frequent value
        df = df.fillna(df[col].value_counts)

# Count the number of NaNs in the dataset and print the counts to verify
print('')
print('Count of NaNs after mean imputation:')
print(df.isna().sum())

incorrect label example : ?

Count of NaNs:
Gender            12
Age               12
Debt               0
Married            6
BankCustomer       6
EducationLevel     9
Ethnicity          9
YearsEmployed      0
PriorDefault       0
Employed           0
CreditScore        0
DriversLicense     0
Citizen            0
ZipCode           13
Income             0
ApprovalStatus     0
dtype: int64

Count of NaNs after mean imputation:
Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
DriversLicense    0
Citizen           0
ZipCode           0
Income            0
ApprovalStatus    0
dtype: int64


---

### ML models

#### Pre-procesisng

In [4]:
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings("ignore")

# Use LabelEncoder to transform labels to numeric
le = LabelEncoder()
for col in df.columns.values:
    if df[col].dtypes =='object':
        df[col]=le.fit_transform(df[col].astype(str))  

# Drop the features Driving License and ZipCode and convert the DataFrame to a NumPy array
df = df.drop(['DriversLicense', 'ZipCode'], axis=1)
#df = df.values

# Segregate features and labels into separate variables
#X,y = df[:,0:13].values , df[:,13].values
X = df.iloc[:, 0:13].values
y = df.iloc[:, 13].values

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0,1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

def results(model):
    ''' Takes a clasiffication model executed 
    and returns accuracy and confusion matrix'''
    pd.options.display.float_format = '{:,.0f}'.format
    print('Accuracy: ', '{:,.2f}'.format(model.score(rescaledX_test, y_test)))
    print('')
    print('Confusion Matrix: ')
    print(confusion_matrix(y_test, y_pred))

# Set seed for reproducibility
seed = 1

#### Logistic Regression classifier

In [5]:
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression(random_state=seed)

# Fit to the train set and meassure
logreg.fit(rescaledX_train, y_train)
y_pred = logreg.predict(rescaledX_test)

# Results
results(logreg)

Accuracy:  0.84

Confusion Matrix: 
[[92 11]
 [26 99]]


#### Random Forest classifier

In [6]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate a Random Forest classifier with default parameter values
rforest = RandomForestClassifier(random_state=seed)

# Fit to the train set and meassure
rforest.fit(rescaledX_train, y_train)
y_pred = rforest.predict(rescaledX_test)

# Results
results(rforest)

Accuracy:  0.76

Confusion Matrix: 
[[92 11]
 [44 81]]


#### XGBoost classifier

In [7]:
import xgboost

# Instantiate a XGBoost classifier with default parameter values
xgboost = xgboost.XGBClassifier(random_state=seed)

# Fit to the train set and meassure
xgboost.fit(rescaledX_train, y_train)
y_pred = xgboost.predict(rescaledX_test)

# Results
results(xgboost)

Accuracy:  0.84

Confusion Matrix: 
[[ 89  14]
 [ 22 103]]


<p style="text-align: justify;">
As it is a credit card approval, the highest risk is in the customers that shouldn't get a credit card but they obtain it. Those are the false positives. So even though the LogReg and XGB preformed the same in accuracy, the count of false positives is less in the LogReg.


#### Hyperparameter tunning on selected model (LogReg)

In [8]:
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
param_grid = {'tol': [0.01, 0.001, 0.0001, 0.00001],
              'max_iter': np.arange(0,151,20)}

# Iter over parameters with a cross validation of 5
logreg_cv = GridSearchCV(logreg, param_grid, cv=5) 
logreg_cv.fit(X, y)

# Return best parameters
print(logreg_cv.best_params_)
logreg_cv.best_score_

{'max_iter': 40, 'tol': 0.0001}


0.8478260869565217

#### Export the final model

In [15]:
from joblib import dump

# Fit the best params
logreg = LogisticRegression()
logreg.set_params(**logreg_cv.best_params_)

# Fit to the train set and meassure
logreg.fit(rescaledX_train, y_train)

dump(logreg, 'cc_approval_logreg.joblib') 

['cc_approval_logreg.joblib']

---