# Credit card application study

<p style="text-align: justify;">
    <br /> 
For the past 10 years I worked at a Bank where one of my duties was credit analysis and approval. Back in 2010 we did them manually on an excel spreadsheet and with the documentation printed out. In the last few years we implemented a software based on rules that discarded or approved the credit application, but it always seemed to me that it was an inflexible model, and the features were taken into account individually and not in a combined way. So I came across this dataset from the UCI Machine Learning Repository about credit cards approvals and tried to implement a ML model to see if it can be predicted accurately.
    <br /> 
<br /> 
Although te column names are available, the data was anonymized to protect the privacy.
</p>

---

## 1. Extract and transform

In [1]:
import pandas as pd
df = pd.read_csv('datasets/cc_approvals.data', header=None)

In [2]:
df.columns = [['Gender',
               'Age',
               'Debt',
               'Married',
               'BankCustomer',
               'EducationLevel',
               'Ethnicity',
               'YearsEmployed',
               'PriorDefault',
               'Employed',
               'CreditScore',
               'DriversLicense',
               'Citizen',
               'ZipCode',
               'Income',
               'ApprovalStatus']]
df.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [3]:
import numpy as np

# There is almost 5% of missing categorical data, as it isn't significant we impute the mean
df = df.replace('?', np.NaN)
df.fillna(df.mean(), inplace=True)

# For the numeric missing values we impute the most frequent value
for col in df.columns:
    if df[col].dtypes == 'object':
        df = df.fillna(df[col].value_counts)

assert df.isna().sum().sum(axis = 0, skipna = False) == 0

---

## 2. ML models

### 2.1 Pre-procesisng

In [4]:
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings("ignore")

# Use LabelEncoder to transform labels to numeric
le = LabelEncoder()
for col in df.columns.values:
    if df[col].dtypes =='object':
        df[col]=le.fit_transform(df[col].astype(str))  

# Drop non relevant features
df = df.drop(['DriversLicense', 'ZipCode','Ethnicity'], axis=1)

X = df.iloc[:, 0:-1].values
y = df.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=42)

# Rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0,1))
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

def print_performance(model):
    ''' Takes a classification model executed 
    and returns accuracy and confusion matrix'''
    pd.options.display.float_format = '{:,.0f}'.format
    print('Accuracy: ', '{:,.3f}'.format(model.score(X_test, y_test)))
    print('\n','Confusion Matrix: ')
    print(confusion_matrix(y_test, y_pred))

### 2.2 Logistic Regression classifier

In [5]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()

# Fit and predict
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

print_performance(log_reg)

Accuracy:  0.838

 Confusion Matrix: 
[[92 11]
 [26 99]]


### 2.3 Random Forest classifier

In [6]:
from sklearn.ensemble import RandomForestClassifier

r_forest = RandomForestClassifier(random_state=np.random.seed(5))

# Fit and predict
r_forest.fit(X_train, y_train)
y_pred = r_forest.predict(X_test)

print_performance(r_forest)

Accuracy:  0.855

 Confusion Matrix: 
[[ 91  12]
 [ 21 104]]


### 2.4 XGBoost classifier

In [7]:
import xgboost

xgboost = xgboost.XGBClassifier()

# Fit and predict
xgboost.fit(X_train, y_train)
y_pred = xgboost.predict(X_test)

print_performance(xgboost)

Accuracy:  0.855

 Confusion Matrix: 
[[ 89  14]
 [ 19 106]]


<p style="text-align: justify;">
As it is a credit card approval, the highest risk is in the customers that shouldn't get a credit card but they obtain it. Those are the false positives. So even though the RandomForest and XGB preformed the same in accuracy, the count of false positives is less in the RandomForest. We will try to improve Random Forest accuracy by tuning.


---

## 3. Hyperparameter tunning on Random Forest 

In [8]:
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': np.arange(50,151,25),
              'max_depth': [2,3,4],
              'min_samples_leaf': [1,2],
              'min_samples_split':[2,3,4,5]
}

r_forest_cv = GridSearchCV(RandomForestClassifier(), param_grid, cv=4) 
r_forest_cv.fit(X, y)

print(r_forest_cv.best_params_)
r_forest_cv.best_score_

{'max_depth': 2, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 125}


0.8608695652173913

Even though the sample is small we managed to improve the accuracy by tuning hyperparameters. The best fit of the 3 is the Random Forest and it would be nice to try it with a bigger sample vs the XGboost to check if this persists.

---

## 4. Export the final model

In [9]:
from joblib import dump

# Fit the best params to the final model
r_forest = RandomForestClassifier()
r_forest.set_params(**r_forest_cv.best_params_)
r_forest.fit(X, y)

# Persist the final model
dump(r_forest, 'cc_approval_r_forest.joblib') 

['cc_approval_r_forest.joblib']

---