## Credit Card Approvals

In the present work we will do some machine learning with linear and non-linear models for the predictions of the approvals of credit card applications. We are going to use the free data set provided by the UCI Machine Learning Repository, I present the link:

http://archive.ics.uci.edu/ml/datasets/credit+approval

### Preprocessing

But first, we'll clean up a bit before we begin.

In [1]:
#
# Here we import everything you need
#
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

In [2]:
# 
# It is always good practice to open a file with the context manager 'with'
#
with open(r'C:\Users\Pau\temp\crx (1).txt','r') as file:
    
    data = file.readlines()
    
df = pd.DataFrame(data)
print(df.head())

                                                   0
0      b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+\n
1  a,58.67,4.46,u,g,q,h,3.04,t,t,06,f,g,00043,560...
2    a,24.50,0.5,u,g,q,h,1.5,t,f,0,f,g,00280,824,+\n
3   b,27.83,1.54,u,g,w,v,3.75,t,t,05,t,g,00100,3,+\n
4   b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+\n


In [3]:
#
# We must to split the column 0, we use the comma as a separator 
#
df = pd.concat([df[[0]], df[0].str.split(',', expand=True)], axis=1)
print(df.head())

                                                  0  0      1      2  3  4   \
0      b,30.83,0,u,g,w,v,1.25,t,t,01,f,g,00202,0,+\n  b  30.83      0  u  g   
1  a,58.67,4.46,u,g,q,h,3.04,t,t,06,f,g,00043,560...  a  58.67   4.46  u  g   
2    a,24.50,0.5,u,g,q,h,1.5,t,f,0,f,g,00280,824,+\n  a  24.50    0.5  u  g   
3   b,27.83,1.54,u,g,w,v,3.75,t,t,05,t,g,00100,3,+\n  b  27.83   1.54  u  g   
4   b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+\n  b  20.17  5.625  u  g   

  5  6     7  8  9   10 11 12     13   14   15  
0  w  v  1.25  t  t  01  f  g  00202    0  +\n  
1  q  h  3.04  t  t  06  f  g  00043  560  +\n  
2  q  h   1.5  t  f   0  f  g  00280  824  +\n  
3  w  v  3.75  t  t  05  t  g  00100    3  +\n  
4  w  v  1.71  t  f   0  f  s  00120    0  +\n  


In [4]:
#
# Select the split columns
#
df = df.iloc[:, 1:]
print(df)

    0      1       2  3  4   5   6     7  8  9   10 11 12     13   14   15
0    b  30.83       0  u  g   w   v  1.25  t  t  01  f  g  00202    0  +\n
1    a  58.67    4.46  u  g   q   h  3.04  t  t  06  f  g  00043  560  +\n
2    a  24.50     0.5  u  g   q   h   1.5  t  f   0  f  g  00280  824  +\n
3    b  27.83    1.54  u  g   w   v  3.75  t  t  05  t  g  00100    3  +\n
4    b  20.17   5.625  u  g   w   v  1.71  t  f   0  f  s  00120    0  +\n
..  ..    ...     ... .. ..  ..  ..   ... .. ..  .. .. ..    ...  ...  ...
685  b  21.08  10.085  y  p   e   h  1.25  f  f   0  f  g  00260    0  -\n
686  a  22.67    0.75  u  g   c   v     2  f  t  02  t  g  00200  394  -\n
687  a  25.25    13.5  y  p  ff  ff     2  f  t  01  t  g  00200    1  -\n
688  b  17.92   0.205  u  g  aa   v  0.04  f  f   0  f  g  00280  750  -\n
689  b  35.00   3.375  u  g   c   h  8.29  f  f   0  t  g  00000    0  -\n

[690 rows x 16 columns]


For privacy, there is no name for data features, but we can get a good idea here:

http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html

Columns 2, 7, 10 and 14 seem to be numerical and 15 the output according to the blog.

In [5]:
#
# Change the type of the columns 2, 7, 10 and 14
#
l=[2,7,10,14]
for i in l:
    df[i] = pd.to_numeric(df[i])
print(df.dtypes)

0      object
1      object
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13     object
14      int64
15     object
dtype: object


In [6]:
#
# Print the shape, information and main statistics of the data frame
#
print('shape' + str(df.shape))
print("\n")
print('info')
print("\n")
print(df.info())
print("\n")
print('description')
print("\n")
print(df.describe())

shape(690, 16)


info


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB
None


description


               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean    

In [7]:
#
# Visual check for possible missing values
#
print(df.iloc[-20:-1,:])

    0      1       2  3  4   5   6      7  8  9   10 11 12     13   14   15
670  b  47.17   5.835  u  g   w   v  5.500  f  f   0  f  g  00465  150  -\n
671  b  25.83  12.835  u  g  cc   v  0.500  f  f   0  f  g  00000    2  -\n
672  a  50.25   0.835  u  g  aa   v  0.500  f  f   0  t  g  00240  117  -\n
673  ?  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -\n
674  a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -\n
675  a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -\n
676  a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -\n
677  b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -\n
678  a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -\n
679  a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -\n
680  b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -\n
681  b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -\n
682  b  17.0

In [8]:
#
# Replace the '?'s with NaN
#
df = df.replace( '?', np.nan)

print(df.iloc[-20:-1])

      0      1       2  3  4   5   6      7  8  9   10 11 12     13   14   15
670    b  47.17   5.835  u  g   w   v  5.500  f  f   0  f  g  00465  150  -\n
671    b  25.83  12.835  u  g  cc   v  0.500  f  f   0  f  g  00000    2  -\n
672    a  50.25   0.835  u  g  aa   v  0.500  f  f   0  t  g  00240  117  -\n
673  NaN  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -\n
674    a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -\n
675    a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -\n
676    a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -\n
677    b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -\n
678    a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -\n
679    a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -\n
680    b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -\n
681    b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00

In [9]:
# fill the missing values with the mean of the corresponding column
df.fillna(df.mean(), inplace=True)

# Count the NaNs values to verify
print(df.isnull().sum())

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64


In [10]:
# non-numeric columns have no mean, so a good practice is to fill in missing values with the most frequent value.
for col in df.columns:
    # Check if the column is of object type
    if df[col].dtypes == 'object':
        # Impute with the most frequent value
        df[col] = df[col].fillna(df[col].value_counts().index[0])

# Count the NaNs values to verify
print(df.isnull().sum())

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64


### Machine Learning Model

#### Convert to numeric data

Now that we have pre-processed our data, we will move on to the machine learning process, but there is still a concern, most machine learning models only work with numeric data.

We will use LabelEncoder from scikit-learn packages to convert the object data to numeric. 

In [11]:
# Instantiate LabelEncoder
LE = LabelEncoder()

for i in df.columns:
    if df[i].dtypes == 'object':
    # Use LabelEncoder to do the numeric transformation
        df[i]=LE.fit_transform(df[i])

#### Feature Selection

The features 'DriverLicense' and 'ZipCode'(11 and 13 respectively) do not appear to be relevant for credit card approvals, so for the sake of simplicity, we discard them.

In [12]:
# Drop the features 11 and 13 
df2 = df.drop([11, 13], axis=1)
# Convert to numpy array
df2 = df2.to_numpy()

# Separate features and labels
X, y = df2[:,0:13] , df2[:,13]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                y,
                                test_size = 0.35,
                                random_state = 42) 

#### Scaling data

Machine learning models only see numbers, so if you have different ranges ... let's say thousands and tens, some features will be more important in the model than others. For example, 'Employee' only has values of zeros and ones ... while 'Income' is in thousands ... there is no way to compare to each other unless we scale them first.

In [13]:
Scaler = MinMaxScaler(feature_range=(0, 1)) 
rescaled_X_train = Scaler.fit_transform(X_train)
rescaled_X_test = Scaler.fit_transform(X_test)

### Linear Model : Logistic Regression

We choose Logistic Regression, it is a simple model but it works well in this type of classification problem where the labels are binary.

In [14]:
# Create an Instance of the classifier
lr = LogisticRegression() 

# Fit logreg to the train set

lr.fit(rescaled_X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [15]:
# Predict
y_pred = lr.predict(rescaled_X_test)

# Get the accuracy 
print("Accuracy : ", lr.score(rescaled_X_test, y_test))

# It is always good practice to do the confusion matrix in binary classification
print(confusion_matrix(y_test, y_pred))

Accuracy :  0.8512396694214877
[[100   9]
 [ 27 106]]


#### GridSearchCV

Its seems to work pretty well, but lets do some GridSearchCV, to see if we can get any improvement by changing the default hyperparameters.

In [16]:
# Create a dict for the hyper-parameters
param_grid = dict({'tol':[0.1, 0.01, 0.001, 0.0001], 'max_iter':[50, 100, 150, 200]})

# Instantiate GridSearchCV 
grid_model = GridSearchCV(estimator = lr, param_grid = param_grid, cv = 5)

# Fit data...
grid_model_result = grid_model.fit(rescaled_X_train, y_train)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.850362 using {'max_iter': 50, 'tol': 0.1}


In [17]:
# Finally, let's see how the best parameters work.
# Define the hyperparameters
lr1 = LogisticRegression(max_iter = 50, tol = 0.1) 

# Train the model...

lr1.fit(rescaled_X_train, y_train)

# Predict
y_pred1 = lr1.predict(rescaled_X_test)

# Get the accuracy ...
print("Accuracy : ", lr1.score(rescaled_X_test, y_test))

# Print the confusion matrix 
print(confusion_matrix(y_test, y_pred1))

Accuracy :  0.8512396694214877
[[100   9]
 [ 27 106]]


its returns the same as before... 

### Non-Linear Model: SVC

We can also try with a non-linear model to see if out perform the previous. Support Vector Machine Classifier is commonly used in classification problems.

In [18]:
# Create an Instance of the classifier
svm = SVC()

# Create a dict for the hyper-parameters
param_grid_svc = dict({'gamma' : [0.001, 0.01, 0.05, 0.1, 1, 2]}) # Smaller gamma leads to smoother limits ...

# Instantiate GridSearchCV 
grid_model_svc = GridSearchCV(estimator = svm, param_grid = param_grid_svc, cv = 5)

# Fit data...
grid_model_svc_result = grid_model_svc.fit(rescaled_X_train, y_train)

# Summarize results
best_score_svc, best_params_svc = grid_model_svc_result.best_score_, grid_model_svc_result.best_params_
print("Best: %f using %s" % (best_score_svc, best_params_svc))

Best: 0.857054 using {'gamma': 0.01}


In [19]:
# Finally, let's see how the best parameter works.

# Define the hyperparameter
svm1 = SVC(gamma = 0.01)

# Train the model...
svm1.fit(rescaled_X_train, y_train)

# Predict
y_pred_svm1 = svm1.predict(rescaled_X_test)

# Get the accuracy ...
print("Accuracy : ", svm1.score(rescaled_X_test, y_test))

# Print the confusion matrix 
print(confusion_matrix(y_test, y_pred_svm1))

Accuracy :  0.8512396694214877
[[101   8]
 [ 28 105]]


### Conclusions

The accuracy is roughly the same as the linear model. In fact, it is not a bad prediction at all, but we thought that we could do better.

Sometimes the simplest models are sufficient for these types of classification problems.