![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [163]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV


# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
(cc_apps.head(10))
#CHANGE THE COLUMN NAMES TO string
cc_apps.columns = cc_apps.columns.astype(str)

### Inspecting the applications

In [164]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print("\n")

# Inspect missing values in the dataset
cc_apps.tail(30)



                2           7         10             12
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
660,b,22.25,9.0,u,g,aa,v,0.085,f,f,0,g,0,-
661,b,29.83,3.5,u,g,c,v,0.165,f,f,0,g,0,-
662,a,23.5,1.5,u,g,w,v,0.875,f,f,0,g,0,-
663,b,32.08,4.0,y,p,cc,v,1.5,f,f,0,g,0,-
664,b,31.08,1.5,y,p,w,v,0.04,f,f,0,s,0,-
665,b,31.83,0.04,y,p,m,v,0.04,f,f,0,g,0,-
666,a,21.75,11.75,u,g,c,v,0.25,f,f,0,g,0,-
667,a,17.92,0.54,u,g,c,v,1.75,f,t,1,g,5,-
668,b,30.33,0.5,u,g,d,h,0.085,f,f,0,s,0,-
669,b,51.83,2.04,y,p,ff,ff,1.5,f,f,0,g,1,-


### split the data into train and test sets

In [165]:

cc_apps_train , cc_apps_test= train_test_split(cc_apps, test_size=0.33, random_state=42)

### missing values handling

In [166]:
# Replace the '?'s with NaN
cc_apps_train = cc_apps_train.replace('?', np.nan)
cc_apps_test = cc_apps_test.replace('?', np.nan)

# Impute the missing values with mean imputation for the numeric columns only
numeric_columns = cc_apps_train.select_dtypes(include=np.number).columns
cc_apps_train[numeric_columns] = cc_apps_train[numeric_columns].fillna(cc_apps_train[numeric_columns].mean())
cc_apps_test[numeric_columns] = cc_apps_test[numeric_columns].fillna(cc_apps_train[numeric_columns].mean())

# Impute the missing values with the most frequent value for the categorical columns only

categorical_columns = cc_apps_train.select_dtypes(include='object').columns
cc_apps_train[categorical_columns] = cc_apps_train[categorical_columns].fillna(cc_apps_train[categorical_columns].mode().iloc[0])
cc_apps_test[categorical_columns] = cc_apps_test[categorical_columns].fillna(cc_apps_train[categorical_columns].mode().iloc[0])


#check for missing values
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())



0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
dtype: int64
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
dtype: int64


### Preprocessing the data


In [167]:
#USE GET DUMMIES TO CONVERT CATEGORICAL DATA TO NUMERIC
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)


#reindex the columns in the test set to match the train set
cc_apps_test = cc_apps_test.reindex(columns = cc_apps_train.columns, fill_value=0)


#split the data into features and target
X_train, y_train = cc_apps_train.drop(columns=cc_apps_train.columns[-1]), cc_apps_train[cc_apps_train.columns[-1]]
X_test, y_test = cc_apps_test.drop(columns=cc_apps_test.columns[-1]), cc_apps_test[cc_apps_test.columns[-1]]






In [168]:
# Instantiate the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the features and transform only the numeric columns
X_train[numeric_columns] = scaler.fit_transform(X_train[numeric_columns])
X_test[numeric_columns] = scaler.transform(X_test[numeric_columns])



### Fitting different models with grid search

In [169]:
# Instantiate the classifier
models = [LogisticRegression(), RandomForestClassifier(random_state=42)]
model_names = ['Logistic Regression', 'AdaBoost', 'Gradient Boosting', 'Random Forest']
for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{name} confusion matrix:\n {confusion_matrix(y_test, y_pred)}")

# Hyperparameter tuning
# logistic regression
param_grid_lr = {'C': [0.01, 0.1, 1, 10, 100],
                 'max_iter': [100, 200, 300]}
lr_cv = GridSearchCV(LogisticRegression(random_state=42), param_grid_lr, cv=5, n_jobs=-1)
lr_cv.fit(X_train, y_train)
print(f"Best parameters for Logistic Regression: {lr_cv.best_params_}")


# Random Forest
param_grid_rf = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 7]}
rf_cv = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5, n_jobs=-1)
rf_cv.fit(X_train, y_train)
print(f"Best parameters for Random Forest: {rf_cv.best_params_}")

best_models = [lr_cv.best_estimator_, rf_cv.best_estimator_]
scores= [lr_cv.best_score_, rf_cv.best_score_]




Logistic Regression confusion matrix:
 [[103   0]
 [  0 125]]
AdaBoost confusion matrix:
 [[102   1]
 [  0 125]]
Best parameters for Logistic Regression: {'C': 1, 'max_iter': 100}
Best parameters for Random Forest: {'max_depth': 7, 'n_estimators': 50}


### final scores

In [170]:
# max accuracy scores
best_scores = max(scores)


Logistic Regression accuracy: 1.0
Random Forest accuracy: 0.9780701754385965
