![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [40]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+



### 1. Preprocessing the data



In [41]:
# This code creates a copy of the DataFrame `cc_apps` and assigns it to `cc_apps_cleaned`.
# It then replaces all occurrences of the '?' character in `cc_apps_cleaned` with NaN (Not a Number) values.

cc_apps_cleaned = cc_apps.copy()
cc_apps_cleaned.replace('?', np.nan, inplace=True)

In [42]:
print(cc_apps_cleaned.dtypes)

0      object
1      object
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12      int64
13     object
dtype: object


In [43]:
# This code prints the count of unique values in column 0 of the cc_apps_cleaned DataFrame.
cc_apps_cleaned[0].value_counts()

b    468
a    210
Name: 0, dtype: int64

In [44]:
# The code `cc_apps_cleaned.mean()` performs the following actions:

# 1. **Calculate Mean**: The `mean()` method calculates the mean (average) of each numerical column in the DataFrame `cc_apps_cleaned`.
# 2. **Return Result**: It returns a Series where the index is the column names and the values are the mean of each column.

cc_apps_cleaned.mean()

2        4.758725
7        2.223406
10       2.400000
12    1017.385507
dtype: float64

In [45]:
# This code iterates over each column in the DataFrame `cc_apps_cleaned` and fills missing values based on 
#the column's data type.

for col in cc_apps_cleaned.columns:
    # Check if the column's data type is 'object' (typically used for categorical data)
    if cc_apps_cleaned[col].dtype == "object":
        # Find the most frequent value in the column
        most_frequent = cc_apps_cleaned[col].value_counts().index[0]
        # Fill missing values in the column with the most frequent value
        cc_apps_cleaned[col].fillna(most_frequent, inplace=True)
    else:
        # Calculate the mean of the column
        mean_value = cc_apps_cleaned[col].mean()
        # Fill missing values in the column with the mean value
        cc_apps_cleaned[col].fillna(mean_value, inplace=True)

In [46]:
# This code changes text columns in the DataFrame `cc_apps_cleaned` into numbers.

# It uses a function called `pd.get_dummies` from the pandas library to do this.

# `drop_first=True` means it will remove the first category of each text column to avoid having extra columns that say the same thing.

# The new DataFrame with the numbers instead of text is called `cc_app_dummies`.

cc_app_dummies = pd.get_dummies(cc_apps_cleaned, drop_first=True)
print(cc_app_dummies.head())

       2     7  10   12  0_b  1_15.17  ...  6_z  8_t  9_t  11_p  11_s  13_-
0  0.000  1.25   1    0    1        0  ...    0    1    1     0     0     0
1  4.460  3.04   6  560    0        0  ...    0    1    1     0     0     0
2  0.500  1.50   0  824    0        0  ...    0    1    0     0     0     0
3  1.540  3.75   5    3    1        0  ...    0    1    1     0     0     0
4  5.625  1.71   0    0    1        0  ...    0    1    0     0     1     0

[5 rows x 383 columns]


**1. Prepare the data for Modeling**

In [47]:
print(cc_app_dummies.columns)

Index([        2,         7,        10,        12,     '0_b', '1_15.17',
       '1_15.75', '1_15.83', '1_15.92', '1_16.00',
       ...
           '6_j',     '6_n',     '6_o',     '6_v',     '6_z',     '8_t',
           '9_t',    '11_p',    '11_s',    '13_-'],
      dtype='object', length=383)


In [48]:
# This code is used to separate the features (X) and the target variable (y) from the DataFrame `cc_app_dummies`.

# `X` will contain all the columns except the last one.
# `iloc[:, :-1]` selects all rows and all columns except the last one.
X = cc_app_dummies.iloc[:, :-1].values

# `y` will contain only the last column.
# `iloc[:, -1]` selects all rows and only the last column.
# Note: There is a typo in the original code. It should be `iloc` instead of `illoc`.
y = cc_app_dummies.iloc[:, -1].values

In [49]:
# This code splits the dataset into training and testing sets.

# `train_test_split` is a function from the `sklearn.model_selection` module.
# It takes the features (X) and the target variable (y) and splits them into training and testing sets.

# `test_size=0.2` means that 20% of the data will be used for testing, and 80% will be used for training.

# `random_state=42` is a seed value to ensure that the split is reproducible. 
# Using the same seed value will always produce the same split.

# `X_train` and `y_train` will contain the training data.
# `X_test` and `y_test` will contain the testing data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [50]:
# This code scales the features in the training and test sets using StandardScaler from the sklearn library.

# StandardScaler standardizes features by removing the mean and scaling to unit variance.
# This means each feature will have a mean of 0 and a standard deviation of 1.

# Create an instance of StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform the training data
# `fit_transform` calculates the mean and standard deviation on the training data and scales it accordingly.
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
# `transform` scales the test data using the mean and standard deviation calculated from the training data.
X_test_scaled = scaler.transform(X_test)


print(np.mean(X), np.std(X))
print(np.mean(X_train_scaled), np.std(X_test_scaled))

2.703086994460885 271.4041878829651
5.315659219504316e-18 0.8309442044804253


**3. Training the Model**

In [51]:
# Create an instance of LogisticRegression
logreg = LogisticRegression()

# Fit the logistic regression model on the scaled training data
# `fit` trains the model using the training data and the corresponding labels
logreg.fit(X_train_scaled, y_train)

# Predict the labels for the scaled test data
# `predict` uses the trained model to make predictions on the test data
y_pred = logreg.predict(X_test_scaled)

# Compute the confusion matrix to evaluate the accuracy of the classification
# `confusion_matrix` compares the true labels (y_test) with the predicted labels (y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# The confusion matrix is stored in the variable `conf_matrix`
print(conf_matrix)

#The output is a confusion matrix, which is a table used to evaluate the performance of a classification model. It compares the true labels (y_test) with the predicted labels (y_pred) and shows the counts of true positive, true negative, false positive, and false negative predictions. The confusion matrix helps in understanding the accuracy, precision, recall, and other performance metrics of the model.

[[52 18]
 [12 56]]


**4. Finding the best scoring model**

In [52]:
# Define a dictionary `params_grid` that contains the hyperparameters to be tuned.
# "C" is the inverse of regularization strength; smaller values specify stronger regularization.
# "penalty" specifies the norm used in the penalization ('l1' for L1 norm, 'l2' for L2 norm).
# "solver" specifies the algorithm to use in the optimization problem ('liblinear' is a good choice for small datasets).
params_grid = {
    "C": [0.01, 0.1, 1, 10],
    "penalty": ['l1', 'l2'],
    "solver":['liblinear']
}

# Create an instance of GridSearchCV.
# `estimator` is the model to be used (LogisticRegression in this case).
# `param_grid` is the dictionary of hyperparameters to be tuned.
# `cv` is the number of cross-validation folds (5 in this case).
grid_search = GridSearchCV(
    estimator = LogisticRegression(),
    param_grid = params_grid,
    cv = 5
)

# Fit the GridSearchCV object to the scaled training data.
# This will train the model for each combination of hyperparameters in `param_grid`
# and evaluate it using 5-fold cross-validation.
grid_search.fit(X_train_scaled, y_train)

# `grid_search.best_score_` contains the best mean cross-validated score obtained
# during the grid search.
grid_search.best_score_

0.8768550368550369

In [53]:
# `grid_search.best_estimator_` retrieves the model that gave the best performance
# during the grid search. This model is already trained on the entire training set
# using the best combination of hyperparameters found during the grid search.
best_model = grid_search.best_estimator_

# `print(best_model)` outputs the details of the best model, including the hyperparameters
# that were used. This helps in understanding which combination of hyperparameters
# resulted in the best performance.
print(best_model)

LogisticRegression(C=1, penalty='l1', solver='liblinear')


In [54]:
# `best_model.fit(X_train_scaled, y_train)` trains the best model found during the grid search
# on the scaled training data (`X_train_scaled`) and the corresponding labels (`y_train`).
# This step ensures that the model is fitted to the training data before making predictions.

best_model.fit(X_train_scaled, y_train)

# `best_score = best_model.score(X_test_scaled, y_test)` evaluates the performance of the trained model
# on the scaled test data (`X_test_scaled`) and the corresponding labels (`y_test`).
# The `score` method typically returns the mean accuracy for classification models or the R^2 score for regression models.
# The result is stored in the variable `best_score`.

best_score = best_model.score(X_test_scaled, y_test)

# The `print(best_score)` statement outputs the performance score of the model on the test data.
# This helps in understanding how well the model generalizes to unseen data.

print(best_score)

0.782608695652174
