<h1 style='font-weight: bold; color: #5AA49D; font-size: 3rem'>Credit Card Approval Machine Learning Model</h1>
<p><u>By:</u> Minh Nguyen</p>
<p><u>Date:</u> April 30th 2022</p>
<p><u>Data Source:</u> UCI Machine Learning Respository (<a href='https://archive.ics.uci.edu/ml/datasets/Credit+Approval?msclkid=200008bdc4a311ec9f500a3245a2bfb1'>UCI</a>)</p>

<h2 style='font-weight: bold; color: #5AA49D'>1. Introduction</h2>

<p>How do banks know whether or not to approve you for a credit card based on just some information about you? Thanks to Machine Learning (ML), many banks were able to create their own model of predicting how reliable an applicant is. for this project, I will apply different Classification ML models into the data and pick out which model perform the best (the highest accuracy).</p>
<p>Special thanks to <a href='https://www.kaggle.com/samuelcortinhas'>SAMUEL CORTINHAS</a> for cleaning and tranforming the data into CSV format.</p>

<h2 style='font-weight: bold; color: #5AA49D'>2. Data Analysis</h2>

<p>Before we create a model, let's do some exploratory data analysis to find insights, trends, and outliers from the data.</p>

<h3 style='color: #5AA49D'>2.1. Importing libraries and data</h3>

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

from sklearn.model_selection import KFold
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.feature_selection import RFE

In [None]:
df = pd.read_csv('../input/credit-card-approval-clean-data/clean_dataset.csv', dtype={'ZipCode': str})

<h3 style='color: #5AA49D'>2.2. Data Overview</h3>

In [None]:
df.head()

In [None]:
print(df.columns)

<h4>Data Information</h4>
<ul>
    <li><b>Gender:</b> 1=Male, 0=Female</li>
    <li><b>Age:</b> in years</li>
    <li><b>Debt:</b> outstanding debt (scaled)</li>
    <li><b>Married:</b> 1=Married, 0=Single/Divorce/etc.</li>
    <li><b>BankCustomer:</b> 1=has a bank account, 0=doesn't have a bank account</li>
    <li><b>Industry:</b> current or most recent job sector</li>
    <li><b>Ethnicity:</b> ethnicity</li>
    <li><b>YearsEmployed:</b> years employed</li>
    <li><b>PriorDefault:</b> 1=has prior default, 0=no prior default</li>
    <li><b>Employed:</b> 1=employed, 0=unemployed</li>
    <li><b>CreditScore:</b> credit score (scaled)</li>
    <li><b>DriversLicense:</b> 1=has driver license, 0=no driver license</li>
    <li><b>Citizen:</b> citizenship, either ByBirth, ByOtherMeans or Temporary</li>
    <li><b>ZipCode:</b> zip code</li>
    <li><b>Income:</b> income (scaled)</li>
    <li><b>Approved:</b> 1=approved, 0=not approved</li>
</ul>

<p>As you can see from the table, some numeric values like debt and income are scaled, which means that the data does not represent the amount that it shows. For example, income of 560 doesn't mean $560/year income, but 560 is scaled based on all the data from the feature.</p>

In [None]:
cols = ['Industry', 'Ethnicity', 'Citizen']

for col in cols:
    print(f'--- {col} ---')
    print(df[col].unique())
    print('\n')

In [None]:
print(f'Number of rows: {df.shape[0]}\nNumber of columns: {df.shape[1]}')

In [None]:
print(f'Number of null values: {df.isnull().values.sum()}')
print(f'Number of duplicated values: {df.duplicated().values.sum()}')

<h3 style='color: #5AA49D'>2.3. Data Visualization</h3>

In [None]:
# setting graphing format

plt.rcParams['figure.figsize'] = (10, 8)
font_fmt = {'fontweight': 'bold',
           'fontsize': 20}


In [None]:
df.head()

In [None]:
numeric_cols = ['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income']
boolean_cols = ['Gender', 'Married', 'BankCustomer', 'PriorDefault', 'Employed', 'DriversLicense', 'Approved']
string_cols = ['Industry', 'Ethnicity', 'Citizen']

In [None]:
plt.figure(figsize=(18, 12))

for i, plot in enumerate(numeric_cols):
    plt.subplot(int(f'23{i+1}'))
    plt.title(f'{plot}', fontdict=font_fmt)
    plt.subplots_adjust(hspace=0.2)
    sns.boxplot(data=df, y=plot)

In [None]:
for col in numeric_cols:
    print(f'--- {col} ---')
    print(df[col].describe())
    print('\n')

In [None]:
plt.figure(figsize=(18, 12))

for i, plot in enumerate(numeric_cols):
    plt.subplot(int(f'23{i+1}'))
    plt.title(f'{plot} by Approval', fontdict=font_fmt)
    plt.subplots_adjust(hspace=0.2)
    sns.boxplot(data=df, y=plot, x='Approved')
    plt.xlabel('')


plt.show()

In [None]:
for col in numeric_cols:
    for a in range(2):
        if a == 0:
            print(f'--- {col} (Not Approved) ---')
        else:
            print(f'--- {col} (Approved) ---')
        print(df[col][df['Approved']==a].describe())
        print('\n')

In [None]:
plt.figure(figsize=(18, 12))

for i, plot in enumerate(boolean_cols):
    plt.subplot(int(f'33{i+1}'))
    plt.title(plot, fontdict=font_fmt)
    plt.subplots_adjust(hspace=0.3)
    sns.countplot(x=df[plot])
    plt.xlabel('')

In [None]:
plt.figure(figsize=(18, 12))

for i, plot in enumerate(boolean_cols):
    if plot != 'Approved':
        plt.subplot(int(f'33{i+1}'))
        plt.title(f'{plot} by approval', fontdict=font_fmt)
        plt.subplots_adjust(hspace=0.3)
        sns.countplot(data=df, x='Approved', hue=plot)
        plt.xlabel('')
        plt.legend(['Not Approved', 'Approved'], loc='upper right')

In [None]:
for i, plot in enumerate(string_cols):
    if plot != 'ZipCode':
        plt.subplot(int(f'22{i+1}'))
        plt.title(f'{plot}', fontdict=font_fmt)
        plt.subplots_adjust(wspace=0.3, hspace=0.3)
        sns.countplot(y=df[plot])

In [None]:
plt.figure(figsize=(16, 10))

for i, plot in enumerate(string_cols):
    sns.catplot(y=plot, col="Approved",
                data=df, kind="count",
                height=6, aspect=1,
                order=df[plot].value_counts().index)

<h3 style='color: #5AA49D'>2.4. Key Findings</h3>

<ul>
    <li>Based on descriptive analysis, <b>YearsEmployed</b>, <b>CreditScore</b>, and <b>Income</b> might affect creidt card approval chance; the higher the numbers, the higher the chance of approval.</li>
    <li><b>PriorDefault</b> and <b>Employed</b> also seem to be significant factors in determining approval status. Having prior default or defaults increases the chance of approval. Being employed also increases the chance of approval.</li>
</ul>

<h2 style='color: #5AA49D'>3. Data Wrangling</h2>

In [None]:
df.head()

In [None]:
# removing ZipCode, Industry, Ethnicity, and Citizenship from the training dataset
# I think including these information would be unethical to be used for credit card approving

df_copy = df.drop(string_cols, axis=1)
df_copy.drop('ZipCode', axis=1, inplace=True)
print(string_cols)
df_copy.head()

In [None]:
# splitting X (variables) and y (output)
X = df_copy.drop('Approved', axis=1)
y = df_copy['Approved']

In [None]:
# rescaling data
sc = MinMaxScaler(feature_range=(0,1))

X = sc.fit_transform(X)

<h2 style='color: #5AA49D'>4. Machine Learning model</h2>

<p>First, we will evaluate different classification models to see which product the best accuracy score in this dataset.</p>
<p>Then, we can work on improving and verifying our model's accuracy</p>

In [None]:
# this cell of code is copied from Edureka! with some slight modifications

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

# evaluate each model in turn

for name, model in models:
    kfold = KFold(n_splits=10)
    cv_results = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
    scoring = f"{name}: {round(cv_results.mean(), 4)} ({round(cv_results.std(),4)})"
    print(scoring)


<p><b>Logistic Regression</b> and <b>Linear Discriminant Analysis</b> seem to have the best accuracy score out of those classification models (84.49% and 85.36%, respectively).</p>
<p>According to Wikipedia,</p>
<ul>
    <li><a href='https://en.wikipedia.org/wiki/Logistic_regression'>Logistic Regression</a> "is a statistical model that models the probability of one event (out of two alternatives) taking place by having the log-odds (the logarithm of the odds) for the event be a linear combination of one or more independent variables ('predictors')".</li>
    <li><a href="https://en.wikipedia.org/wiki/Linear_discriminant_analysis">Linear Discriminant Analysis</a> "is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification."</li>
</ul>
<p>For this project, I will be using <b>Logistic Regression</b> since it is fairly easy to understand while still achieving a good accuracy score.</p>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

test_model = LogisticRegression()
test_model.fit(X_train, y_train)

predictions = test_model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy score for Logistic Regression model: {accuracy}')

<p>An accuracy score of 89.86% is very good.</p>
<p>Let's try to improve the accuracy of the model and make sure that the model is not overly optimistic using:</p>
<ul>
    <li><a href='https://machinelearningmastery.com/rfe-feature-selection-in-python/'>Recursive Feature Elimination</a> (RFE) is a feature selection algorithm that will evaluate and rank the importance of each feature in the dataset. Then, it will eliminate features that aren't strongly correlate to the output.</li>
    <li><a href='https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/'>Repeated K-Fold cross validation</a> will help to reduce noises compared to the simple KFold cross validation method.</li>
</ul>

In [None]:
# we don't know how many features to choose so 5 features can be a good start for our RFE

rfe = RFE(LogisticRegression(), n_features_to_select=5)
rfe.fit(X_train, y_train)

In [None]:
df_rank = pd.DataFrame(data=rfe.ranking_, index=df_copy.columns[:-1])
df_rank.rename(columns={0:'rank'}, inplace=True)
df_rank.sort_values('rank', ascending=True)

<p>When selecting 5 features, <b>BankCustomer</b>, <b>YearsEmployed</b>, <b>PriorDefault</b>, <b>Employed</b>, and <b>CreditScore</b> seem to be the top 5 important factors in our prediction model.</p>
<p>Now, we can try to create a pipeline </p>

In [None]:
results = []

for i in range(1, df_rank.shape[0]+1):
    pipeline = Pipeline(steps=[('rfe', RFE(LogisticRegression(), n_features_to_select=i)), ('lg', LogisticRegression())])
    cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=10)

    scores = cross_val_score(pipeline, X, y, cv=cv, scoring='accuracy')
    
    results.append([i, scores.mean(), scores.std()])

In [None]:
df_pipeline = pd.DataFrame(data=results, columns=['n_feature', 'mean', 'std'])
df_pipeline

In [None]:
print(f"Cross val score mean: {df_pipeline['mean'].mean()}")
print(f"Cross val score std: {df_pipeline['mean'].std()}")

In [None]:
df_pipeline[df_pipeline['mean'] == df_pipeline['mean'].max()]

<p>Picking 6 features from the 11 features from the dataset has the best accuracy.</p>
<p>Let's look at the ranking of each feature:</p>

In [None]:
rfe = RFE(LogisticRegression(), n_features_to_select=6)
rfe.fit(X_train, y_train)
df_rank = pd.DataFrame(data=rfe.ranking_, index=df_copy.columns[:-1])
df_rank.rename(columns={0:'rank'}, inplace=True)
df_rank.sort_values('rank', ascending=True)

<p>In addition to our first 5 features, <b>Income</b> is another feature that seems to have high importance in determining the credit card approval.</p>
<p>Let's visualize these features:</p>

In [None]:
cols = ['BankCustomer', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore', 'Income', 'Approved']

new_df = df[cols]
new_df.head()

In [None]:
plt.figure(figsize=(24, 8))

plt.subplot(141)
sns.boxplot(data=new_df, y='YearsEmployed', x='Approved')
plt.title('YearsEmployed', fontdict=font_fmt)
plt.subplot(142)
sns.boxplot(data=new_df, y='CreditScore', x='Approved')
plt.title('CreditScore', fontdict=font_fmt)
plt.subplot(143)
sns.boxplot(data=new_df, y='Income', x='Approved')
plt.title('Income', fontdict=font_fmt)
plt.subplot(144)
sns.boxplot(data=new_df[new_df['Income']<10000], y='Income', x='Approved')
plt.title('Income (<10000)', fontdict=font_fmt)

plt.show()

In [None]:
plt.figure(figsize=(24, 8))

plt.subplot(131)
sns.countplot(data=new_df, x='BankCustomer', hue='Approved')
plt.title('BankCustomer', fontdict=font_fmt)
plt.subplot(132)
sns.countplot(data=new_df, x='PriorDefault', hue='Approved')
plt.title('PriorDefault', fontdict=font_fmt)
plt.subplot(133)
sns.countplot(data=new_df, x='Employed', hue='Approved')
plt.title('Employed', fontdict=font_fmt)

plt.show()

<h2 style='font-weight: bold; color: #5AA49D'>6. Conclusion</h2>

<p>We will be using <b>Logistic Regression</b> with <b>6</b> of the most important features for our predictive model:</p>
<ul>
    <li>BankCustomer</li>
    <li>YearsEmployed</li>
    <li>PriorDefault</li>
    <li>Employed</li>
    <li>CreditScore</li>
    <li>Income</li>
</ul>
<p>Our classification model has an accuracy of <b>85.6% (standard deviation = 0.027)</b>. This model is fairly accurate in predicting whether to approve an applicant for a credit card or not.</p>
<p>I think we can also remove Gender, Age, and Marriage status from our set of features since they can somewhat be unethical for the ML model. However, with RFE, the algorithm automatically elimiates unimportant features.</p>
<p>In the banking industry, I think that this can be a good start in determining the reliability of an applicant for a credit card. However, in reality, there are different types and levels of credit card so each credit card would have a separate model to approve.</p>