**Introduction**

In today's banking systems, processing credit card applications efficiently is crucial. Commercial banks receive thousands of applications daily, many of which are rejected due to factors such as high loan balances, low income levels, or excessive credit inquiries. Manually reviewing these applications is not only tedious and time-consuming but also prone to human error.

To streamline this process, many financial institutions now leverage machine learning to automate credit risk assessment. In this project, we replicate this real-world scenario by building a machine learning model that predicts whether a credit card application should be approved or not.

We use the Credit Card Approval dataset from the UCI Machine Learning Repository. This dataset contains a mix of numerical and categorical features, with varying data ranges and some missing values. The project involves cleaning and preprocessing the data, exploring feature relationships through exploratory data analysis (EDA), and training a classification model to make accurate approval predictions.

Finally, we deploy the trained model as a Flask web application. The app includes:

A login page for user authentication

A form interface where users can input application data

A results page displaying the model's prediction

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
import joblib

In [2]:
df = pd.read_csv("/content/crx.data",header=None)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [4]:
df.columns = ['Gender', 'Age', 'Debt', 'Married', 'BankCustomer',
              'EducationLevel', 'Ethnicity', 'YearsEmployed', 'PriorDefault',
              'Employed', 'CreditScore', 'DriversLicense', 'Citizen',
              'ZipCode', 'Income','Approved']

In [5]:
df.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


Let's see if there is a dataset issues

In [6]:
# Replace the '?'s with NaN

df = df.replace('?',np.nan)

In [7]:
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

In [8]:
df.describe()

Unnamed: 0,Age,Debt,YearsEmployed,CreditScore,Income
count,678.0,690.0,690.0,690.0,690.0
mean,31.568171,4.758725,2.223406,2.4,1017.385507
std,11.957862,4.978163,3.346513,4.86294,5210.102598
min,13.75,0.0,0.0,0.0,0.0
25%,22.6025,1.0,0.165,0.0,0.0
50%,28.46,2.75,1.0,0.0,5.0
75%,38.23,7.2075,2.625,3.0,395.5
max,80.25,28.0,28.5,67.0,100000.0


Our variables are Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus.

In [9]:
num_col = df.select_dtypes(include=['float64','int64']).columns

df[num_col] = df[num_col].fillna(df[num_col].mean())

In [10]:
for col in df.columns:
  if df[col].dtypes == 'object':
    df = df.fillna(df[col].value_counts().index[0])

In [11]:
df.isnull().sum()

Unnamed: 0,0
Gender,0
Age,0
Debt,0
Married,0
BankCustomer,0
EducationLevel,0
Ethnicity,0
YearsEmployed,0
PriorDefault,0
Employed,0


**Preprocessing the data**

In [12]:
# Drop the features 11 and 13
df = df.drop(['Ethnicity','CreditScore','ZipCode', 'DriversLicense','Citizen','EducationLevel','BankCustomer','Married'], axis=1)

In [13]:
df['Gender'] = df['Gender'].map({'a': 0, 'b': 1})
df['Approved'] = df['Approved'].map({'-': 0, '+': 1})
df['PriorDefault'] = df['PriorDefault'].map({'f': 0, 't': 1})
df['Employed'] = df['Employed'].map({'f': 0, 't': 1})

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Gender         690 non-null    int64  
 1   Age            690 non-null    float64
 2   Debt           690 non-null    float64
 3   YearsEmployed  690 non-null    float64
 4   PriorDefault   690 non-null    int64  
 5   Employed       690 non-null    int64  
 6   Income         690 non-null    int64  
 7   Approved       690 non-null    int64  
dtypes: float64(3), int64(5)
memory usage: 43.3 KB


In [15]:
le = LabelEncoder()

In [16]:
# Use LabelEncoder to do the numeric transformation

for col in df.columns.to_numpy():
  if df[col].dtypes =='object':
    df[col]=le.fit_transform(df[col])

In [17]:
df.columns

Index(['Gender', 'Age', 'Debt', 'YearsEmployed', 'PriorDefault', 'Employed',
       'Income', 'Approved'],
      dtype='object')

In [18]:
df['Gender'].value_counts()

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
1,480
0,210


In [19]:
Variables = ['Gender', 'Age', 'Debt', 'YearsEmployed', 'PriorDefault', 'Employed','Income', 'Approved']

for var in Variables:
  print(var)
  print(df[var].unique())

Gender
[1 0]
Age
[30.83       58.67       24.5        27.83       20.17       32.08
 33.17       22.92       54.42       42.5        22.08       29.92
 38.25       48.08       45.83       36.67       28.25       23.25
 21.83       19.17       25.         47.75       27.42       41.17
 15.83       47.         56.58       57.42       42.08       29.25
 42.         49.5        36.75       22.58       27.25       23.
 27.75       54.58       34.17       28.92       29.67       39.58
 56.42       54.33       41.         31.92       41.5        23.92
 25.75       26.         37.42       34.92       34.25       23.33
 23.17       44.33       35.17       43.25       56.75       31.67
 23.42       20.42       26.67       36.         25.5        19.42
 32.33       34.83       38.58       44.25       44.83       20.67
 34.08       21.67       21.5        49.58       27.67       39.83
 31.56817109 37.17       25.67       34.         49.         62.5
 31.42       52.33       28.75       28.58      

Splitting the dataset into train and test sets

In [20]:
# convert the DataFrame to a NumPy array
df = df.to_numpy()

In [21]:
df.shape

(690, 8)

In [22]:
X,y = df[:,0:7], df[:,7]

In [23]:
# Split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=40)

In [24]:
scaler = MinMaxScaler(feature_range=(0,1))
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

logistic regression model to the train set

In [25]:
log_regre = LogisticRegression()

In [26]:
log_regre.fit(X_train_scaled,y_train)

Predictions and evaluating performance

In [27]:
y_pred = log_regre.predict(X_test_scaled)

In [28]:
Accuracy = log_regre.score(X_test_scaled, y_test)

print("Accuracy of logistic regression classifier: ", Accuracy)

Accuracy of logistic regression classifier:  0.8695652173913043


In [29]:
confusion_matrix(y_test,y_pred)

array([[90, 18],
       [ 9, 90]])

Grid searching and making the model perform better

In [30]:
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

param_grid = dict(tol=tol, max_iter=max_iter)

Finding the best performing model

In [31]:
# GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=log_regre, param_grid=param_grid, cv=5)

# rescale X and assign it to rescaledX
X_scaled = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(X_scaled,y)

In [32]:
# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.855072 using {'max_iter': 100, 'tol': 0.01}


In [33]:
model = LogisticRegression(max_iter=100, tol=0.001)


model.fit(X_train_scaled,y_train)

y_pred = model.predict(X_test_scaled)

Accuracy = model.score(X_test_scaled, y_test)

print("Accuracy of logistic regression classifier: ", Accuracy)

Accuracy of logistic regression classifier:  0.8695652173913043


In [34]:
joblib.dump(model,'model.pkl')

['model.pkl']

### 📝 Input Guide: What Values You Can Enter

Please refer to the list below to know the values you can enter for each field in the form:

- **Gender**  
  Enter `1` for **Male** or `0` for **Female**.

- **Age**  
  Enter your **age in years**.  
  _Example_: `25.5`, `42.0`, `30`

- **Debt**  
  Enter your **current debt amount** in thousands of currency units.  
  _Example_: `0`, `4.5`, `12.75`


- **Years Employed**  
  Enter the number of years you have been employed.  
  _Example_: `3.5`, `10`, `0.75`

- **Prior Default**  
  Enter `1` if you have **defaulted before**, or `0` if **not**.

- **Employed**  
  Enter `1` if you are **currently employed**, or `0` if **not**.


- **Income**  
  Enter your **yearly income**.  
  _Example_: `15000`, `4200`, `0`


In [35]:
mj = joblib.load('model.pkl')

In [36]:
mj.predict([[0,2,1,1,0,0,0]])

array([0.])

In [37]:
mj.predict([[1,35,0,10,0,1,10000]])

array([1.])