# BANKING CREDIT CARD TRANSACTION
### Prediction of Valid or Fraud

## Introduction

There are lots of credit card transaction processes submitted by banking customers every day. In every second it is possible there are thousands or even millions of transactions that enter the banking system.

However, are all the proposed transactions valid without certain parties who may wish to commit fraud against the Bank?   
In fact, of the entire transaction process entered into the system, only 99% of transactions are actually submitted by customers, the other 1% is fraud submitted by irresponsible parties. Even though the percentage of fraud is very small, the number of transaction processes and a large nominal will still cause losses for the Bank.

If the fraud is allowed to continue, it will make more and more other less responsible parties participate in doing this bad thing. If this happens, this 1% figure will continue to grow exponentially over time. So that it will cause big losses for the Bank, both material and non-material losses such as customer trust which ultimately makes customers turn into competitors' bank customers. 

With a very large number of transactions and must be processed by the system very quickly and accurately (benefit of using credit cards), it is impossible for humans to be able to manually validate every transaction submitted by customers is valid or fraudulent.

Therefore, it is necessary to implement an automatic validation system to process every transaction that enters the banking system with a maximum score accuracy of 100% by using <em>"Machine Learning"</em>.

Why must the system score accuracy be 100%?   
Because, if the score accuracy of the automatic validation system is 99%, it can be stated that the system has failed in practice, because there is no difference between before and after implementing the Machine Learning validation system. even in certain cases (<90% accuracy) can be detrimental to customers due to evaluation errors.

## Import Dataset

The dataset used in this portfolio is the <a href="http://archive.ics.uci.edu/ml/datasets/credit approval">Credit Card Approval dataset</a> dataset from The UCI Machine Learning Repository.

In [1]:
import pandas as pd

cc_dataset = pd.read_csv('datasets/cc_approvals.data')
cc_dataset.head()

Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+


The table above is a table of credit card transaction history that has been disguised to maintain the privacy of customer data, therefore the names of each column in the table are disguised. However, in the blog <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">"Analysis of Credit Approval Data"</a> there is a clue that generally the column names in The banking datasets in sequence are:
- Gender
- Age
- Debt
- Married
- BankCustomer
- EducationLevel
- Ethnicity
- YearEmployed
- PriorDefault
- Employed
- CreditScore
- DriverLicense
- Citizen
- ZipCode
- Income
- ApprovalStatus

For convenience, the name of the column will be changed

In [2]:
cc_dataset.columns = ['Gender','Age','Debt','Married','BankCustomer',
                      'EducationLevel','Ethnicity','YearEmployed',
                      'PriorDefault','Employed','CreditScore',
                      'Driverlicense','Citizen','ZipCode','Income',
                      'ApprovalStatus']

cc_dataset.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearEmployed,PriorDefault,Employed,CreditScore,Driverlicense,Citizen,ZipCode,Income,ApprovalStatus
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+


Now the Dataset looks better.
However, before being able to create a <em>Machine learning</em> model, the issues or problems that the Dataset has must be identified first.

In [3]:
print(cc_dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689 entries, 0 to 688
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          689 non-null    object 
 1   Age             689 non-null    object 
 2   Debt            689 non-null    float64
 3   Married         689 non-null    object 
 4   BankCustomer    689 non-null    object 
 5   EducationLevel  689 non-null    object 
 6   Ethnicity       689 non-null    object 
 7   YearEmployed    689 non-null    float64
 8   PriorDefault    689 non-null    object 
 9   Employed        689 non-null    object 
 10  CreditScore     689 non-null    int64  
 11  Driverlicense   689 non-null    object 
 12  Citizen         689 non-null    object 
 13  ZipCode         689 non-null    object 
 14  Income          689 non-null    int64  
 15  ApprovalStatus  689 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.2+ KB
None


No Null detected.

Advanced checks using commonly found or used strings as missing values

In [4]:
print(cc_dataset[cc_dataset.values=='?'])
print(cc_dataset[cc_dataset.values==' '])
print(cc_dataset[cc_dataset.values==''])

    Gender    Age   Debt Married BankCustomer EducationLevel Ethnicity  \
70       b  34.83  4.000       u            g              d        bb   
82       a      ?  3.500       u            g              d         v   
85       b      ?  0.375       u            g              d         v   
91       b      ?  5.000       y            p             aa         v   
96       b      ?  0.500       u            g              c        bb   
..     ...    ...    ...     ...          ...            ...       ...   
621      a  25.58  0.000       ?            ?              ?         ?   
621      a  25.58  0.000       ?            ?              ?         ?   
625      b  22.00  7.835       y            p              i        bb   
640      ?  33.17  2.250       y            p             cc         v   
672      ?  29.50  2.000       y            p              e         h   

     YearEmployed PriorDefault Employed  CreditScore Driverlicense Citizen  \
70         12.500            t   

There are cells missing values ​​with the string value '?', namely in the column:

In [5]:
print(cc_dataset.columns[cc_dataset.isin(['?']).any()])
print(cc_dataset.isin(['?']).sum())
n_missing = cc_dataset.isin(['?']).sum().sum()
print('total missing values =',n_missing)

Index(['Gender', 'Age', 'Married', 'BankCustomer', 'EducationLevel',
       'Ethnicity', 'ZipCode'],
      dtype='object')
Gender            12
Age               12
Debt               0
Married            6
BankCustomer       6
EducationLevel     9
Ethnicity          9
YearEmployed       0
PriorDefault       0
Employed           0
CreditScore        0
Driverlicense      0
Citizen            0
ZipCode           13
Income             0
ApprovalStatus     0
dtype: int64
total missing values = 67


From the investigation, there are 67 rows in 7 data columns that have the cell value '?'. This problem is called missing values.

Machine learning models will not be able to be created if there are still missing values. Therefore this problem must be addressed first. However, before that, the columns that are considered unimportant or not related to the validity of a credit card transaction will be removed first.

In [6]:
cc_dataset = cc_dataset.drop(columns=['ZipCode','Driverlicense'])
cc_dataset.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearEmployed,PriorDefault,Employed,CreditScore,Citizen,Income,ApprovalStatus
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,g,0,+


## Dealing with Missing Values

In [7]:
import numpy as np

cc_dataset =cc_dataset.replace('?',np.nan)
cc_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689 entries, 0 to 688
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          677 non-null    object 
 1   Age             677 non-null    object 
 2   Debt            689 non-null    float64
 3   Married         683 non-null    object 
 4   BankCustomer    683 non-null    object 
 5   EducationLevel  680 non-null    object 
 6   Ethnicity       680 non-null    object 
 7   YearEmployed    689 non-null    float64
 8   PriorDefault    689 non-null    object 
 9   Employed        689 non-null    object 
 10  CreditScore     689 non-null    int64  
 11  Citizen         689 non-null    object 
 12  Income          689 non-null    int64  
 13  ApprovalStatus  689 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.5+ KB


Based on Dataset.info, only columns of object data type have missing values. Therefore, the way to overcome this is to fill in the missing values ​​with the majority value of the column.

In [8]:
for col in cc_dataset.columns:
  if cc_dataset[col].dtypes == 'object':
    cc_dataset = cc_dataset.fillna(cc_dataset[col].value_counts().index[0])

cc_dataset.isnull().sum()

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearEmployed      0
PriorDefault      0
Employed          0
CreditScore       0
Citizen           0
Income            0
ApprovalStatus    0
dtype: int64

##### Missing values ​​have been resolved.

## Making Machine Learning Models

#### 1. Data Preprocessing

At this stage the credit card dataset that has been cleaned of missing values ​​will be divided into 2 parts for the purpose of training and testing machine learning models. Then convert all data types to numeric and scale them to be on the same scale.
1. Split Dataset (train-test)
2. Change the data type to numbers
3. Scaling data

In [9]:
from sklearn.model_selection import train_test_split

cc_train, cc_test = train_test_split(cc_dataset,test_size=0.3,random_state=50)

cc_train = pd.get_dummies(cc_train)
cc_test = pd.get_dummies(cc_test)
print('After Dummy')
print('Train shape :', cc_train.shape)
print('Test shape  :', cc_test.shape)
print('\n')

#Reindex kolom dataset test setelah di dummy
cc_test = cc_test.reindex(columns=cc_train.columns,fill_value=0)
print('After Reindex Dummy')
print('Train shape :', cc_train.shape)
print('Test shape  :', cc_test.shape)

After Dummy
Train shape : (482, 330)
Test shape  : (207, 215)


After Reindex Dummy
Train shape : (482, 330)
Test shape  : (207, 330)


Next we separate the Feature (X) and Target (y) before scaling the Feature.

Scaling values with minimum 0 and maximum 1.

In [10]:
X_train, y_train = cc_train.iloc[:,:-1].values, cc_train.iloc[:,[-1]].values
X_test, y_test = cc_test.iloc[:,:-1].values, cc_test.iloc[:,[-1]].values

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
scaledX_train = scaler.fit_transform(X_train)
scaledX_test = scaler.transform(X_test)

#### 2. Model Training

One of the important factors and advantages of using a credit card is the practicality and speed of transactions. Therefore, the automatic validation system must have high performance but still be fast. So that the machine learning model used must be light but still effective.

Of the many types of machine learning models, the lightest model for this type of classification is "Logistic regression" and to find the best performance of the type of logistic regression model in the Dataset, a search for the most appropriate parameters is carried out.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

tol = [0.01,0.001,0.0001]
max_iter = [100,150,200]
parameter = dict(tol=tol,max_iter=max_iter)

logreg = LogisticRegression()
grid_model = GridSearchCV(estimator=logreg,param_grid=parameter,cv=13)

grid_model_result = grid_model.fit(scaledX_train,y_train.ravel())
print("Best Accuration : %f" % (grid_model_result.best_score_))
print("Best Parameter  :", grid_model_result.best_params_)

Best Accuration : 1.000000
Best Parameter  : {'max_iter': 100, 'tol': 0.01}


In [12]:
best_model = grid_model_result.best_estimator_

score_test = best_model.score(scaledX_test,y_test)
print("Model Accuracy on the test dataset", score_test)

Model Accuracy on the test dataset 1.0


In [13]:
from sklearn.metrics import confusion_matrix
y_pred = best_model.predict(scaledX_test)

confusion_matrix(y_test,y_pred)

array([[ 97,   0],
       [  0, 110]])

Based on the accuracy and confusion matrix, information is obtained that the machine learning model created has succeeded in accurately classifying the test dataset, with the following results:
### 97 Fraud Transactions
### 110 Valid Transaction