## Credit Card Fraud Detection

## Business Understanding
Credit card fraud detection is the process of identifying suspicious or unauthorized use of credit card information. It is important for businesses to have an effective fraud detection system in place to protect their customers' financial information and to minimize financial losses from fraudulent transactions.

* Machine learning algorithms - these algorithms use historical data and patterns to detect fraud and make predictions about future transactions.

It's important for businesses to continuously monitor and update their fraud detection systems to stay ahead of evolving threats and to ensure that they are able to effectively identify and prevent fraud. Additionally, businesses must comply with industry regulations such as the Payment Card Industry Data Security Standard (PCI DSS) to protect customer data and maintain customer trust.

## Data Understanding

In [40]:
# Import necessary library
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [41]:
# Load dataset
df = pd.read_csv('creditcard/creditcard.csv')

# first five rows of the dataset
df.head(5)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [42]:
# check for number of columns and rows
rows,columns = df.shape

# Print the number of rows and columns in one statement
print(f"Number of rows: {rows}; Number of columns: {columns}")

Number of rows: 284807; Number of columns: 31


In [43]:
# Get the column names
column_names = df.columns

# Use a for loop to print the column names
for i, column_name in enumerate(column_names):
    print(f"Column {i+1}: {column_name}")

Column 1: Time
Column 2: V1
Column 3: V2
Column 4: V3
Column 5: V4
Column 6: V5
Column 7: V6
Column 8: V7
Column 9: V8
Column 10: V9
Column 11: V10
Column 12: V11
Column 13: V12
Column 14: V13
Column 15: V14
Column 16: V15
Column 17: V16
Column 18: V17
Column 19: V18
Column 20: V19
Column 21: V20
Column 22: V21
Column 23: V22
Column 24: V23
Column 25: V24
Column 26: V25
Column 27: V26
Column 28: V27
Column 29: V28
Column 30: Amount
Column 31: class


In [44]:

# Specify the column to check
column_name = 'class'

# Get the values of the specified column
column_values = df[column_name].unique()

# Use a for loop to print the values of the specified column
for i, value in enumerate(column_values):
    print(f"Value {i+1}: {value}")

Value 1: 0
Value 2: 1


* Class 0 --> Non fraudulent
* class 1 --> fraudulent

In [45]:
# Check for missing values
df.isna().any()
# Get the column names
column_names = df.columns

# Use a for loop to check for missing values in each column
for column_name in column_names:
    missing_values = df[column_name].isnull().sum()
    print(f"{column_name}: {missing_values} missing values")

Time: 0 missing values
V1: 0 missing values
V2: 0 missing values
V3: 0 missing values
V4: 0 missing values
V5: 0 missing values
V6: 0 missing values
V7: 0 missing values
V8: 0 missing values
V9: 0 missing values
V10: 0 missing values
V11: 0 missing values
V12: 0 missing values
V13: 0 missing values
V14: 0 missing values
V15: 0 missing values
V16: 0 missing values
V17: 0 missing values
V18: 0 missing values
V19: 0 missing values
V20: 0 missing values
V21: 0 missing values
V22: 0 missing values
V23: 0 missing values
V24: 0 missing values
V25: 0 missing values
V26: 0 missing values
V27: 0 missing values
V28: 0 missing values
Amount: 0 missing values
class: 0 missing values


In [46]:
# check for duplicates
duplicates = df.duplicated().sum()

# Print the number of duplicates
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 1081


In [47]:
# Get the column names
column_names = df.columns

# Use a for loop to check for duplicated values in each column
for column_name in column_names:
    duplicates = df[column_name].duplicated().sum()
    print(f"{column_name}: {duplicates} duplicated values")

Time: 160215 duplicated values
V1: 9154 duplicated values
V2: 9152 duplicated values
V3: 9150 duplicated values
V4: 9153 duplicated values
V5: 9150 duplicated values
V6: 9155 duplicated values
V7: 9156 duplicated values
V8: 9164 duplicated values
V9: 9151 duplicated values
V10: 9161 duplicated values
V11: 9159 duplicated values
V12: 9153 duplicated values
V13: 9150 duplicated values
V14: 9154 duplicated values
V15: 9154 duplicated values
V16: 9162 duplicated values
V17: 9161 duplicated values
V18: 9152 duplicated values
V19: 9162 duplicated values
V20: 9175 duplicated values
V21: 9190 duplicated values
V22: 9163 duplicated values
V23: 9196 duplicated values
V24: 9162 duplicated values
V25: 9167 duplicated values
V26: 9160 duplicated values
V27: 9210 duplicated values
V28: 9249 duplicated values
Amount: 252040 duplicated values
class: 284805 duplicated values


## Data Preparation

In [48]:
# dafaset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [49]:
# data distribution
df['class'].value_counts()

0    284315
1       492
Name: class, dtype: int64

In [50]:
"""
Dataset in unbalanced
"""

'\nDataset in unbalanced\n'

In [51]:
# separate data
norm = df[df['class'] == 0]
fraud = df[df['class'] == 1]

In [52]:
# print shape of the variables
dataframes = [norm, fraud]
for data in dataframes:
    print("Shape of dataframe: ", data.shape)



Shape of dataframe:  (284315, 31)
Shape of dataframe:  (492, 31)


In [53]:
# Data statistics
norm.Amount.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [54]:
fraud.Amount.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

In [55]:
# compare mean for the fraud and normal data
df.groupby('class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


In [56]:
# sampling
norm_samp = norm.sample(n=492)

In [57]:
# concantenate
new_df = pd.concat([norm_samp, fraud],axis=0)
new_df.head(5)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,class
279234,168729.0,0.018184,0.768485,0.172684,-0.769906,0.59146,-0.634593,0.865897,-0.014814,-0.216557,...,-0.234318,-0.553228,0.023422,-0.35457,-0.490981,0.142097,0.239997,0.083222,2.69,0
10938,18769.0,-2.146416,-1.596222,2.560461,-1.332597,1.306061,-1.099903,-1.272066,0.476996,2.455747,...,0.119163,0.097449,0.078743,-0.066076,0.438715,-0.797729,0.01586,0.093013,21.05,0
173658,121585.0,1.639938,-0.908127,0.832521,1.845868,-1.652753,0.105208,-1.147006,0.23983,1.992198,...,0.262163,0.936092,0.118866,0.464955,-0.308215,-0.560345,0.098183,0.004622,109.99,0
119020,75299.0,-0.70512,1.126895,1.201219,0.438523,0.348542,-0.364667,0.79133,-0.150286,-0.412665,...,0.163367,0.440549,-0.130025,0.080528,0.173362,-0.373281,-0.324178,0.004187,9.24,0
21547,31731.0,-0.692342,0.955848,1.237962,0.817452,1.1147,-0.042481,0.783874,0.050176,-1.422808,...,0.148604,0.404951,-0.406803,-0.26637,0.602503,-0.084888,0.064739,0.058097,1.79,0


In [58]:
new_df['class'].value_counts()

0    492
1    492
Name: class, dtype: int64

In [59]:
new_df.groupby('class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,92816.739837,-0.125509,-0.042867,0.042232,0.049525,-0.014395,-0.014182,-0.051047,0.017644,-0.011017,...,-0.039999,0.051157,-0.028532,0.000382,-0.00073,-0.006382,-0.0033,-0.00057,-0.012989,86.312195
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


In [60]:
# split data into Features and Target
X = new_df.drop(columns='class',axis=1)
y = new_df['class']

In [67]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

In [62]:
# split data into training & testing data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=49)

## Modelling

Logistic Regression

In [63]:
# model training
lr = LogisticRegression()

In [64]:
# training the Logistic Regression Model with Training Data
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

Model Evaluation

In [65]:

X_train_prediction = lr.predict(X_train)

# Accuracy Score
# accuracy on training data
training_data_accuracy = accuracy_score(X_train_prediction, y_train)
print('Accuracy on Training data : ', training_data_accuracy)


Accuracy on Training data :  0.9349593495934959


In [66]:
# accuracy on test data
X_test_prediction = lr.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, y_test)
print('Accuracy score on Test Data : ', test_data_accuracy)

Accuracy score on Test Data :  0.943089430894309


In [68]:
# Calculate precision, recall, F1 score, and support
precision, recall, f1_score, support = precision_recall_fscore_support(y_test, X_test_prediction, average='binary')

print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1_score)

Precision:  0.9830508474576272
Recall:  0.90625
F1 score:  0.943089430894309


KNN Classifier

In [69]:
from sklearn.neighbors import KNeighborsClassifier

In [70]:
# Train the KNN classifier
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)


KNeighborsClassifier()

Model Evaluation

In [71]:
# Make prediction on training data
y_train_pred = knn_clf.predict(X_train)
# Make predictions on the test set
y_pred = knn_clf.predict(X_test)
# Evaluate the model's performance
print("Accuracy Score on Training Data:", accuracy_score(y_train_pred, y_train))
print("Accuracy Score on Test Data:", accuracy_score(y_test, y_pred))

Accuracy Score on Training Data: 0.7642276422764228
Accuracy Score on Test Data: 0.6219512195121951


In [72]:
# Calculate precision, recall, F1 score, and support
precision, recall, f1_score, support = precision_recall_fscore_support(y_test, y_pred, average='binary')

print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1_score)

Precision:  0.6422764227642277
Recall:  0.6171875
F1 score:  0.6294820717131474


Random Forest

In [73]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

In [74]:
# Train the Random Forest classifier
rand_clf = RandomForestClassifier()
rand_clf.fit(X_train, y_train)

RandomForestClassifier()

Model Evaluation

In [75]:
# Make predictions on the training data
rand_train_y_pred = rand_clf.predict(X_train)
# Make predictions on the test data
rand_pred = rand_clf.predict(X_test)
# Evaluate the model's performance
print("Accuracy Score on Test Data:", accuracy_score(y_test, rand_pred))

Accuracy Score on Test Data: 0.9471544715447154


In [76]:
# Calculate precision, recall, F1 score, and support
precision, recall, f1_score, support = precision_recall_fscore_support(y_test, rand_pred, average='binary')

print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1_score)

Precision:  0.9752066115702479
Recall:  0.921875
F1 score:  0.9477911646586346


SVC Model

In [77]:
from sklearn.svm import SVC


In [78]:
# Train the SVM classifier
clf = SVC(kernel='linear', C=1)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model's performance
from sklearn.metrics import accuracy_score
print("SVM Accuracy:", accuracy_score(y_test, y_pred))

SVM Accuracy: 0.9227642276422764


In [79]:
# Calculate precision, recall, F1 score, and support
precision, recall, f1_score, support = precision_recall_fscore_support(y_test, y_pred, average='binary')

print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score: ", f1_score)

Precision:  1.0
Recall:  0.8515625
F1 score:  0.919831223628692


Train the Random Forest classifier with tuned hyperparameters

In [80]:
from sklearn.model_selection import GridSearchCV

In [81]:
# define the hyperparameters to be tuned
param_grid = {'n_estimators': [10, 50, 100, 200],
              'max_depth': [5, 10, 15, 20, 25],
              'min_samples_split': [2, 5, 10, 20],
              'min_samples_leaf': [1, 2, 4, 8]}
# create an instance of the Random Forest classifier
clf = RandomForestClassifier()

# create an instance of the GridSearchCV with the hyperparameters and the classifier
grid_search = GridSearchCV(clf, param_grid, cv=5)

# fit the grid search on the training data
grid_search.fit(X, y)

# print the best parameters
print("Best parameters found: ", grid_search.best_params_)

# print the best score
print("Best score found: ", grid_search.best_score_)

# make predictions on the test data using the best classifier
y_pred = grid_search.predict(X)

Best parameters found:  {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}
Best score found:  0.93698850098415
