# Task 5 : Credit Card Fraud Detection

*Varad Deshmukh*

**Instructions**

* Build a machine learning model to identify fraudulent credit card transactions.
* Preprocess and normalize the transaction data, handle class imbalance issues, and split the dataset into training and testing sets.
* Train a classification algorithm, such as logistic regression or random forests, to classify transactions as fraudulent or genuine.
* Evaluate the model's performance using metrics like precision, recall, and F1-score, and consider techniques like oversampling or undersampling for improving results.

> ## Importing Libraries

In this project, along with the usual `NumPy` and `pandas`, we will need many modules from the `sklearn` library as well. To have an easier reference to the imported modules and classes and their usage at appropriate places, we will import them as they are required, and their rationale will be explained there itself.

In [10]:
# importing necessary libraries
import numpy as np
import pandas as pd

> ## Loading the dataset

The dataset for this project has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. We will be using the same data, with appropriate citations at the end.

The data is in the form a '.csv' file named 'creditcard.csv', which contains about 2.85 lakh rows and 31 columns. The names of the columns, except three - Time, Amount and Class - have been changed into unidentifiable names owing to privacy and security concerns.

This data has a column named 'Class', which is a categorical variable, showing whether the transaction was legitimate or fraud. The feature class has two values '0' and '1', for legitimate and fraud transactions, respectively.

In [11]:
# load the dataset in the form of a pandas DataFrame
data = pd.read_csv('creditcard.csv')

# view first 5 rows of the dataset
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


> ## Data Preprocessing

Data Preprocessing involves understanding the datatypes of the columns, checking for any missing values and subsequently cleaning the data.

In [12]:
# fetch information about the columns and their datatypes
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

This shows that of all the 31 columns, only the 'Class' column is of data type 'int64', rest all are 'float64'. Moreover, the memory usage of the dataset is about 67.4 mb. Now we check for missing data.

In [13]:
# check for missing data
data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

All the values being zero means that we have no missing values in our dataset. We now check whether the 'Class' column can be termed to be as categorical.

In [14]:
# check unique values in the 'Class' column
data['Class'].nunique()

2

This shows that indeed the 'Class' column can be treated as a categorical, as it shows whether the transactions was legitimate or fraudulent. 

> ## Normalizing the data

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. In our case, our data includes both negative and positive quantities, both having wide ranges. It is often good choice to normalize data, as many machine learning algorithms have been found to run and converge faster when the data is normalized.

So, for normaliing our data, we use the `MinMaxScaler()` class in the `scikit-learn` library for this purpose.

In [15]:
# import MinMaxScaler() from scikit-learn
from sklearn.preprocessing import MinMaxScaler

In [16]:
# scale the data
scaler = MinMaxScaler()
scaler.fit(data)
scaled = scaler.fit_transform(data)

# make a pandas DataFrame out of the scaled data, to be used for further modelling
scaled_data = pd.DataFrame(scaled, columns=data.columns)

In [17]:
# show how the scaled data looks like
scaled_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,0.935192,0.76649,0.881365,0.313023,0.763439,0.267669,0.266815,0.786444,0.475312,...,0.561184,0.522992,0.663793,0.391253,0.585122,0.394557,0.418976,0.312697,0.005824,0.0
1,0.0,0.978542,0.770067,0.840298,0.271796,0.76612,0.262192,0.264875,0.786298,0.453981,...,0.55784,0.480237,0.666938,0.33644,0.58729,0.446013,0.416345,0.313423,0.000105,0.0
2,6e-06,0.935217,0.753118,0.868141,0.268766,0.762329,0.281122,0.270177,0.788042,0.410603,...,0.565477,0.54603,0.678939,0.289354,0.559515,0.402727,0.415489,0.311911,0.014739,0.0
3,6e-06,0.941878,0.765304,0.868484,0.213661,0.765647,0.275559,0.266803,0.789434,0.414999,...,0.559734,0.510277,0.662607,0.223826,0.614245,0.389197,0.417669,0.314371,0.004807,0.0
4,1.2e-05,0.938617,0.77652,0.864251,0.269796,0.762975,0.263984,0.268968,0.782484,0.49095,...,0.561327,0.547271,0.663392,0.40127,0.566343,0.507497,0.420561,0.31749,0.002724,0.0


> ## Handling Class Imbalance Issues

The Class Imbalance problem occurs when there is a severe skew in the class distribution of our training data. This skewness can influence the performance of our machine learning algorithms.

I'll show what I mean to say. Let us group our data according to the value of 'Class' feature as 0 or 1. 0 means the transaction is legitimate and 1 means it is fraudulent.

In [18]:
# group by 'Class' and find count
scaled_data['Class'].value_counts()

0.0    284315
1.0       492
Name: Class, dtype: int64

This shows that in our data, majority of transactions are legitimate and have the value of 'Class' feature as 0. Of the 2.85 lakh transactions, only 492 transactions are fraud. This imbalance in the class distribution of our training data may affect our machine learning algorithm because our algorithm may completely ignore the minority class. The reason this is an issue is because the minority class is often the class that we are most interested in. Here, the minority class is when 'Class' = 1, i.e. the transaction is fraudulent, precisely what we are interested in.

So, to obviate this imbalance, we do Random Sampling. There are basically two types of random sampling - Oversampling and Undersampling. In oversampling, we duplicate samples from our minority class and in undersampling, we choose random entries from the majority class and delete the rest of them. Here, we choose undersampling.

For that, we first separate the data into data for legitimate transactions and data for fraud transactions, using the 'Class' feature value.

In [19]:
# separate legit and fraud transactions data
legit_transactions = scaled_data[data['Class'] == 0]
fraud_transactions = scaled_data[data['Class'] == 1]

In [20]:
# number of legit transactions
legit_transactions.shape

(284315, 31)

In [21]:
# number of fraud transactions
fraud_transactions.shape

(492, 31)

Now, for performing the undersampling, we use the `RandomUnderSampler` class from the `imblearn` library, which is built for such imbalanced class problems.

In [22]:
from imblearn.under_sampling import RandomUnderSampler

In [23]:
# separating the features (X) and target (y)
X = scaled_data.drop(columns='Class', axis=1)
y = scaled_data['Class']

In [24]:
# instantiating the undersampler
usampler = RandomUnderSampler()

In [25]:
# undersampling the features and target
X_rus, y_rus = usampler.fit_resample(X, y)

In [26]:
# resultant undersampled feature matrix
X_rus.shape

(984, 30)

In [27]:
# resultant undersampled target
y_rus.shape

(984,)

Thus, after undersampling, we see that the features matrix now has the same number of rows as that of target. This means, we will now train our data on the same number of legitimate and fraudulent transactions. Thus, we have eliminated the class imbalance.

> ## Modelling the data

We employ the Logistic Regression machine learning model, with the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm, which is a popular optimization algorithm used in numerical optimization for this project. For splitting the data into training and testing data, we use the `train_test_split` and also import some important metrics that will give us an idea of how well our model is predicting the transaction legitimacy.

In [28]:
# importing necessary modules from scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, f1_score, confusion_matrix, recall_score, classification_report

In [29]:
# splitting the data into training and test data for cross-validation
X_train, X_test, y_train, y_test = train_test_split(X_rus, y_rus, test_size=0.2, random_state=50) 

In [30]:
# how our training and test data look
print('Shapes :')
print('X_train : ', X_train.shape)
print('X_test : ', X_test.shape)
print('y_train : ', y_train.shape)
print('y_test : ', y_test.shape)

Shapes :
X_train :  (787, 30)
X_test :  (197, 30)
y_train :  (787,)
y_test :  (197,)


In [31]:
# instantiating a logistic regression model
# specifiying number of iterations resolves the warnings
model = LogisticRegression(solver='lbfgs', max_iter=1000)

In [32]:
# fitting the model to our training data
model.fit(X_train, y_train)

In [33]:
# ascertaining the accuracy of predictions on our training data
train_predictions = model.predict(X_train)
training_accuracy = accuracy_score(train_predictions, y_train)
print('Accuracy on training data : {:.2f}%'.format(training_accuracy * 100))

Accuracy on training data : 92.12%


Having trained the model on our training data, we now ascertain its accuracy over our test data

In [34]:
# deploying the model on our test data and ascertaining its accuracy
test_predictions = model.predict(X_test)
test_accuracy = accuracy_score(test_predictions, y_test)
print('Accuracy on test data : {:.2f}%'.format(test_accuracy * 100))

Accuracy on test data : 90.36%


The Logistic Regression model that we deployed is 90.36% accurate in predicting whether a transaction was legit or fraud. This is a good amount of accuracy.

> ## Metrics for our model

We find the Precision, Recall and F1 Score for our model to ascertain its performance. We also find a classification report to summarize the values.

In [35]:
precision = precision_score(y_test, test_predictions)
print('precision : {:.2f}%'.format(precision * 100))

precision : 100.00%


In [36]:
recall = recall_score(y_test, test_predictions)
print('recall : {:.2f}%'.format(recall * 100))

recall : 82.41%


In [37]:
f1_score = f1_score(y_test, test_predictions)
print('f1 : {:.2f}%'.format(f1_score * 100))

f1 : 90.36%


In [38]:
print(classification_report(y_test, test_predictions))

              precision    recall  f1-score   support

         0.0       0.82      1.00      0.90        89
         1.0       1.00      0.82      0.90       108

    accuracy                           0.90       197
   macro avg       0.91      0.91      0.90       197
weighted avg       0.92      0.90      0.90       197



The model which we have built is now ready to be deployed. We have to feed in the features V1 to V28 to the model and it will predict whether the transaction is legitimate or fraudulent. Our model has been found to be around 90.36% accurate. There is more scope for improvisation, using more sphisticated machine learning algorithms and more elaborate random sampling.