<h1><center> Credit Card Fraud Detection </center></h1>
<h2><center> Part 1. Exploratory Data Analysis </center></h2>
<h2><center> Vinay Kumar </center></h2>

# Contents

- Introduction
- Visuaizing individual features
- Relationships among the features

# 1. Introduction

In [None]:
# Importing necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import plotly.express as px

## Data

Source: https://www.kaggle.com/mlg-ulb/creditcardfraud

The dataset contains information on the transactions made using credit cards by European cardholders, in two particular days of September $2013$. It presents a total of $284807$ transactions, of which $492$ were fraudulent. Clearly, the dataset is highly imbalanced, the positive class (fraudulent transactions) accounting for only $0.173\%$ of all transactions.

For a particular transaction, the feature **Time** represents the time (in seconds) elapsed between the transaction and the very first transaction, **Amount** represents the amount of the transaction and **Class** represents the status of the transaction with respect to authenticity. The class of an authentic (resp. fraudulent) transaction is taken to be $0$ (resp. $1$). Rest of the variables (**V1** to **V28**) are obtained from principle component analysis (PCA) transformation on original features that are not available due to confidentiality.

In [None]:
# The dataset

data = pd.read_csv('../input/creditcardfraud/creditcard.csv')
data

In [None]:
# Statistical descriptions of the features

features = data.drop(['Class'], axis = 1)
features.describe()

## Objectives of the project

### Primary objective:

**Classification of transactions as authentic or fraudulent**. To be prcise, given the data on **Time**, **Amount** and transformed features **V1** to **V28** for a particular transaction, our goal is to correctly classify the transaction as **authentic** or **fraudulent**. We employ different techniques to build classification models and compare them by various evaluation metrics.

### Secondary objectives:

Answering the following questions using machine learning and statistical tools and techniques.

- When a fraudulent transaction is made, is it followed soon by one or more such fraudulent transactions? In other words, do the attackers make consecutive fraudulent transactions in a short span of time?


- Is the amount of a fraudulent transaction generally larger than that of an authentic transaction?


- Is there any indication in the data that fraudulent transactions occur at high-transaction period?


- It is seen from the data that the number of transactions are high in some time intervals and low in between. Does the occurance of fraudulent transactions related to these time intervals?


- There are a few time-points which exhibits high number of fraud transactions. Is it due to high number of total transactions or due to some other reason?

#### In this part we carry out exploratory data analysis for the features in the dataset.

# 2. Visualizing individual features

## Class

First we analyze the feature which is the main object of the study: The class variable, which indicates if a particular transaction is authentic or fraudulent.

In [None]:
# Splitting of the data by authenticity of transactions

data_authentic = data[data['Class'] == 0] # authentic transactions only
data_fraud = data[data['Class'] == 1] # fraud transactions only

# Class frequencies

class_label = ['Authentic', 'Fraud']
class_frequency = [len(data_authentic), len(data_fraud)]

fig1 = px.pie(values = class_frequency,
             names = class_label,
             title = 'Frequency comparison of authentic and fraudulent transactions',
             template = 'ggplot2'
            )
fig1.show()

It is evident that the data is extremely imbalanced with authentic transactions being the majority class and fraudulent transactions being the minority class. Next we analyze the frequency of transactions made over time elapsed starting from the first transaction.

## Time

In [None]:
# Transaction frequency over time

fig1 = px.histogram(data,
                   x = 'Time',
                   nbins = 200,
                   title = 'Distribution of transactions over time',
                   template = 'ggplot2'
                  )
fig1.show()

**Observation:** The number of transactions are particularly high in certain time intervals and low in between.

In [None]:
# Histogram for fraudulent transactions

fig1 = px.histogram(data_fraud,
                   x = 'Time',
                   nbins = 200,
                   title = 'Distribution of fraudulent transactions over time',
                   template = 'ggplot2'
                  )
fig1.show()

**Observation:** There are certain spikes in the data that indicates high number of fraud transactions at certain time points.

Next we visualize the distribution of transaction amount. It is seen from the data that this feature is positively skewed to a great extent. Hence we use log-scale in the y-axis to produce a nondegenerate visualization of the same.

## Amount

In [None]:
# Transaction amount

fig1 = px.histogram(data,
                   x = 'Amount',
                   nbins = 200,
                   title = 'Distribution of transaction amount',
                   #log_y = True,
                   template = 'ggplot2'
                  )
fig1.show()

fig2 = px.histogram(data,
                   x = 'Amount',
                   nbins = 200,
                   title = 'Distribution of transaction amount on logarithmic scale',
                   log_y = True,
                   template = 'ggplot2'
                  )
fig2.show()

The high positive skewness even after taking the log-scale motivates us to map the amount data using log transformation.

In [None]:
# Transaction amount after log transformation

np.seterr(divide = 'ignore')
#np.seterr(divide = 'warn')

df = pd.DataFrame(columns = ['Class','log_amount'])
df['Class'] = data['Class']
df['log_amount'] = np.log2(data['Amount'])

fig1 = px.histogram(df,
                   x = 'log_amount',
                   nbins = 200,
                   title = 'Distribution of transaction amount after log transformation',
                   #log_y = True,
                   template = 'ggplot2'
                  )
fig1.show()

Since this gives a more symmetric output, we are motivated to work with this transformed amount data, from which the original amount data can easily be converted back to.

In [None]:
# Visualizations of authentic and fraudulent transactions after log transformation

class_list = list(data['Class'])
fraud_status = []
for i in range(len(class_list)):
    fraud_status.append(bool(class_list[i]))

fig1 = px.violin(df,
             x = 'Class',
             y = 'log_amount',
             color = fraud_status,
             title = 'Distribution of Amount for Authentic and Fraudulent transactions after log transformation',
             template = 'ggplot2'
            )
fig1.show()

It is clear from the plots that most of the large-amount transactions are authentic, which maybe caused by the extra security measures given to high-amount transactions in form of multiple passwords and OTPs.

## V1-V28

In [None]:
# Function to print histogram of a chosen feature

def hist(data, feature):
    fig1 = px.histogram(data,
                   x = feature,
                   nbins = 200,
                   title = 'Distribution of {}'.format(feature),
                   template = 'ggplot2'
                  )
    fig1.show()
    
# Function to print boxplot and violinplot of a chosen feature

def box_violin(data, feature):
    fig, (ax1, ax2) = plt.subplots(1,2, figsize=(16.1, 6))
    sns.boxplot(x = data[feature], ax = ax1)
    sns.violinplot(x = data[feature], ax = ax2)
    plt.show()
    
# Function to combine the above two functions

def eda(data, feature):
    hist(data, feature)
    box_violin(data, feature)

In [None]:
# List of features obtained by PCA transformation

features_pca = list(data.columns)
features_pca.remove('Time')
features_pca.remove('Amount')
features_pca.remove('Class')

In [None]:
# Visualizations and statistical descriptions of the features obtained by PCA transformation

for feature in features_pca:
    eda(data, feature)

# 3. Relationships among the features

First we analyze how the amount of transaction behaves with respect to time.

## Amount vs Time

In [None]:
fig1 = px.scatter(data,
                 x = 'Time',
                 y = 'Amount',
                 color = fraud_status,
                 #marginal_x = 'rug',
                 #marginal_y = 'rug',
                 #size = class_rescaled,
                 title = 'Amount vs Time',
                 template = 'ggplot2'
                )
fig1.show()

We split up the scatterplot into two different subplots, one for authentic transactions and the other for fraudulent tranactions.

In [None]:
fig1 = px.scatter(data,
                 x = 'Time',
                 y = 'Amount', 
                 facet_col = fraud_status,
                 color = fraud_status,
                 title = 'Amount vs Time',
                 template = 'ggplot2'
                )
fig1.show()

Note that *facet_col=False* corresponds to the authentic transactions and *facet_col=True* corresponds to the fraudulent transactions. We zoom into the second subplot a bit to get a clearer picture.

In [None]:
# Amount vs Time for fraudulent transactions - scatterplot

fig1 = px.scatter(data_fraud,
                 x = 'Time',
                 y = 'Amount',
                 title = 'Amount vs Time for fraudulent transactions',
                 template = 'ggplot2'
                )
fig1.show()

In [None]:
print('Correlation coefficient between Time and Amount')
print('\n')
print('For all transactions: {}'.format(data['Time'].corr(data['Amount'])))
print('For authentic transactions: {}'.format(data_authentic['Time'].corr(data_authentic['Amount'])))
print('For fraudulent transactions: {}'.format(data_fraud['Time'].corr(data_fraud['Amount'])))

**Observation:** Time and Amount appear to be approximately uncorrelated, which is echoed even when authentic and fraudulent transactions are considered separately.

Next we examine bivariate scatterplots and linear relationships between certain pairs of feature variables, which exhibit contrasting correlation structures for authentic and fraudulent transactions. Such a phenomenon occurs for a number of pairs, but we analyze $5$ specific pairs among these for the sake of brevity.

## V3 vs Time

In [None]:
fig1 = px.scatter(data,
                 x = 'Time',
                 y = 'V3', 
                 facet_col = fraud_status,
                 color = fraud_status,
                 title = 'V3 vs Time',
                 template = 'ggplot2'
                )
fig1.show()

In [None]:
print('Correlation coefficient between V3 and Time')
print('\n')
print('For all transactions: {}'.format(data['V3'].corr(data['Time'])))
print('For authentic transactions: {}'.format(data_authentic['V3'].corr(data_authentic['Time'])))
print('For fraudulent transactions: {}'.format(data_fraud['V3'].corr(data_fraud['Time'])))

#### Observations:

- V3 and Time have moderate negative correlation for authentic transactions.
- However, they have slightly positive correlation for fraudulent transactions.

## Amount vs V20

In [None]:
fig1 = px.scatter(data,
                 x = 'V20',
                 y = 'Amount', 
                 facet_col = fraud_status,
                 color = fraud_status,
                 title = 'Amount vs V20',
                 template = 'ggplot2'
                )
fig1.show()

In [None]:
print('Correlation coefficient between Amount and V20')
print('\n')
print('For all transactions: {}'.format(data['Amount'].corr(data['V20'])))
print('For authentic transactions: {}'.format(data_authentic['Amount'].corr(data_authentic['V20'])))
print('For fraudulent transactions: {}'.format(data_fraud['Amount'].corr(data_fraud['V20'])))

#### Observations:

- Amount and V20 have moderate positive correlation for authentic transactions.
- However, they are approximately uncorrelated for fraudulent transactions.

## V2 vs V1

In [None]:
fig1 = px.scatter(data,
                 x = 'V1',
                 y = 'V2', 
                 facet_col = fraud_status,
                 color = fraud_status,
                 title = 'V2 vs V1',
                 template = 'ggplot2'
                )
fig1.show()

In [None]:
print('Correlation coefficient between V1 and V2')
print('\n')
print('For all transactions: {}'.format(data['V1'].corr(data['V2'])))
print('For authentic transactions: {}'.format(data_authentic['V1'].corr(data_authentic['V2'])))
print('For fraudulent transactions: {}'.format(data_fraud['V1'].corr(data_fraud['V2'])))

#### Observations:

- V1 and V2 are approximately uncorrelated for authentic transactions.
- However, they have significant negative correlation for fraudulent transactions.

## V3 vs V2

In [None]:
fig1 = px.scatter(data,
                 x = 'V2',
                 y = 'V3', 
                 facet_col = fraud_status,
                 color = fraud_status,
                 title = 'V3 vs V2',
                 template = 'ggplot2'
                )
fig1.show()

In [None]:
print('Correlation coefficient between V2 and V3')
print('\n')
print('For all transactions: {}'.format(data['V2'].corr(data['V3'])))
print('For authentic transactions: {}'.format(data_authentic['V2'].corr(data_authentic['V3'])))
print('For fraudulent transactions: {}'.format(data_fraud['V2'].corr(data_fraud['V3'])))

#### Observations:

- V2 and V3 are approximately uncorrelated for authentic transactions.
- However, they have significant negative correlation for fraudulent transactions.

## V3 vs V1

In [None]:
fig1 = px.scatter(data,
                 x = 'V1',
                 y = 'V3', 
                 facet_col = fraud_status,
                 color = fraud_status,
                 title = 'V3 vs V1',
                 template = 'ggplot2'
                )
fig1.show()

In [None]:
print('Correlation coefficient between V1 and V3')
print('\n')
print('For all transactions: {}'.format(data['V1'].corr(data['V3'])))
print('For authentic transactions: {}'.format(data_authentic['V1'].corr(data_authentic['V3'])))
print('For fraudulent transactions: {}'.format(data_fraud['V1'].corr(data_fraud['V3'])))

#### Observations:

- V1 and V3 are approximately uncorrelated for authentic transactions.
- However, they have significant positive correlation for fraudulent transactions.

## Multicollinearity

We check for multicollinearity among the features through the heat map which plots the correlation coefficient of each pair of features via color density.

In [None]:
# Heat map of the feature variables

features = data.drop(['Class'], axis = 1)

fig, ax = plt.subplots(figsize=(16, 12))
sns.heatmap(features.corr(), center = 0, cmap = 'Blues')
ax.set_title('Heat map of the feature variables')

#### Observations:

- As expected, the PCA-engineered features are uncorrelated.
- There exists non-zero correlations between time and some PCA-engineered features as well as between amount and some PCA-engineered features.

If one considers only the authentic transactions, then the overall structure of the heat map remains the same, although moderate changes are visible.

In [None]:
# Heat map of the feature variables for fraudulent transactions

features_authentic = data_authentic.drop(['Class'], axis = 1)

fig, ax = plt.subplots(figsize=(16, 12))
sns.heatmap(features_authentic.corr(), center = 0, cmap = 'Blues')
ax.set_title('Heat map of the feature variables for authentic transactions')

However, the fraudulent transactions show a significantly different correlation structure among the features.

In [None]:
# Heat map of the feature variables for fraudulent transactions

features_fraud = data_fraud.drop(['Class'], axis = 1)

fig, ax = plt.subplots(figsize=(16, 12))
sns.heatmap(features_fraud.corr(), center = 0, cmap = 'Blues')
ax.set_title('Heat map of the feature variables for fraudulent transactions')

Note that the analysis involving the transformed variables **V1-V28** do not reflect any relationship among the original variables from which those are engineered. We have included EDA for these variables for the sake of completeness.