<a href="https://www.kaggle.com/code/yaramahrous/creditcardfraud-eda?scriptVersionId=192695343" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')


In [None]:
df = pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')

# Exploring the data

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isna().sum()


In [None]:
df['Class'].value_counts()

In [None]:
df_normal = df[df['Class'] == 0]
df_normal.drop('Class', axis=1, inplace=True)
df_fraud = df[df['Class'] == 1]
df_fraud.drop('Class', axis=1, inplace=True)

In [None]:
df['Hour'] = df['Time'].apply(lambda x: np.ceil(float(x)/3600) % 24)

In [None]:
df_fraud.describe()


In [None]:
df_normal.describe()

# Visualizing the data

In [None]:
class_labels = {0: 'Normal', 1: 'Fraud'}
class_counts = df['Class'].value_counts()
labels = [class_labels[i] for i in class_counts.index]
sizes = class_counts.values
fig, axes = plt.subplots(1, 2, figsize=(15, 7))


axes[0].pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,)
axes[0].set_title('Class Distribution (Pie Chart)')


sns.countplot(data=df, x='Class', ax=axes[1])
axes[1].set_title('Class Distribution (Count Plot)')
axes[1].set_xlabel('Class')
axes[1].set_ylabel('Count')

axes[1].set_xticklabels(['Normal', 'Fraud'])

# Adjust layout
plt.tight_layout()


The number of normal transactions >>> The number of fraud ones

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 7))

sns.kdeplot(df_normal['Amount'],ax=axes[0],  color='b', label='Normal') 
axes[0].set_title('Amount Distribution (Normal Transactions)')
sns.kdeplot(df_fraud['Amount'], ax=axes[1], color='r', label='Fraud')   
axes[1].set_title('Amount Distribution (Fraud Transactions)')

In [None]:
df_normal['Amount'].skew()

In [None]:
df_fraud['Amount'].skew()

The skewness in normal transaction is much higher than that of fraud ones
Maybe the frauds want to make thier transactions as normal as possible to avoid suspension 

In [None]:
px.histogram(df_normal, x='Amount', nbins=100, title='Normal Transactions')

In [None]:
px.histogram(df_fraud, x='Amount', nbins=100, title='Fraud Transactions')

In [None]:
px.histogram(df_fraud, x='Hour', barmode='group', title='Fraud Transactions by Hour')

The fraud transaction are somehow distributed along the day with peaks in 12-13 hour and 2-3 hour


In [None]:
px.histogram(df_normal, x='Hour', barmode='group', title='Normal Transactions by Hour')


The normal transactions start its peak after the 10 hour and still in peak till 23 hour

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle('Amount vs Hour')

ax[0].scatter( df_fraud.Amount, df_fraud.Hour,alpha=0.5)
ax[0].set_title('Fraud ')
ax[0].set_xlabel('Amount')
ax[0].set_ylabel('Hour')

ax[1].scatter(df_normal.Amount,df_normal.Hour ,alpha=0.5)
ax[1].set_title('Normal ')
ax[1].set_xlabel('Amount')
ax[1].set_ylabel('Hour')

Most of fraud transcations are with amount < 500 with peak at hour = 12
Most of fraud transcations are with amount < 5000 with peak at hour = 14

In [None]:
df_fraud['Time_diff'] = df_fraud['Time'].diff().fillna(0)
df_normal['Time_diff'] = df_normal['Time'].diff().fillna(0)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle('Time Diff')

sns.histplot(df_fraud['Time_diff'], kde=True, bins=100,ax=ax[0])    
ax[0].set_title('Historgram of Time Diff for Fraud Transactions')
ax[0].set_xlabel('Time Diff in minutes')
ax[0].set_ylabel('Frequency')

sns.histplot(df_normal['Time_diff'], kde=True, bins=100,ax=ax[1])    
ax[1].set_title('Historgram of Time Diff for Normal Transactions')
ax[1].set_xlabel('Time Diff in minutes')
ax[1].set_ylabel('Freqquency')

The time diffrenece between normal transactions are much less than that of fraud transactions


This is expected as the normal transactions are way more frequent than the fraud ones and the fraud transactions try to choose diffrent times to avoid getting caught

In [None]:
px.imshow(df_fraud.corr(), title='Fraud Transactions Correlation Matrix',text_auto=True, width=800, height=800)

We can see that there is a very strong +ve correlations between (V1 & V3), (V1 & V5), (V12 & V17),(V16 & V18), (V16 & V17), (V17 & V18) in fraud transactions


We can see that there is a very strong -ve correlations between (V1 & V2), (V2 & V3), (V2 & V5), (V3 & V7), (V11 & V14), (V11 & V12), (V21 & V22), in fraud transactions


In [None]:
columns_to_plot = [col for col in df_fraud.columns if col not in ['Amount', 'Time', 'Hour', 'Time_diff']]
n_cols = 3  # Number of columns for subplots
n_rows = int(np.ceil(len(columns_to_plot) / n_cols))  # Number of rows for subplots

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
axes = axes.flatten()

for i, column in enumerate(columns_to_plot):
    sns.kdeplot(data=df_fraud, x=column,  ax=axes[i], color='red',label='Fraud')
    sns.kdeplot(data=df_normal, x=column, ax=axes[i], color='blue',label='Normal')
    axes[i].set_title(f'Histogram of {column}')
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')
    axes[i].legend()

for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()