## 1. Introduction
> With the increasing popularity of online payments, the risk of fraudulent transactions is also rising. This poses a significant financial threat to both businesses and consumers. The objective of this analysis is to identify the patterns between fraudulent and non-fraudulent payments to help customers to identify the legitimacy of their transactions. 

## 2. Import Needed Libraries and Dataset
> The dataset is collected from Kaggle, which contains historical information about fraudulent transactions which can be used to detect fraud in online payments. Link to source: https://www.kaggle.com/datasets/jainilcoder/online-payment-fraud-detection.
> To better understand the dataset, here's the basic information about the dataset:
> * Step: A unit where 1 step equals 1 hour
> * Type: Type of Online Transaction
> * Amount: Amount in Transaction
> * nameOrig: Name of the customer (more like codenames)
> * oldbalanceOrg: Balance of the sender before transaction
> * newbalanceOrig: Balance of the sender after transaction
> * nameDest: Name of the receipent (Destination)
> * oldbalanceDest: Balance of the receipent before transaction
> * newbalanceDest: Balance of the receipent after transaction
> * isFraud: Is it a fraud?
> * isFlaggedFraud: Did the system detect it was a fraud?

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [15]:
df = pd.read_csv("onlinefraud.csv")

## 3. Data Cleaning

In [16]:
# Checking the total numbers of empty values in each column
df.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

> * There are no missing values

In [17]:
# Getting the information about the dataset, to check the data types of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


> * Columns associated with money amount are type float64
> * Columns associated with name ar type object
> * Columns associated with boolean types ar int64 except step which represents count of hours

In [18]:
print("Amount of duplicated values in df: " , df.duplicated().sum())

Amount of duplicated values in df:  0


> * There are no duplicate values

> ### Data cleaning process is done:
> * There are no missing values.
> * There are no wrong data types.
> * There are no duplicate values.

## 4. Exploratory Data Analysis (EDA)

In [19]:
# Read the 1st 5rows to explore the data
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [20]:
# Read the shape (rows, cols) of dataset
df.shape

(6362620, 11)

In [7]:
# Describe the dataset
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


## 5. Data Mining / Data Modelling