# Financial Fraud Detection

### The Data
The data used for this analysis is a synthetically generated digital transactions dataset using a simulator called PaySim. PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. It aggregates anonymized data from the private dataset to generate a synthetic dataset and then injects fraudulent transactions.<br><br>
View data here: https://www.kaggle.com/datasets/ealaxi/paysim1<br><br>
Class variable: `isFraud`

In [74]:
import pandas as pd
from IPython.display import display

df = pd.read_csv('fraud-detection-syn-dataset.csv')
del df['isFlaggedFraud']

display(df)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0
...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1


### Data Cleaning

In [75]:
display(df.dtypes)

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
dtype: object

Data type of our class variable is `int64` which is to be converted to `object` for our analysis.

In [97]:
df['isFraud'] = df['isFraud'].astype('object')
display(df.dtypes.tail(1))

isFraud    object
dtype: object

Now, we get the summary statistics of this `pandas dataframe` which will be done in two bits:
1. Summary of 'Numerical' attributes:

In [90]:
desc_df_num = df.describe([.25, .5, .75])
display(desc_df_num)

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0
min,1.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0


2. Summary of 'Categorical' columns:<br><br>
In this section only the count, unique values and mode of data hold importance.

In [91]:
desc_df_cat = df.describe(include="object")
display(desc_df_cat)

Unnamed: 0,type,nameOrig,nameDest,isFraud
count,6362620,6362620,6362620,6362620
unique,5,6353307,2722362,2
top,CASH_OUT,C1902386530,C1286084959,0
freq,2237500,3,113,6354407
