# Initial EDA

I aimed to gain a comprehensive understanding of the financial fraud dataset. The dataset, containing information about various attributes such as transaction amount, type, origin and destination account balances, and indicators for fraud detection, exhibited no missing values. Through univariate analysis, I explored the distribution of transaction amounts, revealing a varying range of values. 

Subsequent steps will involve a more in-depth examination of transaction types and their relation to fraud rates, as well as exploring feature engineering possibilities to enhance predictive capabilities.


In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

file_path = '../data/PS_20174392719_1491204439457_log.csv'
# Loading the CSV file
df = pd.read_csv(file_path, nrows=10000)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            10000 non-null  int64  
 1   type            10000 non-null  object 
 2   amount          10000 non-null  float64
 3   nameOrig        10000 non-null  object 
 4   oldbalanceOrg   10000 non-null  float64
 5   newbalanceOrig  10000 non-null  float64
 6   nameDest        10000 non-null  object 
 7   oldbalanceDest  10000 non-null  float64
 8   newbalanceDest  10000 non-null  float64
 9   isFraud         10000 non-null  int64  
 10  isFlaggedFraud  10000 non-null  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 859.5+ KB


In [29]:
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,4.1789,103546.7,893933.0,915274.1,934275.8,1096606.0,0.0068,0.0
std,2.479821,266307.2,2135683.0,2181428.0,2676340.0,3014496.0,0.082185,0.0
min,1.0,2.39,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,4397.53,127.6875,0.0,0.0,0.0,0.0,0.0
50%,5.0,12858.74,21375.56,10349.94,0.0,0.0,0.0,0.0
75%,7.0,114382.5,178271.9,176093.4,283106.7,252055.2,0.0,0.0
max,7.0,10000000.0,12930420.0,13010500.0,19516120.0,19169200.0,1.0,0.0


In [21]:
df.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')

In [22]:
df['step'].value_counts()

step
7    2836
1    2708
6    1660
2    1014
5     665
4     565
3     552
Name: count, dtype: int64

In [23]:
df['amount'].value_counts()

amount
25975.86     3
5580.15      2
60726.57     2
35063.63     2
963532.14    2
            ..
4189.91      1
6621.75      1
49367.62     1
162724.99    1
5096.16      1
Name: count, Length: 9954, dtype: int64

In [24]:
df['type'].value_counts()

type
PAYMENT     5465
CASH_IN     1949
CASH_OUT    1321
TRANSFER     921
DEBIT        344
Name: count, dtype: int64

In [25]:
df['nameOrig'].value_counts()

nameOrig
C1231006815    1
C891268602     1
C328246293     1
C979049207     1
C1228068224    1
              ..
C1504912697    1
C1531409183    1
C1086508626    1
C1621615881    1
C299358529     1
Name: count, Length: 10000, dtype: int64

In [26]:
df['isFraud'].value_counts()

isFraud
0    9932
1      68
Name: count, dtype: int64

In [27]:
df['isFlaggedFraud'].value_counts

<bound method IndexOpsMixin.value_counts of 0       0
1       0
2       0
3       0
4       0
       ..
9995    0
9996    0
9997    0
9998    0
9999    0
Name: isFlaggedFraud, Length: 10000, dtype: int64>

In [None]:
df.head(100)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.0,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.0,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
95,1,TRANSFER,710544.77,C835773569,0.0,0.00,C1359044626,738531.50,16518.36,0,0
96,1,TRANSFER,581294.26,C843299092,0.0,0.00,C1590550415,5195482.15,19169204.93,0,0
97,1,TRANSFER,11996.58,C605982374,0.0,0.00,C1225616405,40255.00,0.00,0,0
98,1,PAYMENT,2875.10,C1412322831,15443.0,12567.90,M1651262695,0.00,0.00,0,0


In [None]:
df['isFlaggedFraud'].value_counts()

# Drop nameorig, namedest and is flaggedfraud. columns have no value.

isFlaggedFraud
0    10000
Name: count, dtype: int64

### Hypothesis:

The amount of a transaction (`amount`) might be a significant predictor of fraudulent activities (`isFraud`). Higher transaction amounts may be associated with an increased likelihood of fraud. Additionally, there may be a correlation between the transaction type (`type`) and fraudulent transactions, suggesting that certain types of transactions are more prone to fraudulent activities.

### Rationale:

1. The univariate analysis indicates that the distribution of transaction amounts varies, with potentially higher amounts being more prevalent in fraudulent transactions.
2. The bivariate analysis shows a boxplot illustrating that fraudulent transactions (`isFraud` = 1) tend to have higher amounts compared to non-fraudulent transactions (`isFraud` = 0).
3. The correlation heatmap includes `amount` and `isFraud`, suggesting a potential relationship.