# Fraud Transactions Analysis  
## Step 2 - Data Cleaning

Objectives:
- Load parquet dataset
- Check missing values
- Check duplicates
- Validate balances
- Create first derived features
- Save cleaned dataset

In [2]:
import polars as pl

In [3]:
df = pl.read_parquet("../data/transactions.parquet")
df.shape

(6362620, 11)

In [4]:
df.head()

step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
i64,str,f64,str,f64,f64,str,f64,f64,i64,i64
1,"""PAYMENT""",9839.64,"""C1231006815""",170136.0,160296.36,"""M1979787155""",0.0,0.0,0,0
1,"""PAYMENT""",1864.28,"""C1666544295""",21249.0,19384.72,"""M2044282225""",0.0,0.0,0,0
1,"""TRANSFER""",181.0,"""C1305486145""",181.0,0.0,"""C553264065""",0.0,0.0,1,0
1,"""CASH_OUT""",181.0,"""C840083671""",181.0,0.0,"""C38997010""",21182.0,0.0,1,0
1,"""PAYMENT""",11668.14,"""C2048537720""",41554.0,29885.86,"""M1230701703""",0.0,0.0,0,0


In [5]:
df.null_count()

step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0


Observation:
No missing values detected.

In [9]:
df.select(pl.count()).item()
df.select(pl.all().n_unique()).row(0)


(Deprecated in version 0.20.5)
  df.select(pl.count()).item()


(743, 5, 5316900, 6353307, 1845844, 2682586, 2722362, 3614697, 3555499, 2, 2)

In [10]:
df = df.unique()


In [11]:
numeric_cols = [
    "amount",
    "oldbalanceOrg",
    "newbalanceOrig",
    "oldbalanceDest",
    "newbalanceDest"
]

for col in numeric_cols:
    print(col, df.filter(pl.col(col) < 0).shape[0])


amount 0
oldbalanceOrg 0
newbalanceOrig 0
oldbalanceDest 0
newbalanceDest 0


Observation:
No negative values detected.

In [12]:
df = df.with_columns([
    (pl.col("oldbalanceOrg") - pl.col("newbalanceOrig")).alias("balance_diff_orig"),
    (pl.col("oldbalanceDest") - pl.col("newbalanceDest")).alias("balance_diff_dest")
])

In [13]:
df.select([
    pl.mean("balance_diff_orig"),
    pl.mean("balance_diff_dest")
])

balance_diff_orig,balance_diff_dest
f64,f64
-21230.564504,-124294.731682


We created balance difference features to track money movement.


In [14]:
df = df.with_columns([
    pl.col("isFraud").cast(pl.Int8),
    pl.col("isFlaggedFraud").cast(pl.Int8)
])


In [15]:
df.write_parquet("../data/clean.parquet")
print("Clean dataset saved ✔️")


Clean dataset saved ✔️
