# Fraud Transactions Analysis  
## Step 1 - Data Loading

Objective:
- Load large CSV file
- Check structure and types
- Optimize memory usage
- Save dataset in Parquet format

In [3]:
pip install polars

Collecting polars
  Using cached polars-1.37.1-py3-none-any.whl.metadata (10 kB)
Collecting polars-runtime-32==1.37.1 (from polars)
  Using cached polars_runtime_32-1.37.1-cp310-abi3-macosx_11_0_arm64.whl.metadata (1.5 kB)
Using cached polars-1.37.1-py3-none-any.whl (805 kB)
Using cached polars_runtime_32-1.37.1-cp310-abi3-macosx_11_0_arm64.whl (39.7 MB)
Installing collected packages: polars-runtime-32, polars
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [polars]2m1/2[0m [polars]
[1A[2KSuccessfully installed polars-1.37.1 polars-runtime-32-1.37.1
Note: you may need to restart the kernel to use updated packages.


In [10]:
import polars as pl

In [11]:
import os

In [12]:
DATA_PATH = "../data/fraud_dataset.csv"

if not os.path.exists(DATA_PATH):
    raise FileNotFoundError("fraud_dataset.csv not found in data folder")

print("File found ✔️")

File found ✔️


In [13]:
df = pl.read_csv(
    DATA_PATH,
    infer_schema_length=10000
)

In [14]:
rows, cols = df.shape
print(f"Rows: {rows:,}")
print(f"Columns: {cols}")

Rows: 6,362,620
Columns: 11


In [15]:
df.head(10)

step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
i64,str,f64,str,f64,f64,str,f64,f64,i64,i64
1,"""PAYMENT""",9839.64,"""C1231006815""",170136.0,160296.36,"""M1979787155""",0.0,0.0,0,0
1,"""PAYMENT""",1864.28,"""C1666544295""",21249.0,19384.72,"""M2044282225""",0.0,0.0,0,0
1,"""TRANSFER""",181.0,"""C1305486145""",181.0,0.0,"""C553264065""",0.0,0.0,1,0
1,"""CASH_OUT""",181.0,"""C840083671""",181.0,0.0,"""C38997010""",21182.0,0.0,1,0
1,"""PAYMENT""",11668.14,"""C2048537720""",41554.0,29885.86,"""M1230701703""",0.0,0.0,0,0
1,"""PAYMENT""",7817.71,"""C90045638""",53860.0,46042.29,"""M573487274""",0.0,0.0,0,0
1,"""PAYMENT""",7107.77,"""C154988899""",183195.0,176087.23,"""M408069119""",0.0,0.0,0,0
1,"""PAYMENT""",7861.64,"""C1912850431""",176087.23,168225.59,"""M633326333""",0.0,0.0,0,0
1,"""PAYMENT""",4024.36,"""C1265012928""",2671.0,0.0,"""M1176932104""",0.0,0.0,0,0
1,"""DEBIT""",5337.77,"""C712410124""",41720.0,36382.23,"""C195600860""",41898.0,40348.79,0,0


In [16]:
df.schema


Schema([('step', Int64),
        ('type', String),
        ('amount', Float64),
        ('nameOrig', String),
        ('oldbalanceOrg', Float64),
        ('newbalanceOrig', Float64),
        ('nameDest', String),
        ('oldbalanceDest', Float64),
        ('newbalanceDest', Float64),
        ('isFraud', Int64),
        ('isFlaggedFraud', Int64)])

Observations:
- Dataset contains more than 6 million rows
- No obvious schema issues
- Data types are consistent

In [17]:
df.estimated_size("mb")


560.588846206665

In [18]:
OUTPUT_PATH = "../data/transactions.parquet"

df.write_parquet(OUTPUT_PATH)

print("Parquet file saved ✔️")

Parquet file saved ✔️


Why Parquet?
- Faster loading
- Smaller size
- Column-based storage

In [19]:
df_parquet = pl.read_parquet(OUTPUT_PATH)
df_parquet.shape

(6362620, 11)