## Data Cleaning & Preprocessing

This section outlines the steps taken to prepare the dataset for fraud classification. The goal was to ensure data quality, consistency, and suitability for statistical modelling.

---

## Dataset Information

The dataset consists of **transaction-level financial records**, representing simulated mobile money transactions. Each observation corresponds to a single transaction and includes information such as:

- **Transaction type** (e.g., CASH_OUT, TRANSFER)
- **Transaction amount**
- **Sender and recipient account balances** before and after the transaction
- **Fraud indicator**, identifying whether a transaction is fraudulent  

The dataset exhibits a **high class imbalance**, with fraudulent transactions accounting for only a small proportion of all observations. This characteristic is typical in real-world fraud detection problems and is considered during analysis and modelling.


In [None]:
import pandas as pd
import numpy as np

In [90]:
df = pd.read_csv("FinancialCrime.csv")
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


The dataset was first loaded into Python using `pandas`. The `df.info()` function was used to inspect the structure of the dataset, including variable data types, non-null counts, and overall memory usage. This step ensured that all variables were correctly imported and helped identify any missing values or data type inconsistencies prior to preprocessing.


In [92]:
df.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

The absence of substantial missing values ensures that all observations can be retained, preserving the integrity of the dataset and reducing the risk of bias introduced through data removal or imputation

In [93]:
def preprocess_data(df):
    df.drop(columns= ["step", "nameOrig", "isFlaggedFraud", "nameDest"], inplace= True)
preprocess_data(df)



Non-informative identifier variables (step, nameOrig, nameDest, and isFlaggedFraud) were removed during preprocessing to reduce noise and retain only features relevant for fraud classification.

In [94]:
print(df.columns)

Index(['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud'],
      dtype='object')


In [95]:
df['type'] = df['type'].astype('category')
df['isFraud'] = df['isFraud'].astype('category')



The variables type and isFraud were converted to categorical data types to ensure correct interpretation during classification modelling.

In [96]:
df["isFraud"].value_counts(normalize=True)


isFraud
0    0.998709
1    0.001291
Name: proportion, dtype: float64

The proportion of fraudulent and non-fraudulent transactions was examined using relative frequencies. This analysis revealed a strong class imbalance, with fraudulent transactions representing only a small fraction of the dataset.

In [97]:
df["deltaOrig"] = df["oldbalanceOrg"] - df["newbalanceOrig"]
df["deltaDest"] = df["newbalanceDest"] - df["oldbalanceDest"]
df["log_amount"] = np.log1p(df["amount"])

df = df[["type","log_amount", "deltaOrig", "deltaDest", "isFraud"]]

Feature engineering included creating balance change variables (deltaOrig, deltaDest) and applying a log transformation to transaction amounts (log_amount). The dataset was then reduced to key predictive variables for modelling.

In [98]:
df.dtypes

type          category
log_amount     float64
deltaOrig      float64
deltaDest      float64
isFraud       category
dtype: object

In [99]:
pd.crosstab(df["type"],df["isFraud"], margins= True)

isFraud,0,1,All
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CASH_IN,1399284,0,1399284
CASH_OUT,2233384,4116,2237500
DEBIT,41432,0,41432
PAYMENT,2151495,0,2151495
TRANSFER,528812,4097,532909
All,6354407,8213,6362620


In [100]:
df_stat = df[df["type"].isin(["TRANSFER", "CASH_OUT"])].copy()

Transaction types were analysed using a contingency table to assess their association with fraudulent activity. Subsequent analysis focused on TRANSFER and CASH_OUT transactions, which demonstrated the strongest relevance to fraud occurrence.

In [102]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)
print(df.isnull().sum())


type          0
log_amount    0
deltaOrig     0
deltaDest     0
isFraud       0
dtype: int64


seeing missing values and handeling them for exporting the dataset to R

In [103]:
df.to_csv("cleaned_data.csv", index= False)