<a href="https://colab.research.google.com/github/VarunPrabaharan16/MSc-Fraud-Detection/blob/main/notebooks/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing and Graph Finalization
**Dataset**: IEEE-CIS Fraud Detection  
**Objective**: Clean the dataset, handle missing values, encode features, and create a PyTorch Geometric graph for GNN input.  
**Stored in**: /MyDrive/msc-fraud-detection/data/ieee-fraud-detection

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
import os
data_path = '/content/drive/My Drive/MSc Fraud Detection/data/ieee-fraud-detection'
transactions = pd.read_csv(f'{data_path}/train_transaction.csv')
identity = pd.read_csv(f'{data_path}/train_identity.csv') if 'train_identity.csv' in os.listdir(data_path) else None
print("Transactions Shape:", transactions.shape)
print("Identity Shape:", identity.shape if identity is not None else "Not loaded")

Transactions Shape: (590540, 394)
Identity Shape: (144233, 41)


In [10]:
# Drop columns with >50% missing values
missing = transactions.isnull().mean()
print("Columns with >50% missing:", missing[missing > 0.5].index.tolist())
transactions = transactions.loc[:, missing <= 0.5]

# Define numerical and categorical columns
numerical_cols = transactions.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = transactions.select_dtypes(include=['object']).columns
print("Categorical Columns:", categorical_cols.tolist())

Columns with >50% missing: []
Categorical Columns: []


In [11]:
# Impute categorical columns with mode, fallback to 'Unknown' if mode is unavailable
for col in categorical_cols:
    if transactions[col].notnull().any():  # Check if column has any non-missing values
        mode_value = transactions[col].mode()
        if not mode_value.empty:  # Check if mode exists
            transactions[col] = transactions[col].fillna(mode_value[0])
        else:
            transactions[col] = transactions[col].fillna('Unknown')
    else:
        transactions[col] = transactions[col].fillna('Unknown')  # All values missing

# Verify imputation
print("Missing Values in Categorical Columns After Imputation:")
print(transactions[categorical_cols].isnull().sum())

Missing Values in Categorical Columns After Imputation:
Series([], dtype: float64)


In [12]:
# Encode categorical columns
from sklearn.preprocessing import LabelEncoder

for col in categorical_cols:
    transactions[col] = transactions[col].astype(str)  # Ensure string type for encoding
    le = LabelEncoder()
    transactions[col] = le.fit_transform(transactions[col])

# Impute numerical columns with median
transactions[numerical_cols] = transactions[numerical_cols].fillna(transactions[numerical_cols].median())

# Normalize numerical features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
transactions[numerical_cols] = scaler.fit_transform(transactions[numerical_cols])

print("Preprocessed Transactions Shape:", transactions.shape)
print("Sample Data:", transactions.head())

Preprocessed Transactions Shape: (590540, 220)
Sample Data:    TransactionID   isFraud  TransactionDT  TransactionAmt  ProductCD  \
0      -1.732048 -0.190417      -1.577987       -0.278167   0.547250   
1      -1.732042 -0.190417      -1.577986       -0.443327   0.547250   
2      -1.732036 -0.190417      -1.577972       -0.317889   0.547250   
3      -1.732030 -0.190417      -1.577965       -0.355521   0.547250   
4      -1.732024 -0.190417      -1.577964       -0.355521  -1.559603   

      card1     card2     card3     card4     card5  ...      V312      V313  \
0  0.821695 -0.009783 -0.281425 -2.753251 -1.396380  ... -0.227583 -0.222385   
1 -1.457558  0.264810 -0.281425 -1.048192 -2.368254  ... -0.227583 -0.222385   
2 -1.068263  0.813997 -0.281425  0.656866 -0.813255  ... -0.227583 -0.222385   
3  1.679858  1.305711 -0.281425 -1.048192 -2.003802  ...  0.556723 -0.222385   
4 -1.102133  0.967258 -0.281425 -1.048192 -2.368254  ... -0.227583 -0.222385   

       V314      V315     

In [13]:
print("Total Missing Values:", transactions.isnull().sum().sum())
print("Categorical Columns After Encoding:", transactions[categorical_cols].head())

Total Missing Values: 0
Categorical Columns After Encoding: Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]


In [15]:
transactions.to_csv(f'{data_path}/preprocessed_transactions_no_identity.csv', index=False)
print("Saved to:", f'{data_path}/preprocessed_transactions_no_identity.csv')

Saved to: /content/drive/My Drive/MSc Fraud Detection/data/ieee-fraud-detection/preprocessed_transactions_no_identity.csv


## Transactions Preprocessing Update
- **Issue**: IndexError in categorical imputation due to missing modes.
- **Fix**: Used per-column mode imputation with 'Unknown' fallback.
- **Result**: No missing values in categorical columns, ready for encoding and normalization.
- **Output**: preprocessed_transactions_no_identity.csv