<a href="https://colab.research.google.com/github/bradkim1/ScaleMLPrototype/blob/main/ScaleMLPrototype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scale ML prototype with Large Scale Data
In this notebook, I developed a scalable fraud detection workflow using a modular Python class capable of handling large datasets. The pipeline encapsulates preprocessing, feature engineering, and model training using RandomForestClassifier, GradientBoostingClassifier, or XGBClassifier, with support for outlier removal, PCA-based dimensionality reduction of high-dimensional V features, and aggregated statistical features.

To prevent memory overload during prediction on large datasets, the workflow supports chunked processing—reading and transforming data in smaller batches to avoid loading the entire dataset into memory at once. This chunk-based approach enables efficient training and inference even in memory-constrained environments like Colab.

For deployment readiness, the pipeline guarantees consistent preprocessing between training and inference and writes predictions—including probabilities—to disk. While this prototype processes training data as full DataFrames, it is structured to be extended easily for streaming or distributed prediction scenarios with minimal changes.

Original datasets source:
https://www.kaggle.com/competitions/ieee-fraud-detection/data

Data file used in this notebook: https://drive.google.com/drive/folders/1XVqZ1psVwXygFlhLsOjvrhm9XM0wKlQ2?dmr=1&ec=wgc-drive-globalnav-goto

In [1]:
import pandas as pd
import numpy as np
import gc
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score


In [2]:
class AdvancedMLPipeline:
    def __init__(self, model_type='rf', n_components=10, remove_outliers=True):
        # Initialize all components: model, imputers, encoder, PCA
        model_map = {
            'gb': GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
            'rf': RandomForestClassifier(n_estimators=100, random_state=42),
            'xgb': XGBClassifier(use_label_encoder=False, eval_metric='logloss')
        }
        self.model = model_map[model_type]
        #self.model = RandomForestClassifier(class_weight='balanced', random_state=42)
        self.scaler = StandardScaler()
        self.imputer_num = SimpleImputer(strategy='mean')
        self.imputer_cat = SimpleImputer(strategy='most_frequent')
        self.encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
        self.pca = PCA(n_components=n_components)
        self.remove_outliers = remove_outliers
        self.cat_columns = []
        self.feature_columns = []

    def _add_features(self, df):
        # Create time features from TransactionDT
        if 'TransactionDT' in df.columns:
            df['hour'] = pd.to_datetime(df['TransactionDT'], unit='s', errors='coerce').dt.hour.astype('Int8')
            df['day'] = pd.to_datetime(df['TransactionDT'], unit='s', errors='coerce').dt.dayofweek.astype('Int8')
            df.drop(columns=['TransactionDT'], inplace=True)

        # Group-level aggregation features
        c_cols = [col for col in df.columns if col.startswith('C')]
        d_cols = [col for col in df.columns if col.startswith('D')]
        v_cols = [col for col in df.columns if col.startswith('V')]

        df['C_sum'] = df[c_cols].sum(axis=1)
        df['D_missing'] = df[d_cols].isnull().sum(axis=1)
        df['V_mean'] = df[v_cols].mean(axis=1)
        return df

    def _remove_outliers(self, df, col='TransactionAmt'):
        # Remove outliers from TransactionAmt using IQR
        if col in df.columns:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - 1.5 * IQR
            upper = Q3 + 1.5 * IQR
            before = df.shape[0]
            df = df[(df[col] >= lower) & (df[col] <= upper)]
            after = df.shape[0]
            print(f"Removed {before - after} outliers from {col}")
        return df

    def fit_model_on_chunk(self, df, label_column='isFraud'):
        df = self._add_features(df)

        if self.remove_outliers:
            df = self._remove_outliers(df)

        # Convert known categoricals to string
        known_cats = ['ProductCD', 'card4', 'card6', 'DeviceType', 'DeviceInfo',
                      'M1','M2','M3','M4','M5','M6','M7','M8','M9']
        for col in known_cats:
            if col in df.columns:
                df[col] = df[col].astype(str)

        # Identify categorical and numeric columns
        self.cat_columns = [col for col in df.columns if df[col].dtype == 'object' and col != label_column]
        num_cols = [col for col in df.columns if col not in self.cat_columns + [label_column, 'TransactionID']]

        # Impute missing values
        if self.cat_columns:
            df[self.cat_columns] = self.imputer_cat.fit_transform(df[self.cat_columns])
        if num_cols:
            df[num_cols] = self.imputer_num.fit_transform(df[num_cols])

        # One-hot encode categorical features
        encoded = self.encoder.fit_transform(df[self.cat_columns])
        encoded_df = pd.DataFrame(encoded, columns=self.encoder.get_feature_names_out(self.cat_columns), index=df.index)

        # Replace original categoricals with encoded version
        df = df.drop(columns=self.cat_columns)
        df = pd.concat([df, encoded_df], axis=1)

        # Scale numeric columns
        if num_cols:
            df[num_cols] = self.scaler.fit_transform(df[num_cols])

        # Apply PCA on V-columns
        v_cols = [col for col in df.columns if col.startswith('V')]
        if v_cols:
            pca_trans = self.pca.fit_transform(df[v_cols])
            pca_df = pd.DataFrame(pca_trans, columns=[f'V_PCA_{i}' for i in range(pca_trans.shape[1])], index=df.index)
            df = df.drop(columns=v_cols)
            df = pd.concat([df, pca_df], axis=1)

        # Save the final feature columns
        self.feature_columns = [col for col in df.columns if col not in [label_column, 'TransactionID']]

        # Train model
        X = df[self.feature_columns]
        y = df[label_column]
        self.model.fit(X, y)

    def transform_for_predict(self, df):
        df = self._add_features(df)

        # Fill missing categorical values
        for col in self.cat_columns:
            if col not in df.columns:
                df[col] = "missing"
        df[self.cat_columns] = self.imputer_cat.transform(df[self.cat_columns])

        # Impute numeric
        num_cols = [col for col in df.columns if col not in self.cat_columns + ['TransactionID']]
        df[num_cols] = self.imputer_num.transform(df[num_cols])

        # Encode categoricals
        encoded = self.encoder.transform(df[self.cat_columns])
        encoded_df = pd.DataFrame(encoded, columns=self.encoder.get_feature_names_out(self.cat_columns), index=df.index)

        df = df.drop(columns=self.cat_columns)
        df = pd.concat([df, encoded_df], axis=1)

        # Scale numeric
        if num_cols:
            df[num_cols] = self.scaler.transform(df[num_cols])

        # Apply PCA to V columns
        v_cols = [col for col in df.columns if col.startswith('V')]
        if v_cols:
            pca_trans = self.pca.transform(df[v_cols])
            pca_df = pd.DataFrame(pca_trans, columns=[f'V_PCA_{i}' for i in range(pca_trans.shape[1])], index=df.index)
            df = df.drop(columns=v_cols)
            df = pd.concat([df, pca_df], axis=1)

        # Align columns with training
        for col in self.feature_columns:
            if col not in df.columns:
                df[col] = 0
        df = df[self.feature_columns]
        return df

    def evaluate(self, X, y_true):
        y_pred = self.model.predict(X)
        y_proba = self.model.predict_proba(X)[:, 1]
        print(classification_report(y_true, y_pred))
        print("ROC AUC:", roc_auc_score(y_true, y_proba))
        print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))

    def plot_roc_curve(self, X, y_true):
        from sklearn.metrics import roc_curve, auc
        y_proba = self.model.predict_proba(X)[:, 1]
        fpr, tpr, _ = roc_curve(y_true, y_proba)
        roc_auc = auc(fpr, tpr)

        import matplotlib.pyplot as plt
        plt.figure()
        plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
        plt.plot([0, 1], [0, 1], linestyle='--')
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('Receiver Operating Characteristic')
        plt.legend(loc='lower right')
        plt.grid(True)
        plt.show()

    def plot_precision_recall(self, X, y_true):
        from sklearn.metrics import precision_recall_curve, average_precision_score
        y_proba = self.model.predict_proba(X)[:, 1]
        precision, recall, _ = precision_recall_curve(y_true, y_proba)
        avg_precision = average_precision_score(y_true, y_proba)
        import matplotlib.pyplot as plt
        plt.figure()
        plt.plot(recall, precision, label=f'Avg Precision = {avg_precision:.2f}')
        plt.xlabel('Recall')
        plt.ylabel('Precision')
        plt.title('Precision-Recall Curve')
        plt.legend()
        plt.grid(True)
        plt.show()

    def get_metrics(self, X, y_true):
        from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
        y_pred = self.model.predict(X)
        y_proba = self.model.predict_proba(X)[:, 1]
        return {
            'f1_score': f1_score(y_true, y_pred),
            'precision': precision_score(y_true, y_pred),
            'recall': recall_score(y_true, y_pred),
            'roc_auc': roc_auc_score(y_true, y_proba)
        }


In [3]:
# Reading datasets from google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Training and evaluation
# Load and clean training data
df = pd.read_csv("/content/drive/MyDrive/ML_Datasets/train_merged.csv")

# Optional: drop very high-null or low-variance columns
drop_cols = [col for col in df.columns if df[col].nunique(dropna=True) <= 1 or df[col].isnull().mean() > 0.95]
df.drop(columns=drop_cols, inplace=True)

# Train/validation split
df_train, df_val = train_test_split(df, test_size=0.2, stratify=df['isFraud'], random_state=42)

# Initialize and train pipeline
pipeline = AdvancedMLPipeline(model_type='rf', n_components=15, remove_outliers=True)
pipeline.fit_model_on_chunk(df_train.copy(), label_column='isFraud')

# Evaluate on validation set
X_val = pipeline.transform_for_predict(df_val.drop(columns=['isFraud']))
pipeline.evaluate(X_val, df_val['isFraud'])

# Extra metrics
#pipeline.plot_roc_curve(X_val, df_val['isFraud'])
#pipeline.plot_precision_recall(X_val, df_val['isFraud'])
metrics = pipeline.get_metrics(X_val, df_val['isFraud'])
print(metrics)

# Load test data and make predictions
# Prepare chunked test prediction
chunk_size = 100_000
top_n = 1000  # Change this to control how many top frauds to save
reader = pd.read_csv("/content/drive/MyDrive/ML_Datasets/test_merged.csv", chunksize=chunk_size)

results = []

for i, chunk in enumerate(reader):
    print(f" Processing chunk {i+1}")
    # Drop columns not used in training
    chunk = chunk.drop(columns=[col for col in chunk.columns if col not in df_train.columns and col != 'TransactionID'], errors='ignore')

    # Transform and predict
    X_chunk = pipeline.transform_for_predict(chunk.copy())
    y_prob = pipeline.model.predict_proba(X_chunk)[:, 1]
    y_pred = pipeline.model.predict(X_chunk)

    # Create prediction DataFrame
    result_chunk = pd.DataFrame({
        'TransactionID': chunk['TransactionID'],
        'isFraud': y_pred,
        'fraudProbability': y_prob
    })

    results.append(result_chunk)
    del chunk, X_chunk, y_prob, y_pred, result_chunk
    gc.collect()

# test data likely contains >1M+ rows
# After one-hot encoding, the column count can explode
# Then PCA transformation and alignment to >multiple_thousand+ features — done on >1M+ rows can explode RAM!!!

# Instead of transforming the whole test set at once, process it in parts and collect predictions gradually
# transform_for_predict() step is exploding RAM due to:
# One-hot encoding on many categorical columns
# Keeping hundreds of V columns + PCA transform on them
# No chunking — it's transforming 500,000+ rows at once

# Combine all chunked results and save
# Save predictions to file
final_predictions = pd.concat(results)

# Sort by highest fraud probability and keep top N
top_frauds = final_predictions.sort_values(by='fraudProbability', ascending=False).head(top_n)

final_predictions.to_csv("/content/drive/MyDrive/ML_Datasets/test_predictions.csv", index=False)
print("Top {top_n} predictions saved to: top_fraud_test_predictions.csv")

Removed 53211 outliers from TransactionAmt
              precision    recall  f1-score   support

           0       0.98      1.00      0.99    113975
           1       0.97      0.36      0.52      4133

    accuracy                           0.98    118108
   macro avg       0.97      0.68      0.76    118108
weighted avg       0.98      0.98      0.97    118108

ROC AUC: 0.9261526358685571
Confusion Matrix:
 [[113928     47]
 [  2652   1481]]
{'f1_score': 0.5232291114644055, 'precision': 0.9692408376963351, 'recall': 0.35833534962496977, 'roc_auc': np.float64(0.9261526358685571)}
 Processing chunk 1
 Processing chunk 2
 Processing chunk 3
 Processing chunk 4
 Processing chunk 5
 Processing chunk 6
Top {top_n} predictions saved to: top_fraud_test_predictions.csv
