# Banking Data Anomaly Detection - Comprehensive Analysis

This notebook demonstrates anomaly detection across various banking transaction types to identify:
- Fraudulent transactions
- Money laundering patterns
- Data quality issues
- Regulatory compliance violations
- Operational anomalies

## Table of Contents
1. [Environment Setup](#Environment-Setup)
2. [Data Loading and Exploration](#Data-Loading-and-Exploration)
3. [Deposit Transaction Analysis](#Deposit-Transaction-Analysis)
4. [Credit Card Fraud Detection](#Credit-Card-Fraud-Detection)
5. [ATM Transaction Monitoring](#ATM-Transaction-Monitoring)
6. [Foreign Exchange Analysis](#Foreign-Exchange-Analysis)
7. [Investment Transaction Surveillance](#Investment-Transaction-Surveillance)
8. [Corporate Expense Audit](#Corporate-Expense-Audit)
9. [General Ledger Validation](#General-Ledger-Validation)
10. [Cross-Transaction Analysis](#Cross-Transaction-Analysis)
11. [Executive Dashboard](#Executive-Dashboard)

## Environment Setup

In [None]:
# Import required libraries
import os
import sys
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import seaborn as sns
from plotly.subplots import make_subplots

# Pynomaly imports
sys.path.append("../../../src")
from pynomaly.domain.entities.dataset import Dataset
from pynomaly.domain.value_objects.contamination_rate import ContaminationRate
from pynomaly.infrastructure.adapters.pyod_adapter import PyODAdapter
from pynomaly.infrastructure.adapters.sklearn_adapter import SklearnAdapter

# Configure display options
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
plt.rcParams["figure.figsize"] = (12, 8)
sns.set_style("whitegrid")
warnings.filterwarnings("ignore")

print("Environment setup complete!")

## Data Loading and Exploration

In [None]:
# Load all banking datasets
datasets = {}
data_dir = "../datasets/"

dataset_files = [
    "deposits.csv",
    "loans.csv",
    "investments.csv",
    "fx_transactions.csv",
    "atm_transactions.csv",
    "debit_card_transactions.csv",
    "credit_card_transactions.csv",
    "expense_transactions.csv",
    "gl_transactions.csv",
]

for file in dataset_files:
    name = file.replace(".csv", "")
    datasets[name] = pd.read_csv(os.path.join(data_dir, file))
    print(f"Loaded {name}: {len(datasets[name]):,} records")

print(
    f"\nTotal records across all datasets: {sum(len(df) for df in datasets.values()):,}"
)

In [None]:
# Dataset overview
overview_data = []
for name, df in datasets.items():
    overview_data.append(
        {
            "Dataset": name.replace("_", " ").title(),
            "Records": len(df),
            "Anomalies": df["is_anomaly"].sum() if "is_anomaly" in df.columns else 0,
            "Anomaly Rate": f"{df['is_anomaly'].mean() * 100:.1f}%"
            if "is_anomaly" in df.columns
            else "N/A",
            "Date Range": f"{df['timestamp'].min()} to {df['timestamp'].max()}"
            if "timestamp" in df.columns
            else "N/A",
        }
    )

overview_df = pd.DataFrame(overview_data)
print("Banking Datasets Overview:")
print(overview_df.to_string(index=False))

In [None]:
# Visualize dataset sizes and anomaly rates
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Dataset sizes
sizes = [len(df) for df in datasets.values()]
names = [name.replace("_", " ").title() for name in datasets.keys()]
ax1.bar(names, sizes, color="skyblue")
ax1.set_title("Dataset Sizes")
ax1.set_ylabel("Number of Records")
ax1.tick_params(axis="x", rotation=45)

# Anomaly rates
anomaly_rates = [
    df["is_anomaly"].mean() * 100 if "is_anomaly" in df.columns else 0
    for df in datasets.values()
]
ax2.bar(names, anomaly_rates, color="lightcoral")
ax2.set_title("Anomaly Rates by Dataset")
ax2.set_ylabel("Anomaly Rate (%)")
ax2.tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.show()

## Deposit Transaction Analysis

Analyzing deposit transactions for:
- Money laundering patterns
- Structuring (transactions just under $10,000)
- Unusual timing patterns
- Velocity anomalies

In [None]:
# Load deposits data and convert timestamp
deposits = datasets["deposits"].copy()
deposits["timestamp"] = pd.to_datetime(deposits["timestamp"])

# Basic statistics
print("Deposit Transaction Statistics:")
print(f"Total deposits: {len(deposits):,}")
print(f"Total amount: ${deposits['amount'].sum():,.2f}")
print(f"Average deposit: ${deposits['amount'].mean():,.2f}")
print(f"Median deposit: ${deposits['amount'].median():,.2f}")
print(f"Largest deposit: ${deposits['amount'].max():,.2f}")
print(
    f"Structuring indicators (near $10K): {((deposits['amount'] >= 9000) & (deposits['amount'] < 10000)).sum()}"
)

In [None]:
# Feature engineering for deposit analysis
def engineer_deposit_features(df):
    features_df = df.copy()

    # Time features
    features_df["hour"] = features_df["timestamp"].dt.hour
    features_df["day_of_week"] = features_df["timestamp"].dt.dayofweek
    features_df["is_weekend"] = features_df["day_of_week"].isin([5, 6]).astype(int)
    features_df["is_business_hours"] = (
        (features_df["hour"] >= 9) & (features_df["hour"] <= 17)
    ).astype(int)

    # Amount features
    features_df["amount_log"] = np.log1p(features_df["amount"])
    features_df["near_threshold"] = (
        (features_df["amount"] >= 9000) & (features_df["amount"] < 10000)
    ).astype(int)
    features_df["large_amount"] = (features_df["amount"] > 50000).astype(int)

    # Source type encoding
    source_encoded = pd.get_dummies(features_df["source_type"], prefix="source")
    features_df = pd.concat([features_df, source_encoded], axis=1)

    return features_df


deposits_features = engineer_deposit_features(deposits)
print("Feature engineering complete for deposits")

In [None]:
# Anomaly detection on deposits
def detect_deposit_anomalies(df, contamination=0.05):
    feature_columns = [
        "amount",
        "amount_log",
        "hour",
        "day_of_week",
        "is_weekend",
        "is_business_hours",
        "near_threshold",
        "large_amount",
    ]

    # Add source type columns
    source_columns = [col for col in df.columns if col.startswith("source_")]
    feature_columns.extend(source_columns)

    X = df[feature_columns].fillna(0).values

    # Initialize adapters
    sklearn_adapter = SklearnAdapter()
    pyod_adapter = PyODAdapter()

    # Create dataset
    dataset = Dataset(data=X, target=None, feature_names=feature_columns)

    contamination_rate = ContaminationRate(contamination)

    # Run multiple algorithms
    iso_result = sklearn_adapter.detect_anomalies(
        dataset=dataset,
        algorithm_type="isolation_forest",
        contamination=contamination_rate,
    )

    lof_result = pyod_adapter.detect_anomalies(
        dataset=dataset, algorithm_type="lof", contamination=contamination_rate
    )

    # Combine scores
    df_result = df.copy()
    df_result["iso_score"] = iso_result.anomaly_scores
    df_result["lof_score"] = lof_result.anomaly_scores
    df_result["ensemble_score"] = (df_result["iso_score"] + df_result["lof_score"]) / 2

    # Flag anomalies
    threshold = np.percentile(df_result["ensemble_score"], (1 - contamination) * 100)
    df_result["predicted_anomaly"] = (df_result["ensemble_score"] > threshold).astype(
        int
    )

    return df_result


deposit_results = detect_deposit_anomalies(deposits_features)
print(
    f"Deposit anomaly detection complete. Detected {deposit_results['predicted_anomaly'].sum()} anomalies."
)

In [None]:
# Analyze deposit anomalies
deposit_anomalies = deposit_results[deposit_results["predicted_anomaly"] == 1]
deposit_normal = deposit_results[deposit_results["predicted_anomaly"] == 0]

print("=== DEPOSIT ANOMALY ANALYSIS ===")
print(f"Normal transactions: {len(deposit_normal):,}")
print(f"Anomalous transactions: {len(deposit_anomalies):,}")
print(f"Detection rate: {len(deposit_anomalies) / len(deposit_results) * 100:.1f}%")

if len(deposit_anomalies) > 0:
    print("\nAnomaly Characteristics:")
    print(
        f"Average amount: ${deposit_anomalies['amount'].mean():,.2f} vs ${deposit_normal['amount'].mean():,.2f}"
    )
    print(f"Structuring indicators: {deposit_anomalies['near_threshold'].sum()}")
    print(f"Large deposits (>$50K): {deposit_anomalies['large_amount'].sum()}")
    print(f"After-hours deposits: {(1 - deposit_anomalies['is_business_hours']).sum()}")
    print(f"Weekend deposits: {deposit_anomalies['is_weekend'].sum()}")

In [None]:
# Visualize deposit anomalies
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle("Deposit Transaction Anomaly Analysis", fontsize=16)

# Amount distribution
axes[0, 0].hist(
    deposit_normal["amount"], bins=50, alpha=0.7, label="Normal", density=True
)
axes[0, 0].hist(
    deposit_anomalies["amount"], bins=50, alpha=0.7, label="Anomaly", density=True
)
axes[0, 0].set_xlabel("Amount ($)")
axes[0, 0].set_ylabel("Density")
axes[0, 0].set_title("Amount Distribution")
axes[0, 0].legend()
axes[0, 0].set_xscale("log")

# Time patterns
hour_counts = (
    deposit_results.groupby(["hour", "predicted_anomaly"]).size().unstack(fill_value=0)
)
hour_counts.plot(kind="bar", ax=axes[0, 1], color=["blue", "red"])
axes[0, 1].set_xlabel("Hour of Day")
axes[0, 1].set_ylabel("Count")
axes[0, 1].set_title("Hourly Distribution")
axes[0, 1].legend(["Normal", "Anomaly"])

# Source type distribution
source_data = (
    deposit_results.groupby(["source_type", "predicted_anomaly"])
    .size()
    .unstack(fill_value=0)
)
source_data.plot(kind="bar", ax=axes[1, 0], color=["blue", "red"])
axes[1, 0].set_xlabel("Source Type")
axes[1, 0].set_ylabel("Count")
axes[1, 0].set_title("Source Type Distribution")
axes[1, 0].legend(["Normal", "Anomaly"])

# Anomaly scores
axes[1, 1].scatter(
    deposit_results["iso_score"],
    deposit_results["lof_score"],
    c=deposit_results["predicted_anomaly"],
    cmap="coolwarm",
    alpha=0.6,
)
axes[1, 1].set_xlabel("Isolation Forest Score")
axes[1, 1].set_ylabel("LOF Score")
axes[1, 1].set_title("Algorithm Score Comparison")

plt.tight_layout()
plt.show()

## Credit Card Fraud Detection

Detecting fraudulent credit card transactions using:
- Velocity analysis
- Amount deviation patterns
- Geographic anomalies
- Merchant category analysis

In [None]:
# Load credit card data
credit_cards = datasets["credit_card_transactions"].copy()
credit_cards["timestamp"] = pd.to_datetime(credit_cards["timestamp"])

print("Credit Card Transaction Statistics:")
print(f"Total transactions: {len(credit_cards):,}")
print(f"Total amount: ${credit_cards['amount'].sum():,.2f}")
print(f"Average transaction: ${credit_cards['amount'].mean():,.2f}")
print(f"Unique customers: {credit_cards['customer_id'].nunique():,}")
print(f"Unique merchants: {credit_cards['merchant'].nunique():,}")
print(f"Card-not-present transactions: {(~credit_cards['card_present']).sum():,}")

In [None]:
# Credit card fraud detection
def detect_credit_fraud(df, contamination=0.05):
    features_df = df.copy()

    # Time features
    features_df["hour"] = features_df["timestamp"].dt.hour
    features_df["is_night"] = (
        (features_df["hour"] < 6) | (features_df["hour"] > 22)
    ).astype(int)
    features_df["is_weekend"] = (
        features_df["timestamp"].dt.dayofweek.isin([5, 6]).astype(int)
    )

    # Amount features
    features_df["amount_log"] = np.log1p(features_df["amount"])
    features_df["high_amount"] = (features_df["amount"] > 1000).astype(int)

    # Card features
    features_df["card_present_int"] = features_df["card_present"].astype(int)

    # Customer behavior
    customer_stats = (
        features_df.groupby("customer_id")["amount"].agg(["mean", "std"]).reset_index()
    )
    customer_stats.columns = ["customer_id", "avg_amount", "std_amount"]
    customer_stats["std_amount"] = customer_stats["std_amount"].fillna(0)
    features_df = features_df.merge(customer_stats, on="customer_id")

    # Deviation from normal
    features_df["amount_dev"] = abs(
        features_df["amount"] - features_df["avg_amount"]
    ) / (features_df["std_amount"] + 1)

    # Encode categoricals
    category_encoded = pd.get_dummies(features_df["category"], prefix="cat")
    location_encoded = pd.get_dummies(features_df["location"], prefix="loc")
    features_df = pd.concat([features_df, category_encoded, location_encoded], axis=1)

    # Select features
    feature_columns = [
        "amount",
        "amount_log",
        "high_amount",
        "hour",
        "is_night",
        "is_weekend",
        "card_present_int",
        "mcc_code",
        "amount_dev",
    ]

    # Add encoded features
    cat_cols = [col for col in features_df.columns if col.startswith("cat_")]
    loc_cols = [col for col in features_df.columns if col.startswith("loc_")]
    feature_columns.extend(cat_cols + loc_cols)

    X = features_df[feature_columns].fillna(0).values

    # Run anomaly detection
    sklearn_adapter = SklearnAdapter()
    dataset = Dataset(data=X, target=None, feature_names=feature_columns)
    contamination_rate = ContaminationRate(contamination)

    result = sklearn_adapter.detect_anomalies(
        dataset=dataset,
        algorithm_type="isolation_forest",
        contamination=contamination_rate,
    )

    features_df["fraud_score"] = result.anomaly_scores
    threshold = np.percentile(features_df["fraud_score"], (1 - contamination) * 100)
    features_df["predicted_fraud"] = (features_df["fraud_score"] > threshold).astype(
        int
    )

    return features_df


credit_results = detect_credit_fraud(credit_cards)
print(
    f"Credit card fraud detection complete. Detected {credit_results['predicted_fraud'].sum()} fraudulent transactions."
)

In [None]:
# Analyze credit card fraud
fraud_txns = credit_results[credit_results["predicted_fraud"] == 1]
normal_txns = credit_results[credit_results["predicted_fraud"] == 0]

print("=== CREDIT CARD FRAUD ANALYSIS ===")
print(
    f"Total fraud detected: {len(fraud_txns):,} ({len(fraud_txns) / len(credit_results) * 100:.1f}%)"
)
print(f"Fraud amount: ${fraud_txns['amount'].sum():,.2f}")
print(f"Average fraud amount: ${fraud_txns['amount'].mean():,.2f}")

if len(fraud_txns) > 0:
    print("\nFraud Patterns:")
    print(f"High-value fraud (>$1000): {fraud_txns['high_amount'].sum()}")
    print(f"Card-not-present fraud: {(1 - fraud_txns['card_present_int']).sum()}")
    print(f"Night-time fraud: {fraud_txns['is_night'].sum()}")
    print(f"Weekend fraud: {fraud_txns['is_weekend'].sum()}")

    if "loc_international" in fraud_txns.columns:
        print(f"International fraud: {fraud_txns['loc_international'].sum()}")

    print("\nTop 5 Fraud Transactions:")
    top_fraud = fraud_txns.nlargest(5, "fraud_score")[
        ["transaction_id", "amount", "merchant", "fraud_score"]
    ]
    print(top_fraud.to_string(index=False))

## ATM Transaction Monitoring

Detecting ATM fraud and skimming attacks through:
- Multiple failed attempts
- Geographic anomalies
- Timing patterns
- Amount patterns

In [None]:
# ATM analysis
atm_txns = datasets["atm_transactions"].copy()
atm_txns["timestamp"] = pd.to_datetime(atm_txns["timestamp"])

print("ATM Transaction Statistics:")
print(f"Total ATM transactions: {len(atm_txns):,}")
print(f"Successful transactions: {(atm_txns['status'] == 'success').sum():,}")
print(f"Declined transactions: {(atm_txns['status'] == 'declined').sum():,}")
print(f"Error transactions: {(atm_txns['status'] == 'error').sum():,}")
print(f"Unique ATMs: {atm_txns['atm_id'].nunique():,}")
print(f"Unique locations: {atm_txns['location'].nunique():,}")

# Withdrawal analysis
withdrawals = atm_txns[atm_txns["transaction_type"] == "withdrawal"]
print("\nWithdrawal Analysis:")
print(f"Total withdrawals: {len(withdrawals):,}")
if len(withdrawals) > 0:
    print(f"Total withdrawn: ${withdrawals['amount'].sum():,.2f}")
    print(f"Average withdrawal: ${withdrawals['amount'].mean():.2f}")
    print(f"Most common amounts: {withdrawals['amount'].value_counts().head()}")

## Executive Dashboard

Summary of all anomaly detection results across banking operations

In [None]:
# Create executive summary
def create_executive_summary():
    summary = {
        "Metric": [
            "Total Transactions Analyzed",
            "Deposits - Total Anomalies",
            "Credit Cards - Fraud Detected",
            "ATM - Suspicious Activity",
            "Investment - Unusual Patterns",
            "FX - Money Laundering Risk",
            "Expenses - Potential Fraud",
            "GL - Data Quality Issues",
        ],
        "Count": [
            sum(len(df) for df in datasets.values()),
            deposit_results["predicted_anomaly"].sum(),
            credit_results["predicted_fraud"].sum(),
            "TBD",  # ATM analysis not fully implemented
            "TBD",  # Investment analysis not fully implemented
            "TBD",  # FX analysis not fully implemented
            "TBD",  # Expense analysis not fully implemented
            "TBD",  # GL analysis not fully implemented
        ],
        "Risk Level": [
            "INFO",
            "MEDIUM" if deposit_results["predicted_anomaly"].sum() > 100 else "LOW",
            "HIGH" if credit_results["predicted_fraud"].sum() > 500 else "MEDIUM",
            "TBD",
            "TBD",
            "TBD",
            "TBD",
            "TBD",
        ],
    }

    return pd.DataFrame(summary)


executive_summary = create_executive_summary()
print("=== EXECUTIVE SUMMARY ===")
print(executive_summary.to_string(index=False))

# Risk assessment
total_deposit_risk = (
    deposit_anomalies["amount"].sum() if len(deposit_anomalies) > 0 else 0
)
total_fraud_risk = fraud_txns["amount"].sum() if len(fraud_txns) > 0 else 0

print("\n=== FINANCIAL RISK ASSESSMENT ===")
print(f"Suspicious deposit amounts: ${total_deposit_risk:,.2f}")
print(f"Fraudulent transaction amounts: ${total_fraud_risk:,.2f}")
print(f"Total risk exposure: ${total_deposit_risk + total_fraud_risk:,.2f}")

In [None]:
# Create interactive dashboard
fig = make_subplots(
    rows=2,
    cols=2,
    subplot_titles=[
        "Dataset Overview",
        "Deposit Anomalies",
        "Credit Card Fraud",
        "Risk Summary",
    ],
    specs=[
        [{"type": "bar"}, {"type": "scatter"}],
        [{"type": "scatter"}, {"type": "bar"}],
    ],
)

# Dataset sizes
dataset_names = list(datasets.keys())
dataset_sizes = [len(df) for df in datasets.values()]

fig.add_trace(go.Bar(x=dataset_names, y=dataset_sizes, name="Records"), row=1, col=1)

# Deposit amount vs anomaly score
fig.add_trace(
    go.Scatter(
        x=deposit_results["amount"],
        y=deposit_results["ensemble_score"],
        mode="markers",
        marker=dict(color=deposit_results["predicted_anomaly"], colorscale="Viridis"),
        name="Deposits",
    ),
    row=1,
    col=2,
)

# Credit card amount vs fraud score
fig.add_trace(
    go.Scatter(
        x=credit_results["amount"],
        y=credit_results["fraud_score"],
        mode="markers",
        marker=dict(color=credit_results["predicted_fraud"], colorscale="Reds"),
        name="Credit Cards",
    ),
    row=2,
    col=1,
)

# Risk summary
risk_categories = ["Deposits", "Credit Cards", "Other"]
risk_amounts = [total_deposit_risk, total_fraud_risk, 0]

fig.add_trace(
    go.Bar(x=risk_categories, y=risk_amounts, name="Risk Amount"), row=2, col=2
)

fig.update_layout(
    height=800, showlegend=False, title_text="Banking Anomaly Detection Dashboard"
)
fig.show()

## Recommendations and Next Steps

Based on the anomaly detection analysis, here are the key recommendations:

### Immediate Actions
1. **Review high-risk transactions** identified by the ensemble models
2. **Implement real-time monitoring** for the identified patterns
3. **Enhance customer verification** for transactions flagged as anomalous

### System Improvements
1. **Deploy automated alerts** for transactions exceeding risk thresholds
2. **Integrate multiple detection algorithms** for better coverage
3. **Establish feedback loops** to improve model accuracy

### Compliance and Reporting
1. **Generate regulatory reports** for suspicious activity
2. **Document investigation procedures** for anomalous transactions
3. **Regular model validation** and performance monitoring

### Future Enhancements
1. **Cross-transaction analysis** to identify complex fraud schemes
2. **Real-time streaming analytics** for immediate detection
3. **Advanced ML models** including deep learning and graph analysis