# Welcome to the Detective's Quest: Unraveling the Money Laundering Syndicate

Welcome, Detective! The king of **Valdris (Kingdom 1)** has entrusted you with a critical mission: investigate the bustling **Goldweave Port (City 5)**, which has been overrun by a sophisticated money laundering syndicate. Armed with magical powers, you can conjure **synthetic data**—realistic, fabricated datasets that mimic the kingdom's financial flows, documents, and testimonies without touching sensitive information or alerting the syndicate to the investigation.

By influencing the city's *data realm*, you will:
- Generate **simulated transaction databases** to trace illicit money movements.
- Craft **bank statements** revealing suspicious businesses.
- Summon **investigative reports** to build a compelling case.

You will then use the synthetic data to:
- Train your familiar to **detect anomalies** in the transaction database.
- Develop a golem that **answers questions and queries** on the synthetic data.

As you progress, you’ll master advanced spells (tools and practices) to ensure your synthetic creations are **accurate**, **private**, and powerful enough to dismantle the syndicate’s operations. Through this lab, you’ll expose the syndicate’s secrets while demonstrating the ethical power of synthetic data in real-world investigations.

Embark on this quest to bring justice to City 5, wielding the art of synthetic data to outsmart the criminals and protect the kingdom!

## Mission 1: Conjuring Financial Flow Projections
**Objective**: Generate synthetic transactional databases using various methods to simulate the movement of money, enabling the tracing of illicit financial flows in a fantasy kingdom setting.



In [5]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

def create_base_transaction_data(n_samples=3000, ml_ratio=0.05):
    """Create realistic financial transaction data with money laundering cases, including typologies in a fantasy kingdom setting"""
    
    np.random.seed(42)
    
    # Define fantasy locations: 4 kingdoms, each with 5 cities
    kingdoms = ['K1', 'K2', 'K3', 'K4']
    cities_per_kingdom = ['C1', 'C2', 'C3', 'C4', 'C5']
    all_locations = [f"{k}{c}" for k in kingdoms for c in cities_per_kingdom]
    
    # Common locations: Most cities
    common_locations = [loc for loc in all_locations if loc not in ['K1C5']]
    
    # High-risk locations: Primarily K1C5 (the corrupt city under investigation)
    high_risk_locations = ['K1C5']
    
    # Monitored locations: Some cities in other kingdoms, e.g., border cities
    monitored_locations = ['K2C5', 'K3C1', 'K4C3']
    
    # Probabilities for location selection
    def select_location(is_ml=False, typology=None):
        if is_ml:
            if typology == 'layering':
                # Higher chance for monitored or high-risk for layering, favor K1C5
                probs = [0.2, 0.3, 0.5]  # common, monitored, high_risk
            else:
                probs = [0.3, 0.3, 0.4]  # Slightly favor high-risk (K1C5)
        else:
            probs = [0.94, 0.05, 0.01]
        
        category = np.random.choice(['common', 'monitored', 'high_risk'], p=probs)
        if category == 'common':
            return np.random.choice(common_locations)
        elif category == 'monitored':
            return np.random.choice(monitored_locations)
        else:
            return np.random.choice(high_risk_locations)
    
    # Base transaction features
    data = {
        'transaction_id': [],
        'amount': [],
        'account_age_days': [],
        'transaction_hour': [],
        'day_of_week': [],
        'transactions_last_24h': [],
        'account_balance_ratio': [],
        'merchant_risk_score': [],
        'cross_border': [],
        'cash_equivalent': [],
        'transaction_type': [],
        'customer_segment': [],
        'sender_location': [],
        'receiver_location': [],
        'is_money_laundering': []
    }
    
    # Generate legitimate transactions
    n_legit = int(n_samples * (1 - ml_ratio))
    for i in range(n_legit):
        data['transaction_id'].append(f'TXN_{len(data["transaction_id"]) + 1:06d}')
        data['amount'].append(np.random.lognormal(mean=6, sigma=2))
        data['account_age_days'].append(int(np.random.gamma(shape=2, scale=365)))
        data['transaction_hour'].append(np.random.randint(0, 24))
        data['day_of_week'].append(np.random.randint(0, 7))
        data['transactions_last_24h'].append(np.random.poisson(2))
        data['account_balance_ratio'].append(np.random.beta(2, 5))
        data['merchant_risk_score'].append(np.random.beta(2, 8))
        data['cross_border'].append(np.random.choice([0, 1], p=[0.9, 0.1]))
        data['cash_equivalent'].append(np.random.choice([0, 1], p=[0.95, 0.05]))
        data['transaction_type'].append(np.random.choice(
            ['wire_transfer', 'card_payment', 'atm_withdrawal', 
             'online_transfer', 'check_deposit', 'cash_deposit'],
            p=[0.15, 0.35, 0.15, 0.25, 0.05, 0.05]
        ))
        data['customer_segment'].append(np.random.choice(
            ['retail', 'business', 'premium', 'corporate'],
            p=[0.6, 0.2, 0.15, 0.05]
        ))
        sender = select_location(is_ml=False)
        receiver = select_location(is_ml=False)
        data['sender_location'].append(sender)
        data['receiver_location'].append(receiver)
        data['cross_border'][-1] = 1 if sender[:2] != receiver[:2] else 0  # Cross-border if different kingdoms
        data['is_money_laundering'].append(0)
    
    # Generate money laundering transactions with typologies (focusing on corrupt syndicate in K1C5)
    n_ml_base = int(n_samples * ml_ratio)
    for i in range(n_ml_base):
        # Choose typology
        typology = np.random.choice(['structuring', 'layering', 'integration'], p=[0.4, 0.4, 0.2])
        
        # Base features for this ML case
        base_amount = np.random.lognormal(mean=6, sigma=2) * np.random.uniform(3, 15)
        account_age = int(np.random.gamma(shape=2, scale=365))
        day_of_week = np.random.randint(0, 7)
        merchant_risk = np.random.beta(8, 2)
        transaction_type = np.random.choice(
            ['wire_transfer', 'cash_deposit', 'online_transfer'],
            p=[0.6, 0.3, 0.1]
        )
        customer_segment = np.random.choice(
            ['retail', 'business', 'premium', 'corporate'],
            p=[0.6, 0.2, 0.15, 0.05]
        )
        sender = select_location(is_ml=True, typology=typology)
        receiver = select_location(is_ml=True, typology=typology)
        # Ensure K1C5 involvement in at least one side for syndicate investigation
        if np.random.random() < 0.7:  # 70% chance to involve K1C5
            if np.random.random() < 0.5:
                sender = 'K1C5'
            else:
                receiver = 'K1C5'
        cross_border = 1 if sender[:2] != receiver[:2] else 0
        cash_equivalent = np.random.choice([0, 1], p=[0.6, 0.4])
        account_balance_ratio = np.random.beta(2, 5)
        
        if typology == 'structuring':
            # Structuring: multiple small transactions to avoid detection by kingdom guards
            n_splits = np.random.randint(3, 10)
            split_amounts = np.full(n_splits, base_amount / n_splits)
            # Add some variation
            split_amounts += np.random.normal(0, base_amount * 0.05, n_splits)
            split_amounts = np.maximum(split_amounts, 1)  # Ensure positive
            
            # Close timings: same day, hours within 24h
            base_hour = np.random.choice([0, 1, 2, 3, 22, 23]) if np.random.random() < 0.6 else np.random.randint(0, 24)
            hours = np.sort(np.random.randint(base_hour - 12, base_hour + 12, n_splits) % 24)
            
            for j in range(n_splits):
                amount = split_amounts[j]
                if np.random.random() < 0.7:
                    amount = round(amount, -2)  # Round to nearest 100
                
                data['transaction_id'].append(f'TXN_{len(data["transaction_id"]) + 1:06d}')
                data['amount'].append(amount)
                data['account_age_days'].append(account_age)
                data['transaction_hour'].append(hours[j])
                data['day_of_week'].append(day_of_week)
                data['transactions_last_24h'].append(n_splits)
                data['account_balance_ratio'].append(account_balance_ratio)
                data['merchant_risk_score'].append(merchant_risk)
                data['cross_border'].append(cross_border)
                data['cash_equivalent'].append(cash_equivalent)
                data['transaction_type'].append('cash_deposit' if np.random.random() < 0.5 else transaction_type)
                data['customer_segment'].append(customer_segment)
                data['sender_location'].append(sender)
                data['receiver_location'].append(receiver)
                data['is_money_laundering'].append(1)
        
        elif typology == 'layering':
            # Layering: cross-kingdom transfers to obscure origins, often through K1C5
            hop_count = np.random.randint(3, 8)  # Not directly used, but influence risk
            amount = base_amount
            if np.random.random() < 0.7:
                amount = round(amount, -2)
            hour = np.random.choice([0, 1, 2, 3, 22, 23]) if np.random.random() < 0.6 else np.random.randint(0, 24)
            transactions_last_24h = np.random.randint(5, 15)
            cross_border = 1  # Always cross-border for layering
            transaction_type = 'wire_transfer' if np.random.random() < 0.7 else 'online_transfer'
            # Ensure different kingdoms if needed
            if sender[:2] == receiver[:2]:
                receiver = select_location(is_ml=True, typology=typology)
            
            data['transaction_id'].append(f'TXN_{len(data["transaction_id"]) + 1:06d}')
            data['amount'].append(amount)
            data['account_age_days'].append(account_age)
            data['transaction_hour'].append(hour)
            data['day_of_week'].append(day_of_week)
            data['transactions_last_24h'].append(transactions_last_24h)
            data['account_balance_ratio'].append(account_balance_ratio)
            data['merchant_risk_score'].append(merchant_risk * 1.2)  # Slightly higher
            data['cross_border'].append(cross_border)
            data['cash_equivalent'].append(0)
            data['transaction_type'].append(transaction_type)
            data['customer_segment'].append(customer_segment)
            data['sender_location'].append(sender)
            data['receiver_location'].append(receiver)
            data['is_money_laundering'].append(1)
        
        else:  # integration
            # Integration: blending funds as legitimate in kingdom economy, often via K1C5 merchants
            amount = base_amount * np.random.uniform(0.8, 1.2)
            if np.random.random() < 0.5:
                amount = round(amount, -2)
            hour = np.random.randint(0, 24)
            transactions_last_24h = np.random.randint(1, 5)
            cross_border = np.random.choice([0, 1], p=[0.7, 0.3])
            transaction_type = np.random.choice(['check_deposit', 'cash_deposit'], p=[0.6, 0.4])
            merchant_risk = np.random.beta(2, 8)  # Lower risk
            
            data['transaction_id'].append(f'TXN_{len(data["transaction_id"]) + 1:06d}')
            data['amount'].append(amount)
            data['account_age_days'].append(account_age)
            data['transaction_hour'].append(hour)
            data['day_of_week'].append(day_of_week)
            data['transactions_last_24h'].append(transactions_last_24h)
            data['account_balance_ratio'].append(account_balance_ratio)
            data['merchant_risk_score'].append(merchant_risk)
            data['cross_border'].append(cross_border)
            data['cash_equivalent'].append(1)
            data['transaction_type'].append(transaction_type)
            data['customer_segment'].append('business' if np.random.random() < 0.6 else customer_segment)
            data['sender_location'].append(sender)
            data['receiver_location'].append(receiver)
            data['is_money_laundering'].append(1)
    
    df = pd.DataFrame(data)
    print(f"✓ Total transactions generated: {len(df)} (approx {ml_ratio*100:.1f}% ML syndicate cases)")
    print(f"  💰 Corrupt syndicate cases: {df['is_money_laundering'].sum()}")
    print(f"  🏰 Transactions involving K1C5: {((df['sender_location'] == 'K1C5') | (df['receiver_location'] == 'K1C5')).sum()}")
    return df

class SDVSynthesizerEngine:
    """
    Multi-algorithm synthetic data generation using SDV for fantasy detective investigation:
    - GaussianCopula (Statistical approach)
    - CTGAN (Deep learning approach)  
    - TVAE (Variational autoencoder approach)
    - GraphNetwork (Network-based approach)
    """
    
    def __init__(self):
        self.synthesizers = {}
        self.results = {}
        
    def prepare_metadata(self, df):
        """Create SDV metadata for the dataset"""
        from sdv.metadata import SingleTableMetadata
        
        # Remove transaction_id for synthesis
        df_clean = df.drop('transaction_id', axis=1)
        
        # Create metadata automatically
        metadata = SingleTableMetadata()
        metadata.detect_from_dataframe(df_clean)
        
        # Define categorical columns explicitly
        categorical_cols = [
            'transaction_type', 'customer_segment', 'cross_border', 
            'cash_equivalent', 'is_money_laundering', 'day_of_week',
            'sender_location', 'receiver_location'
        ]
        
        for col in categorical_cols:
            if col in df_clean.columns:
                metadata.update_column(col, sdtype='categorical')
        
        # Define numerical columns with constraints
        numerical_cols = [
            'amount', 'account_age_days', 'transaction_hour',
            'transactions_last_24h', 'account_balance_ratio',
            'merchant_risk_score'
        ]
        for col in numerical_cols:
            if col in df_clean.columns:
                metadata.update_column(col, sdtype='numerical')
        
        print(f"✓ Created metadata for {len(df_clean.columns)} columns")
        return df_clean, metadata
    
    def synthesize_with_gaussian_copula(self, df, metadata, n_samples=1000):
        """Fast statistical approach - great for preserving correlations in syndicate patterns"""
        print("📊 Conjuring with SDV GaussianCopula (Statistical Magic)...")
        
        try:
            from sdv.single_table import GaussianCopulaSynthesizer
            
            # Initialize synthesizer
            synthesizer = GaussianCopulaSynthesizer(metadata)
            
            # Fit the model
            synthesizer.fit(df)
            
            # Generate synthetic data
            synthetic_data = synthesizer.sample(num_rows=n_samples)
            
            self.synthesizers['gaussian_copula'] = synthesizer
            self.results['gaussian_copula'] = synthetic_data
            
            print(f"✓ GaussianCopula: Generated {len(synthetic_data)} transactions")
            return synthetic_data
            
        except Exception as e:
            print(f"✗ GaussianCopula failed: {e}")
            return None
    
    def synthesize_with_ctgan(self, df, metadata, n_samples=1000, epochs=100):
        """Deep learning GAN approach - captures complex syndicate patterns"""
        print("🧠 Channeling SDV CTGAN (Deep Learning Sorcery)...")
        
        try:
            from sdv.single_table import CTGANSynthesizer
            
            # Initialize with optimized parameters for speed
            synthesizer = CTGANSynthesizer(
                metadata,
                epochs=epochs,
                batch_size=500,
                verbose=False
            )
            
            # Fit the model
            synthesizer.fit(df)
            
            # Generate synthetic data
            synthetic_data = synthesizer.sample(num_rows=n_samples)
            
            self.synthesizers['ctgan'] = synthesizer
            self.results['ctgan'] = synthetic_data
            
            print(f"✓ CTGAN: Generated {len(synthetic_data)} transactions")
            return synthetic_data
            
        except Exception as e:
            print(f"✗ CTGAN failed: {e}")
            return None
    
    def synthesize_with_tvae(self, df, metadata, n_samples=1000, epochs=100):
        """Variational Autoencoder approach - good for detecting syndicate anomalies"""
        print("🤖 Weaving with SDV TVAE (Variational Enchantment)...")
        
        try:
            from sdv.single_table import TVAESynthesizer
            
            # Initialize with optimized parameters
            synthesizer = TVAESynthesizer(
                metadata,
                epochs=epochs,
                batch_size=500,
                verbose=False
            )
            
            # Fit the model
            synthesizer.fit(df)
            
            # Generate synthetic data
            synthetic_data = synthesizer.sample(num_rows=n_samples)
            
            self.synthesizers['tvae'] = synthesizer
            self.results['tvae'] = synthetic_data
            
            print(f"✓ TVAE: Generated {len(synthetic_data)} transactions")
            return synthetic_data
            
        except Exception as e:
            print(f"✗ TVAE failed: {e}")
            return None
    
    def synthesize_with_graph_network(self, df, metadata, n_samples=1000, n_nodes=500):
        """Network-based approach - models transaction relationships in the syndicate"""
        print("🕸️ Weaving Synthetic Transaction Network (Graph Magic)...")
        
        try:
            import networkx as nx
            
            # Initialize directed graph
            G = nx.DiGraph()
            np.random.seed(42)
            
            # Create nodes (entities: customers and merchants)
            n_customers = n_nodes // 2
            n_merchants = n_nodes - n_customers
            customers = [f'CUST_{i:05d}' for i in range(1, n_customers + 1)]
            merchants = [f'MERCH_{i:04d}' for i in range(1, n_merchants + 1)]
            
            # Get all unique locations from the original data
            all_locations = list(set(df['sender_location'].unique()) | set(df['receiver_location'].unique()))
            
            # Add nodes with attributes
            for customer in customers:
                # 30% of customers are in K1C5 (the corrupt city)
                location = 'K1C5' if np.random.random() < 0.3 else np.random.choice(
                    [loc for loc in all_locations if loc != 'K1C5'] if len([loc for loc in all_locations if loc != 'K1C5']) > 0 else all_locations
                )
                G.add_node(customer, type='customer', location=location)
            
            for merchant in merchants:
                # 25% of merchants are in K1C5
                location = 'K1C5' if np.random.random() < 0.25 else np.random.choice(
                    [loc for loc in all_locations if loc != 'K1C5'] if len([loc for loc in all_locations if loc != 'K1C5']) > 0 else all_locations
                )
                G.add_node(merchant, type='merchant', location=location)
            
            # Create lists of nodes by type and location for efficient selection
            customers_k1c5 = [n for n, d in G.nodes(data=True) if d['type'] == 'customer' and d['location'] == 'K1C5']
            merchants_k1c5 = [n for n, d in G.nodes(data=True) if d['type'] == 'merchant' and d['location'] == 'K1C5']
            customers_other = [n for n, d in G.nodes(data=True) if d['type'] == 'customer' and d['location'] != 'K1C5']
            merchants_other = [n for n, d in G.nodes(data=True) if d['type'] == 'merchant' and d['location'] != 'K1C5']
            
            # Ensure we have nodes in each category
            if not customers_k1c5:
                customers_k1c5 = customers[:5]  # Fallback to first 5 customers
                for c in customers_k1c5:
                    G.nodes[c]['location'] = 'K1C5'
            if not merchants_k1c5:
                merchants_k1c5 = merchants[:5]  # Fallback to first 5 merchants
                for m in merchants_k1c5:
                    G.nodes[m]['location'] = 'K1C5'
            if not customers_other:
                customers_other = customers[5:] if len(customers) > 5 else customers
            if not merchants_other:
                merchants_other = merchants[5:] if len(merchants) > 5 else merchants
            
            # Generate edges (transactions) with attributes
            ml_ratio = df['is_money_laundering'].mean()
            edges = []
            
            for _ in range(n_samples):
                # Determine if this is a money laundering transaction
                is_ml = np.random.random() < ml_ratio
                
                # Select sender and receiver based on ML status
                if is_ml and np.random.random() < 0.7:  # 70% of ML involves K1C5
                    if np.random.random() < 0.5:
                        # Sender from K1C5
                        sender = np.random.choice(customers_k1c5)
                        receiver = np.random.choice(merchants_k1c5 + merchants_other)
                    else:
                        # Receiver in K1C5
                        sender = np.random.choice(customers_k1c5 + customers_other)
                        receiver = np.random.choice(merchants_k1c5)
                else:
                    # Normal transaction pattern
                    sender = np.random.choice(customers)
                    receiver = np.random.choice(merchants)
                
                # Get locations
                sender_location = G.nodes[sender]['location']
                receiver_location = G.nodes[receiver]['location']
                
                # Calculate cross_border
                cross_border = 1 if sender_location[:2] != receiver_location[:2] else 0
                
                # Generate transaction attributes based on ML status
                if is_ml:
                    # ML transactions tend to be larger and use specific types
                    amount = np.random.lognormal(mean=7, sigma=2)
                    transaction_type = np.random.choice(
                        ['wire_transfer', 'cash_deposit', 'online_transfer'],
                        p=[0.6, 0.3, 0.1]
                    )
                    merchant_risk_score = np.random.beta(8, 2)  # Higher risk
                    cash_equivalent = np.random.choice([0, 1], p=[0.6, 0.4])
                    transactions_last_24h = np.random.poisson(5)  # More activity
                else:
                    # Normal transactions
                    amount = np.random.lognormal(mean=6, sigma=2)
                    transaction_type = np.random.choice(
                        df['transaction_type'].unique(),
                        p=df['transaction_type'].value_counts(normalize=True).values
                    )
                    merchant_risk_score = np.random.beta(2, 8)  # Lower risk
                    cash_equivalent = np.random.choice([0, 1], p=[0.95, 0.05])
                    transactions_last_24h = np.random.poisson(2)
                
                # Create edge data
                edge_data = {
                    'sender': sender,
                    'receiver': receiver,
                    'amount': max(1, amount),  # Ensure positive amount
                    'transaction_type': transaction_type,
                    'cross_border': cross_border,
                    'is_money_laundering': int(is_ml),
                    'account_age_days': int(np.random.gamma(shape=2, scale=365)),
                    'transaction_hour': np.random.randint(0, 24),
                    'day_of_week': np.random.randint(0, 7),
                    'transactions_last_24h': transactions_last_24h,
                    'account_balance_ratio': np.random.beta(2, 5),
                    'merchant_risk_score': merchant_risk_score,
                    'cash_equivalent': cash_equivalent,
                    'customer_segment': np.random.choice(
                        df['customer_segment'].unique(),
                        p=df['customer_segment'].value_counts(normalize=True).values
                    ),
                    'sender_location': sender_location,
                    'receiver_location': receiver_location
                }
                
                edges.append(edge_data)
                G.add_edge(sender, receiver, **edge_data)
            
            # Convert edges to DataFrame
            synthetic_data = pd.DataFrame(edges)
            
            # Drop the sender and receiver columns as they're not in the original schema
            synthetic_data = synthetic_data.drop(['sender', 'receiver'], axis=1)
            
            # Ensure column order matches original (minus transaction_id)
            original_cols = [col for col in df.columns if col != 'transaction_id']
            synthetic_data = synthetic_data[original_cols]
            
            self.synthesizers['graph_network'] = G
            self.results['graph_network'] = synthetic_data
            
            # Print statistics
            ml_count = synthetic_data['is_money_laundering'].sum()
            k1c5_count = ((synthetic_data['sender_location'] == 'K1C5') | 
                        (synthetic_data['receiver_location'] == 'K1C5')).sum()
            
            print(f"✓ GraphNetwork: Generated {len(synthetic_data)} transactions")
            print(f"  Network: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
            print(f"  ML transactions: {ml_count} ({ml_count/len(synthetic_data)*100:.1f}%)")
            print(f"  K1C5 involvement: {k1c5_count} ({k1c5_count/len(synthetic_data)*100:.1f}%)")
            
            return synthetic_data
            
        except Exception as e:
            print(f"✗ GraphNetwork failed: {e}")
            import traceback
            traceback.print_exc()
            return None

def evaluate_synthetic_quality(original_df, synthetic_results, metadata):
    """Comprehensive evaluation using SDV quality metrics for detective analysis"""
    
    print("\n" + "="*60)
    print("🔍 PHANTOM QUALITY ASSESSMENT - SDV MULTI-METHOD ANALYSIS")
    print("="*60)
    
    if not synthetic_results:
        print("❌ No synthetic phantoms to evaluate!")
        return
    
    try:
        from sdv.evaluation.single_table import evaluate_quality
        
        evaluation_results = {}
        
        for method_name, synthetic_df in synthetic_results.items():
            print(f"\n🎭 Evaluating {method_name.upper()} Phantoms:")
            print("-" * 45)
            
            try:
                # Ensure synthetic_df has the same columns as original_df (excluding transaction_id)
                synthetic_df = synthetic_df[original_df.columns]
                
                # SDV quality evaluation with metadata
                quality_report = evaluate_quality(
                    real_data=original_df,
                    synthetic_data=synthetic_df,
                    metadata=metadata
                )
                overall_score = quality_report.get_score()
                
                print(f"🏆 Overall Quality Score: {overall_score:.3f}/1.000")
                
                # Detailed breakdown
                column_shapes = quality_report.get_details('Column Shapes')
                column_pair_trends = quality_report.get_details('Column Pair Trends')
                
                print(f"📊 Column Shape Score: {column_shapes['Score'].mean():.3f}")
                print(f"🔗 Pair Trends Score: {column_pair_trends['Score'].mean():.3f}")
                
                evaluation_results[method_name] = {
                    'overall_score': overall_score,
                    'column_shapes': column_shapes['Score'].mean(),
                    'pair_trends': column_pair_trends['Score'].mean()
                }
                
            except Exception as e:
                print(f"❌ SDV evaluation failed for {method_name}: {e}")
                # Fallback to manual evaluation
                evaluation_results[method_name] = manual_evaluation(original_df, synthetic_df, method_name)
        
        # Summary comparison
        print(f"\n🏅 PHANTOM SYNTHESIS LEADERBOARD:")
        print("-" * 40)
        sorted_results = sorted(evaluation_results.items(), 
                               key=lambda x: x[1]['overall_score'], reverse=True)
        
        for rank, (method, scores) in enumerate(sorted_results, 1):
            print(f"{rank}. {method.upper()}: {scores['overall_score']:.3f}")
        
        return evaluation_results
        
    except ImportError:
        print("⚠️ SDV evaluation metrics not available, using manual assessment...")
        return manual_evaluation_all(original_df, synthetic_results)

def manual_evaluation(original_df, synthetic_df, method_name):
    """Manual quality assessment when SDV evaluation not available"""
    
    # Ensure synthetic_df has the same columns as original_df
    synthetic_df = synthetic_df[original_df.columns]
    
    # ML ratio preservation
    orig_ml_rate = original_df['is_money_laundering'].mean()
    synth_ml_rate = synthetic_df['is_money_laundering'].mean()
    ml_score = max(0, 1 - abs(orig_ml_rate - synth_ml_rate) / orig_ml_rate) if orig_ml_rate > 0 else 0
    
    # Amount distribution similarity
    orig_amount_mean = original_df['amount'].mean()
    synth_amount_mean = synthetic_df['amount'].mean()
    amount_score = max(0, 1 - abs(orig_amount_mean - synth_amount_mean) / orig_amount_mean) if orig_amount_mean > 0 else 0
    
    # Data validity checks
    hour_valid = ((synthetic_df['transaction_hour'] >= 0) & 
                  (synthetic_df['transaction_hour'] <= 23)).all() if 'transaction_hour' in synthetic_df.columns else False
    amount_positive = (synthetic_df['amount'] > 0).all() if 'amount' in synthetic_df.columns else False
    
    validity_score = (hour_valid + amount_positive) / 2
    
    print(f"💰 ML Rate Score: {ml_score:.3f}")
    print(f"💵 Amount Score: {amount_score:.3f}")
    print(f"✅ Validity Score: {validity_score:.3f}")
    
    return {
        'overall_score': (ml_score + amount_score + validity_score) / 3,
        'ml_rate_score': ml_score,
        'amount_score': amount_score,
        'validity_score': validity_score
    }

def manual_evaluation_all(original_df, synthetic_results):
    """Manual evaluation for all methods"""
    
    evaluation_results = {}
    for method_name, synthetic_df in synthetic_results.items():
        print(f"\n🎭 Evaluating {method_name.upper()} Phantoms:")
        print("-" * 45)
        evaluation_results[method_name] = manual_evaluation(original_df, synthetic_df, method_name)
    
    return evaluation_results

def create_customer_merchant_networks(df, n_customers=500, n_merchants=200):
    """Generate customer-merchant relationship networks with suspicious syndicate patterns"""
    
    print("\n🕸️ Weaving Customer-Merchant Networks...")
    
    # Create customer and merchant IDs in fantasy theme
    customers = [f'CUST_{i:05d}' for i in range(1, n_customers + 1)]
    merchants = [f'MERCH_{i:04d}' for i in range(1, n_merchants + 1)]
    
    # Add network relationships to transactions
    df_network = df.copy()
    df_network['customer_id'] = np.random.choice(customers, len(df))
    df_network['merchant_id'] = np.random.choice(merchants, len(df))
    
    # Create suspicious clustering patterns for ML cases, focusing on K1C5 syndicate
    ml_mask = df_network['is_money_laundering'] == 1
    
    if ml_mask.sum() > 0:
        # High-risk merchant cluster (syndicate-linked)
        high_risk_merchants = merchants[:int(len(merchants) * 0.1)]  # Top 10% as high-risk
        df_network.loc[ml_mask, 'merchant_id'] = np.random.choice(
            high_risk_merchants, ml_mask.sum())
        
        # Suspicious customer cluster (layering pattern in corrupt city)
        suspicious_customers = customers[:int(len(customers) * 0.1)]  # Top 10% suspicious
        df_network.loc[ml_mask, 'customer_id'] = np.random.choice(
            suspicious_customers, ml_mask.sum())
    
    print(f"✓ Network created: {len(customers)} customers, {len(merchants)} merchants")
    print(f"  High-risk merchants: {len(high_risk_merchants) if ml_mask.sum() > 0 else 0}")
    print(f"  Suspicious customers: {len(suspicious_customers) if ml_mask.sum() > 0 else 0}")
    
    return df_network

def generate_time_series_flows(df, days=30):
    """Generate realistic time-series financial flow patterns for investigation"""
    
    print("\n⏰ Conjuring Time-Series Financial Flows...")
    
    # Create realistic timestamps
    start_date = datetime(2024, 1, 1)
    timestamps = []
    
    for i in range(len(df)):
        day_offset = int(np.random.randint(0, days))
        hour = int(df.iloc[i]['transaction_hour'])
        minute = int(np.random.randint(0, 60))
        second = int(np.random.randint(0, 60))
        
        timestamp = start_date + timedelta(days=day_offset, hours=hour, 
                                         minutes=minute, seconds=second)
        timestamps.append(timestamp)
    
    df_timeseries = df.copy()
    df_timeseries['timestamp'] = pd.to_datetime(timestamps)  # Ensure datetime type
    
    # Sort by timestamp to ensure chronological order
    df_timeseries = df_timeseries.sort_values('timestamp').reset_index(drop=True)
    
    # Add velocity and pattern features
    if 'customer_id' in df_timeseries.columns:
        # Running totals by customer
        df_timeseries['running_total'] = df_timeseries.groupby('customer_id')['amount'].cumsum()
        
        # Transaction frequency features
        df_timeseries['hour_of_day'] = df_timeseries['timestamp'].dt.hour
        df_timeseries['day_of_week'] = df_timeseries['timestamp'].dt.dayofweek
        
        # Set timestamp as index for rolling operations
        df_timeseries = df_timeseries.set_index('timestamp')
        
        # Velocity indicators (amount of money moved recently)
        df_timeseries = df_timeseries.sort_values(['customer_id', 'timestamp'])
        df_timeseries['velocity_1h'] = df_timeseries.groupby('customer_id')['amount'].rolling(
            window='1H', min_periods=1).sum().reset_index(level=0, drop=True)
        df_timeseries['velocity_24h'] = df_timeseries.groupby('customer_id')['amount'].rolling(
            window='24H', min_periods=1).sum().reset_index(level=0, drop=True)
        
        # Reset index to make timestamp a column again
        df_timeseries = df_timeseries.reset_index()
    
    print(f"✓ Time-series flows generated over {days} days")
    print(f"  Features: timestamp, running_total, velocity_1h, velocity_24h")
    
    return df_timeseries


In [6]:
# Fantasy AML Mission 1 - Complete Runner Script
# Run this after your main code to execute and evaluate all synthesis methods

def run_complete_mission_1():
    """Execute the complete Mission 1 pipeline"""
    
    print("🏰" + "="*58 + "🏰")
    print("     FANTASY KINGDOM AML INVESTIGATION - MISSION 1")
    print("         Synthetic Data Generation & Evaluation")
    print("🏰" + "="*58 + "🏰")
    
    # Step 1: Generate base transaction data
    print("\n🎯 STEP 1: Generating Base Transaction Data...")
    original_df = create_base_transaction_data(n_samples=3000, ml_ratio=0.05)
    
    # Step 2: Initialize SDV synthesis engine
    print("\n🎯 STEP 2: Initializing Multi-Algorithm Synthesis Engine...")
    engine = SDVSynthesizerEngine()
    
    # Step 3: Prepare metadata
    print("\n🎯 STEP 3: Preparing SDV Metadata...")
    df_clean, metadata = engine.prepare_metadata(original_df)
    
    # Step 4: Run all synthesis methods
    print("\n🎯 STEP 4: Running All Synthesis Methods...")
    
    # Method 1: GaussianCopula (Fast statistical approach)
    gaussian_result = engine.synthesize_with_gaussian_copula(
        df_clean, metadata, n_samples=1000
    )
    
    # Method 2: CTGAN (Deep learning - reduce epochs for demo)
    ctgan_result = engine.synthesize_with_ctgan(
        df_clean, metadata, n_samples=1000, epochs=50  # Reduced for demo
    )
    
    # Method 3: TVAE (Variational autoencoder)
    tvae_result = engine.synthesize_with_tvae(
        df_clean, metadata, n_samples=1000, epochs=50  # Reduced for demo
    )
    
    # Method 4: Graph Network (Custom network-based)
    graph_result = engine.synthesize_with_graph_network(
        df_clean, metadata, n_samples=1000
    )
    
    # Step 5: Quality evaluation
    print("\n🎯 STEP 5: Comprehensive Quality Evaluation...")
    evaluation_results = evaluate_synthetic_quality(
        df_clean, engine.results, metadata
    )
    
    # Step 6: Enhanced features
    print("\n🎯 STEP 6: Generating Enhanced Network Features...")
    
    # Add customer-merchant networks to the best performing method
    best_method = max(evaluation_results.items(), key=lambda x: x[1]['overall_score'])[0]
    best_synthetic = engine.results[best_method].copy()
    
    # Add transaction IDs back
    best_synthetic['transaction_id'] = [f'SYN_{i+1:06d}' for i in range(len(best_synthetic))]
    
    # Create network features
    df_with_networks = create_customer_merchant_networks(
        best_synthetic, n_customers=500, n_merchants=200
    )
    
    # Step 7: Time-series flows
    print("\n🎯 STEP 7: Generating Time-Series Financial Flows...")
    final_df = generate_time_series_flows(df_with_networks, days=30)
    
    # Final summary
    print(f"\n🏆 MISSION 1 COMPLETE!")
    print("="*50)
    print(f"📊 Original Data: {len(original_df):,} transactions")
    print(f"🎭 Best Method: {best_method.upper()}")
    print(f"✨ Final Dataset: {len(final_df):,} transactions")
    print(f"💰 ML Cases: {final_df['is_money_laundering'].sum():,}")
    print(f"🏰 K1C5 Involvement: {((final_df['sender_location'] == 'K1C5') | (final_df['receiver_location'] == 'K1C5')).sum():,}")
    print(f"🕸️ Network Entities: {final_df['customer_id'].nunique():,} customers, {final_df['merchant_id'].nunique():,} merchants")
    print(f"⏰ Time Range: {final_df['timestamp'].min()} to {final_df['timestamp'].max()}")
    
    return {
        'original_data': original_df,
        'synthetic_results': engine.results,
        'evaluation_results': evaluation_results,
        'best_method': best_method,
        'final_enhanced_data': final_df,
        'engine': engine
    }

# Run the complete mission
if __name__ == "__main__":
    # Execute Mission 1
    results = run_complete_mission_1()
    
    
    print("\n🎉 Ready for Mission 2: Detective ML Model Training!")

     FANTASY KINGDOM AML INVESTIGATION - MISSION 1
         Synthetic Data Generation & Evaluation

🎯 STEP 1: Generating Base Transaction Data...
✓ Total transactions generated: 3349 (approx 5.0% ML syndicate cases)
  💰 Corrupt syndicate cases: 499
  🏰 Transactions involving K1C5: 489

🎯 STEP 2: Initializing Multi-Algorithm Synthesis Engine...

🎯 STEP 3: Preparing SDV Metadata...
✓ Created metadata for 14 columns

🎯 STEP 4: Running All Synthesis Methods...
📊 Conjuring with SDV GaussianCopula (Statistical Magic)...
✓ GaussianCopula: Generated 1000 transactions
🧠 Channeling SDV CTGAN (Deep Learning Sorcery)...
✓ CTGAN: Generated 1000 transactions
🤖 Weaving with SDV TVAE (Variational Enchantment)...
✓ TVAE: Generated 1000 transactions
🕸️ Weaving Synthetic Transaction Network (Graph Magic)...
✓ GraphNetwork: Generated 1000 transactions
  Network: 500 nodes, 994 edges
  ML transactions: 158 (15.8%)
  K1C5 involvement: 524 (52.4%)

🎯 STEP 5: Comprehensive Quality Evaluation...

🔍 PHANTOM QUA

Mission 1 Review


Your task was to forge a synthetic dataset of financial transactions—a magical ledger mimicking real commerce while guarding sensitive secrets. This dataset powers our anti-money laundering (AML) efforts, targeting shady syndicates in K1C5 (Kingdom 1, City 5).

You ran the data generation and got these scores:

- **GAUSSIAN_COPULA**: 0.909
- **CTGAN**: 0.869
- **GRAPH_NETWORK**: 0.862
- **TVAE**: 0.579

These scores, from the `SDV` library, measure how well the synthetic data matches the real dataset’s patterns (e.g., via log-likelihood or similarity metrics). Higher scores mean better fidelity. Let’s explore the four methods behind your enchanted ledger!

## The Four Methods of Synthetic Data Creation

You used four methods to conjure synthetic transactions, each like a unique spell in the Arcane Library, designed to replicate the Kingdom’s financial patterns, including K1C5’s suspicious activities. Here’s what each does and why your scores matter.

### 1. Gaussian Copula (Score: 0.909)
**What It Does**: Models relationships between features (e.g., amount, sender_location) using a copula, assuming data fits a Gaussian distribution.

**Key Points**:
- Captures correlations (e.g., high amounts with cross-border transactions).
- Uses `sdv.tabular.GaussianCopula` with `ml_ratio=0.05` (5% money laundering).
- Fast and great for numerical data.

**Your Score**: 0.909 is the highest, showing excellent fidelity for transaction patterns.

### 2. CTGAN (Score: 0.869)
**What It Does**: Uses a GAN (generator vs. discriminator) to create realistic transactions, modeling complex patterns.

**Key Points**:
- Handles numerical and categorical features (e.g., transaction_type).
- Uses `ctgan>=0.11.0` with constraints.
- Strong for money laundering patterns.

**Your Score**: 0.869 indicates robust data, capturing intricate K1C5 behaviors.

### 3. Graph Network (Score: 0.862)
**What It Does**: Models transactions as a network (accounts as nodes, transactions as edges) to mimic syndicate structures.

**Key Points**:
- Uses `networkx>=3.4.2` with features like clustering coefficients.
- Focuses on K1C5 syndicate patterns.
- Ideal for graph-based AML detection.

**Your Score**: 0.862 shows strong network fidelity, great for syndicate analysis.

### 4. TVAE (Score: 0.579)
**What It Does**: Uses a variational autoencoder to learn and generate transactions from a compressed data representation.

**Key Points**:
- Uses `sdv.tabular.TVAE` for complex patterns.
- Good for mixed data but sensitive to tuning.
- May struggle with sparse categories.

**Your Score**: 0.579 suggests poor performance, likely due to tuning issues.

## Understanding Your Scores
The scores (0.909, 0.869, 0.862, 0.579) reflect synthetic data quality. Gaussian Copula (0.909) excels for numerical patterns, CTGAN (0.869) captures complex behaviors, Graph Network (0.862) suits syndicate detection, and TVAE (0.579) needs tuning. Use Gaussian Copula or CTGAN data for most AML tests, and Graph Network for K1C5 syndicates.

Well done, Investigators! Your ledger is ready to unmask K1C5’s secrets!

## Mission 2: The Document Forge
**Objective**: Create realistic synthetic documents critical to a money laundering investigation, including a whistleblower report, bank statements, and suspicious activity report, to build a case against the syndicate.

In [7]:
import os
import random
import datetime
from datetime import timedelta
from typing import Dict, List, Any
import pandas as pd
import numpy as np
import markovify
from reportlab.lib.pagesizes import letter
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle
from reportlab.lib import colors
from reportlab.lib.enums import TA_LEFT, TA_CENTER, TA_RIGHT

class FantasyDocumentForge:
    """Forge synthetic documents for the magical kingdom's financial investigation"""
    
    def __init__(self, output_dir: str = "mission2_documents", seed: int = 42):
        random.seed(seed)
        np.random.seed(seed)
        
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
        
        # Initialize text generation
        self._setup_text_generators()
        
        # Fantasy kingdom structure (matching Mission 1)
        self.kingdoms = {
            'K1': 'Valdris', 'K2': 'Aethon', 'K3': 'Ironhold', 'K4': 'Mystwood'
        }
        
        self.cities = {
            'K1C1': 'Crownhaven', 'K1C2': 'Silverdale', 'K1C3': 'Ironbridge', 
            'K1C4': 'Stormwatch', 'K1C5': 'Goldweave Port',  # Syndicate base
            'K2C1': 'Seafoam', 'K2C2': 'Coral Bay', 'K2C3': 'Tidecrest',
            'K2C4': 'Wavebreak', 'K2C5': 'Saltmere',
            'K3C1': 'Hammerfall', 'K3C2': 'Anvil Rock', 'K3C3': 'Forge Gate',
            'K3C4': 'Steel Harbor', 'K3C5': 'Copper Hill',
            'K4C1': 'Moonvale', 'K4C2': 'Starhollow', 'K4C3': 'Shadowpine',
            'K4C4': 'Whisperwind', 'K4C5': 'Glimmergrove'
        }
        
        # Syndicate shell companies (based in K1C5)
        self.shell_companies = [
            "Golden Griffin Trading Co.", "Dragonscale Import House", "Mystic Coin Exchange",
            "Raven's Rest Consulting", "Crystal Crown Merchants", "Shadowport Trading Ltd",
            "Royal Tide Commerce", "Emerald Anchor Trading", "Goldweave Ventures"
        ]
        
        # Kingdom banking institutions
        self.banks = [
            "Royal Bank of Valdris", "Crown Treasury", "Merchant Guild Financial",
            "Kingdom Commercial Bank", "Golden Vault Banking", "Royal Trade Bank"
        ]
        
        # Offshore jurisdictions (suspicious locations)
        self.offshore_locations = [
            "Free Port of Shadowmere", "Neutral Isles", "Hidden Archipelago",
            "Merchant Republic Haven", "Blackwater Territories", "Rogue Trader Isles"
        ]
        
        # Royal investigators
        self.investigators = [
            ("Aldric", "Stormwind", "Royal Financial Guard"),
            ("Lyra", "Goldbrook", "Crown Investigators"),
            ("Thane", "Ironforge", "Royal Treasury Guard"),
            ("Elena", "Silverleaf", "Kingdom Revenue Service"),
            ("Marcus", "Blackwater", "Royal Financial Guard")
        ]

    def _setup_text_generators(self):
        """Initialize Markovify models with fantasy financial investigation text"""
        
        investigation_corpus = """
        The investigation revealed a complex coin laundering scheme involving multiple shell trading companies and offshore accounts. 
        Suspicious activity reports indicated large gold deposits inconsistent with merchant operations. 
        Bank records showed structured transactions designed to avoid royal reporting requirements.
        Coin transfers to high-risk territories raised concerns for potential laundering activity.
        The trading house showed minimal legitimate customer activity despite reporting substantial revenues.
        Financial analysis revealed layering techniques using multiple accounts to obscure the source of funds.
        Currency transaction reports showed patterns consistent with structuring violations under kingdom law.
        The investigation uncovered a network of shell companies used to facilitate illicit financial transactions.
        Surveillance indicated unusual business activity with limited customer traffic and suspicious gold handling.
        Bank clerks reported customer reluctance to provide identification and business documentation.
        The financial institution filed suspicious activity reports based on unusual transaction patterns.
        Analysis of business records revealed discrepancies between reported income and actual trading activity.
        The subject utilized multiple banking relationships to distribute transactions across different institutions.
        Investigators identified connections between the subject and known coin laundering organizations.
        """
        
        whistleblower_corpus = """
        I am writing to report suspicious financial activities that I observed at the trading company.
        During my employment, I witnessed large gold transactions that seemed unusual for our operations.
        The management instructed workers to structure deposits to avoid royal reporting requirements.
        I observed customers making multiple transactions just under the reporting threshold.
        The business received coin transfers from territories known for banking secrecy.
        Management was reluctant to maintain proper documentation for large gold transactions.
        I noticed discrepancies between the gold reported and the actual business activity levels.
        Workers were told to avoid asking questions about the source of large deposits.
        The company maintained relationships with multiple banks to spread transaction activity.
        I witnessed meetings with individuals who did not appear to be legitimate merchants.
        Gold was frequently transported in bags and stored in areas not typical for normal trading.
        The business reported income that seemed inconsistent with the observable customer activity.
        """
        
        try:
            self.investigation_model = markovify.Text(investigation_corpus)
            self.whistleblower_model = markovify.Text(whistleblower_corpus)
        except:
            self.investigation_model = None
            self.whistleblower_model = None

    def _generate_text(self, model, fallback: str, sentences: int = 3) -> str:
        """Generate text using Markovify or fallback"""
        if model:
            try:
                result = []
                for _ in range(sentences):
                    sentence = model.make_sentence(tries=100)
                    if sentence:
                        result.append(sentence)
                return " ".join(result) if result else fallback
            except:
                pass
        return fallback

    def load_mission1_data(self, filepath: str = None) -> pd.DataFrame:
        """Load Mission 1 data or generate synthetic transaction data"""
        
        if filepath and os.path.exists(filepath):
            try:
                return pd.read_csv(filepath)
            except:
                pass
        
        print("📊 Generating synthetic transaction data aligned with Mission 1")
        
        # Generate transactions for syndicate operations in K1C5
        dates = pd.date_range(start='2024-01-01', end='2024-12-31', freq='D')
        transactions = []
        
        syndicate_company = "Golden Griffin Trading Co."  # Main shell company
        account_numbers = [f"****{random.randint(1000, 9999)}" for _ in range(6)]
        
        for date in dates:
            if random.random() < 0.35:  # Transaction probability
                # Structured deposits (just under 10,000 gold pieces)
                if random.random() < 0.4:
                    amount = random.randint(9000, 9900)
                    trans_type = "Gold Deposit"
                # Large suspicious transfers
                elif random.random() < 0.3:
                    amount = random.randint(15000, 75000)
                    trans_type = random.choice(["Coin Transfer", "Trade Draft", "Merchant Exchange"])
                # Normal business
                else:
                    amount = random.randint(500, 5000)
                    trans_type = random.choice(["Trade Payment", "Merchant Draft", "Guild Transfer"])
                
                transactions.append({
                    'date': date,
                    'location': 'K1C5',  # Syndicate base
                    'account': random.choice(account_numbers),
                    'amount': amount,
                    'type': trans_type,
                    'description': f"{syndicate_company} - Trade Operations",
                    'counterparty': syndicate_company if random.random() < 0.6 else f"Unknown Merchant {random.randint(1, 15)}"
                })
        
        return pd.DataFrame(transactions)

    def generate_whistleblower_report(self, case_data: Dict[str, Any]) -> str:
        """Generate Royal Financial Guard whistleblower report"""
        
        filename = f"whistleblower_report_RG{random.randint(100000, 999999)}.pdf"
        filepath = os.path.join(self.output_dir, filename)
        
        doc = SimpleDocTemplate(filepath, pagesize=letter, topMargin=1*inch, bottomMargin=1*inch)
        styles = getSampleStyleSheet()
        story = []
        
        # Royal document styling
        title_style = ParagraphStyle(
            'RoyalTitle', parent=styles['Title'], fontSize=16, textColor=colors.darkblue,
            spaceAfter=30, alignment=TA_CENTER, fontName='Helvetica-Bold'
        )
        
        header_style = ParagraphStyle(
            'RoyalHeader', parent=styles['Heading2'], fontSize=12, textColor=colors.black,
            spaceBefore=15, spaceAfter=10, fontName='Helvetica-Bold'
        )
        
        # Document header
        story.append(Paragraph("CONFIDENTIAL - ROYAL SEAL", 
                              ParagraphStyle('Seal', fontSize=8, alignment=TA_RIGHT, textColor=colors.red)))
        story.append(Spacer(1, 20))
        
        story.append(Paragraph("ROYAL FINANCIAL GUARD", title_style))
        story.append(Paragraph("WHISTLEBLOWER INCIDENT REPORT", title_style))
        story.append(Spacer(1, 20))
        
        # Report metadata
        report_id = f"RG-{random.randint(100000, 999999)}"
        report_date = datetime.date.today() - timedelta(days=random.randint(1, 30))
        
        metadata = [
            ["Report ID:", report_id],
            ["Date Filed:", report_date.strftime("%B %d, %Y")],
            ["Classification:", "ROYAL CONFIDENTIAL"],
            ["Priority:", random.choice(["HIGH", "CRITICAL"])],
            ["Reporting Method:", "Anonymous Royal Hotline"],
            ["Assigned Inspector:", f"Inspector {random.choice(self.investigators)[1]}"]
        ]
        
        metadata_table = Table(metadata, colWidths=[1.5*inch, 3*inch])
        metadata_table.setStyle(TableStyle([
            ('FONTNAME', (0, 0), (-1, -1), 'Helvetica'),
            ('FONTSIZE', (0, 0), (-1, -1), 10),
            ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
            ('GRID', (0, 0), (-1, -1), 1, colors.black),
            ('BACKGROUND', (0, 0), (0, -1), colors.lightgrey)
        ]))
        
        story.append(metadata_table)
        story.append(Spacer(1, 30))
        
        # Subject information
        story.append(Paragraph("I. SUBJECT INVESTIGATION", header_style))
        
        target_company = case_data.get('target_company', random.choice(self.shell_companies))
        subject_info = [
            ["Trading House:", target_company],
            ["Business Type:", "Import/Export Trading"],
            ["Location:", f"{self.cities['K1C5']}, {self.kingdoms['K1']} Kingdom"],
            ["Primary Contact:", f"{random.choice(['Lord', 'Master', 'Merchant'])} {random.choice(['Aldwin', 'Gareth', 'Thorne'])} {random.choice(['Goldhand', 'Coinsworth', 'Tradewing'])}"],
            ["Guild Registration:", f"TG-{random.randint(10000, 99999)}"]
        ]
        
        subject_table = Table(subject_info, colWidths=[1.5*inch, 4*inch])
        subject_table.setStyle(TableStyle([
            ('FONTNAME', (0, 0), (-1, -1), 'Helvetica'),
            ('FONTSIZE', (0, 0), (-1, -1), 9),
            ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
            ('GRID', (0, 0), (-1, -1), 0.5, colors.grey)
        ]))
        
        story.append(subject_table)
        story.append(Spacer(1, 20))
        
        # Allegations
        story.append(Paragraph("II. REPORTED ALLEGATIONS", header_style))
        
        allegation_text = self._generate_text(
            self.whistleblower_model,
            f"I am reporting suspicious coin laundering activities at {target_company} in {self.cities['K1C5']}. During my employment, I observed large gold transactions and structured deposits designed to avoid Royal Treasury reporting requirements. The trading house showed minimal legitimate merchant activity despite reporting substantial revenues.",
            sentences=4
        )
        
        story.append(Paragraph(allegation_text, styles['Normal']))
        story.append(Spacer(1, 15))
        
        # Specific observations
        observations = [
            f"Gold deposits totaling approximately {random.randint(500, 2000):,} thousand pieces over six months",
            f"Structured transactions just under 10,000 gold threshold on {random.randint(15, 25)} occasions",
            f"Coin transfers to shell companies in {random.choice(self.offshore_locations)}",
            f"Trading operations inconsistent with reported merchant activities",
            "Customer reluctance to provide identification for large transactions",
            f"Use of multiple accounts at {random.randint(3, 6)} different banking houses"
        ]
        
        for i, obs in enumerate(random.sample(observations, random.randint(4, 6)), 1):
            story.append(Paragraph(f"{i}. {obs}", styles['Normal']))
            story.append(Spacer(1, 8))
        
        # Evidence section
        story.append(Spacer(1, 15))
        story.append(Paragraph("III. AVAILABLE EVIDENCE", header_style))
        
        evidence = [
            "Banking house transaction ledgers",
            "Internal trading house correspondence",
            "Surveillance records of unusual activities",
            "Coin transfer documentation",
            "Guild registration and licensing records"
        ]
        
        for item in random.sample(evidence, random.randint(3, 5)):
            story.append(Paragraph(f"• {item}", styles['Normal']))
            story.append(Spacer(1, 6))
        
        # Footer
        story.append(Spacer(1, 30))
        story.append(Paragraph("Submitted under Royal Decree 5328 - Financial Crimes Reporting Act", 
                              ParagraphStyle('Footer', fontSize=8, alignment=TA_CENTER, fontStyle='italic')))
        
        doc.build(story)
        print(f"✨ Conjured whistleblower report: {filepath}")
        return filepath

    def generate_bank_statement(self, transaction_data: pd.DataFrame) -> str:
        """Generate banking house statement showing syndicate transactions"""
        
        filename = f"bank_statement_{random.randint(100000, 999999)}.pdf"
        filepath = os.path.join(self.output_dir, filename)
        
        doc = SimpleDocTemplate(filepath, pagesize=letter, topMargin=0.5*inch)
        styles = getSampleStyleSheet()
        story = []
        
        # Banking house details
        bank_name = random.choice(self.banks)
        account_holder = "Golden Griffin Trading Co."  # Main syndicate front
        account_number = f"****-****-{random.randint(1000, 9999)}"
        statement_date = datetime.date.today().replace(day=1) - timedelta(days=1)
        
        # Bank header
        header_style = ParagraphStyle(
            'BankHeader', fontSize=18, textColor=colors.darkblue,
            fontName='Helvetica-Bold', alignment=TA_CENTER, spaceAfter=20
        )
        
        story.append(Paragraph(bank_name.upper(), header_style))
        story.append(Paragraph(f"{self.cities['K1C5']}, {self.kingdoms['K1']} Kingdom", 
                              ParagraphStyle('Location', fontSize=10, alignment=TA_CENTER, spaceAfter=30)))
        
        # Account information
        account_info = [
            ["Account Holder:", account_holder],
            ["Account Number:", account_number],
            ["Statement Period:", f"{statement_date.replace(day=1).strftime('%m/%d/%Y')} - {statement_date.strftime('%m/%d/%Y')}"],
            ["Account Type:", "Merchant Trading Account"],
            ["Location:", f"{self.cities['K1C5']} Branch"]
        ]
        
        info_table = Table(account_info, colWidths=[1.5*inch, 4*inch])
        info_table.setStyle(TableStyle([
            ('FONTNAME', (0, 0), (-1, -1), 'Helvetica'),
            ('FONTSIZE', (0, 0), (-1, -1), 9),
            ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
            ('BACKGROUND', (0, 0), (-1, -1), colors.lightgrey),
            ('GRID', (0, 0), (-1, -1), 1, colors.black)
        ]))
        
        story.append(info_table)
        story.append(Spacer(1, 20))
        
        # Account summary
        story.append(Paragraph("ACCOUNT SUMMARY", 
                              ParagraphStyle('SectionTitle', fontSize=12, fontName='Helvetica-Bold', spaceAfter=10)))
        
        # Calculate from transaction data or generate
        monthly_data = transaction_data[
            pd.to_datetime(transaction_data['date']).dt.month == statement_date.month
        ] if len(transaction_data) > 0 else pd.DataFrame()
        
        deposits = monthly_data[monthly_data['amount'] > 0]['amount'].sum() if len(monthly_data) > 0 else random.randint(500000, 1500000)
        withdrawals = abs(monthly_data[monthly_data['amount'] < 0]['amount'].sum()) if len(monthly_data) > 0 else random.randint(300000, 800000)
        beginning_balance = random.randint(50000, 200000)
        ending_balance = beginning_balance + deposits - withdrawals
        
        summary_data = [
            ["Beginning Balance:", f"{beginning_balance:,} GP"],
            ["Total Deposits:", f"{deposits:,} GP"],
            ["Total Withdrawals:", f"{withdrawals:,} GP"],
            ["Banking Fees:", f"{random.randint(25, 150)} GP"],
            ["Ending Balance:", f"{ending_balance:,} GP"]
        ]
        
        summary_table = Table(summary_data, colWidths=[2*inch, 1.5*inch])
        summary_table.setStyle(TableStyle([
            ('FONTNAME', (0, 0), (-1, -1), 'Helvetica'),
            ('FONTSIZE', (0, 0), (-1, -1), 10),
            ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
            ('ALIGN', (1, 0), (1, -1), 'RIGHT'),
            ('LINEBELOW', (0, -1), (-1, -1), 2, colors.black),
            ('FONTNAME', (0, -1), (-1, -1), 'Helvetica-Bold')
        ]))
        
        story.append(summary_table)
        story.append(Spacer(1, 30))
        
        # Transaction details
        story.append(Paragraph("TRANSACTION LEDGER", 
                              ParagraphStyle('SectionTitle', fontSize=12, fontName='Helvetica-Bold', spaceAfter=15)))
        
        # Generate suspicious transaction patterns
        transaction_details = [["Date", "Description", "Withdrawal", "Deposit", "Balance"]]
        
        current_balance = beginning_balance
        suspicious_transactions = []
        
        # Pattern 1: Structured gold deposits under 10,000 GP
        for _ in range(random.randint(8, 15)):
            amount = random.randint(9000, 9950)
            date_obj = statement_date.replace(day=random.randint(1, 28))
            suspicious_transactions.append({
                'date': date_obj,
                'description': f"GOLD DEPOSIT - {account_holder[:20]}",
                'amount': amount,
                'type': 'deposit'
            })
        
        # Pattern 2: Large coin transfers from offshore
        for _ in range(random.randint(3, 8)):
            amount = random.randint(25000, 150000)
            date_obj = statement_date.replace(day=random.randint(1, 28))
            origin = random.choice(self.offshore_locations)
            suspicious_transactions.append({
                'date': date_obj,
                'description': f"COIN TRANSFER FROM {origin[:15]}",
                'amount': amount,
                'type': 'deposit'
            })
        
        # Pattern 3: Round number withdrawals
        for _ in range(random.randint(5, 12)):
            amount = random.choice([25000, 50000, 75000, 100000])
            date_obj = statement_date.replace(day=random.randint(1, 28))
            suspicious_transactions.append({
                'date': date_obj,
                'description': f"TRADE DRAFT - BUSINESS EXPENSE",
                'amount': amount,
                'type': 'withdrawal'
            })
        
        # Sort by date
        suspicious_transactions.sort(key=lambda x: x['date'])
        
        # Add to table
        for trans in suspicious_transactions:
            if trans['type'] == 'deposit':
                current_balance += trans['amount']
                transaction_details.append([
                    trans['date'].strftime("%m/%d"),
                    trans['description'],
                    "",
                    f"{trans['amount']:,} GP",
                    f"{current_balance:,} GP"
                ])
            else:
                current_balance -= trans['amount']
                transaction_details.append([
                    trans['date'].strftime("%m/%d"),
                    trans['description'],
                    f"{trans['amount']:,} GP",
                    "",
                    f"{current_balance:,} GP"
                ])
        
        # Create table
        trans_table = Table(transaction_details, colWidths=[0.8*inch, 3.2*inch, 1*inch, 1*inch, 1*inch])
        trans_table.setStyle(TableStyle([
            ('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
            ('FONTNAME', (0, 1), (-1, -1), 'Helvetica'),
            ('FONTSIZE', (0, 0), (-1, -1), 8),
            ('BACKGROUND', (0, 0), (-1, 0), colors.grey),
            ('ALIGN', (2, 0), (-1, -1), 'RIGHT'),
            ('GRID', (0, 0), (-1, -1), 0.5, colors.black),
            ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.beige])
        ]))
        
        story.append(trans_table)
        
        # Royal notice
        story.append(Spacer(1, 30))
        notice_text = f"""
        ROYAL NOTICE: This account has been flagged for enhanced monitoring under Royal Treasury Decree. 
        Large transactions may be reported to the Royal Financial Guard. Contact our compliance officer 
        at the {self.cities['K1C5']} branch for inquiries.
        """
        
        story.append(Paragraph(notice_text, 
                              ParagraphStyle('Notice', fontSize=8, fontStyle='italic', 
                                           textColor=colors.red, leftIndent=20, rightIndent=20)))
        
        doc.build(story)
        print(f"✨ Conjured banking house statement: {filepath}")
        return filepath

    def generate_suspicious_activity_report(self, case_data: Dict[str, Any]) -> str:
        """Generate Royal Treasury Suspicious Activity Report"""
        
        filename = f"SAR_{random.randint(100000, 999999)}.pdf"
        filepath = os.path.join(self.output_dir, filename)
        
        doc = SimpleDocTemplate(filepath, pagesize=letter, topMargin=0.75*inch)
        styles = getSampleStyleSheet()
        story = []
        
        # Royal SAR header
        header_style = ParagraphStyle(
            'SARHeader', fontSize=14, fontName='Helvetica-Bold', alignment=TA_CENTER,
            textColor=colors.darkred, spaceAfter=20
        )
        
        story.append(Paragraph("SUSPICIOUS ACTIVITY REPORT", header_style))
        story.append(Paragraph("Royal Treasury Form RT-111", 
                              ParagraphStyle('FormNumber', fontSize=10, alignment=TA_CENTER, spaceAfter=30)))
        
        # SAR identification
        sar_number = f"RT-{random.randint(1000000, 9999999)}"
        filing_date = datetime.date.today() - timedelta(days=random.randint(1, 15))
        
        identification_data = [
            ["SAR Number:", sar_number],
            ["Filing Date:", filing_date.strftime("%m/%d/%Y")],
            ["Filing Institution:", random.choice(self.banks)],
            ["Branch Location:", f"{self.cities['K1C5']}, {self.kingdoms['K1']} Kingdom"],
            ["Guild Registration:", f"BG-{random.randint(1000000, 9999999)}"]
        ]
        
        id_table = Table(identification_data, colWidths=[1.5*inch, 3*inch])
        id_table.setStyle(TableStyle([
            ('FONTNAME', (0, 0), (-1, -1), 'Helvetica'),
            ('FONTSIZE', (0, 0), (-1, -1), 9),
            ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
            ('GRID', (0, 0), (-1, -1), 1, colors.black),
            ('BACKGROUND', (0, 0), (0, -1), colors.lightgrey)
        ]))
        
        story.append(id_table)
        story.append(Spacer(1, 20))
        
        # Subject Information
        story.append(Paragraph("PART I - SUBJECT INVESTIGATION", 
                              ParagraphStyle('PartHeader', fontSize=11, fontName='Helvetica-Bold', spaceAfter=10)))
        
        subject_company = case_data.get('target_company', "Golden Griffin Trading Co.")
        subject_data = [
            ["Subject Name:", subject_company],
            ["Business Address:", f"Harbor District, {self.cities['K1C5']}"],
            ["Kingdom Location:", f"{self.kingdoms['K1']} Kingdom"],
            ["Guild Registration:", f"TG-{random.randint(1000000, 9999999)}"],
            ["Account Number:", f"****-{random.randint(1000, 9999)}"],
            ["Date Established:", f"{random.randint(1, 12)}/{random.randint(1, 28)}/{random.randint(2015, 2020)}"]
        ]
        
        subject_table = Table(subject_data, colWidths=[1.8*inch, 3.5*inch])
        subject_table.setStyle(TableStyle([
            ('FONTNAME', (0, 0), (-1, -1), 'Helvetica'),
            ('FONTSIZE', (0, 0), (-1, -1), 9),
            ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
            ('GRID', (0, 0), (-1, -1), 0.5, colors.grey)
        ]))
        
        story.append(subject_table)
        story.append(Spacer(1, 20))
        
        # Suspicious Activity Information
        story.append(Paragraph("PART II - SUSPICIOUS ACTIVITY DETAILS", 
                              ParagraphStyle('PartHeader', fontSize=11, fontName='Helvetica-Bold', spaceAfter=10)))
        
        activity_start = filing_date - timedelta(days=random.randint(90, 180))
        activity_end = filing_date - timedelta(days=random.randint(1, 30))
        
        activity_data = [
            ["Activity Period:", f"{activity_start.strftime('%m/%d/%Y')} - {activity_end.strftime('%m/%d/%Y')}"],
            ["Total Amount:", f"{random.randint(750, 2500):,},000 Gold Pieces"],
            ["Transaction Count:", f"{random.randint(25, 75)}"],
            ["Activity Type:", "Structured Deposits/Coin Laundering"],
            ["Primary Location:", f"{self.cities['K1C5']} Branch"],
            ["Prior SAR Filed:", random.choice(["Yes", "No"])]
        ]
        
        activity_table = Table(activity_data, colWidths=[2*inch, 3*inch])
        activity_table.setStyle(TableStyle([
            ('FONTNAME', (0, 0), (-1, -1), 'Helvetica'),
            ('FONTSIZE', (0, 0), (-1, -1), 9),
            ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
            ('GRID', (0, 0), (-1, -1), 0.5, colors.grey)
        ]))
        
        story.append(activity_table)
        story.append(Spacer(1, 20))
        
        # Narrative
        story.append(Paragraph("PART III - INVESTIGATION NARRATIVE", 
                              ParagraphStyle('PartHeader', fontSize=11, fontName='Helvetica-Bold', spaceAfter=10)))
        
        narrative_text = self._generate_text(
            self.investigation_model,
            f"""The subject, {subject_company}, operating from {self.cities['K1C5']}, has engaged in suspicious financial activity consistent with coin laundering operations. 
            Analysis of account activity reveals structured gold deposits designed to avoid Royal Treasury reporting requirements. 
            The subject made {random.randint(15, 25)} gold deposits ranging from 9,000 to 9,900 pieces, totaling over {random.randint(500, 1200):,},000 gold pieces during the reporting period. 
            The trading operations appear inconsistent with the level of gold activity, and the subject has been uncooperative when questioned about fund sources.""",
            sentences=6
        )
        
        story.append(Paragraph(narrative_text, styles['Normal']))
        story.append(Spacer(1, 15))
        
        # Red flag indicators
        story.append(Paragraph("Suspicious Activity Indicators:", 
                              ParagraphStyle('SubHeader', fontSize=10, fontName='Helvetica-Bold', spaceAfter=8)))
        
        red_flags = [
            f"Multiple gold deposits just under 10,000 piece reporting threshold",
            f"Customer reluctance to provide source documentation",
            f"Trading activity inconsistent with gold volume",
            f"Use of multiple accounts to fragment transactions",
            f"Connections to high-risk territories through coin transfers",
            f"Unusual transportation and handling of gold currency"
        ]
        
        for flag in red_flags:
            story.append(Paragraph(f"• {flag}", styles['Normal']))
            story.append(Spacer(1, 4))
        
        # Law enforcement contact
        story.append(Spacer(1, 20))
        story.append(Paragraph("PART IV - ROYAL ENFORCEMENT CONTACT", 
                              ParagraphStyle('PartHeader', fontSize=11, fontName='Helvetica-Bold', spaceAfter=10)))
        
        investigator = random.choice(self.investigators)
        contact_data = [
            ["Agency Notified:", investigator[2]],
            ["Contact Name:", f"Inspector {investigator[0]} {investigator[1]}"],
            ["Royal Station:", f"{self.cities['K1C1']} Headquarters"],
            ["Date Contacted:", (filing_date + timedelta(days=1)).strftime("%m/%d/%Y")],
            ["Badge Number:", f"RG-{random.randint(1000, 9999)}"]
        ]
        
        contact_table = Table(contact_data, colWidths=[1.5*inch, 2.5*inch])
        contact_table.setStyle(TableStyle([
            ('FONTNAME', (0, 0), (-1, -1), 'Helvetica'),
            ('FONTSIZE', (0, 0), (-1, -1), 9),
            ('FONTNAME', (0, 0), (0, -1), 'Helvetica-Bold'),
            ('GRID', (0, 0), (-1, -1), 0.5, colors.grey)
        ]))
        
        story.append(contact_table)
        story.append(Spacer(1, 30))
        
        # Certification
        story.append(Paragraph("ROYAL CERTIFICATION", 
                              ParagraphStyle('CertHeader', fontSize=11, fontName='Helvetica-Bold', 
                                           alignment=TA_CENTER, spaceAfter=15)))
        
        cert_text = f"""
        I certify that the information contained in this Suspicious Activity Report is true and accurate 
        to the best of my knowledge. I understand that providing false information may result in penalties 
        under Royal Treasury Law.
        
        
        ________________________________                    Date: {filing_date.strftime('%m/%d/%Y')}
        Banking House Compliance Officer
        
        {random.choice(['Master Aldwin Goldkeeper', 'Lord Gareth Coinwatch', 'Dame Sarah Vaultguard'])}
        Royal Treasury Compliance Officer
        """
        
        story.append(Paragraph(cert_text, styles['Normal']))
        
        # Footer
        story.append(Spacer(1, 20))
        story.append(Paragraph("Royal Treasury Form RT-111 - Suspicious Activity Report", 
                              ParagraphStyle('Footer', fontSize=8, alignment=TA_CENTER, 
                                           textColor=colors.grey)))
        
        doc.build(story)
        print(f"✨ Conjured Suspicious Activity Report: {filepath}")
        return filepath

    def generate_all_documents(self, mission1_data_path: str = None) -> List[str]:
        """Generate all three investigation documents for the syndicate in K1C5"""
        
        print("🔮 Beginning Mission 2: Summoning Investigation Documents...")
        print(f"🏰 Target: The syndicate operating from {self.cities['K1C5']}, {self.kingdoms['K1']} Kingdom")
        print("=" * 70)
        
        # Load transaction data
        transaction_data = self.load_mission1_data(mission1_data_path)
        
        # Create consistent case data
        case_data = {
            'target_company': 'Golden Griffin Trading Co.',  # Main syndicate front
            'base_location': 'K1C5',
            'investigation_period': {
                'start': datetime.date.today() - timedelta(days=180),
                'end': datetime.date.today() - timedelta(days=30)
            },
            'total_suspicious_amount': random.randint(1000000, 3000000),
            'primary_investigator': random.choice(self.investigators)
        }
        
        generated_files = []
        
        # Generate documents
        try:
            generated_files.append(self.generate_whistleblower_report(case_data))
        except Exception as e:
            print(f"⚠️ Error generating whistleblower report: {e}")
        
        try:
            generated_files.append(self.generate_bank_statement(transaction_data))
        except Exception as e:
            print(f"⚠️ Error generating bank statement: {e}")
        
        try:
            generated_files.append(self.generate_suspicious_activity_report(case_data))
        except Exception as e:
            print(f"⚠️ Error generating SAR: {e}")
        
        print("=" * 70)
        print(f"✅ Mission 2 Complete! Generated {len(generated_files)} investigation documents:")
        
        document_types = ["WHISTLEBLOWER REPORT", "BANK STATEMENT", "SUSPICIOUS ACTIVITY REPORT"]
        for i, file in enumerate(generated_files):
            if i < len(document_types):
                print(f"   {i+1}. {document_types[i]}: {os.path.basename(file)}")
        
        print(f"\n📁 All documents saved to: {self.output_dir}/")
        print(f"🎯 Investigation Focus: {case_data['target_company']} in {self.cities[case_data['base_location']]}")
        
        # Generate mission summary
        self._generate_mission_summary(generated_files, case_data)
        
        return generated_files
    
    def _generate_mission_summary(self, generated_files: List[str], case_data: Dict[str, Any]):
        """Generate mission summary for instructors"""
        
        summary_file = os.path.join(self.output_dir, "Mission2_Investigation_Summary.txt")
        
        with open(summary_file, 'w') as f:
            f.write("MISSION 2: FANTASY INVESTIGATION DOCUMENTS\n")
            f.write("=" * 50 + "\n\n")
            
            f.write("INVESTIGATION TARGET:\n")
            f.write(f"• Syndicate Base: {self.cities[case_data['base_location']]}, {self.kingdoms['K1']} Kingdom\n")
            f.write(f"• Primary Shell Company: {case_data['target_company']}\n")
            f.write(f"• Investigation Period: {case_data['investigation_period']['start']} to {case_data['investigation_period']['end']}\n")
            f.write(f"• Estimated Illicit Activity: {case_data['total_suspicious_amount']:,} Gold Pieces\n\n")
            
            f.write("SYNTHETIC DATA TECHNIQUES:\n")
            f.write("• Markovify: Domain-specific text generation from financial investigation corpora\n")
            f.write("• ReportLab: Professional document formatting matching real investigation standards\n")
            f.write("• Fantasy theming: All real-world references converted to magical kingdom setting\n")
            f.write("• Data consistency: Documents reference the same syndicate operations\n")
            f.write("• Mission 1 integration: Bank statement aligns with transaction data patterns\n\n")
            
            f.write("GENERATED DOCUMENTS:\n")
            for i, file in enumerate(generated_files, 1):
                f.write(f"{i}. {os.path.basename(file)}\n")
            
            f.write(f"\nMONEY LAUNDERING PATTERNS DEMONSTRATED:\n")
            f.write("• Structured deposits just under 10,000 gold piece reporting threshold\n")
            f.write("• Large transfers from high-risk offshore territories\n")
            f.write("• Shell company operations inconsistent with legitimate trading\n")
            f.write("• Multiple banking relationships to distribute suspicious activity\n")
            f.write("• Round-number withdrawals suggesting cash conversion\n\n")
            
            f.write("EDUCATIONAL OUTCOMES:\n")
            f.write("• Recognition of financial crime red flags in fantasy context\n")
            f.write("• Understanding of regulatory reporting requirements\n")
            f.write("• Experience analyzing professional investigation documents\n")
            f.write("• Appreciation for cross-document consistency in investigations\n")
            f.write("• Practical application of synthetic data in investigative scenarios\n")
        
        print(f"📊 Investigation summary generated: {summary_file}")

def main():
    """Execute Mission 2: Generate fantasy investigation documents"""
    
    # Initialize the document forge
    doc_forge = FantasyDocumentForge()
    
    # Generate all investigation documents
    # To integrate with Mission 1 data: doc_forge.generate_all_documents("path/to/mission1_data.csv")
    generated_files = doc_forge.generate_all_documents()
    
    return generated_files

if __name__ == "__main__":
    main()

🔮 Beginning Mission 2: Summoning Investigation Documents...
🏰 Target: The syndicate operating from Goldweave Port, Valdris Kingdom
📊 Generating synthetic transaction data aligned with Mission 1
✨ Conjured whistleblower report: mission2_documents\whistleblower_report_RG924718.pdf
✨ Conjured banking house statement: mission2_documents\bank_statement_265055.pdf
✨ Conjured Suspicious Activity Report: mission2_documents\SAR_883301.pdf
✅ Mission 2 Complete! Generated 3 investigation documents:
   1. WHISTLEBLOWER REPORT: whistleblower_report_RG924718.pdf
   2. BANK STATEMENT: bank_statement_265055.pdf
   3. SUSPICIOUS ACTIVITY REPORT: SAR_883301.pdf

📁 All documents saved to: mission2_documents/
🎯 Investigation Focus: Golden Griffin Trading Co. in Goldweave Port
📊 Investigation summary generated: mission2_documents\Mission2_Investigation_Summary.txt



_______________________________________________________

# Mission 2 Review

Arcane Investigators, in **Mission 2: The Document Summoning**, you’ve performed a mystical ritual to conjure synthetic investigation documents (`mission2_documents/`) that expose the K1C5 syndicate’s money laundering in Goldweave Port. Using the enchanted ledger from Mission 1 (crafted with Gaussian Copula, CTGAN, Graph Network, and TVAE), your ritual summons whistleblower reports, bank statements, and suspicious activity reports (SARs). Your notebook outputs—PDFs revealing specific syndicate schemes!

## The Summoning Ritual

The `FantasyDocumentForge` conjures three document types to unmask K1C5:

### 1. Arcane Text Conjuring
- **Action**: Weaves narrative spells (`markovify>=0.9.4`) from financial crime corpora.
- **Features**: Generates realistic text for whistleblower and SAR narratives, e.g., “structured deposits to avoid reporting.”
- **Output**: Coherent allegations tied to K1C5’s operations.

**Role**: Provides credible, syndicate-focused text for documents.

### 2. Document Forging
- **Action**: Crafts PDFs (`reportlab>=4.2.2`) with tables, headers, and fantasy styling.
- **Features**: Includes syndicate details (e.g., Golden Griffin Trading Co.), K1C5 locations, and suspicious patterns (structured deposits <10,000 gold pieces, offshore transfers).
- **Output**: Whistleblower reports, bank statements (aligned with Mission 1 data), and SARs.

**Role**: Creates documents mirroring real investigations.

### 3. Syndicate Evidence Binding
- **Action**: Integrates Mission 1’s synthetic transactions, ensuring consistency.
- **Features**: Embeds patterns like structured deposits (9,000–9,950 gold pieces), large offshore transfers (25,000–150,000), and round-number withdrawals.
- **Output**: Documents reflecting K1C5’s tactics, saved in `mission2_documents/`.

**Role**: Links documents to transaction data, exposing illicit networks.

## Why It Matters
Your summoned documents reveal sydicate schemes in K1C5, like structured deposits, offshore transfers, and shell company operations—rooted in Mission 1’s synthetic data. These PDFs, rich with red flags, enable Mission 4’s chatbot queries and empower the Valdris Council to combat crime. 
_______________________________________________________



## Mission 3: The Prophecy Familiar
**Objective**: Train an anomaly detection mythical beast on synthetic data patterns to identify suspicious financial activities, strengthening the investigation against the money laundering syndicate.

In [12]:

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data, DataLoader
from torch_geometric.nn import GCNConv
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score, precision_score
import networkx as nx
from datetime import datetime, timedelta

class TransactionGraphBuilder:
    """
    Builds graph representations for AML detection, enhanced for K1C5 syndicate patterns.
    Incorporates clustering coefficients and time-series features from Mission 1.
    """
    def __init__(self, time_window_hours=24, min_edge_weight=0.01):
        self.time_window_hours = time_window_hours
        self.min_edge_weight = min_edge_weight
        self.flagged_accounts = {'K1C5', 'K2C5', 'K3C1', 'K4C3'}
        
    def create_temporal_graph(self, df, window_start=0, window_size=300):
        window_df = df.iloc[window_start:window_start + window_size].copy()
        
        sender_col, receiver_col = ('customer_id', 'merchant_id') if 'customer_id' in df.columns else ('sender_location', 'receiver_location')
        accounts = set(window_df[sender_col].unique()) | set(window_df[receiver_col].unique())
        
        if len(accounts) < 5:
            return None, None
        account_to_idx = {acc: idx for idx, acc in enumerate(sorted(accounts))}
        
        G = nx.DiGraph()
        for _, row in window_df.iterrows():
            G.add_edge(row[sender_col], row[receiver_col], weight=row['amount'])
        
        clustering_coeffs = nx.clustering(G, weight=None)
        
        edge_index = []
        edge_features = []
        
        grouped = window_df.groupby([sender_col, receiver_col])
        
        for (sender, receiver), group in grouped:
            if sender in account_to_idx and receiver in account_to_idx:
                edge_index.append([account_to_idx[sender], account_to_idx[receiver]])
                edge_feat = [
                    group['amount'].sum(),
                    group['amount'].mean(),
                    len(group),
                    group['amount'].std() if len(group) > 1 else 0,
                    group['transaction_hour'].std() if len(group) > 1 else 0,
                    group['cross_border'].mean(),
                    group['cash_equivalent'].mean(),
                    group['merchant_risk_score'].mean(),
                    group['transactions_last_24h'].mean(),
                    3 if sender == 'K1C5' or receiver == 'K1C5' else 1 if sender in self.flagged_accounts or receiver in self.flagged_accounts else 0,  # Boost K1C5
                    group.get('velocity_24h', pd.Series([0] * len(group))).mean(),
                ]
                edge_features.append(edge_feat)
        
        node_features = []
        node_labels = []
        
        for account in sorted(accounts):
            sent = window_df[window_df[sender_col] == account]
            received = window_df[window_df[receiver_col] == account]
            
            total_txns = len(sent) + len(received)
            ml_sent = len(sent[sent['is_money_laundering'] == 1]) if len(sent) > 0 else 0
            ml_received = len(received[received['is_money_laundering'] == 1]) if len(received) > 0 else 0
            ml_count = ml_sent + ml_received
            flagged_interactions = len(set(sent[receiver_col]) & self.flagged_accounts) + \
                                  len(set(received[sender_col]) & self.flagged_accounts)
            k1c5_interactions = len(set(sent[receiver_col]) & {'K1C5'}) + len(set(received[sender_col]) & {'K1C5'})
            
            node_feat = [
                len(sent),
                len(received),
                sent['amount'].sum() if len(sent) > 0 else 0,
                received['amount'].sum() if len(received) > 0 else 0,
                (received['amount'].sum() - sent['amount'].sum()) if total_txns > 0 else 0,
                sent['account_age_days'].mean() if len(sent) > 0 else 0,
                3 if account == 'K1C5' else 1 if account in self.flagged_accounts else 0,  # Boost K1C5
                ml_count * 2.0,
                flagged_interactions * 3.0,  # Increased weight
                k1c5_interactions * 7.0,  # New K1C5-specific feature
                sent.get('velocity_24h', pd.Series([0] * len(sent))).mean() if len(sent) > 0 else 0,
                received.get('velocity_24h', pd.Series([0] * len(received))).mean() if len(received) > 0 else 0,
                clustering_coeffs.get(account, 0.0),
            ]
            node_features.append(node_feat)
            node_labels.append(1 if ml_count > 0 or k1c5_interactions > 0 else 0)  # Include K1C5 interactions in labels
        
        if len(edge_index) < 5:
            return None, None
        
        edge_index = torch.tensor(edge_index, dtype=torch.long).t()
        edge_attr = torch.tensor(edge_features, dtype=torch.float32)
        x = torch.tensor(node_features, dtype=torch.float32)
        y = torch.tensor(node_labels, dtype=torch.float32)
        
        return Data(x=x, edge_index=edge_index, edge_attr=edge_attr, y=y), account_to_idx
    
    def create_dataset(self, df, window_size=300, stride=20):
        graphs = []
        for start in range(0, len(df) - window_size + 1, stride):
            graph, _ = self.create_temporal_graph(df, start, window_size)
            if graph is not None and graph.edge_index.shape[1] > 0:
                graphs.append(graph)
        return graphs

class AMLGAE(nn.Module):
    """
    Graph Autoencoder for anomalous subgraph detection, fine-tuned for K1C5 syndicates.
    """
    def __init__(self, node_features, edge_features, hidden_dim=32, latent_dim=16):
        super(AMLGAE, self).__init__()
        
        self.encoder = nn.ModuleList([
            GCNConv(node_features, hidden_dim),
            GCNConv(hidden_dim, latent_dim),
        ])
        
        self.decoder = nn.ModuleList([
            GCNConv(latent_dim, hidden_dim),
            GCNConv(hidden_dim, node_features),
        ])
        
        self.edge_reconstructor = nn.Sequential(
            nn.Linear(latent_dim * 2 + edge_features, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
        )
        
        self.supervised_head = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
        )
    
    def encode(self, x, edge_index):
        for i, conv in enumerate(self.encoder):
            x = conv(x, edge_index)
            x = F.relu(x) if i < len(self.encoder) - 1 else x
        return x
    
    def decode_nodes(self, z, edge_index):
        for i, conv in enumerate(self.decoder):
            z = conv(z, edge_index)
            z = F.relu(z) if i < len(self.decoder) - 1 else z
        return z
    
    def decode_edges(self, z, edge_index, edge_attr):
        row, col = edge_index
        edge_inputs = torch.cat([z[row], z[col], edge_attr], dim=1)
        return self.edge_reconstructor(edge_inputs)
    
    def forward(self, x, edge_index, edge_attr):
        z = self.encode(x, edge_index)
        node_recon = self.decode_nodes(z, edge_index)
        edge_recon = self.decode_edges(z, edge_index, edge_attr) if edge_attr is not None else None
        supervised_score = self.supervised_head(z)
        return z, node_recon, edge_recon, supervised_score

class AMLDetectionSystem:
    """
    AML system using GAE with semi-supervised fine-tuning for K1C5 syndicate detection.
    """
    def __init__(self, hidden_dim=32, latent_dim=16, learning_rate=0.001):
        self.hidden_dim = hidden_dim
        self.latent_dim = latent_dim
        self.learning_rate = learning_rate
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model = None
        self.graph_builder = TransactionGraphBuilder()
        self.scaler_node = None
        self.scaler_edge = None
        self.threshold = None
        self.val_recon_errors = None
        self.val_error_mean = None
        self.val_error_std = None
    
    def train(self, df, epochs=50, batch_size=16, val_split=0.2):
        print("🔨 Building transaction graphs...")
        graphs = self.graph_builder.create_dataset(df, window_size=300, stride=20)
        graphs = [g for g in graphs if g is not None and g.edge_index.shape[1] > 0]
        print(f"✓ Created {len(graphs)} temporal graph snapshots")
        if graphs:
            print(f"Sample graph - Nodes: {graphs[0].x.shape}, Edges: {graphs[0].edge_index.shape}, Edge attr: {graphs[0].edge_attr.shape}")
        
        train_size = int(len(graphs) * (1 - val_split))
        train_graphs = graphs[:train_size]
        val_graphs = graphs[train_size:]
        
        all_x_train = np.vstack([g.x.numpy() for g in train_graphs]) if train_graphs else np.empty((0, graphs[0].x.shape[1]))
        self.scaler_node = StandardScaler().fit(all_x_train) if len(all_x_train) > 0 else None
        all_edge_train = np.vstack([g.edge_attr.numpy() for g in train_graphs if g.edge_attr.shape[0] > 0]) if any(g.edge_attr.shape[0] > 0 for g in train_graphs) else np.empty((0, 11))
        self.scaler_edge = StandardScaler().fit(all_edge_train) if len(all_edge_train) > 0 else None
        
        for graphs_list in [train_graphs, val_graphs]:
            for g in graphs_list:
                if self.scaler_node is not None:
                    g.x = torch.tensor(self.scaler_node.transform(g.x.numpy()), dtype=torch.float32)
                if g.edge_attr.shape[0] > 0 and self.scaler_edge is not None:
                    g.edge_attr = torch.tensor(self.scaler_edge.transform(g.edge_attr.numpy()), dtype=torch.float32)
        
        train_loader = DataLoader(train_graphs, batch_size=batch_size, shuffle=True)
        val_loader = DataLoader(val_graphs, batch_size=batch_size, shuffle=False)
        
        node_features = graphs[0].x.shape[1] if graphs else 0
        edge_features = graphs[0].edge_attr.shape[1] if graphs and graphs[0].edge_attr.shape[0] > 0 else 11
        
        self.model = AMLGAE(node_features, edge_features, self.hidden_dim, self.latent_dim).to(self.device)
        optimizer = torch.optim.Adam(self.model.parameters(), lr=self.learning_rate, weight_decay=5e-4)
        
        # Compute validation threshold and statistics for z-score scaling
        self.model.eval()
        self.val_recon_errors = []
        with torch.no_grad():
            for batch in val_loader:
                batch = batch.to(self.device)
                _, node_recon, _, _ = self.model(batch.x, batch.edge_index, batch.edge_attr)
                errors = torch.mean((node_recon - batch.x) ** 2, dim=1).cpu().numpy()
                self.val_recon_errors.extend(errors)
        self.threshold = np.percentile(self.val_recon_errors, 98) if self.val_recon_errors else 0.5
        self.val_error_mean = np.mean(self.val_recon_errors) if self.val_recon_errors else 0
        self.val_error_std = np.std(self.val_recon_errors) if self.val_recon_errors else 1
        
        print("\n🚀 Training GAE model...")
        print("=" * 50)
        
        best_val_loss = float('inf')
        patience = 10
        counter = 0
        
        for epoch in range(epochs):
            self.model.train()
            train_loss = 0
            all_train_scores = []
            all_train_labels = []
            
            for batch in train_loader:
                batch = batch.to(self.device)
                optimizer.zero_grad()
                
                z, node_recon, edge_recon, supervised_score = self.model(batch.x, batch.edge_index, batch.edge_attr)
                
                node_loss = F.mse_loss(node_recon, batch.x)
                edge_loss = F.binary_cross_entropy_with_logits(edge_recon.squeeze(), torch.ones_like(edge_recon.squeeze())) if edge_recon is not None else 0
                supervised_loss = F.mse_loss(supervised_score.squeeze(), batch.y) if batch.y is not None else 0
                loss = node_loss + 0.1 * edge_loss + 0.1 * supervised_loss
                
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
                optimizer.step()
                
                train_loss += loss.item()
                
                recon_errors = torch.mean((node_recon - batch.x) ** 2, dim=1).cpu().detach().numpy()
                all_train_scores.append(recon_errors)
                all_train_labels.append(batch.y.cpu().numpy())
            
            self.model.eval()
            val_loss = 0
            all_val_scores = []
            all_val_labels = []
            
            with torch.no_grad():
                for batch in val_loader:
                    batch = batch.to(self.device)
                    z, node_recon, edge_recon, supervised_score = self.model(batch.x, batch.edge_index, batch.edge_attr)
                    
                    node_loss = F.mse_loss(node_recon, batch.x)
                    edge_loss = F.binary_cross_entropy_with_logits(edge_recon.squeeze(), torch.ones_like(edge_recon.squeeze())) if edge_recon is not None else 0
                    supervised_loss = F.mse_loss(supervised_score.squeeze(), batch.y) if batch.y is not None else 0
                    loss = node_loss + 0.1 * edge_loss + 0.1 * supervised_loss
                    
                    val_loss += loss.item()
                    
                    recon_errors = torch.mean((node_recon - batch.x) ** 2, dim=1).cpu().numpy()
                    all_val_scores.append(recon_errors)
                    all_val_labels.append(batch.y.cpu().numpy())
            
            avg_train_loss = train_loss / len(train_loader) if len(train_loader) > 0 else 0
            avg_val_loss = val_loss / len(val_loader) if len(val_loader) > 0 else 0
            
            train_auc = roc_auc_score(np.concatenate(all_train_labels), np.concatenate(all_train_scores)) if all_train_labels and np.any(np.concatenate(all_train_labels)) else 0
            val_auc = roc_auc_score(np.concatenate(all_val_labels), np.concatenate(all_val_scores)) if all_val_labels and np.any(np.concatenate(all_val_labels)) else 0
            val_pr_auc = average_precision_score(np.concatenate(all_val_labels), np.concatenate(all_val_scores)) if all_val_labels and np.any(np.concatenate(all_val_labels)) else 0
            
            if (epoch + 1) % 5 == 0 or epoch == 0:
                print(f"Epoch {epoch+1}/{epochs}")
                print(f"  Train Loss: {avg_train_loss:.4f}, AUC: {train_auc:.4f}")
                print(f"  Val Loss: {avg_val_loss:.4f}, AUC: {val_auc:.4f}, PR-AUC: {val_pr_auc:.4f}")
            
            if avg_val_loss < best_val_loss:
                best_val_loss = avg_val_loss
                counter = 0
            else:
                counter += 1
                if counter >= patience:
                    print("Early stopping!")
                    break
        
        print("\n✓ Training complete!")
        return self.model
    
    def detect_suspicious_patterns(self, df, threshold=None):
        if self.model is None:
            raise ValueError("Model not trained yet!")
        
        self.model.eval()
        graph, account_to_idx = self.graph_builder.create_temporal_graph(df, 0, len(df))
        if graph is None or account_to_idx is None:
            return {'suspicious_accounts': [], 'total_accounts_analyzed': 0, 'high_risk_count': 0, 'k1c5_involvement': 0, 'suspicious_clusters': [], 'precision': 0.0}
        
        if self.scaler_node is not None:
            graph.x = torch.tensor(self.scaler_node.transform(graph.x.numpy()), dtype=torch.float32)
        if graph.edge_attr.shape[0] > 0 and self.scaler_edge is not None:
            graph.edge_attr = torch.tensor(self.scaler_edge.transform(graph.edge_attr.numpy()), dtype=torch.float32)
        
        graph = graph.to(self.device)
        
        with torch.no_grad():
            z, node_recon, edge_recon, _ = self.model(graph.x, graph.edge_index, graph.edge_attr)
            recon_errors = torch.mean((node_recon - graph.x) ** 2, dim=1).cpu().numpy()
        
        # Z-score based anomaly score scaling
        scaled_errors = np.zeros_like(recon_errors)
        if self.val_recon_errors:
            scaled_errors = (recon_errors - self.val_error_mean) / (self.val_error_std + 1e-10)
            scaled_errors = np.log1p(np.clip(scaled_errors, 0, None))  # Logarithmic transformation
            scaled_errors = 100 * (scaled_errors - np.min(scaled_errors)) / (np.max(scaled_errors) - np.min(scaled_errors) + 1e-10)
            scaled_errors = np.clip(scaled_errors, 0, 100)
        
        threshold = threshold if threshold is not None else self.threshold
        accounts = sorted(set(df['customer_id'].unique() if 'customer_id' in df.columns else df['sender_location'].unique()) | 
                         set(df['merchant_id'].unique() if 'merchant_id' in df.columns else df['receiver_location'].unique()))
        suspicious_accounts = []
        pred_labels = []
        true_labels = []
        
        for i, (account, error, scaled_error) in enumerate(zip(accounts, recon_errors, scaled_errors)):
            is_suspicious = error > threshold
            pred_labels.append(1 if is_suspicious else 0)
            true_labels.append(graph.y[i].item())
            if is_suspicious:
                suspicious_accounts.append({
                    'account': account,
                    'anomaly_score': float(scaled_error),
                    'is_high_risk_location': account == 'K1C5' or ('CUST_' in account and df[df['customer_id'] == account]['sender_location'].iloc[0] == 'K1C5' if 'customer_id' in df.columns else False),
                    'pattern': self._identify_pattern(df, account)
                })
        
        precision = precision_score(true_labels, pred_labels) if any(true_labels) else 0.0
        
        G = nx.DiGraph()
        for _, row in df.iterrows():
            sender = row['customer_id' if 'customer_id' in df.columns else 'sender_location']
            receiver = row['merchant_id' if 'merchant_id' in df.columns else 'receiver_location']
            G.add_edge(sender, receiver, weight=row['amount'])
        
        suspicious_clusters = []
        components = list(nx.weakly_connected_components(G))
        for component in components:
            valid_accounts = [acc for acc in component if acc in account_to_idx]
            if not valid_accounts or len(valid_accounts) <= 3:  # Relaxed to >2 nodes
                continue
            avg_error = np.mean([recon_errors[account_to_idx[acc]] for acc in valid_accounts])
            avg_scaled_error = np.mean([scaled_errors[account_to_idx[acc]] for acc in valid_accounts])
            k1c5_nodes = sum(1 for acc in component if acc == 'K1C5' or ('CUST_' in account and df[df['customer_id'] == acc]['sender_location'].iloc[0] == 'K1C5' if 'customer_id' in df.columns else False))
            if avg_error > threshold and k1c5_nodes > 0:
                suspicious_clusters.append({
                    'nodes': list(component),
                    'avg_anomaly_score': float(avg_scaled_error),
                    'k1c5_nodes': k1c5_nodes
                })
        
        suspicious_accounts.sort(key=lambda x: x['anomaly_score'], reverse=True)
        suspicious_clusters.sort(key=lambda x: x['avg_anomaly_score'], reverse=True)
        
        return {
            'suspicious_accounts': suspicious_accounts,
            'total_accounts_analyzed': len(accounts),
            'high_risk_count': len(suspicious_accounts),
            'k1c5_involvement': sum(1 for acc in suspicious_accounts if acc['is_high_risk_location']),
            'suspicious_clusters': suspicious_clusters[:5],
            'precision': precision
        }
    
    def _identify_pattern(self, df, account):
        sent = df[df['customer_id' if 'customer_id' in df.columns else 'sender_location'] == account]
        received = df[df['merchant_id' if 'merchant_id' in df.columns else 'receiver_location'] == account]
        all_txns = pd.concat([sent, received])
        
        if len(all_txns) == 0:
            return "unknown"
        
        patterns = []
        if len(all_txns) > 5:
            amounts = all_txns['amount'].values
            if np.std(amounts) < np.mean(amounts) * 0.3:
                time_spread = all_txns['transaction_hour'].max() - all_txns['transaction_hour'].min()
                if time_spread < 6:
                    patterns.append("structuring")
        
        if all_txns['cross_border'].mean() > 0.7:
            if len(set(all_txns['sender_location'].unique()) | set(all_txns['receiver_location'].unique())) > 4:
                patterns.append("layering")
        
        if all_txns['cash_equivalent'].mean() > 0.5:
            if 'business' in all_txns['customer_segment'].values:
                patterns.append("integration")
        
        return ", ".join(patterns) if patterns else "suspicious_activity"

def demonstrate_aml_system(df):
    print("🏰 Fantasy Kingdom AML Detection System")
    print("=" * 50)
    print("Investigating corrupt syndicate in K1C5...")
    print(f"\nDataset Overview:")
    print(f"  Total transactions: {len(df)}")
    print(f"  Money laundering cases: {df['is_money_laundering'].sum()}")
    print(f"  K1C5 involvement: {((df['sender_location'] == 'K1C5') | (df['receiver_location'] == 'K1C5')).sum()}")
    
    df = df.sample(frac=1).reset_index(drop=True)
    
    aml_system = AMLDetectionSystem(hidden_dim=32, latent_dim=16)
    train_size = int(len(df) * 0.8)
    train_df = df.iloc[:train_size]
    test_df = df.iloc[train_size:]
    
    aml_system.train(train_df, epochs=50, batch_size=16)
    
    print("\n🔍 Analyzing test transactions for suspicious patterns...")
    results = aml_system.detect_suspicious_patterns(test_df)
    
    print(f"\n📊 Detection Results:")
    print(f"  Accounts analyzed: {results['total_accounts_analyzed']}")
    print(f"  High-risk accounts detected: {results['high_risk_count']}")
    print(f"  K1C5 syndicate connections: {results['k1c5_involvement']}")
    print(f"  Precision: {results['precision']:.4f}")
    
    print(f"\n🚨 Top Suspicious Accounts:")
    for i, acc in enumerate(results['suspicious_accounts'][:5], 1):
        print(f"  {i}. {acc['account']} - Anomaly Score: {acc['anomaly_score']:.2f} - Pattern: {acc['pattern']}")
        if acc['is_high_risk_location']:
            print(f"     ⚠️  ALERT: Direct K1C5 syndicate member!")
    
    
    return aml_system, results

# Example usage with Mission 1 data
if __name__ == "__main__":
    # Generate base data
    df = create_base_transaction_data(n_samples=10000, ml_ratio=0.05)
    
    # Enhance with customer-merchant networks
    df = create_customer_merchant_networks(df, n_customers=500, n_merchants=200)
    
    # Add time-series features
    df = generate_time_series_flows(df, days=30)
    
    # Run AML detection
    aml_system, results = demonstrate_aml_system(df)


✓ Total transactions generated: 10914 (approx 5.0% ML syndicate cases)
  💰 Corrupt syndicate cases: 1414
  🏰 Transactions involving K1C5: 1405

🕸️ Weaving Customer-Merchant Networks...
✓ Network created: 500 customers, 200 merchants
  High-risk merchants: 20
  Suspicious customers: 50

⏰ Conjuring Time-Series Financial Flows...
✓ Time-series flows generated over 30 days
  Features: timestamp, running_total, velocity_1h, velocity_24h
🏰 Fantasy Kingdom AML Detection System
Investigating corrupt syndicate in K1C5...

Dataset Overview:
  Total transactions: 10914
  Money laundering cases: 1414
  K1C5 involvement: 1405
🔨 Building transaction graphs...
✓ Created 422 temporal graph snapshots
Sample graph - Nodes: torch.Size([382, 13]), Edges: torch.Size([2, 299]), Edge attr: torch.Size([299, 11])

🚀 Training GAE model...
Epoch 1/50
  Train Loss: 0.7460, AUC: 0.8390
  Val Loss: 0.6645, AUC: 0.8443, PR-AUC: 0.3946
Epoch 5/50
  Train Loss: 0.4808, AUC: 0.7993
  Val Loss: 0.4682, AUC: 0.8084, PR-

_______________________________________________________

# Mission 3 Review

You’ve trained a mystical familiar to hunt money laundering in Goldweave Port’s synthetic transaction ledger. This enchanted pet sniffs out K1C5 syndicate patterns, revealing suspicious accounts and networks. Your notebook outputs—high-risk accounts, K1C5 ties, and precision metrics—expose the syndicate’s schemes!

## The Familiar’s Magic

The Prophecy Familiar uses three powers to track K1C5:

### 1. Web Weaving
- **Action**: Spins transactions into webs (`networkx>=3.4.2`), with accounts as nodes and transactions as weighted threads (`amount`).
- **Features**: Adds transaction counts, K1C5 flags (weighted x7), and time-series data (`velocity_24h`).
- **Output**: Temporal graph snapshots (300 transactions, stride 20).

**Role**: Maps K1C5 networks for detection.

### 2. Arcane Vision
- **Action**: Uses Graph Autoencoder (`torch>=2.8.0`, `torch-geometric>=2.6.1`) to learn normal patterns and spot anomalies.
- **Process**: Encodes nodes with GCNs, reconstructs webs, and flags high-error nodes (anomaly scores 0–100).
- **Training**: Fine-tunes on K1C5 labels.

**Role**: Detects suspicious accounts (e.g., structuring, layering).

### 3. Syndicate Hunt
- **Action**: Trains on 80% of synthetic data (50 epochs), tests on 20%, and identifies suspicious clusters.
- **Outputs**: Lists top accounts, scores, patterns (e.g., structuring for frequent small deposits), and K1C5-linked webs.
- **Metrics**: Reports precision, account counts, K1C5 involvement.

**Role**: Pinpoints high-risk accounts and networks.

## Why It Matters
Your familiar’s visions expose K1C5’s tactics, like structured deposits (<10,000 gold pieces) or cross-border layering, using synthetic data’s realistic patterns. With high precision, it bridges Missions 1 and 2, turning data into actionable omens. 

_______________________________________________________

## Mission 4: The Oracle's Golem
**Objective**: Develop a Retrieval-Augmented Generation (RAG) golem to analyze and answer questions about the synthetic data and insights for Missions 1, and 2.



In [21]:
import pandas as pd
import numpy as np
from pathlib import Path
import pdfplumber
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss
from IPython.display import display, Markdown
import ollama
import signal
import sys

# Signal handler for graceful exit
def signal_handler(sig, frame):
    display(Markdown("🪄 **Arcane Investigation Terminated.**"))
    sys.exit(0)

signal.signal(signal.SIGINT, signal_handler)

display(Markdown("🔮 **Initiating Valdris Financial Investigation System...**"))
display(Markdown("=" * 80))

class FantasyRAGChatbot:
    def __init__(self, transaction_csv="mission1_synthetic_data.csv", docs_folder="mission2_documents"):
        self.transaction_csv = transaction_csv
        self.docs_folder = docs_folder
        self.documents = []
        self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
        self.faiss_index = None
        self.bm25 = None
        self.model_name = "mistral:7b-instruct-v0.3-q4_0"
        self.df = None
        display(Markdown("✓ **Initialized Arcane RAG System for Transaction and Document Analysis**"))

    def load_transaction_data(self):
        """Load and process transaction data from CSV, aligned with provided structure"""
        display(Markdown(f"\n1️⃣ **Loading Enchanted Ledger from {self.transaction_csv}...**"))
        try:
            self.df = pd.read_csv(self.transaction_csv)
            expected_columns = [
                'timestamp', 'amount', 'account_age_days', 'transaction_hour', 'day_of_week',
                'transactions_last_24h', 'account_balance_ratio', 'merchant_risk_score',
                'cross_border', 'cash_equivalent', 'transaction_type', 'customer_segment',
                'sender_location', 'receiver_location', 'is_money_laundering', 'customer_id',
                'merchant_id', 'running_total', 'hour_of_day', 'velocity_1h', 'velocity_24h'
            ]
            if not all(col in self.df.columns for col in expected_columns):
                missing = [col for col in expected_columns if col not in self.df.columns]
                display(Markdown(f"⚠️ **Error: CSV missing columns: {missing}.**"))
                return []
            # Debug: Show column info
            display(Markdown(f"📋 **Column types**: {self.df.dtypes.to_dict()}"))
            display(Markdown(f"📋 **Sample timestamp**: {self.df['timestamp'].head(3).to_list()}"))
            # Convert timestamp to date (YYYY-MM-DD) and rename to 'date'
            self.df['date'] = pd.to_datetime(self.df['timestamp'], errors='coerce').dt.strftime('%Y-%m-%d')
            if self.df['date'].isna().all():
                display(Markdown("⚠️ **Error: All timestamps are invalid.**"))
                self.df['date'] = "Unknown"
        except FileNotFoundError:
            display(Markdown(f"⚠️ **Error: {self.transaction_csv} not found. Ensure Mission 1 data is generated.**"))
            return []
        except Exception as e:
            display(Markdown(f"⚠️ **Error loading CSV: {str(e)}.**"))
            return []

        transaction_docs = []
        for idx, row in self.df.iterrows():
            doc_text = f"""
TRANSACTION #{idx}
Date: {row['date']}
Location: {row['sender_location']}
Account: {row['customer_id']}
Amount: {row['amount']:,} Gold Pieces
Type: {row['transaction_type']}
Description: {row.get('description', 'N/A')}
Counterparty: {row['merchant_id']}
            """.strip()
            metadata = {
                "type": "transaction",
                "transaction_id": str(idx),
                "location": row['sender_location'],
                "amount": float(row['amount']),
                "counterparty": row['merchant_id'],
                "description": row.get('description', 'N/A'),
                "is_money_laundering": bool(row['is_money_laundering']),
                "merchant_risk_score": float(row['merchant_risk_score'])
            }
            transaction_docs.append({"text": doc_text, "metadata": metadata})
        display(Markdown(f"✓ **Loaded {len(transaction_docs)} ledger entries**"))
        return transaction_docs

    def load_investigation_documents(self):
        """Extract text and tables from PDFs with improved error handling"""
        display(Markdown(f"\n2️⃣ **Deciphering Ancient Scrolls from {self.docs_folder}...**"))
        docs_path = Path(self.docs_folder)
        if not docs_path.exists():
            display(Markdown(f"⚠️ **Error: {self.docs_folder} not found. Run Mission 2 to generate documents.**"))
            return []
        
        investigation_docs = []
        for pdf_file in docs_path.glob("*.pdf"):
            display(Markdown(f"📜 **Decoding {pdf_file.name}...**"))
            try:
                with pdfplumber.open(pdf_file) as pdf:
                    full_text = ""
                    for page in pdf.pages:
                        try:
                            text = page.extract_text() or ""
                            tables = page.extract_tables() or []
                            for table in tables:
                                table_str = "\n".join([",".join(str(cell or '') for cell in row) for row in table])
                                text += f"\nTable:\n{table_str}\n"
                            full_text += text + "\n"
                        except Exception as e:
                            display(Markdown(f"⚠️ **Warning: Failed to extract page from {pdf_file.name}: {str(e)}**"))
                    if not full_text.strip():
                        display(Markdown(f"⚠️ **Warning: No text extracted from {pdf_file.name}**"))
                        continue
                    chunks = [full_text[i:i+500] for i in range(0, len(full_text), 500)]
                    doc_type = "unknown"
                    if "whistleblower" in pdf_file.name.lower():
                        doc_type = "whistleblower_report"
                    elif "bank_statement" in pdf_file.name.lower():
                        doc_type = "bank_statement"
                    elif "sar" in pdf_file.name.lower():
                        doc_type = "suspicious_activity_report"
                    for chunk_idx, chunk in enumerate(chunks):
                        metadata = {
                            "type": "document",
                            "document_type": doc_type,
                            "filename": pdf_file.name,
                            "chunk_id": chunk_idx
                        }
                        investigation_docs.append({"text": chunk, "metadata": metadata})
            except Exception as e:
                display(Markdown(f"⚠️ **Error processing {pdf_file.name}: {str(e)}**"))
                continue
        display(Markdown(f"✓ **Processed {len(investigation_docs)} scroll fragments**"))
        return investigation_docs

    def setup_index(self):
        """Build FAISS and BM25 indices"""
        self.documents = self.load_transaction_data() + self.load_investigation_documents()
        if not self.documents:
            display(Markdown("⚠️ **Error: No documents loaded. Check data and document paths.**"))
            return
        for doc in self.documents:
            try:
                doc["embedding"] = self.embedder.encode(doc["text"], convert_to_numpy=True)
            except Exception as e:
                display(Markdown(f"⚠️ **Error embedding document: {str(e)}**"))
                doc["embedding"] = np.zeros(384)
        embeddings = np.array([doc["embedding"] for doc in self.documents]).astype('float32')
        dimension = embeddings.shape[1]
        self.faiss_index = faiss.IndexFlatL2(dimension)
        self.faiss_index.add(embeddings)
        tokenized_docs = [doc["text"].lower().split() for doc in self.documents]
        self.bm25 = BM25Okapi(tokenized_docs)
        display(Markdown(f"✓ **Indexed {len(self.documents)} artifacts with FAISS and BM25**"))

    def search_documents(self, query, top_k=5, prioritize_transactions=False):
        """Hybrid search with option to prioritize transaction data"""
        try:
            query_embedding = self.embedder.encode(query, convert_to_numpy=True).astype('float32')
            tokenized_query = query.lower().split()
            bm25_scores = self.bm25.get_scores(tokenized_query)
            distances, indices = self.faiss_index.search(query_embedding.reshape(1, -1), top_k * 2)
            semantic_scores = [1 / (1 + d) for d in distances[0]]
            combined_scores = []
            for j, i in enumerate(indices[0]):
                score = 0.4 * bm25_scores[i] + 0.6 * semantic_scores[j]
                if prioritize_transactions and self.documents[i]["metadata"]["type"] == "transaction":
                    score *= 1.5
                combined_scores.append((i, score))
            top_indices = sorted(combined_scores, key=lambda x: x[1], reverse=True)[:top_k]
            results = {
                "documents": [self.documents[i]["text"] for i, _ in top_indices],
                "metadatas": [self.documents[i]["metadata"] for i, _ in top_indices]
            }
            return results
        except Exception as e:
            display(Markdown(f"⚠️ **Error during search: {str(e)}**"))
            return {"documents": [], "metadatas": []}

    def generate_transaction_stats(self, query):
        """Compute specific statistics from CSV based on query"""
        if not hasattr(self, 'df') or self.df is None or self.df.empty:
            return "No transaction data available."
        
        try:
            if 'date' not in self.df.columns:
                display(Markdown("⚠️ **Error: 'date' column missing in CSV.**"))
                return "Unable to compute statistics due to missing 'date' column."
            
            if "largest transaction" in query.lower():
                max_trans = self.df.loc[self.df['amount'].idxmax()]
                date_str = str(max_trans['date']) if not pd.isna(max_trans['date']) else "Unknown"
                return f"""
Largest Transaction:
- Amount: {max_trans['amount']:,} Gold Pieces
- Date: {date_str}
- Location: {max_trans['sender_location']}
- Account: {max_trans['customer_id']}
- Counterparty: {max_trans['merchant_id']}
- Type: {max_trans['transaction_type']}
"""
            elif "highest merchant risk score" in query.lower():
                max_risk = self.df.loc[self.df['merchant_risk_score'].idxmax()]
                return f"""
Highest Risk Account:
- Account: {max_risk['customer_id']}
- Merchant Risk Score: {max_risk['merchant_risk_score']:.2f}
- Total Amount: {max_risk['amount']:,} Gold Pieces
- Location: {max_risk['sender_location']}
- Counterparty: {max_risk['merchant_id']}
"""
            elif "suspicious" in query.lower() or "K1C5" in query.lower() or "money laundering" in query.lower():
                suspicious_df = self.df[
                    (self.df['is_money_laundering'] == 1) |
                    (self.df['amount'].between(9000, 9950)) |
                    (self.df['amount'] > 15000) |
                    (self.df['sender_location'] == 'K1C5')
                ]
                return f"""
Transaction Summary:
- Suspicious Transactions: {len(suspicious_df)}
- Total Amount: {suspicious_df['amount'].sum():,.2f} Gold Pieces
- Top Locations: {suspicious_df['sender_location'].value_counts().head(3).to_dict()}
- Top Transaction Types: {suspicious_df['transaction_type'].value_counts().head(2).to_dict()}
"""
        except Exception as e:
            display(Markdown(f"⚠️ **Error in stats computation: {str(e)}**"))
            return f"Unable to compute statistics: {str(e)}"
        return ""

    def generate_response(self, query, search_results):
        """Generate concise response with Ollama, prioritizing CSV for database queries"""
        is_csv_query = any(term in query.lower() for term in ["transaction", "database", "largest", "highest", "account", "amount"])
        context = ""
        
        if is_csv_query:
            stats = self.generate_transaction_stats(query)
            context = stats if stats else "\n\n".join(search_results["documents"][:3])
        else:
            context = "\n\n".join(search_results["documents"][:3])
        
        prompt = f"""
By the decree of the Valdris Council, you are the Arcane Investigator, tasked with exposing financial crimes.
CONTEXT (prioritize transactions for database queries, then documents; K1C5 is Goldweave Port):
{context}
QUESTION: {query}
Respond in 100 words or less with concise bullet points:
- Suspicious Patterns: [e.g., frequent K1C5 transactions]
- Key Entities: [Accounts, locations, counterparties]
- Money Laundering Indicators: [From transactions or documents]
- Risk Assessment: [Recommendation]
Use specific data from context. Avoid elaboration.
"""
        try:
            response = ollama.chat(model=self.model_name, messages=[{'role': 'user', 'content': prompt}])
            response_text = response['message']['content']
            words = response_text.split()
            if len(words) > 100:
                response_text = ' '.join(words[:100]) + "..."
            return response_text
        except Exception as e:
            display(Markdown(f"⚠️ **Error: Ollama failed - {str(e)}. Ensure Ollama server is running with model {self.model_name}.**"))
            return "Unable to generate response due to Ollama error."

    def investigate(self, query):
        """Run investigation query"""
        display(Markdown(f"\n🔍 **Arcane Query: {query}**"))
        if not self.documents or self.faiss_index is None:
            display(Markdown("⚠️ **Error: Index not set up. Run setup_index first.**"))
            return ""
        prioritize_transactions = any(term in query.lower() for term in ["transaction", "database", "largest", "highest", "account", "amount"])
        search_results = self.search_documents(query, top_k=10, prioritize_transactions=prioritize_transactions)
        if not search_results["documents"]:
            display(Markdown("⚠️ **No relevant documents found for query.**"))
            return "No relevant information found."
        response = self.generate_response(query, search_results)
        display(Markdown(f"📜 **Arcane Findings**:\n{response}"))
        return response

    def run(self):
        """Main loop for interactive investigation"""
        self.setup_index()
        if not self.documents:
            display(Markdown("⚠️ **Error: No data loaded. Investigation halted.**"))
            return
        display(Markdown("\n🏰 **Arcane Investigation Chamber Active**"))
        display(Markdown("Type your query or 'exit' to close the chamber."))
        display(Markdown("Example Queries:"))
        display(Markdown("- Largest transaction in the database\n- Account with the highest merchant risk score\n- Suspicious transactions in K1C5"))
        question_count = 0
        while True:
            try:
                question = input(f"🪄 Query #{question_count + 1}: ").strip()
                if question.lower() in ['exit', 'quit', 'stop']:
                    display(Markdown(f"\n👋 **Chamber Closed. {question_count} queries investigated.**"))
                    break
                if question:
                    self.investigate(question)
                    question_count += 1
                else:
                    display(Markdown("💡 **Enter a query or 'exit'.**"))
            except Exception as e:
                display(Markdown(f"⚠️ **Error during query: {str(e)}**"))

if __name__ == "__main__":
    chatbot = FantasyRAGChatbot()
    chatbot.run()

🔮 **Initiating Valdris Financial Investigation System...**

================================================================================

✓ **Initialized Arcane RAG System for Transaction and Document Analysis**


1️⃣ **Loading Enchanted Ledger from mission1_synthetic_data.csv...**

📋 **Column types**: {'timestamp': dtype('O'), 'amount': dtype('float64'), 'account_age_days': dtype('int64'), 'transaction_hour': dtype('int64'), 'day_of_week': dtype('int64'), 'transactions_last_24h': dtype('int64'), 'account_balance_ratio': dtype('float64'), 'merchant_risk_score': dtype('float64'), 'cross_border': dtype('int64'), 'cash_equivalent': dtype('int64'), 'transaction_type': dtype('O'), 'customer_segment': dtype('O'), 'sender_location': dtype('O'), 'receiver_location': dtype('O'), 'is_money_laundering': dtype('int64'), 'customer_id': dtype('O'), 'merchant_id': dtype('O'), 'running_total': dtype('float64'), 'hour_of_day': dtype('int64'), 'velocity_1h': dtype('float64'), 'velocity_24h': dtype('float64')}

📋 **Sample timestamp**: ['2024-01-02 07:36:34', '2024-01-03 12:51:26', '2024-01-07 18:19:47']

✓ **Loaded 2000 ledger entries**


2️⃣ **Deciphering Ancient Scrolls from mission2_documents...**

📜 **Decoding bank_statement_265055.pdf...**

📜 **Decoding SAR_883301.pdf...**

📜 **Decoding whistleblower_report_RG924718.pdf...**

✓ **Processed 19 scroll fragments**

✓ **Indexed 2019 artifacts with FAISS and BM25**


🏰 **Arcane Investigation Chamber Active**

Type your query or 'exit' to close the chamber.

Example Queries:

- Largest transaction in the database
- Account with the highest merchant risk score
- Suspicious transactions in K1C5


🔍 **Arcane Query: Account with the highest merchant risk score**

📜 **Arcane Findings**:
 - Suspicious Patterns: Frequent transactions with Account CUST_0015 at the Goldweave Port (K3C4), specifically in K1C5.
   - Key Entities: Account CUST_00009, Location K3C4, Counterparty MERCH_0015.
   - Money Laundering Indicators: High total amount of 1,073.09 Gold Pieces in a single account. Irregular transactions at the Goldweave Port.
   - Risk Assessment: Increase surveillance and monitoring of Account CUST_00009's activities with Counterparty MERCH_0015 in K1C5. Investigate potential money laundering activities due to high risk score, irregular patterns, and large transaction amounts.


👋 **Chamber Closed. 1 queries investigated.**


_______________________________________________________

# Mission 4 Review 

Your mission was to deploy the `FantasyRAGChatbot`, a mystical Retrieval-Augmented Generation (RAG) golem, to query the enchanted transaction ledger (`mission1_synthetic_data.csv`) and ancient scrolls (`mission2_documents/`) from Goldweave Port. By running the chatbot in your Jupyter notebook, you’ve extracted insights about suspicious transactions and documents, uncovering potential money laundering tied to the K1C5 syndicate. Let’s dive into how this arcane tool works and what it reveals!

## How It Works

The `FantasyRAGChatbot` harnesses synthetic data and documents:

- **Transaction Loader** (`pandas>=2.2.3`): Reads synthetic CSV with columns like `amount`, `customer_id`, `is_money_laundering`. Converts timestamps to dates. Creates text summaries (e.g., “10,000 Gold Pieces, K1C5”).
- **Document Extractor** (`pdfplumber>=0.11.7`): Extracts text/tables from PDFs (e.g., whistleblower reports). Splits into 500-char chunks, labeling as `bank_statement` or `suspicious_activity_report`.
- **RAG System**: Uses BM25 (`rank-bm25>=0.2.2`) and SentenceTransformers (`sentence-transformers==3.2.0`) for retrieval, FAISS (`faiss-cpu==1.9.0`) for embedding search, and Mistral (`ollama>=0.5.3`) for <100-word responses. Prioritizes synthetic transaction data for queries like “largest transaction.”

## Why It Matters
Mission 1’s synthetic data mimics real AML patterns (e.g., K1C5’s high-amount transfers) using Gaussian Copula for statistical fidelity and CTGAN for complex distributions. The golem queries these patterns, revealing suspicious transactions or document clues. Empower the high court giving them the tool to further assess the evnidence, expose synthetic patterns, and take down the money laundering syndicate!

____________________________________________________

 # Conclusion: Triumph of the Arcane Investigator

Noble Investigator, through the 4 missions, you hhave successfully unraveled Goldweave Port’s financial mysteries and money laundering patterns! 

In Mission 1, you forged a synthetic ledger (mission1_synthetic_data.csv) using Gaussian Copula, CTGAN, Graph Network, and TVAE, crafting realistic transaction patterns to expose K1C5’s schemes. 

Mission 2, conjured ancient scrolls (mission2_documents/) with whistleblower reports and bank statements. 

Mission 3, your synthetic data laid the groud work for training your familiar so it has the power to detect suspicious patterns utilizing... 

In Mission 4, the FantasyRAGChatbot golem empowered you to query transactions and documents, revealing high-value transfers and syndicate clues with BM25, SentenceTransformers, and Mistral. Your synthetic data-driven insights have armed the Valdris Council to combat money laundering. Your arcane mastery has safeguarded the realm!