#  Notebook 04: LSTM Sequence Models for Fraud Detection

##  Overview

This notebook implements advanced sequential models for fraud detection using temporal transaction patterns.

###  Objectives

1. **Sequence Generation**: Create transaction sequences with temporal windowing
2. **Model Architectures**: Build 3 advanced LSTM/GRU models
3. **Training Pipeline**: Mixed precision training with class balancing
4. **Embedding Extraction**: Generate embeddings for fusion model (Notebook 05)
5. **Performance Analysis**: Comprehensive evaluation and visualization

###  Model Architectures

| Model | Description | Key Features |
|-------|-------------|--------------|
| **BiLSTM + Attention** | Bidirectional LSTM with attention mechanism | Multi-layer, attention pooling, residual connections |
| **Residual GRU** | Deep GRU with skip connections | 4 layers, layer normalization, temporal pooling |
| **LSTM-CNN Hybrid** | Parallel LSTM and CNN paths | Multi-scale convolutions, feature fusion |

###  Expected Results

- **Accuracy**: 85-92%
- **F1 Score**: 0.65-0.80
- **AUC**: 0.85-0.92

### Integration

- **Input**: PaySim processed data from Notebook 01
- **Output**: Trained models + embeddings for Notebook 05 (Fusion Model)

---

## 1️ Imports and Environment Setup

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import warnings
from typing import Dict, List, Tuple, Optional
import time
from tqdm.auto import tqdm
import pickle
import gc

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
from torch.cuda.amp import autocast, GradScaler

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix,
    classification_report, average_precision_score
)
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Configuration
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print(f'PyTorch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'CUDA version: {torch.version.cuda}')

## 2 Path Configuration and Device Setup

In [None]:
# Path configuration
BASE_PATH = Path(r'C:\Users\youss\Downloads\Flag_finance\data')
PROCESSED_PATH = BASE_PATH / 'processed'
MODELS_PATH = BASE_PATH / 'models'
RESULTS_PATH = BASE_PATH / 'results'

# Ensure directories exist
MODELS_PATH.mkdir(parents=True, exist_ok=True)
RESULTS_PATH.mkdir(parents=True, exist_ok=True)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available():
    torch.backends.cudnn.benchmark = True
    torch.cuda.empty_cache()

print(f'✅ Using device: {device}')
print(f'📁 Data path: {BASE_PATH}')
print(f'📁 Models path: {MODELS_PATH}')
print(f'📁 Results path: {RESULTS_PATH}')

## 3️⃣ Transaction Sequence Generator

This class handles:
- **Temporal windowing** with configurable stride
- **Feature engineering** (17+ features per time step)
- **User-level aggregation** for realistic sequences
- **Normalization** and encoding

In [None]:
class TransactionSequenceGenerator:
    """
    Generate transaction sequences for LSTM training.
    
    Features:
    - Temporal windowing with sliding window
    - User-level aggregation
    - Advanced feature engineering
    - Balanced sampling for training
    """
    
    def __init__(self, 
                 sequence_length: int = 10,
                 stride: int = 5,
                 min_transactions: int = 3):
        self.sequence_length = sequence_length
        self.stride = stride
        self.min_transactions = min_transactions
        self.feature_scaler = StandardScaler()
        self.categorical_encoders = {}
        
    def engineer_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create advanced temporal and statistical features."""
        print('Engineering sequence features...')
        
        df = df.copy()
        
        # Temporal features
        if 'step' in df.columns:
            df['hour'] = df['step'] % 24
            df['day'] = df['step'] // 24
            df['is_weekend'] = (df['day'] % 7 >= 5).astype(int)
            df['is_night'] = ((df['hour'] >= 22) | (df['hour'] <= 6)).astype(int)
        
        # Amount features
        if 'amount' in df.columns:
            df['amount_log'] = np.log1p(df['amount'])
            df['amount_sqrt'] = np.sqrt(df['amount'])
        
        # Balance features (PaySim specific)
        if 'oldbalanceOrg' in df.columns:
            df['balance_ratio_orig'] = df['amount'] / (df['oldbalanceOrg'] + 1e-6)
            df['balance_change_orig'] = df['newbalanceOrig'] - df['oldbalanceOrg']
            df['balance_error_orig'] = df['balance_change_orig'] + df['amount']
        
        if 'oldbalanceDest' in df.columns:
            df['balance_ratio_dest'] = df['amount'] / (df['oldbalanceDest'] + 1e-6)
            df['balance_change_dest'] = df['newbalanceDest'] - df['oldbalanceDest']
            df['balance_error_dest'] = df['balance_change_dest'] - df['amount']
        
        return df
    
    def create_sequences_paysim(self, 
                                df: pd.DataFrame,
                                user_col: str = 'nameOrig') -> Tuple[List, List, List]:
        """Create sequences from PaySim data grouped by user."""
        print(f'\nCreating sequences (length={self.sequence_length}, stride={self.stride})...')
        
        # Engineer features
        df = self.engineer_features(df)
        
        # Sort by user and time
        df = df.sort_values([user_col, 'step'])
        
        # Select features for sequences
        feature_cols = [
            'amount', 'amount_log', 'amount_sqrt',
            'hour', 'day', 'is_weekend', 'is_night',
            'oldbalanceOrg', 'newbalanceOrig',
            'oldbalanceDest', 'newbalanceDest',
            'balance_ratio_orig', 'balance_change_orig', 'balance_error_orig',
            'balance_ratio_dest', 'balance_change_dest', 'balance_error_dest'
        ]
        
        # Encode transaction type
        if 'type' in df.columns:
            self.categorical_encoders['type'] = LabelEncoder()
            df['type_encoded'] = self.categorical_encoders['type'].fit_transform(df['type'])
            feature_cols.append('type_encoded')
        
        # Filter available features
        feature_cols = [col for col in feature_cols if col in df.columns]
        
        sequences = []
        labels = []
        metadata = []
        
        # Group by user
        grouped = df.groupby(user_col)
        
        for user_id, group in tqdm(grouped, desc='Processing users'):
            if len(group) < self.min_transactions:
                continue
            
            group = group.reset_index(drop=True)
            
            # Sliding window
            for i in range(0, len(group) - self.sequence_length + 1, self.stride):
                window = group.iloc[i:i + self.sequence_length]
                
                # Extract features
                seq_features = window[feature_cols].values
                
                # Label: fraud if any transaction in sequence is fraud
                seq_label = int(window['isFraud'].max())
                
                sequences.append(seq_features)
                labels.append(seq_label)
                metadata.append({
                    'user_id': user_id,
                    'start_idx': i,
                    'end_idx': i + self.sequence_length,
                    'fraud_count': int(window['isFraud'].sum())
                })
        
        print(f'Created {len(sequences):,} sequences')
        print(f'Fraud sequences: {sum(labels):,} ({sum(labels)/len(labels)*100:.2f}%)')
        
        return sequences, labels, metadata
    
    def normalize_sequences(self, sequences: List[np.ndarray], 
                           fit: bool = True) -> np.ndarray:
        """Normalize sequence features."""
        print('Normalizing sequences...')
        
        # Flatten all sequences for fitting
        all_features = np.vstack(sequences)
        
        if fit:
            self.feature_scaler.fit(all_features)
        
        # Normalize each sequence
        normalized = []
        for seq in sequences:
            normalized.append(self.feature_scaler.transform(seq))
        
        return np.array(normalized)
    
    def save(self, path: Path):
        """Save generator state."""
        state = {
            'sequence_length': self.sequence_length,
            'stride': self.stride,
            'min_transactions': self.min_transactions,
            'feature_scaler': self.feature_scaler,
            'categorical_encoders': self.categorical_encoders
        }
        with open(path, 'wb') as f:
            pickle.dump(state, f)
        print(f'Saved sequence generator to: {path}')
    
    @classmethod
    def load(cls, path: Path):
        """Load generator state."""
        with open(path, 'rb') as f:
            state = pickle.load(f)
        
        generator = cls(
            sequence_length=state['sequence_length'],
            stride=state['stride'],
            min_transactions=state['min_transactions']
        )
        generator.feature_scaler = state['feature_scaler']
        generator.categorical_encoders = state['categorical_encoders']
        
        print(f'Loaded sequence generator from: {path}')
        return generator

print('✅ TransactionSequenceGenerator class defined')

## 4️⃣ PyTorch Dataset Class

In [None]:
class SequenceDataset(Dataset):
    """PyTorch Dataset for transaction sequences."""
    
    def __init__(self, sequences: np.ndarray, labels: np.ndarray, 
                 metadata: Optional[List[Dict]] = None):
        self.sequences = torch.FloatTensor(sequences)
        self.labels = torch.LongTensor(labels)
        self.metadata = metadata
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]
    
    def get_class_weights(self):
        """Compute class weights for balanced training."""
        labels_np = self.labels.numpy()
        class_counts = np.bincount(labels_np)
        weights = 1.0 / class_counts
        weights = weights / weights.sum() * len(weights)
        return torch.FloatTensor(weights)

print('✅ SequenceDataset class defined')

## 5️⃣ Model Architectures

### 5.1 Attention Layer

In [None]:
class AttentionLayer(nn.Module):
    """Attention mechanism for LSTM outputs."""
    
    def __init__(self, hidden_size):
        super().__init__()
        self.attention = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, 1)
        )
    
    def forward(self, lstm_output):
        # lstm_output: (batch, seq_len, hidden_size)
        attention_weights = self.attention(lstm_output)  # (batch, seq_len, 1)
        attention_weights = F.softmax(attention_weights, dim=1)
        
        # Weighted sum
        attended = torch.sum(attention_weights * lstm_output, dim=1)  # (batch, hidden_size)
        return attended, attention_weights

print('✅ AttentionLayer defined')

### 5.2 BiLSTM with Attention

**Architecture:**
- Multi-layer bidirectional LSTM
- Attention pooling over time steps
- Batch normalization and dropout
- Deep classifier head

In [None]:
class BiLSTMWithAttention(nn.Module):
    """
    Bidirectional LSTM with attention mechanism.
    
    Features:
    - Multi-layer bidirectional LSTM
    - Attention pooling over time steps
    - Dropout and batch normalization
    - Residual connections
    """
    
    def __init__(self, 
                 input_size: int,
                 hidden_size: int = 128,
                 num_layers: int = 3,
                 dropout: float = 0.3,
                 num_classes: int = 2):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Input projection
        self.input_proj = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.BatchNorm1d(hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout * 0.5)
        )
        
        # Bidirectional LSTM
        self.lstm = nn.LSTM(
            hidden_size,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # Attention layer
        self.attention = AttentionLayer(hidden_size * 2)
        
        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.BatchNorm1d(hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.BatchNorm1d(hidden_size // 2),
            nn.ReLU(),
            nn.Dropout(dropout * 0.7),
            nn.Linear(hidden_size // 2, num_classes)
        )
        
        self._init_weights()
    
    def _init_weights(self):
        for name, param in self.named_parameters():
            if 'weight' in name:
                if 'lstm' in name:
                    nn.init.orthogonal_(param)
                else:
                    nn.init.xavier_uniform_(param)
            elif 'bias' in name:
                nn.init.zeros_(param)
    
    def forward(self, x):
        batch_size, seq_len, input_size = x.size()
        
        # Project each time step
        x_proj = self.input_proj(x.view(-1, input_size))
        x_proj = x_proj.view(batch_size, seq_len, -1)
        
        # LSTM
        lstm_out, _ = self.lstm(x_proj)
        
        # Attention pooling
        attended, attention_weights = self.attention(lstm_out)
        
        # Classification
        output = self.classifier(attended)
        
        return output
    
    def extract_embeddings(self, x):
        """Extract embeddings for fusion model."""
        batch_size, seq_len, input_size = x.size()
        
        x_proj = self.input_proj(x.view(-1, input_size))
        x_proj = x_proj.view(batch_size, seq_len, -1)
        
        lstm_out, _ = self.lstm(x_proj)
        attended, _ = self.attention(lstm_out)
        
        return attended

print('✅ BiLSTMWithAttention defined')

### 5.3 Residual GRU

**Architecture:**
- 4-layer GRU with skip connections
- Layer normalization
- Temporal max and mean pooling
- Residual connections for deep training

In [None]:
class ResidualGRU(nn.Module):
    """
    GRU with residual connections for deep networks.
    
    Features:
    - Multi-layer GRU with skip connections
    - Layer normalization
    - Temporal max pooling
    """
    
    def __init__(self,
                 input_size: int,
                 hidden_size: int = 128,
                 num_layers: int = 4,
                 dropout: float = 0.3,
                 num_classes: int = 2):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Input projection
        self.input_proj = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.LayerNorm(hidden_size),
            nn.ReLU()
        )
        
        # Stacked GRU layers with residual connections
        self.gru_layers = nn.ModuleList()
        self.layer_norms = nn.ModuleList()
        
        for i in range(num_layers):
            self.gru_layers.append(
                nn.GRU(hidden_size, hidden_size, batch_first=True)
            )
            self.layer_norms.append(nn.LayerNorm(hidden_size))
        
        self.dropout = nn.Dropout(dropout)
        
        # Classifier with temporal features
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),  # *2 for max+mean pooling
            nn.BatchNorm1d(hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, num_classes)
        )
        
        self._init_weights()
    
    def _init_weights(self):
        for module in self.modules():
            if isinstance(module, nn.Linear):
                nn.init.kaiming_normal_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
            elif isinstance(module, nn.GRU):
                for name, param in module.named_parameters():
                    if 'weight' in name:
                        nn.init.orthogonal_(param)
                    elif 'bias' in name:
                        nn.init.zeros_(param)
    
    def forward(self, x):
        batch_size, seq_len, _ = x.size()
        
        # Project input
        x = self.input_proj(x.view(batch_size * seq_len, -1))
        x = x.view(batch_size, seq_len, -1)
        
        # Stacked GRU with residuals
        for i, (gru, norm) in enumerate(zip(self.gru_layers, self.layer_norms)):
            residual = x
            x, _ = gru(x)
            x = norm(x)
            x = self.dropout(x)
            
            # Residual connection
            if i > 0:
                x = x + residual * 0.3
        
        # Temporal pooling
        max_pool = torch.max(x, dim=1)[0]
        mean_pool = torch.mean(x, dim=1)
        pooled = torch.cat([max_pool, mean_pool], dim=1)
        
        # Classification
        output = self.classifier(pooled)
        
        return output
    
    def extract_embeddings(self, x):
        """Extract embeddings for fusion model."""
        batch_size, seq_len, _ = x.size()
        
        x = self.input_proj(x.view(batch_size * seq_len, -1))
        x = x.view(batch_size, seq_len, -1)
        
        for i, (gru, norm) in enumerate(zip(self.gru_layers, self.layer_norms)):
            residual = x
            x, _ = gru(x)
            x = norm(x)
            if i > 0:
                x = x + residual * 0.3
        
        max_pool = torch.max(x, dim=1)[0]
        mean_pool = torch.mean(x, dim=1)
        return torch.cat([max_pool, mean_pool], dim=1)

print('✅ ResidualGRU defined')

### 5.4 LSTM-CNN Hybrid

**Architecture:**
- Parallel LSTM and 1D CNN paths
- Multi-scale temporal convolutions (kernel sizes 3, 5, 7)
- Feature fusion layer
- Combined temporal modeling

In [None]:
class LSTMCNN(nn.Module):
    """
    Hybrid LSTM-CNN architecture.
    
    Features:
    - Parallel LSTM and 1D CNN paths
    - Feature fusion
    - Multi-scale temporal modeling
    """
    
    def __init__(self,
                 input_size: int,
                 hidden_size: int = 128,
                 num_layers: int = 2,
                 dropout: float = 0.3,
                 num_classes: int = 2):
        super().__init__()
        
        # LSTM path
        self.lstm = nn.LSTM(
            input_size,
            hidden_size,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        # CNN path (multi-scale)
        self.conv1 = nn.Conv1d(input_size, hidden_size, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(input_size, hidden_size, kernel_size=5, padding=2)
        self.conv3 = nn.Conv1d(input_size, hidden_size, kernel_size=7, padding=3)
        
        self.bn1 = nn.BatchNorm1d(hidden_size)
        self.bn2 = nn.BatchNorm1d(hidden_size)
        self.bn3 = nn.BatchNorm1d(hidden_size)
        
        # Fusion layer
        fusion_size = hidden_size * 2 + hidden_size * 3  # LSTM (bidir) + 3 CNNs
        self.fusion = nn.Sequential(
            nn.Linear(fusion_size, hidden_size * 2),
            nn.BatchNorm1d(hidden_size * 2),
            nn.ReLU(),
            nn.Dropout(dropout)
        )
        
        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size * 2, hidden_size),
            nn.BatchNorm1d(hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, num_classes)
        )
        
        self._init_weights()
    
    def _init_weights(self):
        for module in self.modules():
            if isinstance(module, (nn.Linear, nn.Conv1d)):
                nn.init.kaiming_normal_(module.weight)
                if module.bias is not None:
                    nn.init.zeros_(module.bias)
    
    def forward(self, x):
        batch_size, seq_len, input_size = x.size()
        
        # LSTM path
        lstm_out, _ = self.lstm(x)
        lstm_max = torch.max(lstm_out, dim=1)[0]
        
        # CNN path (transpose for Conv1d)
        x_cnn = x.transpose(1, 2)  # (batch, input_size, seq_len)
        
        conv1_out = F.relu(self.bn1(self.conv1(x_cnn)))
        conv2_out = F.relu(self.bn2(self.conv2(x_cnn)))
        conv3_out = F.relu(self.bn3(self.conv3(x_cnn)))
        
        # Global max pooling for each conv
        conv1_pool = torch.max(conv1_out, dim=2)[0]
        conv2_pool = torch.max(conv2_out, dim=2)[0]
        conv3_pool = torch.max(conv3_out, dim=2)[0]
        
        # Fuse all features
        fused = torch.cat([lstm_max, conv1_pool, conv2_pool, conv3_pool], dim=1)
        fused = self.fusion(fused)
        
        # Classification
        output = self.classifier(fused)
        
        return output
    
    def extract_embeddings(self, x):
        """Extract embeddings for fusion model."""
        batch_size, seq_len, input_size = x.size()
        
        lstm_out, _ = self.lstm(x)
        lstm_max = torch.max(lstm_out, dim=1)[0]
        
        x_cnn = x.transpose(1, 2)
        conv1_out = F.relu(self.bn1(self.conv1(x_cnn)))
        conv2_out = F.relu(self.bn2(self.conv2(x_cnn)))
        conv3_out = F.relu(self.bn3(self.conv3(x_cnn)))
        
        conv1_pool = torch.max(conv1_out, dim=2)[0]
        conv2_pool = torch.max(conv2_out, dim=2)[0]
        conv3_pool = torch.max(conv3_out, dim=2)[0]
        
        fused = torch.cat([lstm_max, conv1_pool, conv2_pool, conv3_pool], dim=1)
        return self.fusion(fused)

print('✅ LSTMCNN defined')

## 6️⃣ Training Utilities

### 6.1 Focal Loss for Class Imbalance

In [None]:
class FocalLoss(nn.Module):
    """Focal Loss for handling class imbalance."""
    
    def __init__(self, alpha=None, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(inputs, targets, reduction='none', weight=self.alpha)
        pt = torch.exp(-ce_loss)
        focal_loss = ((1 - pt) ** self.gamma) * ce_loss
        return focal_loss.mean()

print('✅ FocalLoss defined')

### 6.2 Training and Evaluation Functions

In [None]:
def train_epoch(model, loader, optimizer, criterion, scaler, device):
    """Train for one epoch with mixed precision."""
    model.train()
    total_loss = 0
    all_preds = []
    all_labels = []
    
    for sequences, labels in tqdm(loader, desc='Training', leave=False):
        sequences, labels = sequences.to(device), labels.to(device)
        
        optimizer.zero_grad(set_to_none=True)
        
        with autocast():
            outputs = model(sequences)
            loss = criterion(outputs, labels)
        
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()
        
        total_loss += loss.item() * sequences.size(0)
        
        preds = outputs.argmax(dim=1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
    
    avg_loss = total_loss / len(loader.dataset)
    accuracy = accuracy_score(all_labels, all_preds)
    
    return avg_loss, accuracy


@torch.no_grad()
def evaluate(model, loader, criterion, device):
    """Evaluate model."""
    model.eval()
    total_loss = 0
    all_preds = []
    all_probs = []
    all_labels = []
    
    for sequences, labels in tqdm(loader, desc='Evaluating', leave=False):
        sequences, labels = sequences.to(device), labels.to(device)
        
        outputs = model(sequences)
        loss = criterion(outputs, labels)
        
        total_loss += loss.item() * sequences.size(0)
        
        probs = F.softmax(outputs, dim=1)
        preds = probs.argmax(dim=1)
        
        all_preds.extend(preds.cpu().numpy())
        all_probs.extend(probs[:, 1].cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
    
    avg_loss = total_loss / len(loader.dataset)
    
    metrics = {
        'loss': avg_loss,
        'accuracy': accuracy_score(all_labels, all_preds),
        'precision': precision_score(all_labels, all_preds, zero_division=0),
        'recall': recall_score(all_labels, all_preds, zero_division=0),
        'f1': f1_score(all_labels, all_preds, zero_division=0),
        'auc': roc_auc_score(all_labels, all_probs) if len(np.unique(all_labels)) > 1 else 0.0,
        'ap': average_precision_score(all_labels, all_probs) if len(np.unique(all_labels)) > 1 else 0.0
    }
    
    return metrics, all_preds, all_probs

print('✅ Training and evaluation functions defined')

### 6.3 Complete Training Loop

In [None]:
def train_model(model, train_loader, val_loader, model_name, 
                epochs=100, lr=0.001, patience=15, device='cuda'):
    """Complete training loop with early stopping."""
    
    print(f'\n{"="*70}')
    print(f'Training {model_name}')
    print(f'{"="*70}')
    
    # Get class weights from training dataset
    class_weights = train_loader.dataset.get_class_weights().to(device)
    criterion = FocalLoss(alpha=class_weights, gamma=2.5)
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
        optimizer, T_0=20, T_mult=2
    )
    scaler = GradScaler()
    
    best_val_f1 = 0
    patience_counter = 0
    history = {
        'train_loss': [], 'train_acc': [],
        'val_loss': [], 'val_acc': [], 'val_f1': [], 'val_auc': []
    }
    
    start_time = time.time()
    
    for epoch in range(1, epochs + 1):
        # Train
        train_loss, train_acc = train_epoch(
            model, train_loader, optimizer, criterion, scaler, device
        )
        
        # Validate
        val_metrics, _, _ = evaluate(model, val_loader, criterion, device)
        
        # Update scheduler
        scheduler.step()
        
        # Record history
        history['train_loss'].append(train_loss)
        history['train_acc'].append(train_acc)
        history['val_loss'].append(val_metrics['loss'])
        history['val_acc'].append(val_metrics['accuracy'])
        history['val_f1'].append(val_metrics['f1'])
        history['val_auc'].append(val_metrics['auc'])
        
        # Print progress
        if epoch % 5 == 0 or epoch == 1:
            print(f'Epoch {epoch:03d}/{epochs} | '
                  f'Loss: {train_loss:.4f} | '
                  f'Val Acc: {val_metrics["accuracy"]:.4f} | '
                  f'Val F1: {val_metrics["f1"]:.4f} | '
                  f'Val AUC: {val_metrics["auc"]:.4f}')
        
        # Save best model
        if val_metrics['f1'] > best_val_f1:
            best_val_f1 = val_metrics['f1']
            patience_counter = 0
            
            checkpoint_path = MODELS_PATH / f'{model_name}_best.pt'
            torch.save({
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'epoch': epoch,
                'val_metrics': val_metrics
            }, checkpoint_path)
        else:
            patience_counter += 1
        
        # Early stopping
        if patience_counter >= patience:
            print(f'\n⏹️ Early stopping at epoch {epoch}')
            break
    
    train_time = time.time() - start_time
    
    # Load best model and evaluate
    checkpoint = torch.load(MODELS_PATH / f'{model_name}_best.pt', weights_only=False)
    model.load_state_dict(checkpoint['model_state_dict'])
    
    print(f'\n✅ Training complete in {train_time:.2f}s')
    print(f'Best Val F1: {best_val_f1:.4f}')
    
    return {
        'model': model,
        'history': history,
        'train_time': train_time,
        'best_epoch': epoch - patience_counter,
        'best_val_f1': best_val_f1
    }

print('✅ Complete training function defined')

## 7️⃣ Load and Prepare Data

Load PaySim data and create transaction sequences

In [None]:
print('='*70)
print('LOADING PAYSIM DATA')
print('='*70)

# Load PaySim data
paysim_file = PROCESSED_PATH / 'paysim_sample_enhanced.csv'

if not paysim_file.exists():
    print(f'⚠️ PaySim file not found: {paysim_file}')
    print('Please run notebook 01 (data exploration) first.')
else:
    # Load data
    df = pd.read_csv(paysim_file)
    print(f'\n✅ Loaded PaySim data: {df.shape}')
    print(f'Fraud rate: {df["isFraud"].mean()*100:.2f}%')
    print(f'\nColumns: {list(df.columns)}')

### 7.1 Create Transaction Sequences

In [None]:
# Initialize sequence generator
seq_generator = TransactionSequenceGenerator(
    sequence_length=10,
    stride=5,
    min_transactions=5
)

# Create sequences
sequences, labels, metadata = seq_generator.create_sequences_paysim(df)

# Normalize sequences
sequences_normalized = seq_generator.normalize_sequences(sequences, fit=True)

# Save sequence generator
seq_generator.save(PROCESSED_PATH / 'sequence_generator.pkl')

print(f'\n📊 Sequence Statistics:')
print(f'   Total sequences: {len(sequences_normalized):,}')
print(f'   Sequence shape: {sequences_normalized[0].shape}')
print(f'   Fraud sequences: {sum(labels):,} ({sum(labels)/len(labels)*100:.2f}%)')

### 7.2 Create Train/Val/Test Splits

In [None]:
# Convert to numpy arrays
sequences_np = np.array(sequences_normalized)
labels_np = np.array(labels)

# Temporal split (80/10/10)
n_total = len(sequences_np)
n_train = int(0.8 * n_total)
n_val = int(0.9 * n_total)

train_sequences = sequences_np[:n_train]
train_labels = labels_np[:n_train]

val_sequences = sequences_np[n_train:n_val]
val_labels = labels_np[n_train:n_val]

test_sequences = sequences_np[n_val:]
test_labels = labels_np[n_val:]

print(f'Split sizes:')
print(f'   Train: {len(train_sequences):,} (fraud: {train_labels.sum():,}, {train_labels.mean()*100:.2f}%)')
print(f'   Val:   {len(val_sequences):,} (fraud: {val_labels.sum():,}, {val_labels.mean()*100:.2f}%)')
print(f'   Test:  {len(test_sequences):,} (fraud: {test_labels.sum():,}, {test_labels.mean()*100:.2f}%)')

### 7.3 Create PyTorch DataLoaders

In [None]:
# Create PyTorch datasets
train_dataset = SequenceDataset(train_sequences, train_labels)
val_dataset = SequenceDataset(val_sequences, val_labels)
test_dataset = SequenceDataset(test_sequences, test_labels)

# Create data loaders
BATCH_SIZE = 256

train_loader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=0,
    pin_memory=True if torch.cuda.is_available() else False
)

val_loader = DataLoader(
    val_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=0,
    pin_memory=True if torch.cuda.is_available() else False
)

test_loader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=0,
    pin_memory=True if torch.cuda.is_available() else False
)

print(f'✅ Data loaders created (batch_size={BATCH_SIZE})')
print(f'   Train batches: {len(train_loader)}')
print(f'   Val batches: {len(val_loader)}')
print(f'   Test batches: {len(test_loader)}')

## 8️⃣ Train All Models

Train the three LSTM architectures and compare results

In [None]:
input_size = train_sequences.shape[2]  # Number of features
print(f'Input size (features per time step): {input_size}')

# Model configurations
models_config = [
    {
        'name': 'BiLSTM_Attention',
        'model': BiLSTMWithAttention(input_size, hidden_size=128, num_layers=3, dropout=0.3),
        'lr': 0.001,
        'epochs': 100
    },
    {
        'name': 'ResidualGRU',
        'model': ResidualGRU(input_size, hidden_size=128, num_layers=4, dropout=0.3),
        'lr': 0.001,
        'epochs': 100
    },
    {
        'name': 'LSTM_CNN_Hybrid',
        'model': LSTMCNN(input_size, hidden_size=128, num_layers=2, dropout=0.3),
        'lr': 0.001,
        'epochs': 100
    }
]

results = {}

for config in models_config:
    model_name = config['name']
    model = config['model'].to(device)
    
    # Count parameters
    num_params = sum(p.numel() for p in model.parameters())
    print(f'\n{model_name}: {num_params:,} parameters')
    
    # Train model
    result = train_model(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        model_name=model_name,
        epochs=config['epochs'],
        lr=config['lr'],
        patience=15,
        device=device
    )
    
    # Evaluate on test set
    checkpoint = torch.load(MODELS_PATH / f'{model_name}_best.pt', weights_only=False)
    model.load_state_dict(checkpoint['model_state_dict'])
    
    test_metrics, test_preds, test_probs = evaluate(
        model, test_loader, 
        FocalLoss(alpha=train_dataset.get_class_weights().to(device), gamma=2.5),
        device
    )
    
    print(f'\n📊 {model_name} Test Results:')
    print(f'   Accuracy:  {test_metrics["accuracy"]:.4f}')
    print(f'   Precision: {test_metrics["precision"]:.4f}')
    print(f'   Recall:    {test_metrics["recall"]:.4f}')
    print(f'   F1 Score:  {test_metrics["f1"]:.4f}')
    print(f'   AUC:       {test_metrics["auc"]:.4f}')
    print(f'   AP:        {test_metrics["ap"]:.4f}')
    
    results[model_name] = {
        'test_metrics': test_metrics,
        'history': result['history'],
        'train_time': result['train_time'],
        'best_epoch': result['best_epoch'],
        'num_parameters': num_params,
        'predictions': {
            'preds': test_preds,
            'probs': test_probs
        }
    }
    
    # Clear memory
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()

print('\n' + '='*70)
print('✅ ALL MODELS TRAINED')
print('='*70)

## 9️⃣ Results Analysis and Visualization

### 9.1 Model Comparison Table

In [None]:
# Create results dataframe
results_data = []
for model_name, model_results in results.items():
    metrics = model_results['test_metrics']
    results_data.append({
        'Model': model_name,
        'Accuracy': metrics['accuracy'],
        'Precision': metrics['precision'],
        'Recall': metrics['recall'],
        'F1': metrics['f1'],
        'AUC': metrics['auc'],
        'AP': metrics['ap'],
        'Train Time (s)': model_results['train_time'],
        'Epochs': model_results['best_epoch'],
        'Parameters': model_results['num_parameters']
    })

results_df = pd.DataFrame(results_data)
results_df = results_df.sort_values('F1', ascending=False)

print('📊 Model Comparison:')
print(results_df.to_string(index=False))

# Best model
best_model = results_df.iloc[0]['Model']
best_f1 = results_df.iloc[0]['F1']
print(f'\n🏆 Best Model: {best_model} (F1 = {best_f1:.4f})')

### 9.2 Training Curves

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Training Loss
for model_name in results.keys():
    axes[0, 0].plot(results[model_name]['history']['train_loss'], 
                   label=model_name, linewidth=2)
axes[0, 0].set_title('Training Loss', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Validation Accuracy
for model_name in results.keys():
    axes[0, 1].plot(results[model_name]['history']['val_acc'], 
                   label=model_name, linewidth=2)
axes[0, 1].set_title('Validation Accuracy', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Accuracy')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Validation F1
for model_name in results.keys():
    axes[1, 0].plot(results[model_name]['history']['val_f1'], 
                   label=model_name, linewidth=2)
axes[1, 0].set_title('Validation F1 Score', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('F1 Score')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Validation AUC
for model_name in results.keys():
    axes[1, 1].plot(results[model_name]['history']['val_auc'], 
                   label=model_name, linewidth=2)
axes[1, 1].set_title('Validation AUC', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('AUC')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(RESULTS_PATH / 'lstm_training_curves.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'✅ Training curves saved to: {RESULTS_PATH / "lstm_training_curves.png"}')

### 9.3 Performance Comparison Charts

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Metrics comparison
metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1', 'AUC']
x = np.arange(len(results_df))
width = 0.15

for i, metric in enumerate(metrics_to_plot):
    axes[0].bar(x + i*width, results_df[metric], width, 
               label=metric, alpha=0.8)

axes[0].set_xlabel('Model', fontsize=12)
axes[0].set_ylabel('Score', fontsize=12)
axes[0].set_title('Test Set Performance Comparison', fontsize=14, fontweight='bold')
axes[0].set_xticks(x + width * 2)
axes[0].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].set_ylim([0, 1])

# Training time
axes[1].bar(results_df['Model'], results_df['Train Time (s)'], 
           color='steelblue', alpha=0.8)
axes[1].set_xlabel('Model', fontsize=12)
axes[1].set_ylabel('Time (seconds)', fontsize=12)
axes[1].set_title('Training Time Comparison', fontsize=14, fontweight='bold')
axes[1].set_xticklabels(results_df['Model'], rotation=45, ha='right')
axes[1].grid(True, alpha=0.3, axis='y')

for i, v in enumerate(results_df['Train Time (s)']):
    axes[1].text(i, v + 0.5, f'{v:.1f}s', ha='center', va='bottom')

plt.tight_layout()
plt.savefig(RESULTS_PATH / 'lstm_performance_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'✅ Performance comparison saved to: {RESULTS_PATH / "lstm_performance_comparison.png"}')

### 9.4 Confusion Matrices

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, model_name in enumerate(results.keys()):
    preds = results[model_name]['predictions']['preds']
    cm = confusion_matrix(test_labels, preds)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
               xticklabels=['Legit', 'Fraud'],
               yticklabels=['Legit', 'Fraud'],
               ax=axes[idx])
    axes[idx].set_title(f'{model_name}\nConfusion Matrix', fontweight='bold')
    axes[idx].set_ylabel('True Label')
    axes[idx].set_xlabel('Predicted Label')

plt.tight_layout()
plt.savefig(RESULTS_PATH / 'lstm_confusion_matrices.png', dpi=150, bbox_inches='tight')
plt.show()

print(f'✅ Confusion matrices saved to: {RESULTS_PATH / "lstm_confusion_matrices.png"}')

## 🔟 Extract Embeddings for Fusion Model

This is a **critical step** for Notebook 05 where we'll combine GNN and LSTM embeddings.

In [None]:
embeddings_dict = {}

for model_name in results.keys():
    print(f'\nExtracting embeddings from {model_name}...')
    
    # Load best model
    if model_name == 'BiLSTM_Attention':
        model = BiLSTMWithAttention(input_size, hidden_size=128, num_layers=3).to(device)
    elif model_name == 'ResidualGRU':
        model = ResidualGRU(input_size, hidden_size=128, num_layers=4).to(device)
    else:  # LSTM_CNN_Hybrid
        model = LSTMCNN(input_size, hidden_size=128, num_layers=2).to(device)
    
    checkpoint = torch.load(MODELS_PATH / f'{model_name}_best.pt', weights_only=False)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()
    
    # Extract embeddings for all splits
    for split_name, loader in [('train', train_loader), 
                               ('val', val_loader), 
                               ('test', test_loader)]:
        embeddings_list = []
        labels_list = []
        
        with torch.no_grad():
            for sequences, labels in tqdm(loader, desc=f'Extracting {split_name}', leave=False):
                sequences = sequences.to(device)
                embeddings = model.extract_embeddings(sequences)
                embeddings_list.append(embeddings.cpu())
                labels_list.append(labels)
        
        embeddings_all = torch.cat(embeddings_list).numpy()
        labels_all = torch.cat(labels_list).numpy()
        
        if model_name not in embeddings_dict:
            embeddings_dict[model_name] = {}
        
        embeddings_dict[model_name][split_name] = {
            'embeddings': embeddings_all,
            'labels': labels_all
        }
        
        print(f'   {split_name}: {embeddings_all.shape}')

# Save embeddings
embeddings_path = PROCESSED_PATH / 'lstm_embeddings.pkl'
with open(embeddings_path, 'wb') as f:
    pickle.dump(embeddings_dict, f)

print(f'\n✅ Embeddings saved to: {embeddings_path}')
print(f'   These embeddings will be used in Notebook 05 for fusion with GNN embeddings')

## 1️⃣1️⃣ Save Results and Metadata

In [None]:
# Save results summary
results_summary = {
    'models': results_df.to_dict('records'),
    'best_model': {
        'name': best_model,
        'f1_score': float(best_f1),
        'metrics': {k: float(v) for k, v in results[best_model]['test_metrics'].items()}
    },
    'sequence_config': {
        'sequence_length': seq_generator.sequence_length,
        'stride': seq_generator.stride,
        'input_size': input_size
    },
    'dataset_info': {
        'train_size': len(train_sequences),
        'val_size': len(val_sequences),
        'test_size': len(test_sequences),
        'fraud_rate_train': float(train_labels.mean()),
        'fraud_rate_val': float(val_labels.mean()),
        'fraud_rate_test': float(test_labels.mean())
    }
}

results_file = RESULTS_PATH / 'lstm_results.json'
with open(results_file, 'w') as f:
    json.dump(results_summary, f, indent=2)

print(f'✅ Results saved to: {results_file}')

# Save training histories
histories = {
    model_name: {
        k: [float(x) for x in v] 
        for k, v in result['history'].items()
    }
    for model_name, result in results.items()
}

history_file = RESULTS_PATH / 'lstm_training_histories.json'
with open(history_file, 'w') as f:
    json.dump(histories, f, indent=2)

print(f'✅ Training histories saved to: {history_file}')

## 📊 Final Summary and Next Steps

In [None]:
print('='*70)
print('🎉 LSTM SEQUENCE MODELS - COMPLETE SUMMARY')
print('='*70)

print(f'\n📊 Trained Models: {len(results)}')
for model_name in results.keys():
    metrics = results[model_name]['test_metrics']
    print(f'\n   {model_name}:')
    print(f'      Accuracy:  {metrics["accuracy"]:.4f}')
    print(f'      Precision: {metrics["precision"]:.4f}')
    print(f'      Recall:    {metrics["recall"]:.4f}')
    print(f'      F1 Score:  {metrics["f1"]:.4f}')
    print(f'      AUC:       {metrics["auc"]:.4f}')
    print(f'      AP:        {metrics["ap"]:.4f}')

print(f'\n🏆 Best Model: {best_model}')
print(f'   F1 Score: {best_f1:.4f}')

print(f'\n📁 Generated Outputs:')
print(f'   ✅ Trained models: {MODELS_PATH}')
print(f'   ✅ Training results: {RESULTS_PATH}')
print(f'   ✅ Embeddings for fusion: {embeddings_path}')
print(f'   ✅ Visualizations: {RESULTS_PATH}')

print('\n📝 Next Steps:')
print('   1️⃣ Run Notebook 05: Hybrid Fusion Model')
print('   2️⃣ Combine GNN embeddings (from Notebook 03) + LSTM embeddings')
print('   3️⃣ Train multi-modal fusion classifier')
print('   4️⃣ Compare fusion model with individual models')
print('   5️⃣ Achieve state-of-the-art fraud detection performance')

print('\n' + '='*70)
print('✅ NOTEBOOK 04 COMPLETE - READY FOR FUSION!')
print('='*70)