# Day 5: LSTM Autoencoder for Fraud Detection

**Sequence Modeling for Transaction Fraud Detection**

## Overview
- **Objective**: Detect fraud using sequential transaction patterns
- **Architecture**: LSTM with Attention Mechanism
- **Advantage**: Captures temporal dependencies in transactions

## What You'll Learn
1. **Sequence Preparation**: Organize transactions into sequences
2. **LSTM Architecture**: Long Short-Term Memory networks
3. **Attention Mechanism**: Focus on important transactions
4. **Training**: PyTorch implementation with variable-length sequences

---

## 1. Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print("‚úÖ Libraries imported!")

## 2. Generate Sequential Transaction Data

In [None]:
# Generate sequential transaction data for multiple users
np.random.seed(42)

def generate_user_transactions(user_id, n_transactions=50, fraud_prob=0.1):
    """
    Generate a sequence of transactions for a single user.
    """
    # Base behavior
    amounts = np.random.lognormal(3, 0.5, n_transactions)
    hours = np.random.randint(0, 24, n_transactions)
    merchants = np.random.randint(0, 20, n_transactions)
    
    # Labels (0 = legitimate, 1 = fraud)
    labels = np.zeros(n_transactions)
    
    # Add fraud patterns
    n_fraud = int(n_transactions * fraud_prob)
    if n_fraud > 0:
        fraud_indices = np.random.choice(n_transactions, n_fraud, replace=False)
        labels[fraud_indices] = 1
        # Fraud characteristics: higher amounts, unusual hours
        amounts[fraud_indices] *= np.random.uniform(2, 5, n_fraud)
        hours[fraud_indices] = np.random.choice([1, 2, 3, 22, 23], n_fraud)
    
    # Create DataFrame
    df = pd.DataFrame({
        'user_id': user_id,
        'sequence_pos': range(n_transactions),
        'amount': amounts,
        'hour': hours,
        'merchant': merchants,
        'label': labels
    })
    
    return df

# Generate data for multiple users
n_users = 200
sequences = []
sequence_labels = []

for user_id in range(n_users):
    seq_len = np.random.randint(20, 50)
    user_data = generate_user_transactions(user_id, seq_len, fraud_prob=0.15)
    sequences.append(user_data[['amount', 'hour', 'merchant']].values)
    # Sequence label: 1 if any fraud in sequence
    sequence_labels.append(1 if user_data['label'].max() == 1 else 0)

print(f"Generated {len(sequences)} transaction sequences")
print(f"Sequence lengths: {[len(s) for s in sequences[:5]]}...")
print(f"\nSequence label distribution: {np.bincount(sequence_labels)}")
print(f"Fraud rate: {np.mean(sequence_labels)*100:.1f}%")

## 3. LSTM with Attention Architecture

In [None]:
class LSTMAttentionClassifier(nn.Module):
    """
    LSTM with multi-head attention for sequence classification.
    
    Architecture:
    1. LSTM processes sequential transactions
    2. Attention weights important transactions
    3. Final prediction based on attended representation
    """
    
    def __init__(self, input_dim=3, hidden_dim=64, num_layers=2, 
                 num_heads=4, dropout=0.3, bidirectional=True):
        super().__init__()
        
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.bidirectional = bidirectional
        lstm_output_dim = hidden_dim * 2 if bidirectional else hidden_dim
        
        # LSTM layer
        self.lstm = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=bidirectional,
            batch_first=True
        )
        
        # Multi-head attention
        self.attention = nn.MultiheadAttention(
            embed_dim=lstm_output_dim,
            num_heads=num_heads,
            dropout=dropout,
            batch_first=True
        )
        
        # Layer normalization
        self.layer_norm = nn.LayerNorm(lstm_output_dim)
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(lstm_output_dim, lstm_output_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(lstm_output_dim // 2, 1),
            nn.Sigmoid()
        )
    
    def forward(self, padded_sequences, lengths, return_attention=False):
        """
        Forward pass.
        
        Args:
            padded_sequences: (batch_size, max_seq_len, input_dim)
            lengths: (batch_size,) actual sequence lengths
            return_attention: If True, return attention weights
            
        Returns:
            predictions: (batch_size, 1) fraud probabilities
        """
        batch_size, max_seq_len, _ = padded_sequences.shape
        
        # Sort by length (descending) for pack_padded_sequence
        sorted_lengths, sorted_indices = torch.sort(lengths, descending=True)
        sorted_sequences = padded_sequences[sorted_indices]
        
        # Pack sequences
        packed_input = nn.utils.rnn.pack_padded_sequence(
            sorted_sequences, sorted_lengths.cpu(), batch_first=True
        )
        
        # LSTM forward
        packed_output, (hidden, cell) = self.lstm(packed_input)
        
        # Unpack
        lstm_output, _ = nn.utils.rnn.pad_packed_sequence(
            packed_output, batch_first=True, total_length=max_seq_len
        )
        
        # Create attention mask
        attention_mask = torch.arange(max_seq_len, device=lengths.device)[None, :] >= sorted_lengths[:, None]
        
        # Apply attention
        attended_output, attention_weights = self.attention(
            lstm_output, lstm_output, lstm_output,
            key_padding_mask=attention_mask, need_weights=True
        )
        
        # Layer norm and residual
        attended_output = self.layer_norm(attended_output + lstm_output)
        
        # Extract final representation (last valid transaction)
        last_indices = (sorted_lengths - 1).clamp(min=0)
        batch_indices = torch.arange(batch_size, device=lstm_output.device)
        final_output = attended_output[batch_indices, last_indices]
        
        # Unsort
        _, unsorted_indices = torch.sort(sorted_indices)
        final_output = final_output[unsorted_indices]
        
        # Classification
        predictions = self.classifier(final_output)
        
        if return_attention:
            attention_weights = attention_weights[unsorted_indices]
            return predictions, attention_weights
        
        return predictions

# Create model
model = LSTMAttentionClassifier(
    input_dim=3,
    hidden_dim=64,
    num_layers=2,
    num_heads=2,
    dropout=0.3
).to(device)

print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

## 4. Dataset and DataLoader

In [None]:
class TransactionSequenceDataset(Dataset):
    """Dataset for variable-length transaction sequences."""
    
    def __init__(self, sequences, labels):
        self.sequences = sequences
        self.labels = labels
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, idx):
        sequence = torch.tensor(self.sequences[idx], dtype=torch.float32)
        label = torch.tensor(self.labels[idx], dtype=torch.float32)
        return sequence, label

def collate_fn(batch):
    """
    Custom collate function to handle variable-length sequences.
    
    Pads sequences to the max length in the batch.
    """
    sequences, labels = zip(*batch)
    lengths = torch.tensor([len(seq) for seq in sequences])
    
    # Pad sequences
    max_len = max(lengths).item()
    padded = torch.zeros(len(sequences), max_len, sequences[0].shape[1])
    for i, seq in enumerate(sequences):
        padded[i, :len(seq)] = seq
    
    labels = torch.tensor(labels)
    
    return padded, labels, lengths

# Split data
train_size = int(0.8 * len(sequences))
train_sequences, test_sequences = sequences[:train_size], sequences[train_size:]
train_labels, test_labels = sequence_labels[:train_size], sequence_labels[train_size:]

# Create datasets
train_dataset = TransactionSequenceDataset(train_sequences, train_labels)
test_dataset = TransactionSequenceDataset(test_sequences, test_labels)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False, collate_fn=collate_fn)

print(f"Train samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Batch size: 16")

## 5. Training Loop

In [None]:
# Training setup
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

n_epochs = 20
train_losses = []
train_aucs = []

# Training loop
for epoch in range(n_epochs):
    model.train()
    epoch_loss = 0
    all_preds = []
    all_labels = []
    
    for padded_sequences, labels, lengths in train_loader:
        padded_sequences = padded_sequences.to(device)
        labels = labels.to(device).unsqueeze(1)
        lengths = lengths.to(device)
        
        # Forward
        predictions = model(padded_sequences, lengths)
        loss = criterion(predictions, labels)
        
        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        all_preds.extend(predictions.detach().cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
    
    avg_loss = epoch_loss / len(train_loader)
    auc = roc_auc_score(all_labels, all_preds)
    
    train_losses.append(avg_loss)
    train_aucs.append(auc)
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1}/{n_epochs}: Loss={avg_loss:.4f}, AUC={auc:.3f}")

print("\nTraining completed!")

## 6. Training Progress Visualization

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(train_losses, 'o-', linewidth=2, markersize=4)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Binary Cross-Entropy Loss', fontsize=12)
axes[0].set_title('Training Loss', fontsize=14)
axes[0].grid(True, alpha=0.3)

# AUC
axes[1].plot(train_aucs, 'o-', linewidth=2, markersize=4, color='green')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('AUC-ROC', fontsize=12)
axes[1].set_title('Training AUC', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final Training Loss: {train_losses[-1]:.4f}")
print(f"Final Training AUC: {train_aucs[-1]:.3f}")

## 7. Evaluation on Test Set

In [None]:
# Evaluate
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for padded_sequences, labels, lengths in test_loader:
        padded_sequences = padded_sequences.to(device)
        lengths = lengths.to(device)
        
        predictions = model(padded_sequences, lengths)
        all_preds.extend(predictions.cpu().numpy())
        all_labels.extend(labels.numpy())

all_preds = np.array(all_preds).flatten()
all_labels = np.array(all_labels)

# Metrics
auc = roc_auc_score(all_labels, all_preds)
predictions_binary = (all_preds > 0.5).astype(int)

print("="*60)
print("LSTM ATTENTION MODEL - TEST RESULTS")
print("="*60)
print(f"\nAUC-ROC: {auc:.3f}")
print(f"Accuracy: {np.mean(predictions_binary == all_labels)*100:.1f}%")

print("\nClassification Report:")
print(classification_report(all_labels, predictions_binary, 
                          target_names=['Legitimate', 'Fraud']))

# Confusion Matrix
cm = confusion_matrix(all_labels, predictions_binary)
print("\nConfusion Matrix:")
print(cm)

## 8. Attention Visualization

In [None]:
# Extract attention weights for a sample
model.eval()
sample_sequence, sample_label, sample_len = test_dataset[0]
sample_sequence = sample_sequence.unsqueeze(0).to(device)
sample_len = torch.tensor([sample_len]).to(device)

with torch.no_grad():
    pred, attention = model(sample_sequence, sample_len, return_attention=True)

# Plot attention
attention_np = attention[0].cpu().numpy()  # (num_heads, seq_len, seq_len)
avg_attention = attention_np.mean(0)  # Average over heads

# Only plot actual transactions (not padding)
actual_len = sample_len.item()
plt.figure(figsize=(10, 8))
sns.heatmap(avg_attention[:actual_len, :actual_len], 
            cmap='YlOrRd', 
            xticklabels=range(1, actual_len+1),
            yticklabels=range(1, actual_len+1))
plt.xlabel('Key Position', fontsize=12)
plt.ylabel('Query Position', fontsize=12)
plt.title(f'Attention Map (Sample Sequence, Label={sample_label.item()})', fontsize=14)
plt.tight_layout()
plt.show()

print(f"\nPredicted fraud probability: {pred.item():.3f}")
print(f"Actual label: {sample_label.item()}")

## 9. Summary

### LSTM + Attention for Sequence Classification:

1. **Sequential Modeling**: LSTM captures temporal dependencies
   - Processes transactions in order
   - Maintains hidden state across sequence

2. **Attention Mechanism**: Focuses on important transactions
   - Learns which transactions matter most
   - Provides interpretability via attention weights

3. **Variable-Length Sequences**: Handles different transaction counts
   - Padding to max length in batch
   - Packing for efficiency
   - Masking to ignore padding

### Key Takeaways:

- ‚úÖ **LSTMs excel** at sequence modeling for fraud detection
- ‚úÖ **Attention mechanism** improves interpretability
- ‚úÖ **Variable-length handling** is crucial for real-world data
- ‚úÖ **Training requires** careful handling of padding/masking

### Advantages:

- Captures temporal patterns (burst transactions, time-based fraud)
- Handles variable-length sequences naturally
- Attention weights provide interpretability

### Disadvantages:

- More complex than traditional models
- Requires more data to train effectively
- Slower inference than simple models

### Next Steps:
‚Üí **Day 6**: Anomaly Detection (unsupervised methods)
‚Üí **Day 7**: Model Explainability (SHAP, LIME)

---

**üìÅ Project Location**: `01_fraud_detection_core/lstm_fraud_detection/`