# EpiBERT Complete Workflow Example

This notebook demonstrates the complete EpiBERT workflow from data processing to model evaluation.

## Overview

The EpiBERT workflow consists of:
1. **Environment Setup**: Install dependencies and validate tools
2. **Data Processing**: Convert BAM files to model-ready format
3. **Model Training**: Train pretraining or fine-tuning models
4. **Model Evaluation**: Comprehensive performance assessment

This example uses the **PyTorch Lightning implementation** which provides:
- Exact parameter matching with original TensorFlow models
- Modern training features (mixed precision, gradient clipping)
- Better hardware utilization
- Simplified training loops

## Step 1: Environment Setup

In [None]:
import sys
import os
from pathlib import Path

# Add repository root to path
repo_root = Path.cwd().parent if 'example_usage' in str(Path.cwd()) else Path.cwd()
sys.path.insert(0, str(repo_root))

print(f"Repository root: {repo_root}")
print(f"Current working directory: {Path.cwd()}")

In [None]:
# Validate environment
!{repo_root}/setup_environment.sh --lightning --validate-only

## Step 2: Configuration Setup

EpiBERT uses YAML configuration files to manage all parameters. Let's create example configurations.

In [None]:
import yaml

# Data processing configuration
data_config = {
    'input': {
        'sample_name': 'demo_sample',
        'atac_bam': 'data/raw/demo.atac.bam',
        'rampage_bam': 'data/raw/demo.rampage.bam'  # optional for fine-tuning
    },
    'reference': {
        'genome_fasta': 'reference/hg38.fa',
        'chrom_sizes': 'reference/hg38.chrom.sizes',
        'blacklist': 'reference/hg38-blacklist.v2.bed',
        'motif_database': 'reference/JASPAR2022_CORE_vertebrates.meme'
    },
    'output': {
        'base_dir': 'data/processed'
    },
    'processing': {
        'peak_calling': {
            'qvalue': 0.01,
            'shift': -75,
            'extsize': 150
        },
        'signal_tracks': {
            'bin_size': 128,
            'normalize': True
        }
    }
}

# Save data configuration
with open(repo_root / 'demo_data_config.yaml', 'w') as f:
    yaml.dump(data_config, f, default_flow_style=False)

print("Data configuration saved to demo_data_config.yaml")
print(yaml.dump(data_config, default_flow_style=False))

In [None]:
# Training configuration
training_config = {
    'data': {
        'train_data': 'data/processed/train',
        'valid_data': 'data/processed/valid',
        'test_data': 'data/processed/test'
    },
    'model': {
        'type': 'pretraining',  # or 'finetuning'
        'input_length': 524288,
        'output_length': 4096,
        'final_output_length': 4092
    },
    'training': {
        'batch_size': 4,
        'learning_rate': 0.0001,
        'max_epochs': 100,
        'patience': 10,
        'gradient_clip_val': 1.0
    },
    'logging': {
        'wandb_project': 'epibert_demo',
        'wandb_entity': 'your_username',
        'log_dir': 'logs'
    },
    'hardware': {
        'num_gpus': 1,
        'num_workers': 4,
        'precision': 'bf16'
    }
}

# Save training configuration
with open(repo_root / 'demo_training_config.yaml', 'w') as f:
    yaml.dump(training_config, f, default_flow_style=False)

print("Training configuration saved to demo_training_config.yaml")
print(yaml.dump(training_config, default_flow_style=False))

## Step 3: Reference Data Download

Download essential reference files needed for data processing.

In [None]:
# Download reference data (this may take several minutes)
print("Downloading reference data...")
print("Note: Genome FASTA download requires user confirmation due to size (3GB)")

!{repo_root}/scripts/download_references.sh --output-dir {repo_root}/reference

In [None]:
# Check downloaded reference files
import os

ref_dir = repo_root / 'reference'
if ref_dir.exists():
    print("Downloaded reference files:")
    for file in sorted(ref_dir.iterdir()):
        if file.is_file():
            size = file.stat().st_size / (1024*1024)  # Size in MB
            print(f"  {file.name}: {size:.1f} MB")
else:
    print("Reference directory not found. Run the download script first.")

## Step 4: Data Processing (Example)

**Note**: This step requires actual BAM files. For demonstration, we'll show the commands without executing them.

In [None]:
# This is the command you would run with real data:
data_processing_cmd = f"{repo_root}/data_processing/run_pipeline.sh -c {repo_root}/demo_data_config.yaml"

print("Data processing command:")
print(data_processing_cmd)
print()
print("This command would:")
print("1. Process ATAC-seq BAM to fragments")
print("2. Generate signal tracks")
print("3. Call peaks with MACS2")
print("4. Compute motif enrichments")
print("5. Create training datasets")

# Uncomment to run with real data:
# !{data_processing_cmd}

## Step 5: Model Architecture Overview

Let's examine the EpiBERT Lightning model architecture and parameter configurations.

In [None]:
# Import Lightning model
try:
    from lightning_transfer.epibert_lightning import EpiBERTLightning
    
    # Create pretraining model
    print("=== EpiBERT Pretraining Model Architecture ===")
    model_pretrain = EpiBERTLightning(model_type="pretraining")
    print(f"Model type: {model_pretrain.model_type}")
    print(f"Number of attention heads: {model_pretrain.num_heads}")
    print(f"Number of transformer layers: {model_pretrain.num_transformer_layers}")
    print(f"Model dimension (d_model): {model_pretrain.d_model}")
    print(f"Sequence filters: {model_pretrain.filter_list_seq}")
    print(f"ATAC filters: {model_pretrain.filter_list_atac}")
    print(f"Dropout: {model_pretrain.dropout}")
    print(f"Pointwise dropout: {model_pretrain.pointwise_dropout}")
    
    print("\n=== EpiBERT Fine-tuning Model Architecture ===")
    model_finetune = EpiBERTLightning(model_type="finetuning")
    print(f"Model type: {model_finetune.model_type}")
    print(f"Number of attention heads: {model_finetune.num_heads}")
    print(f"Number of transformer layers: {model_finetune.num_transformer_layers}")
    print(f"Model dimension (d_model): {model_finetune.d_model}")
    print(f"Sequence filters: {model_finetune.filter_list_seq}")
    print(f"ATAC filters: {model_finetune.filter_list_atac}")
    print(f"Dropout: {model_finetune.dropout}")
    print(f"Pointwise dropout: {model_finetune.pointwise_dropout}")
    
    # Show parameter count
    def count_parameters(model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print(f"\nPretraining model parameters: {count_parameters(model_pretrain):,}")
    print(f"Fine-tuning model parameters: {count_parameters(model_finetune):,}")
    
except ImportError as e:
    print(f"Lightning dependencies not available: {e}")
    print("Install with: pip install -r lightning_transfer/requirements_lightning.txt")

## Step 6: Training (Example)

**Note**: Training requires processed data. We'll show the training command and setup without executing.

In [None]:
# Training command
training_cmd = f"{repo_root}/scripts/train_model.sh --config {repo_root}/demo_training_config.yaml --lightning"

print("Training command:")
print(training_cmd)
print()
print("This command would:")
print("1. Validate configuration and data paths")
print("2. Initialize Lightning trainer with callbacks")
print("3. Set up logging (W&B, TensorBoard)")
print("4. Train model with early stopping")
print("5. Save best checkpoint")

# Show what the training would look like
print("\nExample training output:")
print("""
Epoch 1/100: 100%|██████████| 1250/1250 [15:23<00:00, 1.35it/s, loss=0.234, v_num=0]
Epoch 2/100: 100%|██████████| 1250/1250 [15:18<00:00, 1.36it/s, loss=0.198, v_num=0]
...
Early stopping triggered. Best model saved to: models/checkpoints/epoch_15.ckpt
""")

# Uncomment to run actual training (requires processed data):
# !{training_cmd}

## Step 7: Model Evaluation (Example)

Demonstrate the comprehensive evaluation capabilities.

In [None]:
# Import evaluation tools
try:
    sys.path.append(str(repo_root / 'scripts'))
    from evaluate_model import EpiBERTEvaluator
    
    # Initialize evaluator
    evaluator = EpiBERTEvaluator(implementation="lightning")
    
    print("EpiBERT Evaluator initialized successfully")
    print("\nAvailable evaluation metrics:")
    print("- Correlation metrics: Pearson, Spearman (global and per-sample)")
    print("- Regression metrics: MSE, MAE, RMSE, explained variance")
    print("- Peak prediction: ROC-AUC, PR-AUC, precision, recall, F1")
    print("- Visualization: Scatter plots, residuals, distributions, Q-Q plots")
    
except ImportError as e:
    print(f"Evaluation dependencies not available: {e}")

In [None]:
# Example evaluation command (would run with real model and data)
evaluation_cmd = f"""python3 {repo_root}/scripts/evaluate_model.py \\
    --model_path models/checkpoints/best_model.ckpt \\
    --test_data data/processed/test.h5 \\
    --implementation lightning \\
    --model_type pretraining \\
    --output_dir results/evaluation"""

print("Evaluation command:")
print(evaluation_cmd)
print()
print("This would generate:")
print("1. Comprehensive metrics JSON file")
print("2. Evaluation plots (scatter, residuals, distributions)")
print("3. Per-sample correlation analysis")
print("4. Peak prediction performance assessment")

# Show example metrics output
print("\nExample evaluation metrics:")
example_metrics = {
    "global_pearson_r": 0.847,
    "global_spearman_r": 0.832,
    "mean_sample_pearson": 0.756,
    "std_sample_pearson": 0.089,
    "mse": 0.045,
    "mae": 0.163,
    "explained_variance": 0.718,
    "roc_auc": 0.923,
    "pr_auc": 0.845,
    "precision": 0.784,
    "recall": 0.692,
    "f1_score": 0.735
}

for metric, value in example_metrics.items():
    print(f"  {metric}: {value}")

## Step 8: Complete Workflow Automation

The master workflow script can automate the entire pipeline.

In [None]:
# Complete workflow command
workflow_cmd = f"{repo_root}/run_complete_workflow.sh --lightning --dry-run"

print("Complete workflow command (dry run):")
print(workflow_cmd)
print()
print("To run the complete workflow with real data:")
print(f"{repo_root}/run_complete_workflow.sh --lightning")

# Run dry run to show workflow overview
!{workflow_cmd}

## Step 9: Model Usage for Predictions

Example of using a trained model for making predictions.

In [None]:
# Example prediction code (would work with trained model)
prediction_example = '''
# Load trained model
from lightning_transfer.epibert_lightning import EpiBERTLightning
import torch
import numpy as np

# Load model from checkpoint
model = EpiBERTLightning.load_from_checkpoint("models/checkpoints/best_model.ckpt")
model.eval()

# Prepare input data
# Input should be shape: (batch_size, input_length, num_features)
# For EpiBERT: (batch_size, 524288, 4) for sequence one-hot encoding
input_sequences = torch.randn(2, 524288, 4)  # Example random input

# Make predictions
with torch.no_grad():
    predictions = model(input_sequences)

print(f"Input shape: {input_sequences.shape}")
print(f"Output shape: {predictions.shape}")
print(f"Predictions range: [{predictions.min():.3f}, {predictions.max():.3f}]")
'''

print("Example prediction code:")
print(prediction_example)

# Try to import Lightning modules to check availability
try:
    import torch
    import pytorch_lightning as pl
    from lightning_transfer.epibert_lightning import EpiBERTLightning
    
    print("\n✅ Lightning modules available for prediction")
    print(f"PyTorch version: {torch.__version__}")
    print(f"Lightning version: {pl.__version__}")
    
except ImportError as e:
    print(f"\n❌ Lightning modules not available: {e}")
    print("Install with: pip install -r lightning_transfer/requirements_lightning.txt")

## Summary

This notebook demonstrated the complete EpiBERT workflow:

### ✅ What we covered:
1. **Environment validation** - Checking dependencies and tools
2. **Configuration setup** - Creating YAML config files for data and training
3. **Reference data download** - Getting genome files and annotations
4. **Model architecture** - Understanding Lightning implementation parameters
5. **Training workflow** - Command structure and expected outputs
6. **Evaluation system** - Comprehensive metrics and visualization
7. **Complete automation** - Master workflow script usage
8. **Model usage** - Making predictions with trained models

### 🚀 Next steps for real usage:
1. **Prepare your data**: Get ATAC-seq (and optionally RAMPAGE-seq) BAM files
2. **Run setup**: `./setup_environment.sh --lightning --install-deps`
3. **Download references**: `./scripts/download_references.sh`
4. **Configure workflow**: Edit the YAML config files with your data paths
5. **Run pipeline**: `./run_complete_workflow.sh --lightning`

### 📚 Additional resources:
- **Data processing documentation**: `data_processing/README.md`
- **Lightning implementation details**: `lightning_transfer/README.md`
- **Example notebooks**: Other notebooks in `example_usage/`
- **Complete documentation**: `README.md`

### 💡 Tips for success:
- Start with a small test dataset to validate the pipeline
- Monitor GPU memory usage and adjust batch size if needed
- Use Weights & Biases for experiment tracking
- Save intermediate results in case of interruptions
- Review evaluation metrics to ensure model quality

**Happy training! 🧬🤖**