# SV-Agent CWL Execution Demo

This notebook demonstrates how sv-agent executes CWL workflows for GATK-SV structural variant analysis.

## Key Concept

sv-agent's main purpose is to **execute CWL workflows** for running GATK-SV analysis on genomic data. It:
1. Converts GATK-SV WDL workflows to CWL format
2. Executes the CWL workflows using its integrated engine
3. Processes BAM/CRAM files to detect structural variants
4. Generates VCF outputs with SV calls

In [None]:
# Import required modules
from sv_agent import SVAgent
from sv_agent.chat import SVAgentChat
import yaml
import json
from pathlib import Path

## 1. Understanding sv-agent's Execution Capabilities

In [None]:
# Initialize chat to ask about execution
agent = SVAgent()
chat = SVAgentChat(agent, llm_provider="none")

# Ask about CWL execution
response = chat.chat("Can you execute CWL workflows?")
print("Q: Can you execute CWL workflows?")
print("\nA:", response)

## 2. Step 1: Convert WDL to CWL

Before we can execute, we need to convert GATK-SV WDL workflows to CWL:

In [None]:
# Example conversion command
print("Convert Module00a (GatherSampleEvidence) to CWL:")
print("\nCommand:")
print("sv-agent convert -o cwl_output -m Module00a")

print("\nThis will generate:")
print("- cwl_output/GatherSampleEvidence.cwl")
print("- cwl_output/tools/*.cwl (individual tool definitions)")
print("- cwl_output/types/*.cwl (type definitions)")

## 3. Step 2: Prepare Input Configuration

CWL workflows require input parameters in YAML format:

In [None]:
# Example input configuration for Module00a
module00a_inputs = {
    "bam_file": {
        "class": "File",
        "path": "/data/samples/sample001.bam",
        "secondaryFiles": [
            {
                "class": "File",
                "path": "/data/samples/sample001.bam.bai"
            }
        ]
    },
    "reference": {
        "class": "File",
        "path": "/data/reference/hg38.fa",
        "secondaryFiles": [
            {
                "class": "File",
                "path": "/data/reference/hg38.fa.fai"
            },
            {
                "class": "File",
                "path": "/data/reference/hg38.dict"
            }
        ]
    },
    "sample_id": "SAMPLE001",
    "sex": "female",
    "primary_contigs": ["chr1", "chr2", "chr3", "chr4", "chr5"],
    "min_mapq": 20,
    "min_base_qual": 20
}

print("Module00a input configuration:")
print(yaml.dump(module00a_inputs, default_flow_style=False))

In [None]:
# Save to file (in practice)
input_file = "module00a_inputs.yaml"
print(f"Save this configuration to: {input_file}")
print("\nThen execute with:")
print(f"sv-agent run cwl_output/GatherSampleEvidence.cwl {input_file}")

## 4. Step 3: Execute CWL Workflow

Now we can run the CWL workflow with sv-agent:

In [None]:
# Ask about execution
response = chat.chat("What happens when I run 'sv-agent run' command?")
print("Q: What happens when I run 'sv-agent run' command?")
print("\nA:", response[:600] + "...")

## 5. Batch Processing Example

For cohort analysis, we need to process multiple samples:

In [None]:
# Batch input configuration
batch_inputs = {
    "samples": [
        {
            "sample_id": "SAMPLE001",
            "bam": {"class": "File", "path": "/data/samples/sample001.bam"},
            "sex": "female"
        },
        {
            "sample_id": "SAMPLE002",
            "bam": {"class": "File", "path": "/data/samples/sample002.bam"},
            "sex": "male"
        },
        {
            "sample_id": "SAMPLE003",
            "bam": {"class": "File", "path": "/data/samples/sample003.bam"},
            "sex": "female"
        }
    ],
    "reference": {
        "class": "File",
        "path": "/data/reference/hg38.fa"
    },
    "batch_id": "BATCH001",
    "output_dir": "/results/batch001"
}

print("Batch processing configuration:")
print(json.dumps(batch_inputs, indent=2)[:500] + "...")

## 6. Complete Pipeline Execution

To run the full GATK-SV pipeline:

In [None]:
# Full pipeline workflow
print("Complete GATK-SV Pipeline Execution:")
print("=" * 50)
print()
print("1. Convert all modules:")
print("   sv-agent convert -o cwl_output")
print()
print("2. Prepare batch configuration (batch_inputs.yaml)")
print()
print("3. Run the complete pipeline:")
print("   sv-agent run cwl_output/GATKSVPipelineBatch.cwl batch_inputs.yaml")
print()
print("This will execute all modules in sequence:")
print("- Module00a-c: Evidence gathering")
print("- Module01: Clustering")
print("- Module02: Metrics generation")
print("- Module03: Filtering")
print("- Module04: Genotyping")
print("- Module05: Cohort VCF creation")
print("- Module06: Annotation")

## 7. Monitoring Execution

In [None]:
# Ask about monitoring
response = chat.chat("How can I monitor the progress of my CWL workflow execution?")
print("Q: How can I monitor the progress of my CWL workflow execution?")
print("\nExpected features:")
print("- Real-time progress updates")
print("- Log files for each step")
print("- Resource usage statistics")
print("- Error reporting")
print("- Intermediate file tracking")

## 8. Output Files and Results

In [None]:
# Expected outputs
print("Expected Output Files from GATK-SV Pipeline:")
print("=" * 50)

outputs = {
    "Module00a": [
        "manta.vcf.gz - Manta SV calls",
        "melt.vcf.gz - Mobile element insertions",
        "sample.PE.txt - Paired-end evidence",
        "sample.SR.txt - Split-read evidence",
        "sample.RD.txt - Read depth profile"
    ],
    "Module00c": [
        "batch.PE.txt - Merged PE evidence",
        "batch.SR.txt - Merged SR evidence",
        "cnmops.vcf.gz - CNV calls"
    ],
    "Module04": [
        "genotyped.vcf.gz - Genotyped variants",
        "genotype_qualities.txt - GQ scores"
    ],
    "Final Output": [
        "cohort.annotated.vcf.gz - Final annotated VCF",
        "cohort.stats.txt - Cohort statistics",
        "qc_report.html - QC summary"
    ]
}

for module, files in outputs.items():
    print(f"\n{module}:")
    for f in files:
        print(f"  - {f}")

## 9. Real-World Example: Trio Analysis

In [None]:
# Trio configuration
trio_config = {
    "samples": [
        {
            "sample_id": "CHILD001",
            "bam": {"class": "File", "path": "/data/trio/child.bam"},
            "sex": "male",
            "family_id": "FAM001",
            "father_id": "FATHER001",
            "mother_id": "MOTHER001"
        },
        {
            "sample_id": "FATHER001",
            "bam": {"class": "File", "path": "/data/trio/father.bam"},
            "sex": "male",
            "family_id": "FAM001"
        },
        {
            "sample_id": "MOTHER001",
            "bam": {"class": "File", "path": "/data/trio/mother.bam"},
            "sex": "female",
            "family_id": "FAM001"
        }
    ],
    "reference": {"class": "File", "path": "/data/reference/hg38.fa"},
    "enable_denovo_calling": True,
    "output_dir": "/results/trio_analysis"
}

print("Trio Analysis Configuration:")
print(yaml.dump(trio_config, default_flow_style=False))

print("\nExpected de novo SV detection:")
print("- 5-10 de novo SVs per trio")
print("- Higher confidence with trio data")
print("- Inheritance pattern validation")

## 10. Performance Considerations

In [None]:
# Ask about performance
response = chat.chat("What are the computational requirements for running 100 samples?")
print("Q: What are the computational requirements for running 100 samples?")
print("\nA:", response)

# Performance tips
print("\n" + "=" * 50)
print("Performance Optimization Tips:")
print("- Use scatter-gather for parallelization")
print("- Allocate sufficient memory for Java tools")
print("- Use SSD storage for temporary files")
print("- Enable caching for reference files")
print("- Monitor resource usage during execution")

## Summary

sv-agent provides a complete solution for running GATK-SV analysis:

1. **Converts** WDL workflows to CWL format
2. **Executes** CWL workflows with integrated engine
3. **Processes** genomic data (BAM/CRAM files)
4. **Detects** structural variants across the genome
5. **Generates** annotated VCF files with SV calls

### Key Commands Summary

```bash
# Convert WDL to CWL
sv-agent convert -o cwl_output -m Module00a

# Run single module
sv-agent run cwl_output/GatherSampleEvidence.cwl inputs.yaml

# Run complete pipeline
sv-agent run cwl_output/GATKSVPipelineBatch.cwl batch_inputs.yaml

# Get help
sv-agent chat
```

### Next Steps

1. Prepare your BAM/CRAM files
2. Create input YAML configurations
3. Convert necessary modules to CWL
4. Execute the workflow with sv-agent
5. Analyze the output VCF files