# Quick Start: DeepBioP for Deep Learning

Welcome to DeepBioP! This notebook demonstrates the basics of loading and processing biological sequence data for deep learning.

## What You'll Learn

1. Loading FASTQ files
2. Filtering sequences by quality and length
3. Encoding sequences for neural networks
4. Using with PyTorch DataLoader

Let's get started!

## Setup

First, import the necessary libraries:

In [None]:
from pathlib import Path

from deepbiop import fq
from deepbiop.transforms import FilterCompose, TransformDataset

## 1. Loading FASTQ Data

The simplest way to load FASTQ data is with `FastqStreamDataset`:

In [None]:
# Create dataset
test_file = Path("../tests/data/test.fastq")

if test_file.exists():
    dataset = fq.FastqStreamDataset(str(test_file))

    # Look at first record
    record = next(iter(dataset))

    print(f"Record ID: {record['id']}")
    print(f"Sequence length: {len(record['sequence'])} bases")
    print(f"First 20 bases: {record['sequence'][:20]}")
    print(f"Quality scores: {record['quality'][:20]}")
else:
    print(f"Test file not found at {test_file}")
    print("This is just a demo - in practice, use your own FASTQ files!")

## 2. Filtering Sequences

Let's filter sequences by quality and length:

In [None]:
# Create filters
quality_filter = fq.QualityFilter(min_mean_quality=25.0)
length_filter = fq.LengthFilter(min_length=50, max_length=500)

# Combine filters
filters = FilterCompose([quality_filter, length_filter])

# Apply filters
if test_file.exists():
    passed = 0
    filtered = 0

    for record in dataset:
        if filters.filter(record):
            passed += 1
        else:
            filtered += 1

        if passed + filtered >= 100:  # Check first 100
            break

    print(f"Passed filters: {passed}")
    print(f"Filtered out: {filtered}")
    print(f"Pass rate: {100 * passed / (passed + filtered):.1f}%")

## 3. Encoding Sequences

Neural networks need numerical data. Let's one-hot encode DNA sequences:

In [None]:
# Create one-hot encoder
encoder = fq.OneHotEncoder(encoding_type="dna", unknown_strategy="skip")

# Encode a sample sequence
sample = {"sequence": b"ACGT", "quality": None}

encoded = encoder(sample)

print("One-hot encoded sequence:")
print(encoded["sequence"])
print(f"\nShape: {encoded['sequence'].shape}")
print("Each row is [A, C, G, T]")

## 4. Complete Pipeline

Now let's combine filtering and encoding in one pipeline:

In [None]:
if test_file.exists():
    # Create base dataset
    dataset = fq.FastqStreamDataset(str(test_file))

    # Create transform pipeline (for encoding)
    transform = fq.OneHotEncoder(encoding_type="dna", unknown_strategy="skip")

    # Create filter pipeline
    filter_pipeline = FilterCompose(
        [fq.QualityFilter(min_mean_quality=25.0), fq.LengthFilter(min_length=50)]
    )

    # Wrap with TransformDataset
    processed = TransformDataset(
        dataset, transform=transform, filter_fn=filter_pipeline
    )

    # Process records
    for i, record in enumerate(processed):
        if i == 0:
            print("First processed record:")
            print(f"  ID: {record['id']}")
            print(f"  Encoded shape: {record['sequence'].shape}")
            print("  First 5 bases (one-hot):")
            print(record["sequence"][:5])

        if i >= 4:  # Process first 5
            break

    print(f"\nProcessed {i + 1} records successfully!")

## 5. Using with PyTorch DataLoader

For training neural networks, wrap the dataset in a PyTorch DataLoader:

In [None]:
try:
    import torch
    from torch.utils.data import DataLoader

    def collate_fn(batch):
        """Custom collate function with padding."""
        sequences = [torch.from_numpy(item["sequence"]).float() for item in batch]

        # Pad to max length
        max_len = max(seq.shape[0] for seq in sequences)
        padded = torch.zeros(len(sequences), max_len, 4)  # 4 for A,C,G,T

        for i, seq in enumerate(sequences):
            padded[i, : seq.shape[0]] = seq

        return {
            "sequences": padded,
            "lengths": torch.tensor([seq.shape[0] for seq in sequences]),
        }

    if test_file.exists():
        # Create DataLoader
        loader = DataLoader(
            processed, batch_size=4, collate_fn=collate_fn, num_workers=0
        )

        # Get one batch
        batch = next(iter(loader))

        print("DataLoader batch:")
        print(f"  Batch shape: {batch['sequences'].shape}")
        print(f"  Sequence lengths: {batch['lengths']}")
        print("\nReady for training!")

except ImportError:
    print("PyTorch not installed. Install with: pip install torch")

## Summary

You've learned how to:

- âœ… Load FASTQ files with `FastqStreamDataset`
- âœ… Filter sequences by quality and length
- âœ… Encode sequences as one-hot arrays
- âœ… Combine filters and transforms in a pipeline
- âœ… Use with PyTorch DataLoader for training

## Next Steps

Check out these notebooks for more advanced usage:

- **[PyTorch Training](pytorch_training.ipynb)**: Complete PyTorch integration examples
- **[Lightning Module](lightning_module.ipynb)**: PyTorch Lightning workflows  
- **[Transformers DNA](transformers_dna.ipynb)**: Hugging Face Transformers integration

Happy coding! ðŸ§¬