# FASTQ Loading and Sequence Encoding

This notebook demonstrates how to:
1. Load FASTQ files
2. Encode sequences for machine learning (one-hot, k-mer, integer)
3. Export to NumPy arrays for ML frameworks

## Installation

```bash
pip install deepbiop
```

In [None]:
import matplotlib.pyplot as plt
import numpy as np

import deepbiop as dbp

## 1. One-Hot Encoding for CNNs/RNNs

One-hot encoding represents each base as a binary vector:
- A = [1, 0, 0, 0]
- C = [0, 1, 0, 0]
- G = [0, 0, 1, 0]
- T = [0, 0, 0, 1]

In [None]:
# Create encoder
encoder = dbp.OneHotEncoder("dna", "skip")

# Example sequences
sequences = [b"ACGTACGT", b"TTGGCCAA", b"AAAACCCC"]

# Encode batch
encoded = encoder.encode_batch(sequences)

print(f"Encoded shape: {encoded.shape}")  # (3, 8, 4) = [batch, seq_len, alphabet_size]
print(f"First sequence encoded:\n{encoded[0]}")

In [None]:
# Visualize one-hot encoding
plt.figure(figsize=(10, 3))
plt.imshow(encoded[0].T, cmap="YlGnBu", aspect="auto")
plt.yticks([0, 1, 2, 3], ["A", "C", "G", "T"])
plt.xlabel("Position")
plt.ylabel("Base")
plt.title("One-Hot Encoding: ACGTACGT")
plt.colorbar(label="Value")
plt.show()

## 2. K-mer Encoding for Feature-Based Models

K-mer encoding counts overlapping k-length subsequences.
Useful for traditional ML models (Random Forest, SVM).

In [None]:
# Create k-mer encoder (k=3, canonical)
kmer_encoder = dbp.KmerEncoder(k=3, canonical=True, encoding_type="dna")

# Encode sequences
kmer_features = kmer_encoder.encode_batch(sequences)

print(f"K-mer features shape: {kmer_features.shape}")  # (3, 64) for 4^3 possible 3-mers
print(
    f"Number of unique 3-mers in first sequence: {np.count_nonzero(kmer_features[0])}"
)

In [None]:
# Visualize k-mer counts
plt.figure(figsize=(12, 4))
plt.bar(range(len(kmer_features[0])), kmer_features[0])
plt.xlabel("K-mer Index")
plt.ylabel("Count")
plt.title("3-mer Frequency Distribution")
plt.show()

## 3. Integer Encoding for Transformers/Embeddings

Integer encoding maps each base to an integer:
- A = 0, C = 1, G = 2, T = 3

Ideal for transformer models and embedding layers.

In [None]:
# Create integer encoder
int_encoder = dbp.IntegerEncoder("dna")

# Encode sequences
int_encoded = int_encoder.encode_batch(sequences)

print(f"Integer encoded shape: {int_encoded.shape}")  # (3, 8)
print(f"First sequence: {sequences[0]}")
print(f"Encoded: {int_encoded[0]}")

## 4. Integration with PyTorch

DeepBioP outputs are NumPy arrays that work seamlessly with PyTorch.

In [None]:
import torch

# Convert to PyTorch tensors
tensor_onehot = torch.from_numpy(encoded).float()
tensor_int = torch.from_numpy(int_encoded).long()

print(f"PyTorch one-hot tensor shape: {tensor_onehot.shape}")
print(f"PyTorch integer tensor shape: {tensor_int.shape}")
print(f"PyTorch tensor dtype: {tensor_int.dtype}")

## 5. Integration with HuggingFace Transformers

Use integer encoding with special tokens for transformer models.

In [None]:
# Add special tokens (CLS=4, SEP=5, PAD=6)
input_ids = torch.from_numpy(int_encoded).long() + 7  # Offset for special tokens

# Create attention mask (1 for real tokens, 0 for padding)
attention_mask = (input_ids != 6).long()

print(f"Input IDs: {input_ids[0]}")
print(f"Attention mask: {attention_mask[0]}")

## 6. Export to Files

Save encoded sequences for later use.

In [None]:
# Export to NumPy files
dbp.utils.export_to_numpy_int("sequences_int.npy", sequences)
dbp.utils.export_to_numpy_onehot("sequences_onehot.npy", sequences)

# Load back
loaded_int = np.load("sequences_int.npy")
loaded_onehot = np.load("sequences_onehot.npy")

print(f"Loaded integer encoding: {loaded_int.shape}")
print(f"Loaded one-hot encoding: {loaded_onehot.shape}")

## Summary

DeepBioP provides three encoding schemes optimized for different ML architectures:

| Encoding | Shape | Use Case | Models |
|----------|-------|----------|--------|
| One-Hot | (N, L, 4) | Spatial features | CNN, RNN, LSTM |
| K-mer | (N, 4^k) | Feature vectors | Random Forest, SVM |
| Integer | (N, L) | Token sequences | Transformers, Embeddings |

All encodings:
- ✅ Zero-copy NumPy arrays
- ✅ Batch processing support
- ✅ PyTorch/TensorFlow compatible
- ✅ HuggingFace Transformers ready