# ðŸŽ¯ Cancer Detection with Fragmentomics

This notebook demonstrates how cfDNA fragmentation patterns differ between healthy and cancer samples, and how to use FragMentor for classification.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from fragmentomics import analyze_sizes, plot_size_comparison
from fragmentomics.viz import plot_size_distribution

## The Biology: Why Fragment Sizes Matter

Cell-free DNA (cfDNA) in blood comes from dying cells. The fragment size distribution reflects:

1. **Nucleosome wrapping** â€” DNA wraps around histones in ~147bp units
2. **Linker DNA** â€” ~20bp between nucleosomes, giving ~167bp mononucleosome size
3. **Tissue of origin** â€” Different tissues have different chromatin accessibility

**Cancer cells have altered chromatin structure**, leading to:
- More short fragments (<150bp)
- Weaker nucleosomal pattern
- Different end motif profiles

In [None]:
# Simulate healthy cfDNA
np.random.seed(42)
healthy_sizes = np.concatenate([
    np.random.normal(167, 15, size=8000),  # Strong mononucleosome peak
    np.random.normal(334, 20, size=1500),  # Dinucleosome
    np.random.normal(100, 20, size=500),   # Some short fragments
]).astype(np.int32)
healthy_sizes = healthy_sizes[(healthy_sizes >= 50) & (healthy_sizes <= 500)]

# Simulate cancer cfDNA (more short fragments, weaker pattern)
np.random.seed(123)
cancer_sizes = np.concatenate([
    np.random.normal(167, 25, size=5000),  # Weaker mononucleosome
    np.random.normal(334, 30, size=800),   # Weaker dinucleosome
    np.random.normal(100, 25, size=3000),  # More short fragments
    np.random.normal(70, 15, size=1200),   # Very short tumor-derived
]).astype(np.int32)
cancer_sizes = cancer_sizes[(cancer_sizes >= 50) & (cancer_sizes <= 500)]

print(f"Healthy: {len(healthy_sizes):,} fragments")
print(f"Cancer: {len(cancer_sizes):,} fragments")

In [None]:
# Analyze both samples
healthy_dist = analyze_sizes(healthy_sizes)
cancer_dist = analyze_sizes(cancer_sizes)

print("=" * 50)
print("HEALTHY SAMPLE")
print("=" * 50)
print(healthy_dist.summary())

print("\n" + "=" * 50)
print("CANCER SAMPLE")
print("=" * 50)
print(cancer_dist.summary())

In [None]:
# Compare the distributions visually
fig, ax = plot_size_comparison(
    [healthy_dist, cancer_dist],
    labels=["Healthy", "Cancer"],
    colors=["#2ecc71", "#e74c3c"],
)
plt.title("Fragment Size Distribution: Healthy vs Cancer")
plt.show()

## Key Features for Classification

The most discriminative features between healthy and cancer samples:

In [None]:
# Compare key metrics
metrics = {
    "Short fragment ratio (<150bp)": (healthy_dist.ratio_short, cancer_dist.ratio_short),
    "Mononucleosome ratio (140-180bp)": (healthy_dist.ratio_mono, cancer_dist.ratio_mono),
    "Mean fragment size": (healthy_dist.mean, cancer_dist.mean),
    "10bp periodicity score": (healthy_dist.periodicity_10bp, cancer_dist.periodicity_10bp),
}

print(f"{'Metric':<35} {'Healthy':>12} {'Cancer':>12} {'Diff':>10}")
print("=" * 75)
for metric, (h, c) in metrics.items():
    diff = c - h
    print(f"{metric:<35} {h:>12.3f} {c:>12.3f} {diff:>+10.3f}")

## Building a Simple Classifier

Using scikit-learn with fragmentomics features:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Generate multiple samples for training
def generate_sample(is_cancer=False, n_frags=5000):
    np.random.seed()
    if is_cancer:
        sizes = np.concatenate([
            np.random.normal(167, 25, size=int(n_frags * 0.5)),
            np.random.normal(100, 25, size=int(n_frags * 0.35)),
            np.random.normal(70, 15, size=int(n_frags * 0.15)),
        ]).astype(np.int32)
    else:
        sizes = np.concatenate([
            np.random.normal(167, 15, size=int(n_frags * 0.8)),
            np.random.normal(334, 20, size=int(n_frags * 0.15)),
            np.random.normal(100, 20, size=int(n_frags * 0.05)),
        ]).astype(np.int32)
    sizes = sizes[(sizes >= 50) & (sizes <= 500)]
    return analyze_sizes(sizes)

# Generate training data
n_samples = 50
X = []
y = []

for _ in range(n_samples):
    # Healthy sample
    dist = generate_sample(is_cancer=False)
    X.append([dist.ratio_short, dist.ratio_mono, dist.mean, dist.periodicity_10bp])
    y.append(0)
    
    # Cancer sample
    dist = generate_sample(is_cancer=True)
    X.append([dist.ratio_short, dist.ratio_mono, dist.mean, dist.periodicity_10bp])
    y.append(1)

X = np.array(X)
y = np.array(y)

print(f"Training set: {len(X)} samples, {X.shape[1]} features")

In [None]:
# Train and evaluate
clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)

print(f"Cross-validation accuracy: {scores.mean():.1%} (+/- {scores.std() * 2:.1%})")

In [None]:
# Feature importance
clf.fit(X, y)
feature_names = ["Short ratio", "Mono ratio", "Mean size", "10bp periodicity"]

plt.figure(figsize=(8, 4))
plt.barh(feature_names, clf.feature_importances_, color="steelblue")
plt.xlabel("Feature Importance")
plt.title("Most Important Features for Cancer Detection")
plt.tight_layout()
plt.show()

## Next Steps

For real cancer detection:

1. **Use real data** â€” Apply to actual WGS cfDNA BAM files
2. **Add more features** â€” End motifs, GC correction, WPS
3. **Validate carefully** â€” Use proper train/test splits
4. **Consider confounders** â€” Age, sex, sample handling

See the FragMentor documentation for more advanced analysis.