Skip to content

Conversation

@trvrb
Copy link
Member

@trvrb trvrb commented Jan 20, 2026

Summary

  • Add scripts/train_test_split.py that marks entire clades as "test" data in Auspice JSON files
  • Add train_test_split rule to Snakefile workflow (runs after download_auspice_json)

Motivation

Individual trajectories largely overlap (sharing evolutionary paths), so splitting individual trajectories wouldn't create truly out-of-sample test data. By marking entire clades as "test", we ensure test data represents complete evolutionary lineages that are separate from training data.

Algorithm

  1. Load Auspice JSON and build helper data structures (node map, parent map)
  2. Randomly select a seed tip from available (unmarked) tips
  3. Walk back from the seed tip toward the root, counting mutations on each branch
  4. When accumulated mutations >= mutations_back, use that ancestor as the clade root
  5. Check clade size: If ancestor's clade contains > max_clade_proportion of total tips, skip this seed tip and try another (prevents selecting huge clades like "all of Omicron")
  6. Mark all descendants of the ancestor as "test"
  7. Repeat steps 2-6 until target proportion of tips are marked as test
  8. Add train_test coloring to meta.colorings with "train" (blue) and "test" (red)
  9. Add train_test: {value: "train"|"test"} to each node's node_attrs
  10. Write modified Auspice JSON

CLI Interface

python scripts/train_test_split.py \
    --json INPUT.json \
    --output OUTPUT.json \
    --test-proportion 0.1 \
    --mutations-back 5 \
    --max-clade-proportion 0.01 \
    --gene nuc \
    --seed 42
Argument Default Description
--json required Input Auspice JSON file
--output required Output Auspice JSON file
--test-proportion 0.1 Target proportion of tips as test (0.0-1.0)
--mutations-back 5 Mutations to walk back from seed tip
--max-clade-proportion 0.01 Max size of any single test clade as proportion of total tips
--gene "nuc" Gene key for counting mutations
--seed None Random seed for reproducibility

Test plan

  • Run on test dataset and verify ~10% of tips marked as test
  • View in Auspice - color by "Train/Test Split" should show contiguous red (test) clades
  • Verify multiple distinct test clades exist (not just one giant clade)

🤖 Generated with Claude Code

New script marks entire clades as "test" data in Auspice JSON files,
enabling proper train/test splits where test data is truly out-of-sample.

Algorithm: randomly select seed tips, walk back N mutations to find
clade ancestors, mark all descendants as test. Skips oversized clades
to prevent selecting huge portions of the tree.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@trvrb
Copy link
Member Author

trvrb commented Jan 20, 2026

With default parameters we're getting what looks like appropriate test clade sizes.

Here's spike-xs
Screenshot 2026-01-19 at 9 10 28 PM

Here's cytb-xs
Screenshot 2026-01-19 at 9 11 12 PM

@trvrb
Copy link
Member Author

trvrb commented Jan 20, 2026

Plan: Propagate Train/Test Split to Computed Trajectories

Key Insight

A trajectory is root-to-tip. For a test tip, the full path includes training ancestors:

root(train) → ... → internal(train) → boundary(test) → ... → tip(test)
  • Train trajectory: Full root-to-tip path
  • Test trajectory: Truncated to start at first test node (boundary-to-tip)

This ensures test trajectories don't "leak" training data.

Data Flow Change

auspice.json (with node_attrs.train_test.value)
    ↓
branches.py → branches.tsv (parent, child, hamming, train_test)  ← NEW column
    ↓
trajectory.py → results/{analysis}/train/*.fasta    ← full root-to-tip
              → results/{analysis}/test/*.fasta     ← truncated
    ↓
package.py → trajectories-train-*.tar.zst
           → trajectories-test-*.tar.zst

Files to Modify

  1. branches.py: Add train_test column to branches.tsv (extract from node.node_attrs)
  2. trajectory.py:
    • Parse train_test column
    • Create train/ and test/ subdirectories
    • For test tips: truncate path to start at test boundary
  3. package.py: Create separate train/test shards

trvrb and others added 2 commits January 19, 2026 23:02
When calculating walk-back distance for train/test splitting, only count
mutations within the specified trim region. This ensures the split uses
the same genomic region as the downstream analysis (e.g., S1 for spike).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- branches.py: Extract train_test label from node_attrs and add to TSV output
- trajectory.py: Parse train_test column, create train/test subdirectories,
  truncate test trajectories to start from test boundary node
- package.py: Detect train/test subdirectories and create separate shards
  (trajectories-train-*.tar.zst, trajectories-test-*.tar.zst)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants