Add train/test split script for phylogenetic trees #3

trvrb · 2026-01-20T00:36:29Z

Summary

Add scripts/train_test_split.py that marks entire clades as "test" data in Auspice JSON files
Add train_test_split rule to Snakefile workflow (runs after download_auspice_json)

Motivation

Individual trajectories largely overlap (sharing evolutionary paths), so splitting individual trajectories wouldn't create truly out-of-sample test data. By marking entire clades as "test", we ensure test data represents complete evolutionary lineages that are separate from training data.

Algorithm

Load Auspice JSON and build helper data structures (node map, parent map)
Randomly select a seed tip from available (unmarked) tips
Walk back from the seed tip toward the root, counting mutations on each branch
When accumulated mutations >= mutations_back, use that ancestor as the clade root
Check clade size: If ancestor's clade contains > max_clade_proportion of total tips, skip this seed tip and try another (prevents selecting huge clades like "all of Omicron")
Mark all descendants of the ancestor as "test"
Repeat steps 2-6 until target proportion of tips are marked as test
Add train_test coloring to meta.colorings with "train" (blue) and "test" (red)
Add train_test: {value: "train"|"test"} to each node's node_attrs
Write modified Auspice JSON

CLI Interface

python scripts/train_test_split.py \
    --json INPUT.json \
    --output OUTPUT.json \
    --test-proportion 0.1 \
    --mutations-back 5 \
    --max-clade-proportion 0.01 \
    --gene nuc \
    --seed 42

Argument	Default	Description
`--json`	required	Input Auspice JSON file
`--output`	required	Output Auspice JSON file
`--test-proportion`	0.1	Target proportion of tips as test (0.0-1.0)
`--mutations-back`	5	Mutations to walk back from seed tip
`--max-clade-proportion`	0.01	Max size of any single test clade as proportion of total tips
`--gene`	"nuc"	Gene key for counting mutations
`--seed`	None	Random seed for reproducibility

Test plan

Run on test dataset and verify ~10% of tips marked as test
View in Auspice - color by "Train/Test Split" should show contiguous red (test) clades
Verify multiple distinct test clades exist (not just one giant clade)

🤖 Generated with Claude Code

New script marks entire clades as "test" data in Auspice JSON files, enabling proper train/test splits where test data is truly out-of-sample. Algorithm: randomly select seed tips, walk back N mutations to find clade ancestors, mark all descendants as test. Skips oversized clades to prevent selecting huge portions of the tree. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

trvrb · 2026-01-20T05:13:01Z

With default parameters we're getting what looks like appropriate test clade sizes.

Here's spike-xs

Here's cytb-xs

trvrb · 2026-01-20T05:31:42Z

Plan: Propagate Train/Test Split to Computed Trajectories

Key Insight

A trajectory is root-to-tip. For a test tip, the full path includes training ancestors:

root(train) → ... → internal(train) → boundary(test) → ... → tip(test)

Train trajectory: Full root-to-tip path
Test trajectory: Truncated to start at first test node (boundary-to-tip)

This ensures test trajectories don't "leak" training data.

Data Flow Change

auspice.json (with node_attrs.train_test.value)
    ↓
branches.py → branches.tsv (parent, child, hamming, train_test)  ← NEW column
    ↓
trajectory.py → results/{analysis}/train/*.fasta    ← full root-to-tip
              → results/{analysis}/test/*.fasta     ← truncated
    ↓
package.py → trajectories-train-*.tar.zst
           → trajectories-test-*.tar.zst

Files to Modify

branches.py: Add train_test column to branches.tsv (extract from node.node_attrs)
trajectory.py:
- Parse train_test column
- Create train/ and test/ subdirectories
- For test tips: truncate path to start at test boundary
package.py: Create separate train/test shards

When calculating walk-back distance for train/test splitting, only count mutations within the specified trim region. This ensures the split uses the same genomic region as the downstream analysis (e.g., S1 for spike). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- branches.py: Extract train_test label from node_attrs and add to TSV output - trajectory.py: Parse train_test column, create train/test subdirectories, truncate test trajectories to start from test boundary node - package.py: Detect train/test subdirectories and create separate shards (trajectories-train-*.tar.zst, trajectories-test-*.tar.zst) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

trvrb and others added 2 commits January 19, 2026 23:02

Fix issue with root sequence side car file

84331c0

trvrb force-pushed the train-test branch from 690b665 to 84331c0 Compare January 20, 2026 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add train/test split script for phylogenetic trees #3

Add train/test split script for phylogenetic trees #3

Uh oh!

trvrb commented Jan 20, 2026 •

edited

Loading

Uh oh!

trvrb commented Jan 20, 2026

Uh oh!

trvrb commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add train/test split script for phylogenetic trees #3

Are you sure you want to change the base?

Add train/test split script for phylogenetic trees #3

Uh oh!

Conversation

trvrb commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Algorithm

CLI Interface

Test plan

Uh oh!

trvrb commented Jan 20, 2026

Uh oh!

trvrb commented Jan 20, 2026

Plan: Propagate Train/Test Split to Computed Trajectories

Key Insight

Data Flow Change

Files to Modify

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trvrb commented Jan 20, 2026 •

edited

Loading