Add train/test split script for phylogenetic trees #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


Summary
scripts/train_test_split.pythat marks entire clades as "test" data in Auspice JSON filestrain_test_splitrule to Snakefile workflow (runs afterdownload_auspice_json)Motivation
Individual trajectories largely overlap (sharing evolutionary paths), so splitting individual trajectories wouldn't create truly out-of-sample test data. By marking entire clades as "test", we ensure test data represents complete evolutionary lineages that are separate from training data.
Algorithm
mutations_back, use that ancestor as the clade rootmax_clade_proportionof total tips, skip this seed tip and try another (prevents selecting huge clades like "all of Omicron")train_testcoloring tometa.coloringswith "train" (blue) and "test" (red)train_test: {value: "train"|"test"}to each node'snode_attrsCLI Interface
python scripts/train_test_split.py \ --json INPUT.json \ --output OUTPUT.json \ --test-proportion 0.1 \ --mutations-back 5 \ --max-clade-proportion 0.01 \ --gene nuc \ --seed 42--json--output--test-proportion--mutations-back--max-clade-proportion--gene--seedTest plan
🤖 Generated with Claude Code