Submit tasks as job arrays and fix RNAs in distill summaries#472
Submit tasks as job arrays and fix RNAs in distill summaries#472tpall wants to merge 92 commits intoWrightonLabCSU:devfrom
Conversation
…nctionality - Update groupby_column default value to "input_fasta" in distill.py - Adjust input paths in distill.nf for consistency - Enhance argument handling in SUMMARIZE process
…rid of FASTA_COLUMN environment variable
|
Hey @tpall Thanks for this. The job array addition it nice. There is a larger planned update to batch a lot of the inputs into singular jobs to reduce the burden on the queue, since running DRAM with lots of inputs can overwhelm a SLURM scheduler, but adding in job arrays, which weren't supported on the version of Nextflow we initially developed DRAM2 on, but we recently moved to >=24 (we should lock in >=24.04.0 since there are early 24.* prereleases out there if we add in job arrays). I will have to do some testing on utilizing job arrays and their implication. Because from my initial testing it seems like it stops the next stop from proceeding until their are enough inputs to fill an array. Which might be ok. But if we are going to be doing batching anyway, it might not be that important and not worth it. Also thanks for some of the other QoL updates like updating some of the syntax to DSL2 (Channel -> channel, etc.). I will have to more fully review the code, which I can get to in a couple weeks. I have deadline for next week, and probably won't be able to review much before then. But I will leave just a couple of quick thoughts. Thanks again |
| } | ||
|
|
||
| withName: 'DRAM:ANNOTATE:CALL:.*|DRAM:ANNOTATE:DB_SEARCH:.*' { | ||
| array = params.array_size |
There was a problem hiding this comment.
I would like to support people running DRAM2 with local executor (such as on their own computer if they want), which doesn't support array. So the array should only be used with executors that support it.
| maxRetries = 2 | ||
| } | ||
|
|
||
| withName: 'DRAM:ANNOTATE:CALL:.*|DRAM:ANNOTATE:DB_SEARCH:.*' { |
There was a problem hiding this comment.
jobs under DRAM:ANNOTATE:QC:COLLECT.* could also have a job array
| withName: 'DRAM:ANNOTATE:CALL:.*|DRAM:ANNOTATE:DB_SEARCH:.*' { | ||
| array = params.array_size | ||
| } | ||
|
|
There was a problem hiding this comment.
this code here in base.config for the job array should probably be in modules.config
Added COLLECT_RNA steps to array
Adds a DECOMPRESS_FASTA module (bbtools reformat.sh in the existing bbmap container) and routes only .gz inputs through it via a branch on the fasta channel. Basename stripping is unified so sample.fa and sample.fa.gz produce the same downstream name, keeping outputs identical regardless of input compression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
It was already defined in nextflow.config (default 10) and consumed by conf/base.config, but absent from nextflow_schema.json, so runs emitted a schema-validation warning. Added alongside queue_size under Process Options. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the never-included trees module and all scripts only it referenced (parse_annotations.py, update_annots_trees.py, color_labels.R), plus update_tree.py which had no references at all. Also removes assets/trees/ refpkgs (only consumed by the dropped module) and the DRAM-v1 standalone DB setup scripts under assets/internal/ which were never wired into the DSL2 pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both are unreferenced DRAM-v1 helpers under bin/assets/forms/. They shell out to a DRAM-setup.py CLI that isn't part of the DSL2 pipeline, and upstream has already removed them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight references used DB_CHANNEL_SETUP (all caps) but the workflow is defined as DB_channel_SETUP. Groovy is case-sensitive so the references failed at runtime with "No such variable: DB_CHANNEL_SETUP". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous attempt put params.each { ... } at script top level, but DSL2
rejects statements outside workflow/process/function. Moved the
coercion into a coerceCliParams(p) function defined at top level
(declarations are allowed) and call it from main.nf workflow {} as
the first step, before PIPELINE_INITIALISATION runs validation.
Iterates over a snapshot of the map (collectEntries) to avoid any
risk of concurrent modification while writing back coerced values.
Adds the v1 V flag to DRAM-v amg_flags. V fires when a gene's vogdb hits include any VOG whose VOGdb FunctionalCategory is Xr (viral replication) or Xs (virion structure). Semantics ported verbatim from DRAM v1 mag_annotator/annotate_vgfs.py:get_metabolic_flags (commit 6cd68f9). V is inserted between E and A in the v1 alphabet (M K E V A P T F B); the trailing N (Phase 1) is unchanged. Implementation: - build_viral_vog_ids parses vog_annotations_latest.tsv(.gz) and returns the set of GroupNames whose FunctionalCategory contains Xr or Xs (substring match handles concatenated codes like XrXs). - compute_flags accepts viral_vog_ids; default empty set means V never fires (backward compat with Phase 1 callers). - Phase 2a registered the wrong polars keys in _ID_EXPR_DICT (vog_id/vog_ids). hmm_parser actually emits {db_name}_id columns, i.e. vogdb_id/vogdb_ids. Renamed accordingly so explode_gene_ids picks them up. Plumbing: - modules/local/dramv/dramv_flags.nf takes vog_list as a third input and passes --vog_list through. - subworkflows/local/annotate.nf wires params.vog_list into the call. Safe because use_dramv force-enables use_vog upstream. Tests: - 4 new tests in test_dramv_flags.py covering Xr/Xs firing, rejection of Xh/Xp/Xu, v1 position (MKEVA), backward-compat default-off, and build_viral_vog_ids parsing. 16/16 pass.
feat(dramv): VOGdb plumbing for Phase 2 (V flag prerequisites)
feat(dramv): V flag in compute_flags (Phase 2b)
Adds the v1 auxiliary_score (1-5) per gene to the dramv_flags output. Algorithm ported verbatim from DRAM v1 mag_annotator/annotate_vgfs.py:calculate_auxiliary_scores (commit 6cd68f9): - First/last gene on a scaffold → 5 (no flank context). - Hallmark left and hallmark right → 1. - Hallmark on one side, viral_like on the other → 2. - Viral_like both sides → 3. - Any hallmark/viral_like on either side, OR self carries one → 4. - Otherwise → 5. VIRSORTER_HALLMARK_CATEGORIES = {"0","3"}; VIRAL_LIKE = {"1","4"}. After scoring, v1's downgrade rule fires: if a gene has B in amg_flags AND auxiliary_score < 4, the score is bumped to 4 (a stretch of three metabolic genes is suspicious in a viral region, so AMG calls there get demoted). compute_flags accepts virsorter_categories: dict[gene_id, cat] | None. When None or empty (pure-FASTA mode), every gene falls through to the default branch and scores 5 — matching v1 behaviour for genomes with no VirSorter input. The signature lets a caller plug in VirSorter affi-contigs.tab data later without touching this function; parsing that file is a separate concern (no CLI hook in this PR). Tests: - 4 new in test_dramv_flags.py covering pure-FASTA all-5s, every branch of the if-chain (1/2/3/4/4-self/5), and the B-flag downgrade. - 20/20 pass.
Without VirSorter input the v1 auxiliary_score collapses to 5 for every
gene. The user's pipeline runs geNomad+CheckV externally before DRAM,
so this adds an optional adapter that translates geNomad's gene-level
TSV into VirSorter category codes that the v1 algorithm understands.
Mapping (lossy but matches v1's category semantics for scoring):
virus_hallmark = TRUE → "0" (phage hallmark)
taxname starts with "Viruses" → "1" (phage viral-like)
USCG / plasmid hallmark / NA / Bacteria-like taxname → dropped
The phage / prophage distinction (v1 0 vs 3, 1 vs 4) is collapsed since
the algorithm treats {0,3} and {1,4} as equivalent for the score.
CLI / plumbing:
- bin/dramv_flags.py: --genomad_genes flag, parse_genomad_genes_tsv().
- modules/local/dramv/dramv_flags.nf: optional path input, NO_FILE
sentinel (assets/NO_FILE, empty file) when params.genomad_genes is
null. Script-level guard skips the --genomad_genes arg in that case.
- subworkflows/local/annotate.nf: pass params.genomad_genes through.
- nextflow.config + nextflow_schema.json: declare params.genomad_genes
(default null, type file-path with exists=true when set).
Tests:
- 1 new in test_dramv_flags.py covering hallmark / viral-like / USCG /
plasmid / NA / bacterial-marker drop. 21/21 pass.
DRAMV_FLAGS receives one combined annotations TSV across all samples,
so the geNomad adapter needs gene-level data from every sample to
populate auxiliary_score for the full table.
Changes:
- bin/dramv_flags.py: --genomad_genes is now repeatable. Each file is
parsed and the resulting {gene_id: category} dicts are merged. Log
line counts hallmark / viral-like across all files.
- modules/local/dramv/dramv_flags.nf: accept a list of paths (staged
to ./genomad/), build one --genomad_genes flag per non-sentinel
file, skip when only the NO_FILE sentinel is present.
- subworkflows/local/annotate.nf: collect Channel.fromPath glob into
a list, fall back to a single-element list with the sentinel.
params.genomad_genes can now be a path OR a glob (e.g.
"path/to/genomad/*/EV*_summary/EV*_virus_genes.tsv"). Nextflow handles
the glob expansion via Channel.fromPath.
Tests: existing 21/21 still pass — no semantics change for single-file
callers (a 1-element iterable behaves the same as the old single-path).
nf-schema's "format": "file-path" + "exists": true rejects glob patterns
("not a file, but a file path pattern"), but params.genomad_genes is
now intended to accept a glob so a single launch can match per-sample
*_virus_genes.tsv files. Relax the schema entry to just "type":
"string"; Channel.fromPath in annotate.nf already handles glob
expansion and validates that at least one file matches.
The previous module built each --genomad_genes arg as
"genomad/${it.name}", but ${it.name} on files staged via
`stageAs: "genomad/*"` already includes the staging prefix in the
relative path, so the resulting CLI was "--genomad_genes
genomad/genomad/EV01_virus_genes.tsv" → FileNotFoundError at runtime.
Drop the manual genomad/ prefix and just interpolate ${it} directly;
Nextflow gives us the correct relative path under the work dir.
…ffix)
The first test run produced "0 hallmark + 0 viral-like" across both
samples because the original adapter targeted the documented schema
shape (virus_hallmark TRUE/FALSE, taxname "Viruses;...") but modern
geNomad (>=1.5) emits:
- virus_hallmark as a 0/1 integer column.
- taxname as the leaf taxon only (e.g. "Caudoviricetes"), not a
lineage prefixed with "Viruses".
- marker names like GENOMAD.000123.VV / .Vv / .vV / .vv where the
suffix encodes the viral-confidence classification.
Adapter rewritten to:
- virus_hallmark truthy (1, "1", True, "TRUE") → "0"
- marker suffix in {VV, Vv}, hallmark flag not set → "0"
- marker suffix in {vV, vv}, hallmark flag not set → "1"
- everything else (NA marker, host/plasmid suffix, etc.) → dropped
The taxname check is removed (leaf taxon doesn't reliably encode
"viral"). Test rewritten with rows reflecting the actual schema.
Also documented in the docstring that DRAM provirus contigs
(`<contig>|provirus_*`) carry a CheckV-derived suffix that geNomad's
gene ids — called on the pre-trim assembly contigs — won't match.
For now those rows silently get no coverage; only whole-virus contigs
contribute. A position-overlap-based join is the proper fix and is
left for a follow-up.
feat(dramv): auxiliary_score column (Phase 2c)
geNomad calls genes on the assembly contigs *before* CheckV provirus trimming, so its gene ids look like `k141_99_42`. DRAM calls genes on the trimmed provirus sub-sequence and emits scaffold names like `k141_99|provirus_1000_5000` with gene positions 1-based *within* the provirus. Direct string-match on gene_id therefore produces no coverage for any provirus contig, leaving auxiliary_score stuck at 5 for them — which is what the user saw in their first PR #5 test (1216 of ~9k rows still scored 5 after the parser bugs were fixed). This commit replaces the gene-id-based join with a position-overlap join on the parent assembly contig: - parse_genomad_genes_tsv now returns a polars DataFrame with columns (contig, start, end, category). The contig is the parent assembly contig (gene id with trailing _<num> stripped). One row per kept gene; empty samples produce an empty frame. - build_virsorter_categories_from_genomad joins the geNomad frame onto DRAM annotations via: * _parse_dram_scaffold extracts (parent, offset) from a DRAM scaffold name. Provirus contigs use the suffix offset; whole- virus contigs map to (scaffold, 1) so translation is a no-op. * For each DRAM gene, parent_pos = offset + dram_pos - 1. * Pick the first geNomad gene on the parent contig whose interval [g_start, g_end] overlaps [orig_start, orig_end]. geNomad's same-strand gene calls don't overlap each other, so a single match is unambiguous. main() chains the two: read each --genomad_genes file into a DF, concat, then position-join against the loaded annotations to produce the {dram_gene_id: virsorter_category} dict that compute_flags already accepts. New log line reports coverage: "DRAM genes with mapped category (after provirus position-join): N of M" Tests: existing parser test rewritten for the new return shape; two new tests cover (a) a 5-row mix of provirus gene + parent overlap + whole-virus + non-overlapping + orphan contig, exercising both the provirus offset translation and the whole-virus passthrough, and (b) empty geNomad frame → empty dict. 23/23 pass.
Adds the v1 AMG-confidence cutoff to the distill step. When --amg_only
is enabled, candidate rows whose auxiliary_score exceeds
--max_auxiliary_score are dropped from the strict-AMG sheet. Default
3, matching DRAM v1's recommended cutoff: keep AMG calls with at
least one hallmark/viral-like neighbour OR self-hallmark (scores 1-3
under the v1 algorithm, 4 once the B-downgrade applies). Set to 5 to
disable.
Behaviour preserved when auxiliary_score is absent (older inputs or
non-DRAM-v runs) or when --amg_only is off — the filter is silently
skipped.
Plumbing:
- bin/distill.py: --max_auxiliary_score click option, applied
immediately after the existing --amg_only flag-string filter.
Logs before/after counts.
- modules/local/distill/distill.nf: pass
--max_auxiliary_score ${params.max_auxiliary_score} when
use_dramv is on.
- nextflow.config + nextflow_schema.json: declare
params.max_auxiliary_score (default 3, range 1-5).
- README.md: updated Quick Start blurb to call out the V flag,
auxiliary_score column, --genomad_genes input, position-join
semantics, and the new --max_auxiliary_score knob.
…ack) The earlier position-only join caused a regression: in geNomad >=1.5 provirus genes ARE named `<contig>|provirus_<start>_<end>_<num>` with provirus-local coords — the same convention DRAM uses on those scaffolds. Direct gene-id equality therefore already worked for provirus contigs in PR #5; my position-only refactor stripped the provirus suffix off the geNomad parent on one side but not the other, producing a key mismatch and *losing* matches that previously succeeded. The user's PR #6 test showed score-5 jumping from 13% (PR #5) to 43% as a result. Hybrid join: 1. Direct gene-id equality. Catches provirus genes (`k141_X|provirus_…_N`) and whole-virus genes (`k141_X_N`) when both tools share the same naming. 2. Position-overlap on the parent assembly contig (the previous position-join logic). Only consulted for DRAM genes that didn't get a direct match. This still helps the hypothetical case where naming diverges — DRAM run on a clustered catalog with per-sample geNomad, CheckV further-trimmed proviruses, etc. parse_genomad_genes_tsv now also keeps `gene` (the full id) alongside `contig` so the direct-id index can be built from the same DataFrame. Tests: - parse output schema updated to include the gene column. - new test for direct gene-id match path (covers both provirus and whole-virus naming). - new test for position-overlap fallback (when ids differ but parent contig + translated coords align). - new test confirming direct match takes precedence over a different- category overlap. - empty-input test updated for new schema. - 25/25 pass. Note about `k141_100971` from the user's data: it's a CheckV-trimmed provirus from a contig that geNomad classified as host (below threshold overall, but locally viral). geNomad emits `k141_100971|provirus_1_18064_…` gene ids in its TSV — confirmed by `grep "k141_100971"`. The hybrid join will pick those up via direct gene-id equality, recovering all the matches the position-only approach was missing.
feat(dramv): provirus position-join + --max_auxiliary_score filter (Phase 2d)
Nextflow's path.name returned the staged relative path
'genomad/NO_FILE' instead of just 'NO_FILE' on the user's NF 26.04
HPC, so the strict equality filter missed and the sentinel was passed
to dramv_flags.py as a real --genomad_genes input. polars.read_csv
then choked on the empty file with NoDataError.
Switch the filter to endsWith('NO_FILE'), which doesn't care whether
the path getter returned a basename or the staged relative path.
When --amg_only combined with --max_auxiliary_score filters every row out of annotations (e.g. catalog mode without --genomad_filename_prefix where every gene scores 5 and the default cutoff drops them all), polars infers rule_hits.query_id as null-type and the join against annotations.query_id (str) crashes with SchemaError. Cast both sides to Utf8 explicitly so the empty case yields an empty result rather than a type-mismatch error.
CLI overrides arrive as strings under nf-schema 2.6.x and lenientMode doesn't coerce numerics. Mirror the array_size / queue_size / threads treatment: 'type': ['integer','string'] with pattern '^[1-5]$' so both --max_auxiliary_score 5 and config integer 3 validate cleanly. Click on the dramv_flags.py side coerces to int via type=int.
… prefix
DRAM-on-catalog gene ids are `<sample>_<orig_contig>_<num>` because
nf-virome's CONCAT_FASTAS prepends a sample prefix when concatenating
per-sample fastas into the catalog. geNomad ran upstream of that
rename, so its per-sample `*_virus_genes.tsv` files emit gene ids
like `<orig_contig>_<num>` (no sample prefix). Direct gene-id match
and the position-overlap fallback both miss because the names differ
on the prefix.
Adds an opt-in catalog-mode adapter:
- `derive_sample_prefix_from_filename(path)` extracts a prefix from
each `--genomad_genes` file basename: strips a trailing
`_virus_genes.tsv` (or `_genes.tsv`, with optional .gz) and adds
an underscore. `EV01_virus_genes.tsv` → `"EV01_"`. Non-matching
filenames return `""` (per-sample mode falls through unchanged).
- `parse_genomad_genes_tsv` gains `sample_prefix=""` kwarg, prepended
to both `gene` and `contig` on every kept row so the existing
direct-id and position-overlap join logic line up against DRAM's
catalog-prefixed names.
- New `--genomad_filename_prefix` CLI flag (off by default) gates
the prefix derivation. Per-sample DRAM-v runs keep working as
before; catalog-mode runs opt in.
Plumbed through Nextflow as `params.genomad_filename_prefix` (default
false, schema entry constrains to true|false). The DRAMV_FLAGS module
passes the flag through when set.
README: new "Catalog vs per-sample mode" paragraph documenting the
rename and the flag.
Tests: 2 new in `test_dramv_flags.py`:
- `derive_sample_prefix_from_filename` covers the four basename
shapes plus the no-match fallback.
- `parse_genomad_genes_tsv(..., sample_prefix='EV01_')` confirms
both `gene` and `contig` are prepended on every kept row, and
positions/categories pass through unchanged.
27/27 pass.
feat(dramv): catalog-mode geNomad adapter via filename-derived sample prefix
Phase 2a-2d landed a lot of DRAM-v functionality (V flag, auxiliary_score, geNomad adapter, catalog vs per-sample modes) but the prose grew into a wall of text inside example #8 of the Quick Start list. Pull it out into a dedicated 'DRAM-v (Viral mode)' chapter with sub-sections: - What --use_dramv produces (the three new columns + filter) - Per-sample mode (launch + when to use) - Catalog mode (launch + --genomad_filename_prefix) - Flag reference (M/K/E/V/A/P/T/F/B/N table) - auxiliary_score reference (1-5 scale, B-downgrade, geNomad mapping, hybrid join) Example #8 in Quick Start shrinks to a one-line pointer at the new chapter. Quick Links anchor updated accordingly. No content lost, just reorganised.
DRAM-v Phase 2 (V flag, auxiliary_score, geNomad adapter, --max_auxiliary_score, --genomad_filename_prefix) hasn't been merged into WrightonLabCSU/DRAM yet, so the previous launch examples that read 'nextflow run WrightonLabCSU/DRAM ...' would fail with 'unknown parameter --use_dramv' (and friends). Switch the per-sample and catalog launches to 'nextflow run tpall/DRAM -r dev ...' and add a note at the top of the chapter explaining this is a fork-only path until upstreamed. The non-DRAM-v examples in 'Example Usage' (rename / call / annotate / etc.) keep pointing at WrightonLabCSU/DRAM since those features are upstream.
docs(readme): elevate DRAM-v guide to its own top-level chapter
|
Closing in favour of #503, which narrows this PR to just the array-support change with @madeline-scyphers' three review comments addressed in the design:
#503 is Thanks for the original review notes! |
Changes
array_size = 10parameter to nextflow.config andarrayto conf/base.config for more efficient cluster execution.Channeltochannel) for consistency across workflows and improve readability + usage of implicit variable within closures (e.g.it.nametoit -> it.name)Computing environment and command