Skip to content

Submit tasks as job arrays and fix RNAs in distill summaries#472

Closed
tpall wants to merge 92 commits intoWrightonLabCSU:devfrom
tpall:dev
Closed

Submit tasks as job arrays and fix RNAs in distill summaries#472
tpall wants to merge 92 commits intoWrightonLabCSU:devfrom
tpall:dev

Conversation

@tpall
Copy link
Copy Markdown

@tpall tpall commented Nov 27, 2025

Changes

  • Added array_size = 10 parameter to nextflow.config and array to conf/base.config for more efficient cluster execution.
  • Fix inclusion of rRNA, tRNA, and quast summaries to genome_stats.tsv and metabolism_summary.xlsx in bin/distill.py script.
  • Refactor channel usage (Channel to channel) for consistency across workflows and improve readability + usage of implicit variable within closures (e.g. it.name to it -> it.name)

Computing environment and command

  • nextflow version 25.10.0.10289
  • openjdk 22.0.1-internal 2024-04-16
  • singularity 3.8.5
  • slurm
  • x86_64 GNU/Linux
nextflow run tpall/DRAM -r dev --input_fasta ./DRAM/input_fasta --outdir ./DRAM/call-annotate-distill --threads 8 --summarize --qc --use_kofam --use_dbcan --use_merops --use_viral --use_methyl --use_sulfur -profile singularity --slurm --partition main -with-report -with-trace -with-timeline --array_size 10 --queue_size 10 -resume --annotate 

@madeline-scyphers madeline-scyphers self-requested a review December 1, 2025 21:14
@madeline-scyphers
Copy link
Copy Markdown
Member

madeline-scyphers commented Dec 1, 2025

Hey @tpall Thanks for this. The job array addition it nice. There is a larger planned update to batch a lot of the inputs into singular jobs to reduce the burden on the queue, since running DRAM with lots of inputs can overwhelm a SLURM scheduler, but adding in job arrays, which weren't supported on the version of Nextflow we initially developed DRAM2 on, but we recently moved to >=24 (we should lock in >=24.04.0 since there are early 24.* prereleases out there if we add in job arrays). I will have to do some testing on utilizing job arrays and their implication. Because from my initial testing it seems like it stops the next stop from proceeding until their are enough inputs to fill an array. Which might be ok. But if we are going to be doing batching anyway, it might not be that important and not worth it.

Also thanks for some of the other QoL updates like updating some of the syntax to DSL2 (Channel -> channel, etc.).

I will have to more fully review the code, which I can get to in a couple weeks. I have deadline for next week, and probably won't be able to review much before then.

But I will leave just a couple of quick thoughts.

Thanks again

Comment thread conf/base.config
}

withName: 'DRAM:ANNOTATE:CALL:.*|DRAM:ANNOTATE:DB_SEARCH:.*' {
array = params.array_size
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to support people running DRAM2 with local executor (such as on their own computer if they want), which doesn't support array. So the array should only be used with executors that support it.

Comment thread conf/base.config Outdated
maxRetries = 2
}

withName: 'DRAM:ANNOTATE:CALL:.*|DRAM:ANNOTATE:DB_SEARCH:.*' {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobs under DRAM:ANNOTATE:QC:COLLECT.* could also have a job array

Comment thread conf/base.config Outdated
Comment on lines +68 to +71
withName: 'DRAM:ANNOTATE:CALL:.*|DRAM:ANNOTATE:DB_SEARCH:.*' {
array = params.array_size
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code here in base.config for the job array should probably be in modules.config

tpall and others added 10 commits December 18, 2025 15:20
Adds a DECOMPRESS_FASTA module (bbtools reformat.sh in the existing
bbmap container) and routes only .gz inputs through it via a branch
on the fasta channel. Basename stripping is unified so sample.fa and
sample.fa.gz produce the same downstream name, keeping outputs
identical regardless of input compression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
It was already defined in nextflow.config (default 10) and consumed by
conf/base.config, but absent from nextflow_schema.json, so runs emitted
a schema-validation warning. Added alongside queue_size under Process
Options.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the never-included trees module and all scripts only it referenced
(parse_annotations.py, update_annots_trees.py, color_labels.R), plus
update_tree.py which had no references at all. Also removes
assets/trees/ refpkgs (only consumed by the dropped module) and the
DRAM-v1 standalone DB setup scripts under assets/internal/ which were
never wired into the DSL2 pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both are unreferenced DRAM-v1 helpers under bin/assets/forms/. They
shell out to a DRAM-setup.py CLI that isn't part of the DSL2 pipeline,
and upstream has already removed them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eight references used DB_CHANNEL_SETUP (all caps) but the workflow is
defined as DB_channel_SETUP. Groovy is case-sensitive so the references
failed at runtime with "No such variable: DB_CHANNEL_SETUP".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tpall added 24 commits May 7, 2026 18:47
Previous attempt put params.each { ... } at script top level, but DSL2
rejects statements outside workflow/process/function. Moved the
coercion into a coerceCliParams(p) function defined at top level
(declarations are allowed) and call it from main.nf workflow {} as
the first step, before PIPELINE_INITIALISATION runs validation.

Iterates over a snapshot of the map (collectEntries) to avoid any
risk of concurrent modification while writing back coerced values.
Adds the v1 V flag to DRAM-v amg_flags. V fires when a gene's vogdb
hits include any VOG whose VOGdb FunctionalCategory is Xr (viral
replication) or Xs (virion structure). Semantics ported verbatim from
DRAM v1 mag_annotator/annotate_vgfs.py:get_metabolic_flags (commit
6cd68f9). V is inserted between E and A in the v1 alphabet
(M K E V A P T F B); the trailing N (Phase 1) is unchanged.

Implementation:
- build_viral_vog_ids parses vog_annotations_latest.tsv(.gz) and
  returns the set of GroupNames whose FunctionalCategory contains
  Xr or Xs (substring match handles concatenated codes like XrXs).
- compute_flags accepts viral_vog_ids; default empty set means V
  never fires (backward compat with Phase 1 callers).
- Phase 2a registered the wrong polars keys in _ID_EXPR_DICT
  (vog_id/vog_ids). hmm_parser actually emits {db_name}_id columns,
  i.e. vogdb_id/vogdb_ids. Renamed accordingly so explode_gene_ids
  picks them up.

Plumbing:
- modules/local/dramv/dramv_flags.nf takes vog_list as a third input
  and passes --vog_list through.
- subworkflows/local/annotate.nf wires params.vog_list into the call.
  Safe because use_dramv force-enables use_vog upstream.

Tests:
- 4 new tests in test_dramv_flags.py covering Xr/Xs firing,
  rejection of Xh/Xp/Xu, v1 position (MKEVA), backward-compat
  default-off, and build_viral_vog_ids parsing. 16/16 pass.
feat(dramv): VOGdb plumbing for Phase 2 (V flag prerequisites)
feat(dramv): V flag in compute_flags (Phase 2b)
Adds the v1 auxiliary_score (1-5) per gene to the dramv_flags output.
Algorithm ported verbatim from DRAM v1
mag_annotator/annotate_vgfs.py:calculate_auxiliary_scores (commit
6cd68f9):

  - First/last gene on a scaffold → 5 (no flank context).
  - Hallmark left and hallmark right → 1.
  - Hallmark on one side, viral_like on the other → 2.
  - Viral_like both sides → 3.
  - Any hallmark/viral_like on either side, OR self carries one → 4.
  - Otherwise → 5.

VIRSORTER_HALLMARK_CATEGORIES = {"0","3"}; VIRAL_LIKE = {"1","4"}.

After scoring, v1's downgrade rule fires: if a gene has B in amg_flags
AND auxiliary_score < 4, the score is bumped to 4 (a stretch of three
metabolic genes is suspicious in a viral region, so AMG calls there
get demoted).

compute_flags accepts virsorter_categories: dict[gene_id, cat] | None.
When None or empty (pure-FASTA mode), every gene falls through to the
default branch and scores 5 — matching v1 behaviour for genomes with
no VirSorter input. The signature lets a caller plug in VirSorter
affi-contigs.tab data later without touching this function; parsing
that file is a separate concern (no CLI hook in this PR).

Tests:
- 4 new in test_dramv_flags.py covering pure-FASTA all-5s, every
  branch of the if-chain (1/2/3/4/4-self/5), and the B-flag downgrade.
- 20/20 pass.
Without VirSorter input the v1 auxiliary_score collapses to 5 for every
gene. The user's pipeline runs geNomad+CheckV externally before DRAM,
so this adds an optional adapter that translates geNomad's gene-level
TSV into VirSorter category codes that the v1 algorithm understands.

Mapping (lossy but matches v1's category semantics for scoring):
  virus_hallmark = TRUE          → "0" (phage hallmark)
  taxname starts with "Viruses"  → "1" (phage viral-like)
  USCG / plasmid hallmark / NA / Bacteria-like taxname → dropped

The phage / prophage distinction (v1 0 vs 3, 1 vs 4) is collapsed since
the algorithm treats {0,3} and {1,4} as equivalent for the score.

CLI / plumbing:
  - bin/dramv_flags.py: --genomad_genes flag, parse_genomad_genes_tsv().
  - modules/local/dramv/dramv_flags.nf: optional path input, NO_FILE
    sentinel (assets/NO_FILE, empty file) when params.genomad_genes is
    null. Script-level guard skips the --genomad_genes arg in that case.
  - subworkflows/local/annotate.nf: pass params.genomad_genes through.
  - nextflow.config + nextflow_schema.json: declare params.genomad_genes
    (default null, type file-path with exists=true when set).

Tests:
  - 1 new in test_dramv_flags.py covering hallmark / viral-like / USCG /
    plasmid / NA / bacterial-marker drop. 21/21 pass.
DRAMV_FLAGS receives one combined annotations TSV across all samples,
so the geNomad adapter needs gene-level data from every sample to
populate auxiliary_score for the full table.

Changes:
- bin/dramv_flags.py: --genomad_genes is now repeatable. Each file is
  parsed and the resulting {gene_id: category} dicts are merged. Log
  line counts hallmark / viral-like across all files.
- modules/local/dramv/dramv_flags.nf: accept a list of paths (staged
  to ./genomad/), build one --genomad_genes flag per non-sentinel
  file, skip when only the NO_FILE sentinel is present.
- subworkflows/local/annotate.nf: collect Channel.fromPath glob into
  a list, fall back to a single-element list with the sentinel.

params.genomad_genes can now be a path OR a glob (e.g.
"path/to/genomad/*/EV*_summary/EV*_virus_genes.tsv"). Nextflow handles
the glob expansion via Channel.fromPath.

Tests: existing 21/21 still pass — no semantics change for single-file
callers (a 1-element iterable behaves the same as the old single-path).
nf-schema's "format": "file-path" + "exists": true rejects glob patterns
("not a file, but a file path pattern"), but params.genomad_genes is
now intended to accept a glob so a single launch can match per-sample
*_virus_genes.tsv files. Relax the schema entry to just "type":
"string"; Channel.fromPath in annotate.nf already handles glob
expansion and validates that at least one file matches.
The previous module built each --genomad_genes arg as
"genomad/${it.name}", but ${it.name} on files staged via
`stageAs: "genomad/*"` already includes the staging prefix in the
relative path, so the resulting CLI was "--genomad_genes
genomad/genomad/EV01_virus_genes.tsv" → FileNotFoundError at runtime.

Drop the manual genomad/ prefix and just interpolate ${it} directly;
Nextflow gives us the correct relative path under the work dir.
…ffix)

The first test run produced "0 hallmark + 0 viral-like" across both
samples because the original adapter targeted the documented schema
shape (virus_hallmark TRUE/FALSE, taxname "Viruses;...") but modern
geNomad (>=1.5) emits:

  - virus_hallmark as a 0/1 integer column.
  - taxname as the leaf taxon only (e.g. "Caudoviricetes"), not a
    lineage prefixed with "Viruses".
  - marker names like GENOMAD.000123.VV / .Vv / .vV / .vv where the
    suffix encodes the viral-confidence classification.

Adapter rewritten to:
  - virus_hallmark truthy (1, "1", True, "TRUE")           → "0"
  - marker suffix in {VV, Vv}, hallmark flag not set       → "0"
  - marker suffix in {vV, vv}, hallmark flag not set       → "1"
  - everything else (NA marker, host/plasmid suffix, etc.) → dropped

The taxname check is removed (leaf taxon doesn't reliably encode
"viral"). Test rewritten with rows reflecting the actual schema.

Also documented in the docstring that DRAM provirus contigs
(`<contig>|provirus_*`) carry a CheckV-derived suffix that geNomad's
gene ids — called on the pre-trim assembly contigs — won't match.
For now those rows silently get no coverage; only whole-virus contigs
contribute. A position-overlap-based join is the proper fix and is
left for a follow-up.
feat(dramv): auxiliary_score column (Phase 2c)
geNomad calls genes on the assembly contigs *before* CheckV provirus
trimming, so its gene ids look like `k141_99_42`. DRAM calls genes on
the trimmed provirus sub-sequence and emits scaffold names like
`k141_99|provirus_1000_5000` with gene positions 1-based *within* the
provirus. Direct string-match on gene_id therefore produces no
coverage for any provirus contig, leaving auxiliary_score stuck at 5
for them — which is what the user saw in their first PR #5 test (1216
of ~9k rows still scored 5 after the parser bugs were fixed).

This commit replaces the gene-id-based join with a position-overlap
join on the parent assembly contig:

  - parse_genomad_genes_tsv now returns a polars DataFrame with
    columns (contig, start, end, category). The contig is the parent
    assembly contig (gene id with trailing _<num> stripped). One row
    per kept gene; empty samples produce an empty frame.
  - build_virsorter_categories_from_genomad joins the geNomad frame
    onto DRAM annotations via:
      * _parse_dram_scaffold extracts (parent, offset) from a DRAM
        scaffold name. Provirus contigs use the suffix offset; whole-
        virus contigs map to (scaffold, 1) so translation is a no-op.
      * For each DRAM gene, parent_pos = offset + dram_pos - 1.
      * Pick the first geNomad gene on the parent contig whose
        interval [g_start, g_end] overlaps [orig_start, orig_end].
        geNomad's same-strand gene calls don't overlap each other, so
        a single match is unambiguous.

main() chains the two: read each --genomad_genes file into a DF,
concat, then position-join against the loaded annotations to produce
the {dram_gene_id: virsorter_category} dict that compute_flags
already accepts. New log line reports coverage:
  "DRAM genes with mapped category (after provirus position-join):
   N of M"

Tests: existing parser test rewritten for the new return shape; two
new tests cover (a) a 5-row mix of provirus gene + parent overlap +
whole-virus + non-overlapping + orphan contig, exercising both the
provirus offset translation and the whole-virus passthrough, and
(b) empty geNomad frame → empty dict. 23/23 pass.
Adds the v1 AMG-confidence cutoff to the distill step. When --amg_only
is enabled, candidate rows whose auxiliary_score exceeds
--max_auxiliary_score are dropped from the strict-AMG sheet. Default
3, matching DRAM v1's recommended cutoff: keep AMG calls with at
least one hallmark/viral-like neighbour OR self-hallmark (scores 1-3
under the v1 algorithm, 4 once the B-downgrade applies). Set to 5 to
disable.

Behaviour preserved when auxiliary_score is absent (older inputs or
non-DRAM-v runs) or when --amg_only is off — the filter is silently
skipped.

Plumbing:
  - bin/distill.py: --max_auxiliary_score click option, applied
    immediately after the existing --amg_only flag-string filter.
    Logs before/after counts.
  - modules/local/distill/distill.nf: pass
    --max_auxiliary_score ${params.max_auxiliary_score} when
    use_dramv is on.
  - nextflow.config + nextflow_schema.json: declare
    params.max_auxiliary_score (default 3, range 1-5).
  - README.md: updated Quick Start blurb to call out the V flag,
    auxiliary_score column, --genomad_genes input, position-join
    semantics, and the new --max_auxiliary_score knob.
…ack)

The earlier position-only join caused a regression: in geNomad >=1.5
provirus genes ARE named `<contig>|provirus_<start>_<end>_<num>` with
provirus-local coords — the same convention DRAM uses on those scaffolds.
Direct gene-id equality therefore already worked for provirus contigs in
PR #5; my position-only refactor stripped the provirus suffix off the
geNomad parent on one side but not the other, producing a key mismatch
and *losing* matches that previously succeeded. The user's PR #6 test
showed score-5 jumping from 13% (PR #5) to 43% as a result.

Hybrid join:
  1. Direct gene-id equality. Catches provirus genes (`k141_X|provirus_…_N`)
     and whole-virus genes (`k141_X_N`) when both tools share the same
     naming.
  2. Position-overlap on the parent assembly contig (the previous
     position-join logic). Only consulted for DRAM genes that didn't get
     a direct match. This still helps the hypothetical case where naming
     diverges — DRAM run on a clustered catalog with per-sample geNomad,
     CheckV further-trimmed proviruses, etc.

parse_genomad_genes_tsv now also keeps `gene` (the full id) alongside
`contig` so the direct-id index can be built from the same DataFrame.

Tests:
  - parse output schema updated to include the gene column.
  - new test for direct gene-id match path (covers both provirus and
    whole-virus naming).
  - new test for position-overlap fallback (when ids differ but parent
    contig + translated coords align).
  - new test confirming direct match takes precedence over a different-
    category overlap.
  - empty-input test updated for new schema.
  - 25/25 pass.

Note about `k141_100971` from the user's data: it's a CheckV-trimmed
provirus from a contig that geNomad classified as host (below threshold
overall, but locally viral). geNomad emits `k141_100971|provirus_1_18064_…`
gene ids in its TSV — confirmed by `grep "k141_100971"`. The hybrid join
will pick those up via direct gene-id equality, recovering all the
matches the position-only approach was missing.
feat(dramv): provirus position-join + --max_auxiliary_score filter (Phase 2d)
Nextflow's path.name returned the staged relative path
'genomad/NO_FILE' instead of just 'NO_FILE' on the user's NF 26.04
HPC, so the strict equality filter missed and the sentinel was passed
to dramv_flags.py as a real --genomad_genes input. polars.read_csv
then choked on the empty file with NoDataError.

Switch the filter to endsWith('NO_FILE'), which doesn't care whether
the path getter returned a basename or the staged relative path.
When --amg_only combined with --max_auxiliary_score filters every row
out of annotations (e.g. catalog mode without --genomad_filename_prefix
where every gene scores 5 and the default cutoff drops them all),
polars infers rule_hits.query_id as null-type and the join against
annotations.query_id (str) crashes with SchemaError. Cast both sides
to Utf8 explicitly so the empty case yields an empty result rather
than a type-mismatch error.
CLI overrides arrive as strings under nf-schema 2.6.x and lenientMode
doesn't coerce numerics. Mirror the array_size / queue_size / threads
treatment: 'type': ['integer','string'] with pattern '^[1-5]$' so
both --max_auxiliary_score 5 and config integer 3 validate cleanly.
Click on the dramv_flags.py side coerces to int via type=int.
… prefix

DRAM-on-catalog gene ids are `<sample>_<orig_contig>_<num>` because
nf-virome's CONCAT_FASTAS prepends a sample prefix when concatenating
per-sample fastas into the catalog. geNomad ran upstream of that
rename, so its per-sample `*_virus_genes.tsv` files emit gene ids
like `<orig_contig>_<num>` (no sample prefix). Direct gene-id match
and the position-overlap fallback both miss because the names differ
on the prefix.

Adds an opt-in catalog-mode adapter:

  - `derive_sample_prefix_from_filename(path)` extracts a prefix from
    each `--genomad_genes` file basename: strips a trailing
    `_virus_genes.tsv` (or `_genes.tsv`, with optional .gz) and adds
    an underscore. `EV01_virus_genes.tsv` → `"EV01_"`. Non-matching
    filenames return `""` (per-sample mode falls through unchanged).
  - `parse_genomad_genes_tsv` gains `sample_prefix=""` kwarg, prepended
    to both `gene` and `contig` on every kept row so the existing
    direct-id and position-overlap join logic line up against DRAM's
    catalog-prefixed names.
  - New `--genomad_filename_prefix` CLI flag (off by default) gates
    the prefix derivation. Per-sample DRAM-v runs keep working as
    before; catalog-mode runs opt in.

Plumbed through Nextflow as `params.genomad_filename_prefix` (default
false, schema entry constrains to true|false). The DRAMV_FLAGS module
passes the flag through when set.

README: new "Catalog vs per-sample mode" paragraph documenting the
rename and the flag.

Tests: 2 new in `test_dramv_flags.py`:
  - `derive_sample_prefix_from_filename` covers the four basename
    shapes plus the no-match fallback.
  - `parse_genomad_genes_tsv(..., sample_prefix='EV01_')` confirms
    both `gene` and `contig` are prepended on every kept row, and
    positions/categories pass through unchanged.
27/27 pass.
feat(dramv): catalog-mode geNomad adapter via filename-derived sample prefix
Phase 2a-2d landed a lot of DRAM-v functionality (V flag,
auxiliary_score, geNomad adapter, catalog vs per-sample modes) but the
prose grew into a wall of text inside example #8 of the Quick Start
list. Pull it out into a dedicated 'DRAM-v (Viral mode)' chapter with
sub-sections:

  - What --use_dramv produces       (the three new columns + filter)
  - Per-sample mode                 (launch + when to use)
  - Catalog mode                    (launch + --genomad_filename_prefix)
  - Flag reference                  (M/K/E/V/A/P/T/F/B/N table)
  - auxiliary_score reference       (1-5 scale, B-downgrade, geNomad
                                     mapping, hybrid join)

Example #8 in Quick Start shrinks to a one-line pointer at the new
chapter. Quick Links anchor updated accordingly. No content lost,
just reorganised.
DRAM-v Phase 2 (V flag, auxiliary_score, geNomad adapter,
--max_auxiliary_score, --genomad_filename_prefix) hasn't been merged
into WrightonLabCSU/DRAM yet, so the previous launch examples that
read 'nextflow run WrightonLabCSU/DRAM ...' would fail with 'unknown
parameter --use_dramv' (and friends).

Switch the per-sample and catalog launches to 'nextflow run
tpall/DRAM -r dev ...' and add a note at the top of the chapter
explaining this is a fork-only path until upstreamed. The non-DRAM-v
examples in 'Example Usage' (rename / call / annotate / etc.) keep
pointing at WrightonLabCSU/DRAM since those features are upstream.
docs(readme): elevate DRAM-v guide to its own top-level chapter
@tpall
Copy link
Copy Markdown
Author

tpall commented May 8, 2026

Closing in favour of #503, which narrows this PR to just the array-support change with @madeline-scyphers' three review comments addressed in the design:

  1. Local-executor compat — directive gated on params.slurm so non-SLURM runs see array = 0 (a no-op).
  2. DRAM:ANNOTATE:QC:COLLECT.* — selector regex broadened.
  3. Moved from conf/base.config to conf/modules.config.

#503 is +19 lines / 3 files. The other improvements that were bundled into this PR (Channel→channel cleanup, rRNA/tRNA distill fix, fasta gzip support, schema fixes) will come up as separate focused PRs so each can be reviewed in isolation.

Thanks for the original review notes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants