Skip to content

feat: add --quantMode TranscriptomeSAM output#4

Closed
ewels wants to merge 17 commits into
feat/run-rng-seedfrom
feat/transcriptome-sam
Closed

feat: add --quantMode TranscriptomeSAM output#4
ewels wants to merge 17 commits into
feat/run-rng-seedfrom
feat/transcriptome-sam

Conversation

@ewels
Copy link
Copy Markdown
Owner

@ewels ewels commented Apr 17, 2026

Summary

Implements --quantMode TranscriptomeSAM + --quantTranscriptomeSAMoutput, producing Aligned.toTranscriptome.out.bam for downstream Salmon/RSEM quantification. Matches STAR's behavior from Transcriptome.cpp, Transcriptome_quantAlign.cpp, ReadAlign_quantTranscriptome.cpp.

Part 3 of 3 branches unblocking nf-core/rnaseq's STAR_ALIGN step. Stacked on #2 (feat/run-rng-seed) — needs that branch's RNG helpers (per_read_seed) for random primary-transcript selection. Please review and merge #2 first.

Changes (6 feature commits + 11 simplify commits + 1 fmt commit)

Feature commits:

  1. feat(quant): add TranscriptomeIndex built from GTF exon groupssrc/quant/transcriptome.rs NEW. Per-transcript metadata (strand, gene_id, exons with cumulative length, tr_length). Binary-searchable tr_starts_sorted/tr_end_max_sorted. 22 unit tests.
  2. feat(quant): project genome alignments onto transcripts (align_to_transcripts) — ports STAR's Transcriptome_quantAlign.cpp: binary search, per-block projection, junction matching, reverse-strand flip.
  3. feat(quant): filter modes + softclip extension for transcriptome SAM — three QuantTranscriptomeSAMoutput modes (BanSingleEnd_BanIndels_ExtendSoftclip default, BanSingleEnd, BanSingleEnd_ExtendSoftclip). Ports STAR's soft-clip extension with mismatch re-count.
  4. feat(params): --quantTranscriptomeSAMoutput + TranscriptomeSAM validation — CLI wiring, quant_transcriptome_sam() helper, validate requires --sjdbGTFfile.
  5. feat(io/bam,lib): wire Aligned.toTranscriptome.out.bam pipelineBamWriter::create_transcriptome() (per-transcript @SQ header), build_transcriptome_records{_se,_pe}, plumbed through AlignmentBatchResults, random primary via per_read_seed (from feat: add --runRNGseed flag with seeded primary tie-break #2).
  6. test: integration test for --quantMode TranscriptomeSAM — end-to-end mini test with synthetic genome + GTF + 20 reads.

Simplify commits (11): dead-code removal (Display/allow_single_end/BamWriter::header), shared helpers (rc_encode, pick_primary_and_mapq, BamWriter::with_header), one-build-per-mate PE emission (no sentinel hack), unread read_seq clone dropped, comment cleanup.

Plus: 1 cargo fmt commit. Rebased onto latest feat/run-rng-seed after #2's simplify + fmt commits landed.

  • +28 unit + 1 integration test (total 311 unit + 5 integration = 316 passing).
  • Files: src/quant/mod.rs, src/quant/transcriptome.rs NEW, src/params.rs, src/io/sam.rs, src/io/bam.rs, src/align/read_align.rs, src/lib.rs, tests/transcriptome_sam.rs NEW.

Divergences from STAR (documented in commit bodies)

  • StdRng vs mt19937 — inherited from feat: add --runRNGseed flag with seeded primary tie-break #2. Per-read deterministic, not bit-for-bit with STAR's mt19937 primary-pick order.
  • GTF-only transcript metadata — STAR persists transcriptInfo.tab/exonInfo.tab at genomeGenerate; ruSTAR builds equivalents on-the-fly from GTF. Semantically equivalent.
  • PE pair projection — uses per-mate intersection on transcript indices instead of STAR's combined 2-mate Transcript / EX_iFrag machinery. Equivalent for unique-per-transcript mates, may differ for complex multi-site cases within one transcript.
  • Single-end filter — enforced at caller-level (only both-mapped PairedAlignment mates invoke the projector), not via per-block EX_iFrag checks.
  • Pass-1 of two-pass mode does not emit transcriptome BAM (intentional; only pass 2 emits).

Test plan

  • cargo test — 316 passing
  • cargo clippy --lib -- -D warnings — clean
  • cargo fmt --check — clean
  • Integration test: synthetic genome → valid BAM with T1/T2 @SQ + expected records
  • Differential check vs STAR on bench/ yeast 10k (needs STAR installed + prebuilt index)
  • End-to-end through nf-core/rnaseq Salmon step

Notes

  • Commits are unsigned (1Password SSH agent unavailable during session).
  • Branch was force-pushed-with-lease after rebasing onto updated feat/run-rng-seed.

🤖 Generated with Claude Code

ewels and others added 17 commits April 17, 2026 15:48
Introduce `src/quant/transcriptome.rs` with a `TranscriptomeIndex`
struct that groups GTF exons by `transcript_id`, sorts exons by start,
and records absolute genome coords plus STAR's `exLenCum` (cumulative
transcript-space offset) per exon.

Also builds:
- `tr_start` / `tr_end` — transcript bounds in absolute genome coords
- `tr_order` / `tr_starts_sorted` / `tr_end_max_sorted` — sorted view
  used for STAR-style `binarySearch1a` + running-max early-exit in
  `quantAlign`
- per-transcript length, strand, gene_id, chr_idx

Transcripts with inconsistent chr/strand across exons or on unknown
chromosomes are skipped with `log::warn!`, matching STAR's tolerant
GTF handling at `sjdbInsertJunctions` time.

Note: STAR persists this table as `transcriptInfo.tab` + `exonInfo.tab`
at genomeGenerate; ruSTAR rebuilds it at alignment time from the GTF.
Semantically equivalent.

No signing — commit.gpgsign disabled locally (no key on this worktree).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nscripts)

Port STAR's `Transcriptome::quantAlign` + `alignToTranscript` (see
`source/Transcriptome_quantAlign.cpp:5-89, 91-114`).

`align_to_transcripts` binary-searches `tr_starts_sorted` for the
greatest tr_start <= align.genome_start, then walks back while
`tr_end_max_sorted[i] >= align.genome_end`, calling
`align_to_one_transcript` on every candidate whose [tr_start, tr_end]
fully contains the alignment.

`align_to_one_transcript` walks the alignment's exon blocks and:
  * locates the transcript exon containing the first block
    (binary-searched `find_containing_exon`),
  * rejects blocks that extend past a transcript-exon boundary,
  * on each splice boundary (ruSTAR's `is_splice_boundary_before`,
    which approximates STAR's `canonSJ >= 0` using read-side
    continuity), requires `prev.genome_end == tr_ex[ex].genome_end &&
    next.genome_start == tr_ex[ex+1].genome_start`,
  * translates each block start to t-space via
    `ex_len_cum[ex] + (g_pos - exon_start)`.

Reverse-strand transcripts flip all projected exons:
  `read_pos' = Lread - (read_pos + len)`,
  `tr_pos'  = tr_length - (tr_pos + len)`,
then reverse the exon vector.  `is_reverse` is XOR'd with
`tr_strand == 2`.

Projected CIGAR = original CIGAR with N operations dropped; reversed
for reverse-strand transcripts.  Splices collapse so `n_junction = 0`.

ruSTAR-specific notes:
  * `Transcript.exons` has no canonSJ array; splice vs indel boundaries
    are discriminated by read-side continuity (insertion → read gap,
    splice → read contiguous with genome gap).  Pure deletions rarely
    produce block splits because the stitch merge coalesces across Del.
  * GTF is 1-based inclusive; ruSTAR 0-based half-open.  Transcript
    length = sum(end - start).  STAR uses `exLenCum[last] + (end -
    start) + 1` which arrives at the same numeric value on 1-based
    inclusive inputs.

8 new unit tests cover: single-exon projection, two-exon with matching
junction, junction mismatch (rejected), reverse-strand flip, multi-exon
projection into a longer transcript, out-of-bounds cases, and
projection onto multiple overlapping transcripts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port STAR's `ReadAlign::quantTranscriptome` gatekeeping logic (see
`source/ReadAlign_quantTranscriptome.cpp:9-66`) as
`filter_and_project`:

1. If `!allow_indels && n_gap > 0` → reject.
2. Single-end filter — for PE, skip alignments where both ends came
   from the same mate.  ruSTAR's `Transcript` does not carry per-block
   mate tags; the caller enforces this by only invoking
   `filter_and_project` on both-mapped PE pair mates.  Documented as a
   no-op at the per-mate level.
3. Soft-clip extension (mode-dependent): extend leading / trailing
   soft-clipped bases back into matched bases, counting mismatches
   where both read and genome bases are < 4 and unequal.  Reject if
   `n_mismatch + extension_mismatches > min(outFilterMismatchNmax,
   outFilterMismatchNoverLmax*(Lread-1))`.
4. Call `align_to_transcripts` for projection.

New enum `QuantTranscriptomeSAMoutput` with three variants matching
STAR's CLI strings:
  * `BanSingleEnd_BanIndels_ExtendSoftclip` (default, RSEM-compatible)
  * `BanSingleEnd`
  * `BanSingleEnd_ExtendSoftclip`

`rebuild_cigar_without_softclips` strips the leading/trailing S ops
and folds their length into the adjacent M op.  Interior soft-clips
are left alone.

7 new unit tests: enum FromStr/Display, mode flags, indel rejection
under default mode, indel retention under BanSingleEnd, zero-mismatch
softclip extension folding 4S+40M → 44M, softclip extension exceeding
budget → reject, softclip preservation under BanSingleEnd mode.

Divergence from STAR: ruSTAR checks for genome/read buffer bounds via
`saturating_sub` + slice-length checks before indexing, to protect
against edge cases where extension would read past the genome end.
STAR trusts its upstream bounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion

Add `quant_transcriptome_sam_output: QuantTranscriptomeSAMoutput`
field with STAR-compatible default
`BanSingleEnd_BanIndels_ExtendSoftclip`.  Accepts the three STAR
strings via the `FromStr` impl added in the previous commit.

Add `Parameters::quant_transcriptome_sam()` helper that scans
`quant_mode` for `"TranscriptomeSAM"`, mirroring the existing
`quant_gene_counts()` helper.

Extend `Parameters::validate()` to require `--sjdbGTFfile` when
`--quantMode TranscriptomeSAM` is active.  STAR also allows running
TranscriptomeSAM in SE (the "single-end" flag in the mode name refers
to per-mate, not per-run), so no PE-only gate.

5 new unit tests: default parse, flag enable, explicit mode override
(BanSingleEnd and BanSingleEnd_ExtendSoftclip), validate error without
GTF, validate success with GTF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level orchestration for `--quantMode TranscriptomeSAM`.

src/io/sam.rs:
  * Factor `build_sam_header` into a generic
    `build_sam_header_from_refs(iter, params)` that takes any
    `(name, length)` iterator.  The original `build_sam_header` now
    delegates to it with the genome's chromosomes.
  * New `SamWriter::build_transcriptome_records(...)`: builds records
    with RNAME pointing at the transcriptome header (reference_sequence_id
    = transcript_idx), POS = t-space position + 1, and emits only the
    standard attribute set (NH/HI/AS/NM/nM).  Splice-aware tags (jM, jI,
    XS, MD) are dropped because splices collapse in t-space.
  * `primary_hit_idx` parameter selects the randomly-chosen primary
    alignment; all others get the SECONDARY (0x100) flag.

src/io/bam.rs:
  * New `BamWriter::create_transcriptome(path, tr_idx, params)` builds
    a BAM header with one @sq per transcript (name = transcript_id,
    length = t-space length).
  * New `BamWriter::header()` accessor.
  * Unit test verifying transcriptome BAM writer creation + @sq count.

src/align/read_align.rs:
  * `per_read_seed` changed from `fn` to `pub(crate) fn` so lib.rs can
    seed the transcriptome primary-picker with the same read-scoped
    seed used for genome-space tie-shuffle.

src/lib.rs:
  * Build an `Arc<TranscriptomeIndex>` alongside the gene-counts
    context when `--quantMode TranscriptomeSAM` is active, and thread
    it through run_single_pass / run_two_pass / align_reads_single_end
    / align_reads_paired_end.
  * Open `<prefix>Aligned.toTranscriptome.out.bam` via
    `BamWriter::create_transcriptome` at the start of the pipeline and
    flush/finish at the end.
  * Extend `AlignmentBatchResults` with `transcriptome_records:
    Vec<RecordBuf>`.  Per-read builder helpers:
      - `build_transcriptome_records_se`: projects every genome-space
        alignment via `filter_and_project`, seeds `StdRng` with
        `per_read_seed(run_rng_seed, read_name)`, picks a random
        projected index as primary, builds records.
      - `build_transcriptome_records_pe`: projects both mates
        independently, keeps transcripts where both project, emits two
        records per projected pair (mate1 FIRST_SEGMENT + mate2
        LAST_SEGMENT).  Primary pick is shared across both mates.
  * Transcriptome records are written serially in batch-merge order
    alongside the main SAM/BAM stream (normal and BySJout paths).

Known limitations vs STAR:
  * PE pairing is done per-transcript-idx set intersection between
    projected mates, not via STAR's combined 2-mate
    `Transcript`/`EX_iFrag` machinery.  Equivalent for the common case
    where each mate maps uniquely within a given transcript; may
    diverge when one mate has multiple valid positions within one
    transcript.
  * StdRng (inherited from `feat/run-rng-seed`) replaces STAR's
    std::mt19937 — no bit-for-bit parity on which transcriptome
    projection is picked as primary among ties.  Per-read
    determinism is preserved via `per_read_seed`.

No pass-1 wiring: two-pass pass 1 uses `None` for tr_idx (tr BAM is
only emitted from pass 2).

cargo fmt applied across the touched files plus incidental fmt
touch-ups on read_align.rs / stitch.rs / params.rs / quant/mod.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke test in `tests/transcriptome_sam.rs`:
  * Builds a 2000bp synthetic genome (single chromosome, pseudo-random
    sequence from an LCG).
  * Writes a 2-transcript GTF (T1: chr1:101-400 +, T2: chr1:601-900 +).
  * Generates 20 reads (30bp) drawn alternately from the T1 and T2
    exon regions.
  * Runs `ruSTAR --runMode genomeGenerate` then `alignReads
    --quantMode TranscriptomeSAM`.
  * Asserts `Aligned.toTranscriptome.out.bam` is created, > 100 bytes,
    parses as a valid BAM via noodles, and the header contains exactly
    two @sq lines named T1 and T2.

Uses the same `assert_cmd::Command::cargo_bin` pattern as the existing
`phase9_threading.rs` test (inherits the same deprecation warnings;
pre-existing parity).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iptomeSAMoutput

Remove `Display` impl (never formatted anywhere) and `allow_single_end()`
helper (always returns false, never called). Clean up the stale single-end
comment in `filter_and_project`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lders

`build_transcriptome_records_se` allocated a redundant `fwd` clone of
`read_seq` just to unify the reverse / forward branch, and both SE + PE
builders inlined the same 6-line base-complement match. Replace both
with a single `rc_encode()` helper delegating to the existing
`io::fastq::complement_base`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r mate

Previously `build_transcriptome_records_pe` called the per-record builder
twice *per projected pair*, passing a single-element slice and a
`primary_hit_idx` sentinel (0 / usize::MAX) to force the SECONDARY flag.
The builder already handles multi-alignment sets correctly, so split the
pairs into mate1/mate2 slices, call the builder once per mate with the
true `primary_hit`, then stamp FIRST_SEGMENT / LAST_SEGMENT on the
results and interleave. Same observable output, half the per-pair work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `build_transcriptome_records_se` and `_pe` duplicated the RNG seeding
(`per_read_seed` → `StdRng::seed_from_u64` → `gen_range`) and the MAPQ
calculation (`effective_n = n_alignments.max(n_for_mapq)`). Extract a
single `pick_primary_and_mapq` helper and hoist the `use` blocks out of
the function bodies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`BamWriter::create` and `BamWriter::create_transcriptome` both opened the
file, wrapped it in `BufWriter`/`bam::io::Writer`, and wrote the header.
Move that boilerplate into a private `with_header` helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Merge the two `actual_left` shadowed lets into one `.min(..).min(..)` chain.
- Drop the `actual_right` alias (right-clip never needs clamping).
- Replace `ext.genome_start = ext.exons.first().map(..).unwrap_or(..)` with
  a simple `if let Some(first) = ext.exons.first()` — the fallback branch
  was unreachable because extend_softclips always enters with a non-empty
  exon list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eq clone

- The 5-line single-end carve-out inside the `filter_and_project` doc said
  the same thing as the inline comment; fold both into a one-line mention.
- Drop redundant `// (1) ... (4) ...` step labels from the function body.
- `align_to_one_transcript` was cloning `align.read_seq` into the projected
  transcript but no consumer reads that field on a projected record
  (transcriptome SAM takes the read from the caller).  Use `Vec::new()` to
  avoid the per-projection allocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_one_transcript

The comment referenced a function name that never existed in ruSTAR (it was
presumably copied from an earlier draft); the real call is
`is_splice_boundary_before`. Collapse the now-out-of-date four-line block
to a single factual line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `header()` method was added "for serial-write code paths that build
records outside of `write_batch`" but no such caller exists; only the
unit test used it. The test lives inside the same module so it can read
the private `header` field directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caller had a `Vec<&Box<PairedAlignment>>` and was building a throwaway
`Vec<&PairedAlignment>` to match the function's `&[&PairedAlignment]`
signature. Change the signature to `IntoIterator<Item = &PairedAlignment>`
and call `.iter().map(|b| b.as_ref())` at the call-site, eliminating the
intermediate Vec allocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test module already imports `HashMap` via `use super::*` (the parent
module re-exports `std::collections::HashMap`). The `StdHashMap` alias
was defensive but unused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ewels ewels force-pushed the feat/transcriptome-sam branch from 23672b1 to f420aae Compare April 17, 2026 13:49
@ewels ewels changed the title feat: add --quantMode TranscriptomeSAM output feat: add --quantMode TranscriptomeSAM output Apr 17, 2026
@ewels ewels changed the title feat: add --quantMode TranscriptomeSAM output feat: add --quantMode TranscriptomeSAM output Apr 17, 2026
@ewels ewels closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant