feat: add `--quantMode TranscriptomeSAM` output by ewels · Pull Request #4 · ewels/rustar-aligner

ewels · 2026-04-17T12:55:54Z

Summary

Implements --quantMode TranscriptomeSAM + --quantTranscriptomeSAMoutput, producing Aligned.toTranscriptome.out.bam for downstream Salmon/RSEM quantification. Matches STAR's behavior from Transcriptome.cpp, Transcriptome_quantAlign.cpp, ReadAlign_quantTranscriptome.cpp.

Part 3 of 3 branches unblocking nf-core/rnaseq's STAR_ALIGN step. Stacked on #2 (feat/run-rng-seed) — needs that branch's RNG helpers (per_read_seed) for random primary-transcript selection. Please review and merge #2 first.

Changes (6 feature commits + 11 simplify commits + 1 fmt commit)

Feature commits:

feat(quant): add TranscriptomeIndex built from GTF exon groups — src/quant/transcriptome.rs NEW. Per-transcript metadata (strand, gene_id, exons with cumulative length, tr_length). Binary-searchable tr_starts_sorted/tr_end_max_sorted. 22 unit tests.
feat(quant): project genome alignments onto transcripts (align_to_transcripts) — ports STAR's Transcriptome_quantAlign.cpp: binary search, per-block projection, junction matching, reverse-strand flip.
feat(quant): filter modes + softclip extension for transcriptome SAM — three QuantTranscriptomeSAMoutput modes (BanSingleEnd_BanIndels_ExtendSoftclip default, BanSingleEnd, BanSingleEnd_ExtendSoftclip). Ports STAR's soft-clip extension with mismatch re-count.
feat(params): --quantTranscriptomeSAMoutput + TranscriptomeSAM validation — CLI wiring, quant_transcriptome_sam() helper, validate requires --sjdbGTFfile.
feat(io/bam,lib): wire Aligned.toTranscriptome.out.bam pipeline — BamWriter::create_transcriptome() (per-transcript @SQ header), build_transcriptome_records{_se,_pe}, plumbed through AlignmentBatchResults, random primary via per_read_seed (from feat: add --runRNGseed flag with seeded primary tie-break #2).
test: integration test for --quantMode TranscriptomeSAM — end-to-end mini test with synthetic genome + GTF + 20 reads.

Simplify commits (11): dead-code removal (Display/allow_single_end/BamWriter::header), shared helpers (rc_encode, pick_primary_and_mapq, BamWriter::with_header), one-build-per-mate PE emission (no sentinel hack), unread read_seq clone dropped, comment cleanup.

Plus: 1 cargo fmt commit. Rebased onto latest feat/run-rng-seed after #2's simplify + fmt commits landed.

+28 unit + 1 integration test (total 311 unit + 5 integration = 316 passing).
Files: src/quant/mod.rs, src/quant/transcriptome.rs NEW, src/params.rs, src/io/sam.rs, src/io/bam.rs, src/align/read_align.rs, src/lib.rs, tests/transcriptome_sam.rs NEW.

Divergences from STAR (documented in commit bodies)

StdRng vs mt19937 — inherited from feat: add --runRNGseed flag with seeded primary tie-break #2. Per-read deterministic, not bit-for-bit with STAR's mt19937 primary-pick order.
GTF-only transcript metadata — STAR persists transcriptInfo.tab/exonInfo.tab at genomeGenerate; ruSTAR builds equivalents on-the-fly from GTF. Semantically equivalent.
PE pair projection — uses per-mate intersection on transcript indices instead of STAR's combined 2-mate Transcript / EX_iFrag machinery. Equivalent for unique-per-transcript mates, may differ for complex multi-site cases within one transcript.
Single-end filter — enforced at caller-level (only both-mapped PairedAlignment mates invoke the projector), not via per-block EX_iFrag checks.
Pass-1 of two-pass mode does not emit transcriptome BAM (intentional; only pass 2 emits).

Test plan

cargo test — 316 passing
cargo clippy --lib -- -D warnings — clean
cargo fmt --check — clean
Integration test: synthetic genome → valid BAM with T1/T2 @SQ + expected records
Differential check vs STAR on bench/ yeast 10k (needs STAR installed + prebuilt index)
End-to-end through nf-core/rnaseq Salmon step

Notes

Commits are unsigned (1Password SSH agent unavailable during session).
Branch was force-pushed-with-lease after rebasing onto updated feat/run-rng-seed.

🤖 Generated with Claude Code

Introduce `src/quant/transcriptome.rs` with a `TranscriptomeIndex` struct that groups GTF exons by `transcript_id`, sorts exons by start, and records absolute genome coords plus STAR's `exLenCum` (cumulative transcript-space offset) per exon. Also builds: - `tr_start` / `tr_end` — transcript bounds in absolute genome coords - `tr_order` / `tr_starts_sorted` / `tr_end_max_sorted` — sorted view used for STAR-style `binarySearch1a` + running-max early-exit in `quantAlign` - per-transcript length, strand, gene_id, chr_idx Transcripts with inconsistent chr/strand across exons or on unknown chromosomes are skipped with `log::warn!`, matching STAR's tolerant GTF handling at `sjdbInsertJunctions` time. Note: STAR persists this table as `transcriptInfo.tab` + `exonInfo.tab` at genomeGenerate; ruSTAR rebuilds it at alignment time from the GTF. Semantically equivalent. No signing — commit.gpgsign disabled locally (no key on this worktree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nscripts) Port STAR's `Transcriptome::quantAlign` + `alignToTranscript` (see `source/Transcriptome_quantAlign.cpp:5-89, 91-114`). `align_to_transcripts` binary-searches `tr_starts_sorted` for the greatest tr_start <= align.genome_start, then walks back while `tr_end_max_sorted[i] >= align.genome_end`, calling `align_to_one_transcript` on every candidate whose [tr_start, tr_end] fully contains the alignment. `align_to_one_transcript` walks the alignment's exon blocks and: * locates the transcript exon containing the first block (binary-searched `find_containing_exon`), * rejects blocks that extend past a transcript-exon boundary, * on each splice boundary (ruSTAR's `is_splice_boundary_before`, which approximates STAR's `canonSJ >= 0` using read-side continuity), requires `prev.genome_end == tr_ex[ex].genome_end && next.genome_start == tr_ex[ex+1].genome_start`, * translates each block start to t-space via `ex_len_cum[ex] + (g_pos - exon_start)`. Reverse-strand transcripts flip all projected exons: `read_pos' = Lread - (read_pos + len)`, `tr_pos' = tr_length - (tr_pos + len)`, then reverse the exon vector. `is_reverse` is XOR'd with `tr_strand == 2`. Projected CIGAR = original CIGAR with N operations dropped; reversed for reverse-strand transcripts. Splices collapse so `n_junction = 0`. ruSTAR-specific notes: * `Transcript.exons` has no canonSJ array; splice vs indel boundaries are discriminated by read-side continuity (insertion → read gap, splice → read contiguous with genome gap). Pure deletions rarely produce block splits because the stitch merge coalesces across Del. * GTF is 1-based inclusive; ruSTAR 0-based half-open. Transcript length = sum(end - start). STAR uses `exLenCum[last] + (end - start) + 1` which arrives at the same numeric value on 1-based inclusive inputs. 8 new unit tests cover: single-exon projection, two-exon with matching junction, junction mismatch (rejected), reverse-strand flip, multi-exon projection into a longer transcript, out-of-bounds cases, and projection onto multiple overlapping transcripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port STAR's `ReadAlign::quantTranscriptome` gatekeeping logic (see `source/ReadAlign_quantTranscriptome.cpp:9-66`) as `filter_and_project`: 1. If `!allow_indels && n_gap > 0` → reject. 2. Single-end filter — for PE, skip alignments where both ends came from the same mate. ruSTAR's `Transcript` does not carry per-block mate tags; the caller enforces this by only invoking `filter_and_project` on both-mapped PE pair mates. Documented as a no-op at the per-mate level. 3. Soft-clip extension (mode-dependent): extend leading / trailing soft-clipped bases back into matched bases, counting mismatches where both read and genome bases are < 4 and unequal. Reject if `n_mismatch + extension_mismatches > min(outFilterMismatchNmax, outFilterMismatchNoverLmax*(Lread-1))`. 4. Call `align_to_transcripts` for projection. New enum `QuantTranscriptomeSAMoutput` with three variants matching STAR's CLI strings: * `BanSingleEnd_BanIndels_ExtendSoftclip` (default, RSEM-compatible) * `BanSingleEnd` * `BanSingleEnd_ExtendSoftclip` `rebuild_cigar_without_softclips` strips the leading/trailing S ops and folds their length into the adjacent M op. Interior soft-clips are left alone. 7 new unit tests: enum FromStr/Display, mode flags, indel rejection under default mode, indel retention under BanSingleEnd, zero-mismatch softclip extension folding 4S+40M → 44M, softclip extension exceeding budget → reject, softclip preservation under BanSingleEnd mode. Divergence from STAR: ruSTAR checks for genome/read buffer bounds via `saturating_sub` + slice-length checks before indexing, to protect against edge cases where extension would read past the genome end. STAR trusts its upstream bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion Add `quant_transcriptome_sam_output: QuantTranscriptomeSAMoutput` field with STAR-compatible default `BanSingleEnd_BanIndels_ExtendSoftclip`. Accepts the three STAR strings via the `FromStr` impl added in the previous commit. Add `Parameters::quant_transcriptome_sam()` helper that scans `quant_mode` for `"TranscriptomeSAM"`, mirroring the existing `quant_gene_counts()` helper. Extend `Parameters::validate()` to require `--sjdbGTFfile` when `--quantMode TranscriptomeSAM` is active. STAR also allows running TranscriptomeSAM in SE (the "single-end" flag in the mode name refers to per-mate, not per-run), so no PE-only gate. 5 new unit tests: default parse, flag enable, explicit mode override (BanSingleEnd and BanSingleEnd_ExtendSoftclip), validate error without GTF, validate success with GTF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Top-level orchestration for `--quantMode TranscriptomeSAM`. src/io/sam.rs: * Factor `build_sam_header` into a generic `build_sam_header_from_refs(iter, params)` that takes any `(name, length)` iterator. The original `build_sam_header` now delegates to it with the genome's chromosomes. * New `SamWriter::build_transcriptome_records(...)`: builds records with RNAME pointing at the transcriptome header (reference_sequence_id = transcript_idx), POS = t-space position + 1, and emits only the standard attribute set (NH/HI/AS/NM/nM). Splice-aware tags (jM, jI, XS, MD) are dropped because splices collapse in t-space. * `primary_hit_idx` parameter selects the randomly-chosen primary alignment; all others get the SECONDARY (0x100) flag. src/io/bam.rs: * New `BamWriter::create_transcriptome(path, tr_idx, params)` builds a BAM header with one @sq per transcript (name = transcript_id, length = t-space length). * New `BamWriter::header()` accessor. * Unit test verifying transcriptome BAM writer creation + @sq count. src/align/read_align.rs: * `per_read_seed` changed from `fn` to `pub(crate) fn` so lib.rs can seed the transcriptome primary-picker with the same read-scoped seed used for genome-space tie-shuffle. src/lib.rs: * Build an `Arc<TranscriptomeIndex>` alongside the gene-counts context when `--quantMode TranscriptomeSAM` is active, and thread it through run_single_pass / run_two_pass / align_reads_single_end / align_reads_paired_end. * Open `<prefix>Aligned.toTranscriptome.out.bam` via `BamWriter::create_transcriptome` at the start of the pipeline and flush/finish at the end. * Extend `AlignmentBatchResults` with `transcriptome_records: Vec<RecordBuf>`. Per-read builder helpers: - `build_transcriptome_records_se`: projects every genome-space alignment via `filter_and_project`, seeds `StdRng` with `per_read_seed(run_rng_seed, read_name)`, picks a random projected index as primary, builds records. - `build_transcriptome_records_pe`: projects both mates independently, keeps transcripts where both project, emits two records per projected pair (mate1 FIRST_SEGMENT + mate2 LAST_SEGMENT). Primary pick is shared across both mates. * Transcriptome records are written serially in batch-merge order alongside the main SAM/BAM stream (normal and BySJout paths). Known limitations vs STAR: * PE pairing is done per-transcript-idx set intersection between projected mates, not via STAR's combined 2-mate `Transcript`/`EX_iFrag` machinery. Equivalent for the common case where each mate maps uniquely within a given transcript; may diverge when one mate has multiple valid positions within one transcript. * StdRng (inherited from `feat/run-rng-seed`) replaces STAR's std::mt19937 — no bit-for-bit parity on which transcriptome projection is picked as primary among ties. Per-read determinism is preserved via `per_read_seed`. No pass-1 wiring: two-pass pass 1 uses `None` for tr_idx (tr BAM is only emitted from pass 2). cargo fmt applied across the touched files plus incidental fmt touch-ups on read_align.rs / stitch.rs / params.rs / quant/mod.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end smoke test in `tests/transcriptome_sam.rs`: * Builds a 2000bp synthetic genome (single chromosome, pseudo-random sequence from an LCG). * Writes a 2-transcript GTF (T1: chr1:101-400 +, T2: chr1:601-900 +). * Generates 20 reads (30bp) drawn alternately from the T1 and T2 exon regions. * Runs `ruSTAR --runMode genomeGenerate` then `alignReads --quantMode TranscriptomeSAM`. * Asserts `Aligned.toTranscriptome.out.bam` is created, > 100 bytes, parses as a valid BAM via noodles, and the header contains exactly two @sq lines named T1 and T2. Uses the same `assert_cmd::Command::cargo_bin` pattern as the existing `phase9_threading.rs` test (inherits the same deprecation warnings; pre-existing parity). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iptomeSAMoutput Remove `Display` impl (never formatted anywhere) and `allow_single_end()` helper (always returns false, never called). Clean up the stale single-end comment in `filter_and_project`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lders `build_transcriptome_records_se` allocated a redundant `fwd` clone of `read_seq` just to unify the reverse / forward branch, and both SE + PE builders inlined the same 6-line base-complement match. Replace both with a single `rc_encode()` helper delegating to the existing `io::fastq::complement_base`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r mate Previously `build_transcriptome_records_pe` called the per-record builder twice *per projected pair*, passing a single-element slice and a `primary_hit_idx` sentinel (0 / usize::MAX) to force the SECONDARY flag. The builder already handles multi-alignment sets correctly, so split the pairs into mate1/mate2 slices, call the builder once per mate with the true `primary_hit`, then stamp FIRST_SEGMENT / LAST_SEGMENT on the results and interleave. Same observable output, half the per-pair work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both `build_transcriptome_records_se` and `_pe` duplicated the RNG seeding (`per_read_seed` → `StdRng::seed_from_u64` → `gen_range`) and the MAPQ calculation (`effective_n = n_alignments.max(n_for_mapq)`). Extract a single `pick_primary_and_mapq` helper and hoist the `use` blocks out of the function bodies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`BamWriter::create` and `BamWriter::create_transcriptome` both opened the file, wrapped it in `BufWriter`/`bam::io::Writer`, and wrote the header. Move that boilerplate into a private `with_header` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Merge the two `actual_left` shadowed lets into one `.min(..).min(..)` chain. - Drop the `actual_right` alias (right-clip never needs clamping). - Replace `ext.genome_start = ext.exons.first().map(..).unwrap_or(..)` with a simple `if let Some(first) = ext.exons.first()` — the fallback branch was unreachable because extend_softclips always enters with a non-empty exon list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eq clone - The 5-line single-end carve-out inside the `filter_and_project` doc said the same thing as the inline comment; fold both into a one-line mention. - Drop redundant `// (1) ... (4) ...` step labels from the function body. - `align_to_one_transcript` was cloning `align.read_seq` into the projected transcript but no consumer reads that field on a projected record (transcriptome SAM takes the read from the caller). Use `Vec::new()` to avoid the per-projection allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…o_one_transcript The comment referenced a function name that never existed in ruSTAR (it was presumably copied from an earlier draft); the real call is `is_splice_boundary_before`. Collapse the now-out-of-date four-line block to a single factual line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `header()` method was added "for serial-write code paths that build records outside of `write_batch`" but no such caller exists; only the unit test used it. The test lives inside the same module so it can read the private `header` field directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Caller had a `Vec<&Box<PairedAlignment>>` and was building a throwaway `Vec<&PairedAlignment>` to match the function's `&[&PairedAlignment]` signature. Change the signature to `IntoIterator<Item = &PairedAlignment>` and call `.iter().map(|b| b.as_ref())` at the call-site, eliminating the intermediate Vec allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The test module already imports `HashMap` via `use super::*` (the parent module re-exports `std::collections::HashMap`). The `StdHashMap` alias was defensive but unused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ewels and others added 17 commits April 17, 2026 15:48

ewels force-pushed the feat/transcriptome-sam branch from 23672b1 to f420aae Compare April 17, 2026 13:49

ewels changed the title ~~feat: add --quantMode TranscriptomeSAM output~~ feat: add --quantMode TranscriptomeSAM output Apr 17, 2026

ewels changed the title ~~feat: add --quantMode TranscriptomeSAM output~~ feat: add --quantMode TranscriptomeSAM output Apr 17, 2026

ewels closed this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `--quantMode TranscriptomeSAM` output#4

feat: add `--quantMode TranscriptomeSAM` output#4
ewels wants to merge 17 commits into
feat/run-rng-seedfrom
feat/transcriptome-sam

ewels commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ewels commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes (6 feature commits + 11 simplify commits + 1 fmt commit)

Divergences from STAR (documented in commit bodies)

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ewels commented Apr 17, 2026 •

edited

Loading