feat: add --quantMode TranscriptomeSAM output#4
Closed
ewels wants to merge 17 commits into
Closed
Conversation
Introduce `src/quant/transcriptome.rs` with a `TranscriptomeIndex` struct that groups GTF exons by `transcript_id`, sorts exons by start, and records absolute genome coords plus STAR's `exLenCum` (cumulative transcript-space offset) per exon. Also builds: - `tr_start` / `tr_end` — transcript bounds in absolute genome coords - `tr_order` / `tr_starts_sorted` / `tr_end_max_sorted` — sorted view used for STAR-style `binarySearch1a` + running-max early-exit in `quantAlign` - per-transcript length, strand, gene_id, chr_idx Transcripts with inconsistent chr/strand across exons or on unknown chromosomes are skipped with `log::warn!`, matching STAR's tolerant GTF handling at `sjdbInsertJunctions` time. Note: STAR persists this table as `transcriptInfo.tab` + `exonInfo.tab` at genomeGenerate; ruSTAR rebuilds it at alignment time from the GTF. Semantically equivalent. No signing — commit.gpgsign disabled locally (no key on this worktree). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nscripts)
Port STAR's `Transcriptome::quantAlign` + `alignToTranscript` (see
`source/Transcriptome_quantAlign.cpp:5-89, 91-114`).
`align_to_transcripts` binary-searches `tr_starts_sorted` for the
greatest tr_start <= align.genome_start, then walks back while
`tr_end_max_sorted[i] >= align.genome_end`, calling
`align_to_one_transcript` on every candidate whose [tr_start, tr_end]
fully contains the alignment.
`align_to_one_transcript` walks the alignment's exon blocks and:
* locates the transcript exon containing the first block
(binary-searched `find_containing_exon`),
* rejects blocks that extend past a transcript-exon boundary,
* on each splice boundary (ruSTAR's `is_splice_boundary_before`,
which approximates STAR's `canonSJ >= 0` using read-side
continuity), requires `prev.genome_end == tr_ex[ex].genome_end &&
next.genome_start == tr_ex[ex+1].genome_start`,
* translates each block start to t-space via
`ex_len_cum[ex] + (g_pos - exon_start)`.
Reverse-strand transcripts flip all projected exons:
`read_pos' = Lread - (read_pos + len)`,
`tr_pos' = tr_length - (tr_pos + len)`,
then reverse the exon vector. `is_reverse` is XOR'd with
`tr_strand == 2`.
Projected CIGAR = original CIGAR with N operations dropped; reversed
for reverse-strand transcripts. Splices collapse so `n_junction = 0`.
ruSTAR-specific notes:
* `Transcript.exons` has no canonSJ array; splice vs indel boundaries
are discriminated by read-side continuity (insertion → read gap,
splice → read contiguous with genome gap). Pure deletions rarely
produce block splits because the stitch merge coalesces across Del.
* GTF is 1-based inclusive; ruSTAR 0-based half-open. Transcript
length = sum(end - start). STAR uses `exLenCum[last] + (end -
start) + 1` which arrives at the same numeric value on 1-based
inclusive inputs.
8 new unit tests cover: single-exon projection, two-exon with matching
junction, junction mismatch (rejected), reverse-strand flip, multi-exon
projection into a longer transcript, out-of-bounds cases, and
projection onto multiple overlapping transcripts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Port STAR's `ReadAlign::quantTranscriptome` gatekeeping logic (see `source/ReadAlign_quantTranscriptome.cpp:9-66`) as `filter_and_project`: 1. If `!allow_indels && n_gap > 0` → reject. 2. Single-end filter — for PE, skip alignments where both ends came from the same mate. ruSTAR's `Transcript` does not carry per-block mate tags; the caller enforces this by only invoking `filter_and_project` on both-mapped PE pair mates. Documented as a no-op at the per-mate level. 3. Soft-clip extension (mode-dependent): extend leading / trailing soft-clipped bases back into matched bases, counting mismatches where both read and genome bases are < 4 and unequal. Reject if `n_mismatch + extension_mismatches > min(outFilterMismatchNmax, outFilterMismatchNoverLmax*(Lread-1))`. 4. Call `align_to_transcripts` for projection. New enum `QuantTranscriptomeSAMoutput` with three variants matching STAR's CLI strings: * `BanSingleEnd_BanIndels_ExtendSoftclip` (default, RSEM-compatible) * `BanSingleEnd` * `BanSingleEnd_ExtendSoftclip` `rebuild_cigar_without_softclips` strips the leading/trailing S ops and folds their length into the adjacent M op. Interior soft-clips are left alone. 7 new unit tests: enum FromStr/Display, mode flags, indel rejection under default mode, indel retention under BanSingleEnd, zero-mismatch softclip extension folding 4S+40M → 44M, softclip extension exceeding budget → reject, softclip preservation under BanSingleEnd mode. Divergence from STAR: ruSTAR checks for genome/read buffer bounds via `saturating_sub` + slice-length checks before indexing, to protect against edge cases where extension would read past the genome end. STAR trusts its upstream bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion Add `quant_transcriptome_sam_output: QuantTranscriptomeSAMoutput` field with STAR-compatible default `BanSingleEnd_BanIndels_ExtendSoftclip`. Accepts the three STAR strings via the `FromStr` impl added in the previous commit. Add `Parameters::quant_transcriptome_sam()` helper that scans `quant_mode` for `"TranscriptomeSAM"`, mirroring the existing `quant_gene_counts()` helper. Extend `Parameters::validate()` to require `--sjdbGTFfile` when `--quantMode TranscriptomeSAM` is active. STAR also allows running TranscriptomeSAM in SE (the "single-end" flag in the mode name refers to per-mate, not per-run), so no PE-only gate. 5 new unit tests: default parse, flag enable, explicit mode override (BanSingleEnd and BanSingleEnd_ExtendSoftclip), validate error without GTF, validate success with GTF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Top-level orchestration for `--quantMode TranscriptomeSAM`.
src/io/sam.rs:
* Factor `build_sam_header` into a generic
`build_sam_header_from_refs(iter, params)` that takes any
`(name, length)` iterator. The original `build_sam_header` now
delegates to it with the genome's chromosomes.
* New `SamWriter::build_transcriptome_records(...)`: builds records
with RNAME pointing at the transcriptome header (reference_sequence_id
= transcript_idx), POS = t-space position + 1, and emits only the
standard attribute set (NH/HI/AS/NM/nM). Splice-aware tags (jM, jI,
XS, MD) are dropped because splices collapse in t-space.
* `primary_hit_idx` parameter selects the randomly-chosen primary
alignment; all others get the SECONDARY (0x100) flag.
src/io/bam.rs:
* New `BamWriter::create_transcriptome(path, tr_idx, params)` builds
a BAM header with one @sq per transcript (name = transcript_id,
length = t-space length).
* New `BamWriter::header()` accessor.
* Unit test verifying transcriptome BAM writer creation + @sq count.
src/align/read_align.rs:
* `per_read_seed` changed from `fn` to `pub(crate) fn` so lib.rs can
seed the transcriptome primary-picker with the same read-scoped
seed used for genome-space tie-shuffle.
src/lib.rs:
* Build an `Arc<TranscriptomeIndex>` alongside the gene-counts
context when `--quantMode TranscriptomeSAM` is active, and thread
it through run_single_pass / run_two_pass / align_reads_single_end
/ align_reads_paired_end.
* Open `<prefix>Aligned.toTranscriptome.out.bam` via
`BamWriter::create_transcriptome` at the start of the pipeline and
flush/finish at the end.
* Extend `AlignmentBatchResults` with `transcriptome_records:
Vec<RecordBuf>`. Per-read builder helpers:
- `build_transcriptome_records_se`: projects every genome-space
alignment via `filter_and_project`, seeds `StdRng` with
`per_read_seed(run_rng_seed, read_name)`, picks a random
projected index as primary, builds records.
- `build_transcriptome_records_pe`: projects both mates
independently, keeps transcripts where both project, emits two
records per projected pair (mate1 FIRST_SEGMENT + mate2
LAST_SEGMENT). Primary pick is shared across both mates.
* Transcriptome records are written serially in batch-merge order
alongside the main SAM/BAM stream (normal and BySJout paths).
Known limitations vs STAR:
* PE pairing is done per-transcript-idx set intersection between
projected mates, not via STAR's combined 2-mate
`Transcript`/`EX_iFrag` machinery. Equivalent for the common case
where each mate maps uniquely within a given transcript; may
diverge when one mate has multiple valid positions within one
transcript.
* StdRng (inherited from `feat/run-rng-seed`) replaces STAR's
std::mt19937 — no bit-for-bit parity on which transcriptome
projection is picked as primary among ties. Per-read
determinism is preserved via `per_read_seed`.
No pass-1 wiring: two-pass pass 1 uses `None` for tr_idx (tr BAM is
only emitted from pass 2).
cargo fmt applied across the touched files plus incidental fmt
touch-ups on read_align.rs / stitch.rs / params.rs / quant/mod.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke test in `tests/transcriptome_sam.rs`:
* Builds a 2000bp synthetic genome (single chromosome, pseudo-random
sequence from an LCG).
* Writes a 2-transcript GTF (T1: chr1:101-400 +, T2: chr1:601-900 +).
* Generates 20 reads (30bp) drawn alternately from the T1 and T2
exon regions.
* Runs `ruSTAR --runMode genomeGenerate` then `alignReads
--quantMode TranscriptomeSAM`.
* Asserts `Aligned.toTranscriptome.out.bam` is created, > 100 bytes,
parses as a valid BAM via noodles, and the header contains exactly
two @sq lines named T1 and T2.
Uses the same `assert_cmd::Command::cargo_bin` pattern as the existing
`phase9_threading.rs` test (inherits the same deprecation warnings;
pre-existing parity).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iptomeSAMoutput Remove `Display` impl (never formatted anywhere) and `allow_single_end()` helper (always returns false, never called). Clean up the stale single-end comment in `filter_and_project`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lders `build_transcriptome_records_se` allocated a redundant `fwd` clone of `read_seq` just to unify the reverse / forward branch, and both SE + PE builders inlined the same 6-line base-complement match. Replace both with a single `rc_encode()` helper delegating to the existing `io::fastq::complement_base`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r mate Previously `build_transcriptome_records_pe` called the per-record builder twice *per projected pair*, passing a single-element slice and a `primary_hit_idx` sentinel (0 / usize::MAX) to force the SECONDARY flag. The builder already handles multi-alignment sets correctly, so split the pairs into mate1/mate2 slices, call the builder once per mate with the true `primary_hit`, then stamp FIRST_SEGMENT / LAST_SEGMENT on the results and interleave. Same observable output, half the per-pair work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both `build_transcriptome_records_se` and `_pe` duplicated the RNG seeding (`per_read_seed` → `StdRng::seed_from_u64` → `gen_range`) and the MAPQ calculation (`effective_n = n_alignments.max(n_for_mapq)`). Extract a single `pick_primary_and_mapq` helper and hoist the `use` blocks out of the function bodies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`BamWriter::create` and `BamWriter::create_transcriptome` both opened the file, wrapped it in `BufWriter`/`bam::io::Writer`, and wrote the header. Move that boilerplate into a private `with_header` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Merge the two `actual_left` shadowed lets into one `.min(..).min(..)` chain. - Drop the `actual_right` alias (right-clip never needs clamping). - Replace `ext.genome_start = ext.exons.first().map(..).unwrap_or(..)` with a simple `if let Some(first) = ext.exons.first()` — the fallback branch was unreachable because extend_softclips always enters with a non-empty exon list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eq clone - The 5-line single-end carve-out inside the `filter_and_project` doc said the same thing as the inline comment; fold both into a one-line mention. - Drop redundant `// (1) ... (4) ...` step labels from the function body. - `align_to_one_transcript` was cloning `align.read_seq` into the projected transcript but no consumer reads that field on a projected record (transcriptome SAM takes the read from the caller). Use `Vec::new()` to avoid the per-projection allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_one_transcript The comment referenced a function name that never existed in ruSTAR (it was presumably copied from an earlier draft); the real call is `is_splice_boundary_before`. Collapse the now-out-of-date four-line block to a single factual line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `header()` method was added "for serial-write code paths that build records outside of `write_batch`" but no such caller exists; only the unit test used it. The test lives inside the same module so it can read the private `header` field directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caller had a `Vec<&Box<PairedAlignment>>` and was building a throwaway `Vec<&PairedAlignment>` to match the function's `&[&PairedAlignment]` signature. Change the signature to `IntoIterator<Item = &PairedAlignment>` and call `.iter().map(|b| b.as_ref())` at the call-site, eliminating the intermediate Vec allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test module already imports `HashMap` via `use super::*` (the parent module re-exports `std::collections::HashMap`). The `StdHashMap` alias was defensive but unused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23672b1 to
f420aae
Compare
--quantMode TranscriptomeSAM output
--quantMode TranscriptomeSAM output--quantMode TranscriptomeSAM output
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements
--quantMode TranscriptomeSAM+--quantTranscriptomeSAMoutput, producingAligned.toTranscriptome.out.bamfor downstream Salmon/RSEM quantification. Matches STAR's behavior fromTranscriptome.cpp,Transcriptome_quantAlign.cpp,ReadAlign_quantTranscriptome.cpp.Part 3 of 3 branches unblocking nf-core/rnaseq's STAR_ALIGN step. Stacked on #2 (
feat/run-rng-seed) — needs that branch's RNG helpers (per_read_seed) for random primary-transcript selection. Please review and merge #2 first.Changes (6 feature commits + 11 simplify commits + 1 fmt commit)
Feature commits:
feat(quant): add TranscriptomeIndex built from GTF exon groups—src/quant/transcriptome.rsNEW. Per-transcript metadata (strand, gene_id, exons with cumulative length, tr_length). Binary-searchabletr_starts_sorted/tr_end_max_sorted. 22 unit tests.feat(quant): project genome alignments onto transcripts (align_to_transcripts)— ports STAR'sTranscriptome_quantAlign.cpp: binary search, per-block projection, junction matching, reverse-strand flip.feat(quant): filter modes + softclip extension for transcriptome SAM— threeQuantTranscriptomeSAMoutputmodes (BanSingleEnd_BanIndels_ExtendSoftclip default, BanSingleEnd, BanSingleEnd_ExtendSoftclip). Ports STAR's soft-clip extension with mismatch re-count.feat(params): --quantTranscriptomeSAMoutput + TranscriptomeSAM validation— CLI wiring,quant_transcriptome_sam()helper, validate requires--sjdbGTFfile.feat(io/bam,lib): wire Aligned.toTranscriptome.out.bam pipeline—BamWriter::create_transcriptome()(per-transcript@SQheader),build_transcriptome_records{_se,_pe}, plumbed throughAlignmentBatchResults, random primary viaper_read_seed(from feat: add--runRNGseedflag with seeded primary tie-break #2).test: integration test for --quantMode TranscriptomeSAM— end-to-end mini test with synthetic genome + GTF + 20 reads.Simplify commits (11): dead-code removal (
Display/allow_single_end/BamWriter::header), shared helpers (rc_encode,pick_primary_and_mapq,BamWriter::with_header), one-build-per-mate PE emission (no sentinel hack), unreadread_seqclone dropped, comment cleanup.Plus: 1
cargo fmtcommit. Rebased onto latestfeat/run-rng-seedafter #2's simplify + fmt commits landed.src/quant/mod.rs,src/quant/transcriptome.rsNEW,src/params.rs,src/io/sam.rs,src/io/bam.rs,src/align/read_align.rs,src/lib.rs,tests/transcriptome_sam.rsNEW.Divergences from STAR (documented in commit bodies)
--runRNGseedflag with seeded primary tie-break #2. Per-read deterministic, not bit-for-bit with STAR's mt19937 primary-pick order.transcriptInfo.tab/exonInfo.tabat genomeGenerate; ruSTAR builds equivalents on-the-fly from GTF. Semantically equivalent.Transcript/EX_iFragmachinery. Equivalent for unique-per-transcript mates, may differ for complex multi-site cases within one transcript.PairedAlignmentmates invoke the projector), not via per-blockEX_iFragchecks.Test plan
cargo test— 316 passingcargo clippy --lib -- -D warnings— cleancargo fmt --check— clean@SQ+ expected recordsNotes
feat/run-rng-seed.🤖 Generated with Claude Code