feat(genome): GTF-derived splice-junction insertion into Genome + SA (Phase 17.X) by ewels · Pull Request #5 · ewels/rustar-aligner

ewels · 2026-04-22T05:53:16Z

Summary

Ports STAR's sjdbPrepare.cpp + sjdbBuildIndex.cpp to ruSTAR. At genomeGenerate, GTF-derived splice junctions are now baked into the Genome binary and indexed by the suffix array, matching STAR's layout byte-for-byte.

Stacked on scverse#7 (TranscriptomeSAM). Merge scverse#7 first, then rebase this branch onto main.

Validation — 15/15 files match STAR at genomeGenerate

test/diff_star_transcriptome.sh against STAR 2.7.11b on the synthetic 2-chr / 4-transcript / 3-junction test case:

✓ chrName.txt
✓ chrLength.txt
✓ chrStart.txt
✓ chrNameLength.txt
✓ transcriptInfo.tab
✓ exonInfo.tab
✓ geneInfo.tab
✓ exonGeTrInfo.tab
✓ sjdbList.fromGTF.out.tab
✓ sjdbInfo.txt
✓ sjdbList.out.tab
✓ Genome
✓ genomeParameters.txt
~ SA size-only match (51929 bytes) — content differs (expected, different SA algorithm)
~ SAindex size-only match (6027 bytes) — content differs (expected, different SA algorithm)
RESULT: 15/15 files identical (or size-matched) to STAR's output

SA/SAindex content divergence is expected: STAR's bucketed parallel SA construction breaks ties differently than ruSTAR's comparator-based sort. Size parity proves we indexed the same genome+Gsj length. Byte-for-byte SA parity would require reimplementing STAR's SA algorithm — out of scope for this PR.

Round-trip verified: ruSTAR can load the extended index it built and align reads end-to-end (load path correctly reads n_genome=524585 from genomeFileSizes, confirming Gsj bytes are included alongside the real genome bytes).

What changed

`src/junction/sjdb_insert.rs` (new module, 19 unit tests)

compute_shifts — STAR's sjdbShiftLeft / sjdbShiftRight (repeat length across intron boundary, capped at 255, stops at N).
PreparedJunction + prepare_junction — combines motif detection, shifts, strand derivation. stored_start/end (for sjdbInfo.txt) vs original_start/end (for Gsj extraction) follow STAR's convention where canonical motifs store the pre-shift ORIGINAL coord and non-canonical ones store the SHIFTED coord.
sort_and_dedup — STAR's post-sort + cross-strand dedup (sjdbPrepare.cpp:141-192).
build_gsj — concatenated flanking sequences, 2*overhang+1 bytes per junction (donor + acceptor + spacer=5).
write_sjdb_info_tab + write_sjdb_list_out_tab — byte-exact file writers (header <n>\t<overhang>\n + six tab-separated fields per row; chr-local 1-based coords + strand char).

`src/genome/mod.rs` — `Genome::append_sjdb()`

Extends the forward buffer with Gsj, rebuilds the RC over the extended range. chr_start[n_chr_real] stays pinned at the pre-sjdb forward total (matches STAR's chrStart.txt — verified against the 10k yeast docker output). n_genome grows to include Gsj.

`src/index/mod.rs` — `GenomeIndex::build` orchestration

Parse GTF once (shared between junction_db, transcriptome, and sjdb_insert — was parsed twice before).
Convert chr-local 1-based GTF intron coords to 0-based absolute genome offsets for prepare_junction.
Sort + dedup.
Build Gsj and call Genome::append_sjdb() BEFORE the SA is built (STAR indexes Gsj alongside the real genome in a single SA, sjdbBuildIndex.cpp:293).
Stash sorted prepared junctions in a new prepared_junctions field so write() can emit sjdbInfo.txt + sjdbList.out.tab.

`src/index/io.rs` — load path

Reads n_genome from genomeParameters.txt's genomeFileSizes line. The chr_start boundary is only the pre-sjdb forward total, so with sjdb-extended indices the two values diverge and chr_start cannot be used as the total genome size.

Other

src/junction/mod.rs — SpliceJunctionDb::from_raw_junctions helper (shared by from_gtf and the build path, avoiding the GTF-parse-twice problem).
test/diff_star_transcriptome.sh — extended from 10 files to 15 (added Genome, sjdbInfo.txt, sjdbList.out.tab, plus size-only check for SA/SAindex).

Pre-existing bugs flagged (separate PRs)

Documented during the transcriptome-BAM validation in scverse#7 — not introduced or resolved here:

QUAL encoding in transcriptome BAM — ruSTAR double-encodes Phred+33.
Default SAM attributes — ruSTAR emits AS nM in addition to STAR's default NH HI.
FASTQ parser — fails on some specific reads in the 50k yeast set.

Test plan

cargo test — 359 unit + 4 integration + 1 doc test, all passing
cargo clippy --all-targets — 0 new warnings (27 pre-existing)
cargo fmt --check — clean
test/diff_star_transcriptome.sh — 15/15 files identical/size-matched against STAR 2.7.11b
Round-trip: ruSTAR reads its own sjdb-extended index and aligns reads end-to-end

🤖 Generated with Claude Code

New module src/junction/sjdb_insert.rs with classify_motif() that returns the 0-6 motif code STAR uses (sjdbPrepare.cpp:37-50): 0 = non-canonical 1 = GT/AG(+) (donor bases GT, acceptor bases AG) 2 = CT/AC(-) 3 = GC/AG(+) 4 = CT/GC(-) 5 = AT/AC(+) 6 = GT/AT(-) First of several pieces of sjdbPrepare + sjdbBuildIndex needed to reach byte-parity on Genome / SA / SAindex / sjdbInfo.txt / sjdbList.out.tab. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ports STAR's sjdbPrepare.cpp:52-73 repeat-shift calculation: counts how many bases the intron boundary can shift left / right while preserving the donor/acceptor base identity. Caps at 255 per junction (STAR limit). Stops on genome bounds and on any N-base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New `PreparedJunction` struct carrying STAR's per-junction fields (chr_idx, shift-adjusted start/end, motif, shift_left, shift_right, strand) and a `prepare_junction` free function that pulls them together from raw db coordinates. Follows STAR's sjdbPrepare.cpp semantics: - Compute motif via classify_motif - Compute shifts via compute_shifts - Subtract shift_left from both coordinates (lines 71-72) to sit at the canonical left-most representation - Derive strand from motif when db_strand is unknown (lines 184-188: 2 - motif % 2, or 0 if motif is non-canonical) Three tests cover: canonical + strand, strand derivation fallback, shift-left applied to coords. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The /simplify review flagged my classify_motif + MOTIF_* consts as a duplicate of src/align/score.rs::detect_splice_motif (behind a SpliceMotif enum) + src/junction/sj_output.rs::encode_motif (0-6 code). Two independent motif truth tables is a drift risk. - Move detect_splice_motif from AlignmentScorer method (stateless) to a free fn in src/align/score.rs; the method is now a thin wrapper. - Drop classify_motif + MOTIF_* consts from sjdb_insert. - prepare_junction calls detect_splice_motif + encode_motif instead. - compute_shifts signature changed from &[u8] → &Genome so both helpers share the same input shape; takes n_genome_real to scope the forward-strand slice and avoids redundant bounds checks per iteration. - Tests updated to build Genome instances via a local helper. 343 unit tests pass (up from 335 — the test-coverage consolidation netted more tests via score.rs's existing detect_splice_motif coverage plus the new sjdb_insert tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds PreparedJunction::stored_start/end/original_start/original_end so callers can request either STAR's on-disk coordinate form (canonical → original; non-canonical → shifted) or the canonical-for-extraction form used when reading Gsj flanking bytes. sort_and_dedup() ports STAR's second-pass sort + cross-strand dedup from sjdbPrepare.cpp:141-192. ruSTAR's SpliceJunctionDb already deduplicates on (chr, start, end, strand), so the first-pass intra-strand dedup branches never fire for our inputs — kept second-pass cross-strand logic: - Undefined vs defined strand → keep defined - Both non-canonical → collapse to strand 0 (undefined) - One canonical → keep canonical - Both canonical → keep the one whose strand matches 2 - motif%2 6 new tests exercise each branch of merge_cross_strand plus the sort ordering; all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New build_gsj() in src/junction/sjdb_insert.rs reproduces STAR's sjdbPrepare.cpp:203-215 in-memory buffer: per junction, sjdb_overhang donor bases + sjdb_overhang acceptor bases + 1 spacer byte (value 5, STAR's GENOME_spacingChar). Extraction uses ORIGINAL (pre-shift) coordinates regardless of motif, relying on the new PreparedJunction::original_{start,end} helpers. 5 tests: canonical single junction, non-canonical junction (verifies shift reversal), multi-junction concatenation with independent spacers, plus two bounds-error cases (donor underflow, acceptor overrun). Next commit will wire the Gsj bytes onto the Genome binary + extend the SA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two STAR-faithful text writers matching `sjdbPrepare.cpp:196-222`: - `write_sjdb_info_tab`: header `<n>\t<overhang>\n`, then `stored_start\tstored_end\tmotif\tshiftL\tshiftR\tstrand\n` per junction. - `write_sjdb_list_out_tab`: `chr\t<1-based start>\t<1-based end>\t<strand_char>\n` with chromosome-local, pre-shift ("original") coordinates. Both use BufWriter + Error::io(path) matching the existing writer convention in sj_output.rs. Strand char table matches STAR's `{'.','+','-'}` for 0/1/2. Three byte-match tests covering the header, canonical vs non-canonical coord handling, and multi-chromosome chr_start offsets. 22/22 sjdb_insert tests pass; full suite 357/357, 0 new clippy warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Matches STAR's in-memory layout after `sjdbBuildIndex.cpp:293`: Gsj bytes live immediately after the forward real-genome bytes; chr_start[n_chr_real] stays pinned at the pre-sjdb forward total (verified against STAR's chrStart.txt for the 10k yeast test case: pre-sjdb boundary = 262144, n_genome post-sjdb = 263139 = 262144 + 5*199). n_genome grows to include Gsj. chr_start / chr_length / chr_name / n_chr_real are NOT updated — Gsj lives outside the chromosome accounting. Two unit tests: one verifies forward layout + RC regeneration; one verifies empty-Gsj is a no-op. 359/359 unit + 4/4 integration tests pass; 0 clippy warnings in genome/mod.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iff harness With a GTF provided, GenomeIndex::build now: 1. Parses the GTF once (shared between junction_db, transcriptome, and sjdb_insert — was parsed twice before). 2. Converts chr-local 1-based GTF intron coords to 0-based absolute genome offsets for sjdb_insert::prepare_junction. 3. Sort + dedup prepared junctions via sjdb_insert::sort_and_dedup. 4. Build Gsj buffer and append to Genome (Genome::append_sjdb) BEFORE the suffix array is built — STAR indexes Gsj alongside the real genome in a single SA (sjdbBuildIndex.cpp:293). 5. Stash sorted prepared junctions in a new `prepared_junctions` field so GenomeIndex::write() can emit sjdbInfo.txt + sjdbList.out.tab. GenomeIndex::load() now reads n_genome from genomeParameters.txt's `genomeFileSizes` line (the chr_start boundary is only the real-genome forward total, not the post-sjdb total). Also adds SpliceJunctionDb::from_raw_junctions helper so from_gtf and the build path share the HashMap-building code. **Validation** (test/diff_star_transcriptome.sh, extended): 15/15 files byte-identical to STAR docker output, including: - Genome (real + Gsj appended; verified against STAR 2.7.11b) - sjdbInfo.txt - sjdbList.out.tab - genomeParameters.txt (genomeFileSizes reflects post-sjdb n_genome) SA / SAindex: size-matched (51929 / 6027 bytes), content differs as expected — STAR's bucketed-parallel SA construction breaks ties differently than ruSTAR's comparator-based sort. Size parity proves we indexed the same genome+Gsj length. Round-trip verified: ruSTAR can load the extended index it built and align reads end-to-end (load path reads `n_genome=524585` from the `genomeFileSizes` line, confirming Gsj bytes are included). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ewels · 2026-04-22T09:09:19Z

Moved to upstream scverse#8.

@rg

- shuffle_tied_prefix: deterministic per-read RNG for tied primary selection - --runRNGseed param (default 777, matches STAR) - --outSAMattrRGline: @rg header + RG:Z tags - Various code simplifications and formatting

ewels and others added 9 commits April 22, 2026 07:47

ewels closed this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(genome): GTF-derived splice-junction insertion into Genome + SA (Phase 17.X)#5

feat(genome): GTF-derived splice-junction insertion into Genome + SA (Phase 17.X)#5
ewels wants to merge 9 commits into
feat/transcriptome-samfrom
feat/transcriptome-sam-sj-insertion

ewels commented Apr 22, 2026 •

edited

Loading

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ewels commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation — 15/15 files match STAR at genomeGenerate

What changed

src/junction/sjdb_insert.rs (new module, 19 unit tests)

src/genome/mod.rs — Genome::append_sjdb()

src/index/mod.rs — GenomeIndex::build orchestration

src/index/io.rs — load path

Other

Pre-existing bugs flagged (separate PRs)

Test plan

Uh oh!

ewels commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ewels commented Apr 22, 2026 •

edited

Loading

`src/junction/sjdb_insert.rs` (new module, 19 unit tests)

`src/genome/mod.rs` — `Genome::append_sjdb()`

`src/index/mod.rs` — `GenomeIndex::build` orchestration

`src/index/io.rs` — load path