Skip to content

feat(genome): GTF-derived splice-junction insertion into Genome + SA (Phase 17.X)#5

Closed
ewels wants to merge 9 commits into
feat/transcriptome-samfrom
feat/transcriptome-sam-sj-insertion
Closed

feat(genome): GTF-derived splice-junction insertion into Genome + SA (Phase 17.X)#5
ewels wants to merge 9 commits into
feat/transcriptome-samfrom
feat/transcriptome-sam-sj-insertion

Conversation

@ewels
Copy link
Copy Markdown
Owner

@ewels ewels commented Apr 22, 2026

Summary

Ports STAR's sjdbPrepare.cpp + sjdbBuildIndex.cpp to ruSTAR. At genomeGenerate, GTF-derived splice junctions are now baked into the Genome binary and indexed by the suffix array, matching STAR's layout byte-for-byte.

Stacked on scverse#7 (TranscriptomeSAM). Merge scverse#7 first, then rebase this branch onto main.

Validation — 15/15 files match STAR at genomeGenerate

test/diff_star_transcriptome.sh against STAR 2.7.11b on the synthetic 2-chr / 4-transcript / 3-junction test case:

✓ chrName.txt
✓ chrLength.txt
✓ chrStart.txt
✓ chrNameLength.txt
✓ transcriptInfo.tab
✓ exonInfo.tab
✓ geneInfo.tab
✓ exonGeTrInfo.tab
✓ sjdbList.fromGTF.out.tab
✓ sjdbInfo.txt
✓ sjdbList.out.tab
✓ Genome
✓ genomeParameters.txt
~ SA size-only match (51929 bytes) — content differs (expected, different SA algorithm)
~ SAindex size-only match (6027 bytes) — content differs (expected, different SA algorithm)
RESULT: 15/15 files identical (or size-matched) to STAR's output

SA/SAindex content divergence is expected: STAR's bucketed parallel SA construction breaks ties differently than ruSTAR's comparator-based sort. Size parity proves we indexed the same genome+Gsj length. Byte-for-byte SA parity would require reimplementing STAR's SA algorithm — out of scope for this PR.

Round-trip verified: ruSTAR can load the extended index it built and align reads end-to-end (load path correctly reads n_genome=524585 from genomeFileSizes, confirming Gsj bytes are included alongside the real genome bytes).

What changed

src/junction/sjdb_insert.rs (new module, 19 unit tests)

  • compute_shifts — STAR's sjdbShiftLeft / sjdbShiftRight (repeat length across intron boundary, capped at 255, stops at N).
  • PreparedJunction + prepare_junction — combines motif detection, shifts, strand derivation. stored_start/end (for sjdbInfo.txt) vs original_start/end (for Gsj extraction) follow STAR's convention where canonical motifs store the pre-shift ORIGINAL coord and non-canonical ones store the SHIFTED coord.
  • sort_and_dedup — STAR's post-sort + cross-strand dedup (sjdbPrepare.cpp:141-192).
  • build_gsj — concatenated flanking sequences, 2*overhang+1 bytes per junction (donor + acceptor + spacer=5).
  • write_sjdb_info_tab + write_sjdb_list_out_tab — byte-exact file writers (header <n>\t<overhang>\n + six tab-separated fields per row; chr-local 1-based coords + strand char).

src/genome/mod.rsGenome::append_sjdb()

Extends the forward buffer with Gsj, rebuilds the RC over the extended range. chr_start[n_chr_real] stays pinned at the pre-sjdb forward total (matches STAR's chrStart.txt — verified against the 10k yeast docker output). n_genome grows to include Gsj.

src/index/mod.rsGenomeIndex::build orchestration

  1. Parse GTF once (shared between junction_db, transcriptome, and sjdb_insert — was parsed twice before).
  2. Convert chr-local 1-based GTF intron coords to 0-based absolute genome offsets for prepare_junction.
  3. Sort + dedup.
  4. Build Gsj and call Genome::append_sjdb() BEFORE the SA is built (STAR indexes Gsj alongside the real genome in a single SA, sjdbBuildIndex.cpp:293).
  5. Stash sorted prepared junctions in a new prepared_junctions field so write() can emit sjdbInfo.txt + sjdbList.out.tab.

src/index/io.rs — load path

Reads n_genome from genomeParameters.txt's genomeFileSizes line. The chr_start boundary is only the pre-sjdb forward total, so with sjdb-extended indices the two values diverge and chr_start cannot be used as the total genome size.

Other

  • src/junction/mod.rsSpliceJunctionDb::from_raw_junctions helper (shared by from_gtf and the build path, avoiding the GTF-parse-twice problem).
  • test/diff_star_transcriptome.sh — extended from 10 files to 15 (added Genome, sjdbInfo.txt, sjdbList.out.tab, plus size-only check for SA/SAindex).

Pre-existing bugs flagged (separate PRs)

Documented during the transcriptome-BAM validation in scverse#7 — not introduced or resolved here:

  1. QUAL encoding in transcriptome BAM — ruSTAR double-encodes Phred+33.
  2. Default SAM attributes — ruSTAR emits AS nM in addition to STAR's default NH HI.
  3. FASTQ parser — fails on some specific reads in the 50k yeast set.

Test plan

  • cargo test — 359 unit + 4 integration + 1 doc test, all passing
  • cargo clippy --all-targets — 0 new warnings (27 pre-existing)
  • cargo fmt --check — clean
  • test/diff_star_transcriptome.sh — 15/15 files identical/size-matched against STAR 2.7.11b
  • Round-trip: ruSTAR reads its own sjdb-extended index and aligns reads end-to-end

🤖 Generated with Claude Code

ewels and others added 9 commits April 22, 2026 07:47
New module src/junction/sjdb_insert.rs with classify_motif() that
returns the 0-6 motif code STAR uses (sjdbPrepare.cpp:37-50):

  0 = non-canonical
  1 = GT/AG(+)    (donor bases GT, acceptor bases AG)
  2 = CT/AC(-)
  3 = GC/AG(+)
  4 = CT/GC(-)
  5 = AT/AC(+)
  6 = GT/AT(-)

First of several pieces of sjdbPrepare + sjdbBuildIndex needed to
reach byte-parity on Genome / SA / SAindex / sjdbInfo.txt /
sjdbList.out.tab.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports STAR's sjdbPrepare.cpp:52-73 repeat-shift calculation: counts how
many bases the intron boundary can shift left / right while preserving
the donor/acceptor base identity. Caps at 255 per junction (STAR limit).
Stops on genome bounds and on any N-base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `PreparedJunction` struct carrying STAR's per-junction fields
(chr_idx, shift-adjusted start/end, motif, shift_left, shift_right,
strand) and a `prepare_junction` free function that pulls them together
from raw db coordinates.

Follows STAR's sjdbPrepare.cpp semantics:
- Compute motif via classify_motif
- Compute shifts via compute_shifts
- Subtract shift_left from both coordinates (lines 71-72) to sit at
  the canonical left-most representation
- Derive strand from motif when db_strand is unknown (lines 184-188:
  2 - motif % 2, or 0 if motif is non-canonical)

Three tests cover: canonical + strand, strand derivation fallback,
shift-left applied to coords.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /simplify review flagged my classify_motif + MOTIF_* consts as a
duplicate of src/align/score.rs::detect_splice_motif (behind a
SpliceMotif enum) + src/junction/sj_output.rs::encode_motif (0-6 code).
Two independent motif truth tables is a drift risk.

- Move detect_splice_motif from AlignmentScorer method (stateless) to a
  free fn in src/align/score.rs; the method is now a thin wrapper.
- Drop classify_motif + MOTIF_* consts from sjdb_insert.
- prepare_junction calls detect_splice_motif + encode_motif instead.
- compute_shifts signature changed from &[u8] → &Genome so both
  helpers share the same input shape; takes n_genome_real to scope the
  forward-strand slice and avoids redundant bounds checks per iteration.
- Tests updated to build Genome instances via a local helper.

343 unit tests pass (up from 335 — the test-coverage consolidation netted
more tests via score.rs's existing detect_splice_motif coverage plus the
new sjdb_insert tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds PreparedJunction::stored_start/end/original_start/original_end so
callers can request either STAR's on-disk coordinate form (canonical →
original; non-canonical → shifted) or the canonical-for-extraction
form used when reading Gsj flanking bytes.

sort_and_dedup() ports STAR's second-pass sort + cross-strand dedup
from sjdbPrepare.cpp:141-192. ruSTAR's SpliceJunctionDb already
deduplicates on (chr, start, end, strand), so the first-pass
intra-strand dedup branches never fire for our inputs — kept
second-pass cross-strand logic:

- Undefined vs defined strand → keep defined
- Both non-canonical → collapse to strand 0 (undefined)
- One canonical → keep canonical
- Both canonical → keep the one whose strand matches 2 - motif%2

6 new tests exercise each branch of merge_cross_strand plus the sort
ordering; all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New build_gsj() in src/junction/sjdb_insert.rs reproduces STAR's
sjdbPrepare.cpp:203-215 in-memory buffer: per junction, sjdb_overhang
donor bases + sjdb_overhang acceptor bases + 1 spacer byte (value 5,
STAR's GENOME_spacingChar). Extraction uses ORIGINAL (pre-shift)
coordinates regardless of motif, relying on the new
PreparedJunction::original_{start,end} helpers.

5 tests: canonical single junction, non-canonical junction (verifies
shift reversal), multi-junction concatenation with independent spacers,
plus two bounds-error cases (donor underflow, acceptor overrun).

Next commit will wire the Gsj bytes onto the Genome binary + extend
the SA.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two STAR-faithful text writers matching `sjdbPrepare.cpp:196-222`:

- `write_sjdb_info_tab`: header `<n>\t<overhang>\n`, then
  `stored_start\tstored_end\tmotif\tshiftL\tshiftR\tstrand\n` per junction.
- `write_sjdb_list_out_tab`: `chr\t<1-based start>\t<1-based end>\t<strand_char>\n`
  with chromosome-local, pre-shift ("original") coordinates.

Both use BufWriter + Error::io(path) matching the existing writer
convention in sj_output.rs. Strand char table matches STAR's
`{'.','+','-'}` for 0/1/2.

Three byte-match tests covering the header, canonical vs non-canonical
coord handling, and multi-chromosome chr_start offsets. 22/22
sjdb_insert tests pass; full suite 357/357, 0 new clippy warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches STAR's in-memory layout after `sjdbBuildIndex.cpp:293`:
Gsj bytes live immediately after the forward real-genome bytes;
chr_start[n_chr_real] stays pinned at the pre-sjdb forward total
(verified against STAR's chrStart.txt for the 10k yeast test case:
pre-sjdb boundary = 262144, n_genome post-sjdb = 263139 = 262144 + 5*199).

n_genome grows to include Gsj. chr_start / chr_length / chr_name /
n_chr_real are NOT updated — Gsj lives outside the chromosome
accounting.

Two unit tests: one verifies forward layout + RC regeneration;
one verifies empty-Gsj is a no-op. 359/359 unit + 4/4 integration
tests pass; 0 clippy warnings in genome/mod.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iff harness

With a GTF provided, GenomeIndex::build now:
1. Parses the GTF once (shared between junction_db, transcriptome, and
   sjdb_insert — was parsed twice before).
2. Converts chr-local 1-based GTF intron coords to 0-based absolute
   genome offsets for sjdb_insert::prepare_junction.
3. Sort + dedup prepared junctions via sjdb_insert::sort_and_dedup.
4. Build Gsj buffer and append to Genome (Genome::append_sjdb) BEFORE
   the suffix array is built — STAR indexes Gsj alongside the real
   genome in a single SA (sjdbBuildIndex.cpp:293).
5. Stash sorted prepared junctions in a new `prepared_junctions` field
   so GenomeIndex::write() can emit sjdbInfo.txt + sjdbList.out.tab.

GenomeIndex::load() now reads n_genome from genomeParameters.txt's
`genomeFileSizes` line (the chr_start boundary is only the real-genome
forward total, not the post-sjdb total). Also adds
SpliceJunctionDb::from_raw_junctions helper so from_gtf and the build
path share the HashMap-building code.

**Validation** (test/diff_star_transcriptome.sh, extended):

  15/15 files byte-identical to STAR docker output, including:
  - Genome (real + Gsj appended; verified against STAR 2.7.11b)
  - sjdbInfo.txt
  - sjdbList.out.tab
  - genomeParameters.txt (genomeFileSizes reflects post-sjdb n_genome)

  SA / SAindex: size-matched (51929 / 6027 bytes), content differs
  as expected — STAR's bucketed-parallel SA construction breaks ties
  differently than ruSTAR's comparator-based sort. Size parity proves
  we indexed the same genome+Gsj length.

Round-trip verified: ruSTAR can load the extended index it built and
align reads end-to-end (load path reads `n_genome=524585` from the
`genomeFileSizes` line, confirming Gsj bytes are included).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ewels
Copy link
Copy Markdown
Owner Author

ewels commented Apr 22, 2026

Moved to upstream scverse#8.

@ewels ewels closed this Apr 22, 2026
ewels pushed a commit that referenced this pull request May 4, 2026
- shuffle_tied_prefix: deterministic per-read RNG for tied primary selection
- --runRNGseed param (default 777, matches STAR)
- --outSAMattrRGline: @rg header + RG:Z tags
- Various code simplifications and formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant