feat(genome): GTF-derived splice-junction insertion into Genome + SA (Phase 17.X)#5
Closed
ewels wants to merge 9 commits into
Closed
feat(genome): GTF-derived splice-junction insertion into Genome + SA (Phase 17.X)#5ewels wants to merge 9 commits into
ewels wants to merge 9 commits into
Conversation
New module src/junction/sjdb_insert.rs with classify_motif() that returns the 0-6 motif code STAR uses (sjdbPrepare.cpp:37-50): 0 = non-canonical 1 = GT/AG(+) (donor bases GT, acceptor bases AG) 2 = CT/AC(-) 3 = GC/AG(+) 4 = CT/GC(-) 5 = AT/AC(+) 6 = GT/AT(-) First of several pieces of sjdbPrepare + sjdbBuildIndex needed to reach byte-parity on Genome / SA / SAindex / sjdbInfo.txt / sjdbList.out.tab. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports STAR's sjdbPrepare.cpp:52-73 repeat-shift calculation: counts how many bases the intron boundary can shift left / right while preserving the donor/acceptor base identity. Caps at 255 per junction (STAR limit). Stops on genome bounds and on any N-base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `PreparedJunction` struct carrying STAR's per-junction fields (chr_idx, shift-adjusted start/end, motif, shift_left, shift_right, strand) and a `prepare_junction` free function that pulls them together from raw db coordinates. Follows STAR's sjdbPrepare.cpp semantics: - Compute motif via classify_motif - Compute shifts via compute_shifts - Subtract shift_left from both coordinates (lines 71-72) to sit at the canonical left-most representation - Derive strand from motif when db_strand is unknown (lines 184-188: 2 - motif % 2, or 0 if motif is non-canonical) Three tests cover: canonical + strand, strand derivation fallback, shift-left applied to coords. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /simplify review flagged my classify_motif + MOTIF_* consts as a duplicate of src/align/score.rs::detect_splice_motif (behind a SpliceMotif enum) + src/junction/sj_output.rs::encode_motif (0-6 code). Two independent motif truth tables is a drift risk. - Move detect_splice_motif from AlignmentScorer method (stateless) to a free fn in src/align/score.rs; the method is now a thin wrapper. - Drop classify_motif + MOTIF_* consts from sjdb_insert. - prepare_junction calls detect_splice_motif + encode_motif instead. - compute_shifts signature changed from &[u8] → &Genome so both helpers share the same input shape; takes n_genome_real to scope the forward-strand slice and avoids redundant bounds checks per iteration. - Tests updated to build Genome instances via a local helper. 343 unit tests pass (up from 335 — the test-coverage consolidation netted more tests via score.rs's existing detect_splice_motif coverage plus the new sjdb_insert tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds PreparedJunction::stored_start/end/original_start/original_end so callers can request either STAR's on-disk coordinate form (canonical → original; non-canonical → shifted) or the canonical-for-extraction form used when reading Gsj flanking bytes. sort_and_dedup() ports STAR's second-pass sort + cross-strand dedup from sjdbPrepare.cpp:141-192. ruSTAR's SpliceJunctionDb already deduplicates on (chr, start, end, strand), so the first-pass intra-strand dedup branches never fire for our inputs — kept second-pass cross-strand logic: - Undefined vs defined strand → keep defined - Both non-canonical → collapse to strand 0 (undefined) - One canonical → keep canonical - Both canonical → keep the one whose strand matches 2 - motif%2 6 new tests exercise each branch of merge_cross_strand plus the sort ordering; all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New build_gsj() in src/junction/sjdb_insert.rs reproduces STAR's
sjdbPrepare.cpp:203-215 in-memory buffer: per junction, sjdb_overhang
donor bases + sjdb_overhang acceptor bases + 1 spacer byte (value 5,
STAR's GENOME_spacingChar). Extraction uses ORIGINAL (pre-shift)
coordinates regardless of motif, relying on the new
PreparedJunction::original_{start,end} helpers.
5 tests: canonical single junction, non-canonical junction (verifies
shift reversal), multi-junction concatenation with independent spacers,
plus two bounds-error cases (donor underflow, acceptor overrun).
Next commit will wire the Gsj bytes onto the Genome binary + extend
the SA.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two STAR-faithful text writers matching `sjdbPrepare.cpp:196-222`:
- `write_sjdb_info_tab`: header `<n>\t<overhang>\n`, then
`stored_start\tstored_end\tmotif\tshiftL\tshiftR\tstrand\n` per junction.
- `write_sjdb_list_out_tab`: `chr\t<1-based start>\t<1-based end>\t<strand_char>\n`
with chromosome-local, pre-shift ("original") coordinates.
Both use BufWriter + Error::io(path) matching the existing writer
convention in sj_output.rs. Strand char table matches STAR's
`{'.','+','-'}` for 0/1/2.
Three byte-match tests covering the header, canonical vs non-canonical
coord handling, and multi-chromosome chr_start offsets. 22/22
sjdb_insert tests pass; full suite 357/357, 0 new clippy warnings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches STAR's in-memory layout after `sjdbBuildIndex.cpp:293`: Gsj bytes live immediately after the forward real-genome bytes; chr_start[n_chr_real] stays pinned at the pre-sjdb forward total (verified against STAR's chrStart.txt for the 10k yeast test case: pre-sjdb boundary = 262144, n_genome post-sjdb = 263139 = 262144 + 5*199). n_genome grows to include Gsj. chr_start / chr_length / chr_name / n_chr_real are NOT updated — Gsj lives outside the chromosome accounting. Two unit tests: one verifies forward layout + RC regeneration; one verifies empty-Gsj is a no-op. 359/359 unit + 4/4 integration tests pass; 0 clippy warnings in genome/mod.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iff harness With a GTF provided, GenomeIndex::build now: 1. Parses the GTF once (shared between junction_db, transcriptome, and sjdb_insert — was parsed twice before). 2. Converts chr-local 1-based GTF intron coords to 0-based absolute genome offsets for sjdb_insert::prepare_junction. 3. Sort + dedup prepared junctions via sjdb_insert::sort_and_dedup. 4. Build Gsj buffer and append to Genome (Genome::append_sjdb) BEFORE the suffix array is built — STAR indexes Gsj alongside the real genome in a single SA (sjdbBuildIndex.cpp:293). 5. Stash sorted prepared junctions in a new `prepared_junctions` field so GenomeIndex::write() can emit sjdbInfo.txt + sjdbList.out.tab. GenomeIndex::load() now reads n_genome from genomeParameters.txt's `genomeFileSizes` line (the chr_start boundary is only the real-genome forward total, not the post-sjdb total). Also adds SpliceJunctionDb::from_raw_junctions helper so from_gtf and the build path share the HashMap-building code. **Validation** (test/diff_star_transcriptome.sh, extended): 15/15 files byte-identical to STAR docker output, including: - Genome (real + Gsj appended; verified against STAR 2.7.11b) - sjdbInfo.txt - sjdbList.out.tab - genomeParameters.txt (genomeFileSizes reflects post-sjdb n_genome) SA / SAindex: size-matched (51929 / 6027 bytes), content differs as expected — STAR's bucketed-parallel SA construction breaks ties differently than ruSTAR's comparator-based sort. Size parity proves we indexed the same genome+Gsj length. Round-trip verified: ruSTAR can load the extended index it built and align reads end-to-end (load path reads `n_genome=524585` from the `genomeFileSizes` line, confirming Gsj bytes are included). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
Moved to upstream scverse#8. |
ewels
pushed a commit
that referenced
this pull request
May 4, 2026
- shuffle_tied_prefix: deterministic per-read RNG for tied primary selection - --runRNGseed param (default 777, matches STAR) - --outSAMattrRGline: @rg header + RG:Z tags - Various code simplifications and formatting
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports STAR's
sjdbPrepare.cpp+sjdbBuildIndex.cppto ruSTAR. AtgenomeGenerate, GTF-derived splice junctions are now baked into theGenomebinary and indexed by the suffix array, matching STAR's layout byte-for-byte.Stacked on scverse#7 (TranscriptomeSAM). Merge scverse#7 first, then rebase this branch onto
main.Validation — 15/15 files match STAR at genomeGenerate
test/diff_star_transcriptome.shagainst STAR 2.7.11b on the synthetic 2-chr / 4-transcript / 3-junction test case:SA/SAindex content divergence is expected: STAR's bucketed parallel SA construction breaks ties differently than ruSTAR's comparator-based sort. Size parity proves we indexed the same genome+Gsj length. Byte-for-byte SA parity would require reimplementing STAR's SA algorithm — out of scope for this PR.
Round-trip verified: ruSTAR can load the extended index it built and align reads end-to-end (load path correctly reads
n_genome=524585fromgenomeFileSizes, confirming Gsj bytes are included alongside the real genome bytes).What changed
src/junction/sjdb_insert.rs(new module, 19 unit tests)compute_shifts— STAR'ssjdbShiftLeft/sjdbShiftRight(repeat length across intron boundary, capped at 255, stops at N).PreparedJunction+prepare_junction— combines motif detection, shifts, strand derivation.stored_start/end(forsjdbInfo.txt) vsoriginal_start/end(for Gsj extraction) follow STAR's convention where canonical motifs store the pre-shift ORIGINAL coord and non-canonical ones store the SHIFTED coord.sort_and_dedup— STAR's post-sort + cross-strand dedup (sjdbPrepare.cpp:141-192).build_gsj— concatenated flanking sequences,2*overhang+1bytes per junction (donor + acceptor + spacer=5).write_sjdb_info_tab+write_sjdb_list_out_tab— byte-exact file writers (header<n>\t<overhang>\n+ six tab-separated fields per row; chr-local 1-based coords + strand char).src/genome/mod.rs—Genome::append_sjdb()Extends the forward buffer with Gsj, rebuilds the RC over the extended range.
chr_start[n_chr_real]stays pinned at the pre-sjdb forward total (matches STAR'schrStart.txt— verified against the 10k yeast docker output).n_genomegrows to include Gsj.src/index/mod.rs—GenomeIndex::buildorchestrationprepare_junction.Genome::append_sjdb()BEFORE the SA is built (STAR indexes Gsj alongside the real genome in a single SA,sjdbBuildIndex.cpp:293).prepared_junctionsfield sowrite()can emitsjdbInfo.txt+sjdbList.out.tab.src/index/io.rs— load pathReads
n_genomefromgenomeParameters.txt'sgenomeFileSizesline. Thechr_startboundary is only the pre-sjdb forward total, so with sjdb-extended indices the two values diverge andchr_startcannot be used as the total genome size.Other
src/junction/mod.rs—SpliceJunctionDb::from_raw_junctionshelper (shared byfrom_gtfand the build path, avoiding the GTF-parse-twice problem).test/diff_star_transcriptome.sh— extended from 10 files to 15 (added Genome, sjdbInfo.txt, sjdbList.out.tab, plus size-only check for SA/SAindex).Pre-existing bugs flagged (separate PRs)
Documented during the transcriptome-BAM validation in scverse#7 — not introduced or resolved here:
AS nMin addition to STAR's defaultNH HI.Test plan
cargo test— 359 unit + 4 integration + 1 doc test, all passingcargo clippy --all-targets— 0 new warnings (27 pre-existing)cargo fmt --check— cleantest/diff_star_transcriptome.sh— 15/15 files identical/size-matched against STAR 2.7.11b🤖 Generated with Claude Code