Skip to content

Identify which human reference genome was used to align a BAM/SAM/CRAM file

License

Notifications You must be signed in to change notification settings

fulcrumgenomics/ref-solver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ref-solver

Crates.io Documentation CI License: MIT Rust

Identify which human reference genome was used to align a BAM/SAM/CRAM file.

Fulcrum Genomics

Visit us at Fulcrum Genomics to learn more about how we can power your Bioinformatics with ref-solver and beyond.

The Problem

When working with BAM files from external sources (collaborators, public repositories, sequencing vendors), it's often unclear exactly which reference genome was used for alignment. While the reference might be labeled "GRCh38" or "hg19", there are dozens of variations:

  • Naming conventions: chr1 vs 1 vs NC_000001.11
  • Contig sets: With or without ALT contigs, decoys, HLA alleles
  • Mitochondrial sequences: rCRS (16,569 bp) vs old Cambridge (16,571 bp)
  • Sources: UCSC, NCBI, Broad, Illumina DRAGEN, 1000 Genomes

ref-solver solves this by matching the sequence dictionary from your BAM file against a catalog of known human reference genomes, providing:

  • Exact matches when possible
  • Detailed diagnostics when differences exist
  • Actionable suggestions for fixing mismatches

Quickstart

Installation

# From crates.io (when published)
cargo install ref-solver

# From source
git clone https://github.com/fulcrumgenomics/ref-solver
cd ref-solver
cargo build --release

Basic Usage

# Identify reference from a BAM file
ref-solver identify sample.bam

# From stdin (pipe samtools header)
samtools view -H sample.bam | ref-solver identify -

# JSON output for scripting
ref-solver identify sample.bam --format json

# Compare two files/references
ref-solver compare sample.bam grch38_ncbi --reference

# Score one file against another directly
ref-solver score query.bam reference.dict

# List all known references
ref-solver catalog list

# Start interactive web UI
ref-solver serve --port 8080 --open

Example Output

#1 hg38 (UCSC) (EXACT)
   ID: hg38_ucsc
   Assembly: GRCh38
   Source: UCSC
   Match Type: Exact
   Score: 100.0%

   Contigs: 25 exact, 0 renamed, 0 by name+length, 0 unmatched, 0 conflicts

   Suggestions:
   - Safe to use as-is

   Download: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

Features

  • Multiple input formats: BAM, SAM, CRAM, Picard .dict, TSV
  • MD5-based matching: Uses sequence checksums when available for exact identification
  • Fuzzy matching: Falls back to name+length matching when MD5s are missing
  • Rename detection: Identifies when files differ only in contig naming (chr1 vs 1)
  • Order detection: Detects when contigs are reordered vs. reference
  • Conflict detection: Identifies problematic differences (e.g., wrong mitochondrial sequence)
  • Actionable suggestions: Provides commands to fix issues using fgbio/Picard tools
  • Web interface: Interactive browser-based UI for pasting headers
  • Embedded catalog: 15+ common human reference genomes built-in

Supported References

The built-in catalog includes 15 human reference genomes:

GRCh38/hg38 Family

ID Name Source Notes
hg38_ucsc hg38 (UCSC) UCSC Standard UCSC reference, chr-prefixed naming
grch38_ncbi GRCh38 (NCBI) NCBI NCBI numeric naming (1, 2, ..., X, Y, MT)
grch38_broad_analysis_set GRCh38 Broad Analysis Set Broad GATK Best Practices reference
hs38 hs38 (no-ALT) lh3/ref-gen Recommended. Primary + unplaced/unlocalized (195 seq)
hs38DH hs38DH lh3/bwakit Full set: ALT (261) + decoy (2386) + HLA (525) = 3155 seq
grch38_1kg_analysis GRCh38 1KG Analysis Set 1000 Genomes 1000 Genomes Project analysis set with decoy/HLA
grch38_gdc GRCh38.d1.vd1 NCI GDC TCGA/TARGET reference with 10 viral genomes (2779 seq)
grch38_dragen GRCh38 DRAGEN Illumina Standard Illumina DRAGEN reference
grch38_dragen_altmasked GRCh38 DRAGEN ALT-masked Illumina DRAGEN v3.9+ with ALT regions N-masked

GRCh37/hg19 Family

ID Name Source Notes
hg19_ucsc hg19 (UCSC) UCSC ⚠️ Old Cambridge chrM (16571bp), not rCRS
grch37_ncbi GRCh37 (NCBI) NCBI NCBI naming with rCRS mitochondrial
hs37 hs37 (minimal) lh3/ref-gen Recommended. 25 primary sequences only
hs37d5 hs37d5 1000 Genomes With hs37d5 decoy sequence
b37_broad b37 (Broad) Broad Legacy GATK Best Practices

T2T-CHM13

ID Name Source Notes
chm13v2 T2T-CHM13v2.0 T2T Consortium Complete gapless assembly, all centromeres resolved

Quick Reference Selection Guide

Use Case GRCh38 GRCh37
Standard analysis (BWA-MEM, GATK) hs38 hs37
ALT-aware alignment (BWA-MEM2) hs38DH hs37d5
Illumina DRAGEN grch38_dragen_altmasked
TCGA/GDC compatibility grch38_gdc
Legacy pipelines hg38_ucsc hg19_ucsc

Commands

identify

Identify the reference genome from a BAM/SAM file.

ref-solver identify [OPTIONS] <INPUT>

Arguments:
  <INPUT>  Input file (BAM, SAM, CRAM, .dict, or TSV). Use '-' for stdin.

Options:
  -n, --max-matches <N>  Number of matches to show [default: 5]
      --exact-only       Only show exact matches
      --catalog <PATH>   Path to custom catalog file
      --input-format <FORMAT>  Override auto-detection [sam, bam, cram, dict, tsv, csv]

compare

Compare two headers or a header against a known reference.

ref-solver compare [OPTIONS] <INPUT_A> <INPUT_B>

Arguments:
  <INPUT_A>  First input file
  <INPUT_B>  Second input file, or reference ID from catalog

Options:
      --reference        Treat INPUT_B as a reference ID from the catalog
      --catalog <PATH>   Path to custom catalog file

catalog

Manage the reference catalog.

ref-solver catalog <COMMAND>

Commands:
  list    List all references in the catalog
  show    Show details of a specific reference
  export  Export the catalog to a JSON file

score

Compare two files directly without using the catalog. Useful for comparing arbitrary files. By default, scoring is asymmetric: it measures how well the query matches the reference.

ref-solver score [OPTIONS] <QUERY> <REFERENCE>

Arguments:
  <QUERY>      Query file (BAM, SAM, CRAM, FASTA, FAI, VCF, .dict, TSV, CSV)
  <REFERENCE>  Reference file to compare against

Options:
      --symmetric            Compute both directions (query→reference and reference→query)
      --weight-match <N>     Weight for contig match score (0-100) [default: 70]
      --weight-coverage <N>  Weight for coverage score (0-100) [default: 20]
      --weight-order <N>     Weight for order score (0-100) [default: 10]

Example:

# Compare a BAM against a reference FASTA index
ref-solver score sample.bam reference.fa.fai

# Compare in both directions
ref-solver score --symmetric file_a.dict file_b.dict

# Custom scoring weights (emphasize coverage)
ref-solver score --weight-match 50 --weight-coverage 40 --weight-order 10 query.bam ref.dict

serve

Start the web interface.

ref-solver serve [OPTIONS]

Options:
  -p, --port <PORT>      Port to listen on [default: 8080]
  -a, --address <ADDR>   Address to bind to [default: 127.0.0.1]
      --open             Open browser automatically

Output Formats

Use --format to control output:

  • text (default): Human-readable tabular output
  • json: Structured JSON for scripting
  • tsv: Tab-separated values

Understanding Results

Match Types

Type Meaning
Exact All contigs match exactly (name, length, MD5)
Renamed Same sequences, different naming convention
Reordered Same contigs, different order
Partial Most contigs match, some differences
Mixed Contigs appear to come from multiple references
NoMatch No good match found

Confidence Levels

Level Score Range Meaning
Exact 100% Perfect match
High ≥95% Very confident
Medium ≥80% Likely match
Low <80% Uncertain

Common Issues

Wrong Mitochondrial Sequence

The UCSC hg19 chrM is 16,571 bp (old Cambridge sequence), while most modern references use rCRS (16,569 bp). This is a real sequence difference, not just naming.

Chr Prefix Mismatch

UCSC uses chr1, NCBI/Ensembl use 1. Use fgbio to rename:

fgbio UpdateSequenceDictionary -i input.bam -o output.bam -s reference.dict

Contig Order Differences

Some tools are sensitive to contig order. Use Picard to reorder:

picard ReorderSam I=input.bam O=output.bam R=reference.fa

Custom Catalogs

Export and modify the built-in catalog:

# Export current catalog
ref-solver catalog export my_catalog.json

# Edit to add custom references...

# Use custom catalog
ref-solver identify --catalog my_catalog.json sample.bam

UCSC-Style Naming for Patches

ref-solver automatically generates UCSC-style names for fix-patches and novel-patches in GRCh38 assembly reports. This is particularly important for assembly reports prior to p13, where the UCSC-style-name column shows "na" for patches.

Naming Convention

UCSC uses the following format for patch contigs:

Patch Type Format Example
Fix patches chr{chr}_{accession}v{version}_fix chr1_KN196472v1_fix
Novel patches chr{chr}_{accession}v{version}_alt chr1_KQ458382v1_alt

The transformation converts NCBI GenBank accessions (e.g., KN196472.1) to UCSC-style names by:

  1. Replacing . with v in the accession
  2. Prepending chr{chromosome}_
  3. Appending _fix or _alt based on patch type

Examples

NCBI Accession Patch Type Chromosome UCSC Name
KN196472.1 fix-patch 1 chr1_KN196472v1_fix
KQ458382.1 novel-patch 1 chr1_KQ458382v1_alt
KN196487.1 fix-patch Y chrY_KN196487v1_fix
KV766199.1 novel-patch X chrX_KV766199v1_alt

Disabling UCSC Name Generation

When building custom catalogs, you can disable automatic UCSC name generation:

ref-solver catalog build \
  --id my_ref \
  --name "My Reference" \
  --input assembly_report.txt \
  --no-generate-ucsc-names

This is useful when you want strict adherence to names in the assembly report.

Official Documentation

Library Usage

use ref_solver::{ReferenceCatalog, MatchingEngine, MatchingConfig, QueryHeader};
use ref_solver::parsing::sam::parse_header_text;

// Load catalog
let catalog = ReferenceCatalog::load_embedded()?;

// Parse a header
let header_text = "@SQ\tSN:chr1\tLN:248956422\tM5:6aef897c3d6ff0c78aff06ac189178dd\n";
let query = parse_header_text(header_text)?;

// Find matches
let engine = MatchingEngine::new(&catalog, MatchingConfig::default());
let matches = engine.find_matches(&query, 5);

for m in matches {
    println!("{}: {:.1}%", m.reference.display_name, m.score.composite * 100.0);
}

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Adding New References

To add a new reference to the catalog:

  1. Obtain the sequence dictionary (run samtools dict reference.fa)
  2. Add the reference to catalogs/human_references.json
  3. Include MD5 checksums for all contigs
  4. Run tests to verify matching works correctly

License

MIT License. See LICENSE for details.

Acknowledgments

Citation

If you use ref-solver in your research, please cite:

ref-solver: A tool for identifying human reference genomes from BAM files
https://github.com/fulcrumgenomics/ref-solver