Identify which human reference genome was used to align a BAM/SAM/CRAM file.
Visit us at Fulcrum Genomics to learn more about how we can power your Bioinformatics with ref-solver and beyond.
When working with BAM files from external sources (collaborators, public repositories, sequencing vendors), it's often unclear exactly which reference genome was used for alignment. While the reference might be labeled "GRCh38" or "hg19", there are dozens of variations:
- Naming conventions:
chr1vs1vsNC_000001.11 - Contig sets: With or without ALT contigs, decoys, HLA alleles
- Mitochondrial sequences: rCRS (16,569 bp) vs old Cambridge (16,571 bp)
- Sources: UCSC, NCBI, Broad, Illumina DRAGEN, 1000 Genomes
ref-solver solves this by matching the sequence dictionary from your BAM file against a catalog of known human reference genomes, providing:
- Exact matches when possible
- Detailed diagnostics when differences exist
- Actionable suggestions for fixing mismatches
# From crates.io (when published)
cargo install ref-solver
# From source
git clone https://github.com/fulcrumgenomics/ref-solver
cd ref-solver
cargo build --release# Identify reference from a BAM file
ref-solver identify sample.bam
# From stdin (pipe samtools header)
samtools view -H sample.bam | ref-solver identify -
# JSON output for scripting
ref-solver identify sample.bam --format json
# Compare two files/references
ref-solver compare sample.bam grch38_ncbi --reference
# Score one file against another directly
ref-solver score query.bam reference.dict
# List all known references
ref-solver catalog list
# Start interactive web UI
ref-solver serve --port 8080 --open#1 hg38 (UCSC) (EXACT)
ID: hg38_ucsc
Assembly: GRCh38
Source: UCSC
Match Type: Exact
Score: 100.0%
Contigs: 25 exact, 0 renamed, 0 by name+length, 0 unmatched, 0 conflicts
Suggestions:
- Safe to use as-is
Download: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
- Multiple input formats: BAM, SAM, CRAM, Picard
.dict, TSV - MD5-based matching: Uses sequence checksums when available for exact identification
- Fuzzy matching: Falls back to name+length matching when MD5s are missing
- Rename detection: Identifies when files differ only in contig naming (chr1 vs 1)
- Order detection: Detects when contigs are reordered vs. reference
- Conflict detection: Identifies problematic differences (e.g., wrong mitochondrial sequence)
- Actionable suggestions: Provides commands to fix issues using fgbio/Picard tools
- Web interface: Interactive browser-based UI for pasting headers
- Embedded catalog: 15+ common human reference genomes built-in
The built-in catalog includes 15 human reference genomes:
| ID | Name | Source | Notes |
|---|---|---|---|
hg38_ucsc |
hg38 (UCSC) | UCSC | Standard UCSC reference, chr-prefixed naming |
grch38_ncbi |
GRCh38 (NCBI) | NCBI | NCBI numeric naming (1, 2, ..., X, Y, MT) |
grch38_broad_analysis_set |
GRCh38 Broad Analysis Set | Broad | GATK Best Practices reference |
hs38 |
hs38 (no-ALT) | lh3/ref-gen | Recommended. Primary + unplaced/unlocalized (195 seq) |
hs38DH |
hs38DH | lh3/bwakit | Full set: ALT (261) + decoy (2386) + HLA (525) = 3155 seq |
grch38_1kg_analysis |
GRCh38 1KG Analysis Set | 1000 Genomes | 1000 Genomes Project analysis set with decoy/HLA |
grch38_gdc |
GRCh38.d1.vd1 | NCI GDC | TCGA/TARGET reference with 10 viral genomes (2779 seq) |
grch38_dragen |
GRCh38 DRAGEN | Illumina | Standard Illumina DRAGEN reference |
grch38_dragen_altmasked |
GRCh38 DRAGEN ALT-masked | Illumina | DRAGEN v3.9+ with ALT regions N-masked |
| ID | Name | Source | Notes |
|---|---|---|---|
hg19_ucsc |
hg19 (UCSC) | UCSC | |
grch37_ncbi |
GRCh37 (NCBI) | NCBI | NCBI naming with rCRS mitochondrial |
hs37 |
hs37 (minimal) | lh3/ref-gen | Recommended. 25 primary sequences only |
hs37d5 |
hs37d5 | 1000 Genomes | With hs37d5 decoy sequence |
b37_broad |
b37 (Broad) | Broad | Legacy GATK Best Practices |
| ID | Name | Source | Notes |
|---|---|---|---|
chm13v2 |
T2T-CHM13v2.0 | T2T Consortium | Complete gapless assembly, all centromeres resolved |
| Use Case | GRCh38 | GRCh37 |
|---|---|---|
| Standard analysis (BWA-MEM, GATK) | hs38 |
hs37 |
| ALT-aware alignment (BWA-MEM2) | hs38DH |
hs37d5 |
| Illumina DRAGEN | grch38_dragen_altmasked |
— |
| TCGA/GDC compatibility | grch38_gdc |
— |
| Legacy pipelines | hg38_ucsc |
hg19_ucsc |
Identify the reference genome from a BAM/SAM file.
ref-solver identify [OPTIONS] <INPUT>
Arguments:
<INPUT> Input file (BAM, SAM, CRAM, .dict, or TSV). Use '-' for stdin.
Options:
-n, --max-matches <N> Number of matches to show [default: 5]
--exact-only Only show exact matches
--catalog <PATH> Path to custom catalog file
--input-format <FORMAT> Override auto-detection [sam, bam, cram, dict, tsv, csv]Compare two headers or a header against a known reference.
ref-solver compare [OPTIONS] <INPUT_A> <INPUT_B>
Arguments:
<INPUT_A> First input file
<INPUT_B> Second input file, or reference ID from catalog
Options:
--reference Treat INPUT_B as a reference ID from the catalog
--catalog <PATH> Path to custom catalog fileManage the reference catalog.
ref-solver catalog <COMMAND>
Commands:
list List all references in the catalog
show Show details of a specific reference
export Export the catalog to a JSON fileCompare two files directly without using the catalog. Useful for comparing arbitrary files. By default, scoring is asymmetric: it measures how well the query matches the reference.
ref-solver score [OPTIONS] <QUERY> <REFERENCE>
Arguments:
<QUERY> Query file (BAM, SAM, CRAM, FASTA, FAI, VCF, .dict, TSV, CSV)
<REFERENCE> Reference file to compare against
Options:
--symmetric Compute both directions (query→reference and reference→query)
--weight-match <N> Weight for contig match score (0-100) [default: 70]
--weight-coverage <N> Weight for coverage score (0-100) [default: 20]
--weight-order <N> Weight for order score (0-100) [default: 10]Example:
# Compare a BAM against a reference FASTA index
ref-solver score sample.bam reference.fa.fai
# Compare in both directions
ref-solver score --symmetric file_a.dict file_b.dict
# Custom scoring weights (emphasize coverage)
ref-solver score --weight-match 50 --weight-coverage 40 --weight-order 10 query.bam ref.dictStart the web interface.
ref-solver serve [OPTIONS]
Options:
-p, --port <PORT> Port to listen on [default: 8080]
-a, --address <ADDR> Address to bind to [default: 127.0.0.1]
--open Open browser automaticallyUse --format to control output:
text(default): Human-readable tabular outputjson: Structured JSON for scriptingtsv: Tab-separated values
| Type | Meaning |
|---|---|
Exact |
All contigs match exactly (name, length, MD5) |
Renamed |
Same sequences, different naming convention |
Reordered |
Same contigs, different order |
Partial |
Most contigs match, some differences |
Mixed |
Contigs appear to come from multiple references |
NoMatch |
No good match found |
| Level | Score Range | Meaning |
|---|---|---|
Exact |
100% | Perfect match |
High |
≥95% | Very confident |
Medium |
≥80% | Likely match |
Low |
<80% | Uncertain |
The UCSC hg19 chrM is 16,571 bp (old Cambridge sequence), while most modern references use rCRS (16,569 bp). This is a real sequence difference, not just naming.
UCSC uses chr1, NCBI/Ensembl use 1. Use fgbio to rename:
fgbio UpdateSequenceDictionary -i input.bam -o output.bam -s reference.dictSome tools are sensitive to contig order. Use Picard to reorder:
picard ReorderSam I=input.bam O=output.bam R=reference.faExport and modify the built-in catalog:
# Export current catalog
ref-solver catalog export my_catalog.json
# Edit to add custom references...
# Use custom catalog
ref-solver identify --catalog my_catalog.json sample.bamref-solver automatically generates UCSC-style names for fix-patches and novel-patches in GRCh38 assembly reports. This is particularly important for assembly reports prior to p13, where the UCSC-style-name column shows "na" for patches.
UCSC uses the following format for patch contigs:
| Patch Type | Format | Example |
|---|---|---|
| Fix patches | chr{chr}_{accession}v{version}_fix |
chr1_KN196472v1_fix |
| Novel patches | chr{chr}_{accession}v{version}_alt |
chr1_KQ458382v1_alt |
The transformation converts NCBI GenBank accessions (e.g., KN196472.1) to UCSC-style names by:
- Replacing
.withvin the accession - Prepending
chr{chromosome}_ - Appending
_fixor_altbased on patch type
| NCBI Accession | Patch Type | Chromosome | UCSC Name |
|---|---|---|---|
KN196472.1 |
fix-patch | 1 | chr1_KN196472v1_fix |
KQ458382.1 |
novel-patch | 1 | chr1_KQ458382v1_alt |
KN196487.1 |
fix-patch | Y | chrY_KN196487v1_fix |
KV766199.1 |
novel-patch | X | chrX_KV766199v1_alt |
When building custom catalogs, you can disable automatic UCSC name generation:
ref-solver catalog build \
--id my_ref \
--name "My Reference" \
--input assembly_report.txt \
--no-generate-ucsc-namesThis is useful when you want strict adherence to names in the assembly report.
use ref_solver::{ReferenceCatalog, MatchingEngine, MatchingConfig, QueryHeader};
use ref_solver::parsing::sam::parse_header_text;
// Load catalog
let catalog = ReferenceCatalog::load_embedded()?;
// Parse a header
let header_text = "@SQ\tSN:chr1\tLN:248956422\tM5:6aef897c3d6ff0c78aff06ac189178dd\n";
let query = parse_header_text(header_text)?;
// Find matches
let engine = MatchingEngine::new(&catalog, MatchingConfig::default());
let matches = engine.find_matches(&query, 5);
for m in matches {
println!("{}: {:.1}%", m.reference.display_name, m.score.composite * 100.0);
}Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
To add a new reference to the catalog:
- Obtain the sequence dictionary (run
samtools dict reference.fa) - Add the reference to
catalogs/human_references.json - Include MD5 checksums for all contigs
- Run tests to verify matching works correctly
MIT License. See LICENSE for details.
- Heng Li's ref-gen for reference genome recommendations (hs37, hs38, hs38DH)
- NCI GDC for GRCh38.d1.vd1 reference documentation
- Illumina DRAGEN for DRAGEN reference specifications
- noodles for SAM/BAM parsing
- GATK for reference genome documentation
- T2T Consortium for CHM13 resources
- 1000 Genomes Project for hs37d5 and analysis sets
If you use ref-solver in your research, please cite:
ref-solver: A tool for identifying human reference genomes from BAM files
https://github.com/fulcrumgenomics/ref-solver