🚧 KaryoScope is under active development in preparation for journal submission. The user-facing API and command set are being finalized. Expect breaking changes until v1.0.0. Watch releases for stable versions.
KaryoScope is an alignment-free annotation tool that assigns each k-mer in a query assembly or sequencing read to a feature drawn from one or more user-defined hierarchical feature sets, producing a base-pair resolution annotation in a single pass. Because a feature set is simply any tiling of a reference with labelled regions, KaryoScope is extensible to arbitrary annotation sources, from satellite catalogs and repeat libraries to cytobands, FISH-probe coordinates, and structural-variant breakpoints.
A pre-built database for the human genome is distributed alongside the tool, derived from T2T-CHM13v2.0 with six feature sets covering chromosome of origin, satellite composition, interspersed repeats, subtelomeric structure, gene boundaries, and acrocentric-specific features. From these annotations, KaryoScope produces karyotype visualizations and cytogenetic reports without ever performing read alignment. Additional databases can be built for any reference genome or community-curated annotation source.
- Pangenome-scale throughput. Annotates a single feature set on a complete human haplotype in ~8 minutes on a standard workstation, or the full six-feature-set pipeline for a diploid sample in ~30 minutes at 16 threads — scaling to cohorts of hundreds of phased assemblies. The in-progress migration to the HKS k-mer indexing backend will further reduce runtime and memory footprint.
- Base-pair resolution across the entire genome. Performs well in the satellite-dense centromeres, subtelomeres, and acrocentric short arms where alignment-based pipelines suffer from reference bias and ambiguous mappings.
- Multiple feature classes in a single pass. The same k-mer can carry labels across feature sets simultaneously, so a single position can be annotated as belonging to a specific chromosome, satellite family, repeat class, and gene at once.
- Extensible. Any annotation that tiles a reference of interest can serve as a feature set.
Installation via Bioconda is planned. For now, install from source.
KaryoScope requires Python ≥3.10 and several external tools (KMC, bgzip, tabix, seqtk, and cairo for PDF/PNG karyotype output). The simplest setup is a dedicated conda environment:
git clone https://github.com/barthel-lab/KaryoScope.git
cd KaryoScope
# Create a dedicated environment with Python and the bioinformatics tools.
# `samtools` is only needed if you plan to annotate BAM inputs; drop it
# if you only work with FASTA or FASTQ.
conda create -n karyoscope -c conda-forge -c bioconda \
python=3.12 pip kmc htslib samtools seqtk cairo zlib compilers
conda activate karyoscope
# Install KaryoScope
pip install -e .
# Build the bundled C++ helper (`get_featureIDs`).
# `pip install` is Python-only and does NOT compile the C++ tree.
cd native/get_featureIDs && make && cd ../..The build produces native/get_featureIDs/build/get_featureIDs; the
Python wrapper finds it automatically. See native/README.md
for build-system details (CXX selection, pkg-config-driven zlib lookup,
and the macOS + conda -Wl,-rpath,$CONDA_PREFIX/lib shim).
This walkthrough uses the HG002 v1.1 T2T diploid assembly as input, but any FASTA will work. Substitute your own with --input <path> throughout.
# 1. Download the recommended human reference database (~17 GB, one-time)
karyoscope download
# 2. Download the HG002 v1.1 diploid assembly (~3 GB, one-time)
# Skip if you already have your own assembly to annotate.
curl -O https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v1.1.fasta.gz
# 3. Annotate the assembly. Recommended: at least 16 threads and 50 GB
# of RAM for human-scale inputs (HG002 runs in ~30 min at -t 16).
# --no-bgzip keeps the per-feature-set BEDs as plain text for easy
# inspection; drop it to get the default bgzipped outputs.
# Accepts FASTA, FASTQ (plain or .gz), and BAM. For BAM, samtools
# must be on PATH (it's invoked as `samtools fasta` to stream into
# get_featureIDs). For read-level inputs also pass --no-preserve-order
# for substantially faster writes.
karyoscope annotate --input hg002v1.1.fasta.gz --outdir results/ --threads 16 --no-bgzip
# 4. Render the three primary karyotype views.
# --no-scaffolding skips the per-feature-set scaffolded BED rewrite
# (the expensive step of scaffolding); the scaffold map is still
# applied at bin time so the renders are equivalent.
# The first invocation runs the full scaffold + bin + render cascade;
# the next two reuse the cached intermediates and finish much faster.
COMMON="--input hg002v1.1.fasta.gz --outdir results/ --threads 16 --sex male --no-scaffolding --no-bgzip"
karyoscope karyotype $COMMON --mode genome --feature-set chromosome
karyoscope karyotype $COMMON --mode centromere --feature-set region
karyoscope karyotype $COMMON --mode subtelomere --feature-set subtelomericThis produces three SVGs under results/:
| File | View | Feature set |
|---|---|---|
hg002v1.1.KS_human_CHM13_v2.genome.chromosome.smoothed.karyotype.svg |
Genome view | chromosome |
hg002v1.1.KS_human_CHM13_v2.centromere.region.smoothed.karyotype.svg |
Centromere view | region |
hg002v1.1.KS_human_CHM13_v2.subtelomere.subtelomeric.smoothed.karyotype.svg |
Subtelomere view | subtelomeric |
Pass --format pdf or --format png (repeatable) to additionally produce those formats from the SVG.
| Command | Purpose |
|---|---|
karyoscope download |
Acquire pre-built databases |
karyoscope annotate |
Annotate sequences with k-mer features |
karyoscope scaffold |
Order, orient, and rename assembly contigs |
karyoscope bin |
Aggregate base-pair annotations into larger bins |
karyoscope centromeres |
Extract centromere coordinates |
karyoscope karyotype |
Render karyotype visualization |
karyoscope info |
Inspect databases, files, installation |
karyoscope version |
Print version and environment info |
Run karyoscope <command> --help for full options on any command.
Full documentation is being built. In the meantime, the --help output for each command is the authoritative reference.
KaryoScope works with pre-built databases distributed via the KaryoScope registry. The current default is KS_human_CHM13_v2 (~17 GB), built from the T2T-CHM13v2.0 reference.
Browse and download available databases:
karyoscope download --listBuilding your own database is supported via karyoscope build (coming in v1.0).
KaryoScope outputs for the HPRC Release 2 pangenome samples are hosted by the Human Pangenome Reference Consortium at the TGen_HPRCv2_KaryoScope S3 bucket. Use these to explore HPRC karyotypes without running the pipeline yourself, or as references for downstream analysis.
Note: the currently hosted annotations were generated against a previous version of the KaryoScope database. Updated annotations using the current release will be uploaded as they become available.
Per sample, the bucket contains:
| Path | Contents |
|---|---|
<sample>/bed/ |
Per-feature-set presmoothed annotations (<sample>.KaryoScope.v2.0.<feature_set>.bed.gz) |
<sample>/igv/ |
Per-feature-set, per-haplotype IGV-ready BEDs with tabix index (<sample>.KaryoScope.v2.0.<feature_set>.hap<i>.IGV.bed.gz + .tbi) |
<sample>/plots/ |
Karyotype SVGs: genome view (chromosome feature set), centromere view (region), subtelomere view (subtelomeric) |
If you use KaryoScope in your work, please cite our preprint:
Ranallo-Benavidez TR, Chen YA, Potapova T, Alanko J, Loucks H, Lucas J, Human Pangenome Reference Consortium, Guarracino A, Puglisi SJ, Marchet C, Miga K, Gerton JL, Barthel FP. KaryoScope: rapid, alignment-free sequence annotation for the pangenome era. bioRxiv (2026). doi: 10.64898/2026.05.15.725544
A CITATION.cff file in this repository provides machine-readable citation metadata.
KaryoScope is licensed under GPL-3.0-or-later due to its dependency on the GPL-3.0 KMC library. A future release will switch to MIT once we migrate to HKS for k-mer indexing.
Contributions, bug reports, and feature requests are welcome. See CONTRIBUTING.md to get started, and our Code of Conduct for community norms.
Developed in the Barthel Lab at the Translational Genomics Research Institute (TGen), in collaboration with Jarno Alanko, Simon Puglisi, and Camille Marchet.