Skip to content

major refactor for scalability

Choose a tag to compare

@brentp brentp released this 29 May 20:14
· 214 commits to master since this release

v0.2.0

This was a large re-write of somalier. The command-line usage is backwards incompatible (but
should not change moving forward). There is now a per-sample extract step:

somalier extract -d extracted/ -s $sites_vcf -f $fasta $sample.cram

followed by a relate step:

somalier relate --ped $ped extracted/*.somalier

This enables parallelization by sample across nodes and the resulting, extracted, binary "somalier"
files are only ~220KB per sample so reading them is nearly instant and the relate step
runs in 10 seconds for my 603-sample test-case which makes adjusting pedigree files or removing samples
and re-running a much faster process.
This means we can add a single (n+1) sample and once it's extracted, we can compare it to an entire cohort in a few seconds.

somalier extract can also take a (multi-sample) VCF and create an idential "somalier" file
for cases when a VCF is available.

The sites files (linked below) are also greatly improved (with fewer sites, better accuracy) in this release)
For example, here is the output from previous version:
somalier-before
compared to this version:
somalier-after

Note how on the bottom figure for this version, like colors (relationships indicated from a pedigree file) cluster more tightly than in the previous version.

This release also reports values for X and Y chromosomes which help to evaluate observed vs expected sex, which can help resolve sample swaps.

Install

This release comes with 2 linux binaries:

  • somalier_static is a completely static binary and the recommended way to run somalier; just wget, chmod+x (get a sites file) and go.
  • somalier_shared requires htslib (and libhts.so). use this binary if you need to access S3 or https files.

sites files

sites.hg38.vcf.gz
sites.GRCh37.vcf.gz