Skip to content

Releases: brentp/somalier

multi-sample GVCFs

05 Mar 18:56
Compare
Choose a tag to compare

v0.2.9

  • support multi-sample GVCF and fix some GVCF cases (thanks @ameynert for implementing)
  • also fixes some edge-cases with GVCFs

Installation

grab the static binary , or use docker via brentp/somalier:v0.2.9

sites files (unchanged from previous releases)

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz

pedigree inference and better handling of identical samples

24 Feb 17:39
Compare
Choose a tag to compare

v0.2.8

  • html output has a list of pre-sets to auto-select informative X, Y axes for the sample plot

  • add --infer flag to somalier relate to allow inferring relatedness.
    this accompanies a change in the .samples.tsv output so that it can be used as a pedigree file

  • add --sample-prefix option to extract and corresponding (multi-)option to relate. So, given
    a cohort with DNA and RNA where samples have identical IDs (SM tags) in the DNA and RNA, can use
    somalier as:

    somalier extract -d DNA --sample-prefix DNA- ...
    somalier extract -d RNA --sample-prefix RNA- ...
    
    somalier relate --sample-prefix DNA- --sample-prefix RNA- DNA/*.somalier RNA/*.somalier ...
    

    and it will show the samples that have
    matching IDs after stripping the prefixes as "identical".

Installation

grab the static binary , or use docker via brentp/somalier:v0.2.8

sites files (unchanged from previous releases)

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz

ancestry

27 Nov 16:49
Compare
Choose a tag to compare

this release adds an initial implementation of an ancestry sub-command that can use a set of labelled samples (with extracted somalier files) to train a small neural network which is then used to predict the ancestry of incoming samples.
the implementation is incomplete, but works for well-behaved data. Here is an example:
http://home.chpc.utah.edu/~u6000771/somalier/somalier-ancestry.n.html

This is possible thanks to a very fast randmized PCA implementation (along with a neural network framework) from @mratsim in Arraymancer.

There are also improvements for huge cohorts. See below for full change-set.

Installation

grab the static binary below, or use docker via brentp/somalier:v0.2.7

v0.2.7

  • new subcommand ancestry to predict ancestry using a simple neural network on the somalier
    sketches. creates an interactive html output and a text file
  • fix for "Argument list too long" on huge cohorts (#37)
  • sub-sample .pairs.tsv output for huge cohorts -- only for unrelated samples.
  • better sub-sampling of html output

sites files (unchanged from previous releases)

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz

html for huge cohorts

29 Oct 20:19
Compare
Choose a tag to compare

somalier was fast enough to use on large >5,000 sample cohorts, but the html output was not useful.
this fixes that by sub-sampling pairs of samples that are expected to be unrelated and also appear to be unrelated by the genotype information.

v0.2.6

  • for large cohorts (>1K samples) the html output is now usable.
    it randomly subsets samples that should be and are unrelated.
  • better error messages for bad input
  • inspect environment variable: SOMALIER_ALLOWED_FILTERS so that users can
    give a comma-delimited list of FILTERs that should be allowed (by default only PASS and RefCall
    variants are considered. This is useful for some GVCF formats.

sites files

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz

VCF+GVCF edge-cases

24 Oct 13:44
Compare
Choose a tag to compare

get started with a binary below, or with docker:brentp/somalier:v0.2.5

v0.2.5

  • handle more types of GVCF (#27, thanks @holtjma)
  • handle VCFs without depth (AD) information. this enables extracting VCFs with only genotypes such as
    files converted from array information (#31, thanks @asazonov)

sites files

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz

GVCF support and parameter changes

21 Oct 19:02
9bf0661
Compare
Choose a tag to compare

get started with a binary below, or with docker:brentp/somalier:v0.2.4

v0.2.4

  • unify genotyping between all code-paths (thanks Filipe)
  • if both groups and pedigree information are specified, they correctly share information (#26)
  • relax allele balance to hom-ref is < 0.04 and hom-alt > 0.96 (was 0.02 and 0.98 respectively).
  • support for GVCF (#27)

sites files

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz

major performance improvements, bugfixes

10 Oct 18:05
Compare
Choose a tag to compare

The main change in this release is the use of bitvectors to calculate all-vs-all relatedness. This speeds up the relatedness step by about 100X such that we can calculate relatedness of all 4,825,171 possible pairwise combinations of the 2,504 thousand genomes samples in about 20 seconds.
It also fixes a bug in the a-allele/b-allele designation for VCF that caused problems when comparing samples extracted from VCF/BCF to those from CRAM/BAM.

The readme now includes instructions on how to estimate ancestry from somalier sketches.

v0.2.3

  • calculate relatedness correctly for samples with parent-ids specified
    when the parents are not actually in the pedigree file.
  • use bit-vectors to calculate relatedness. this gives up to a 250X speedup.
    with this code, I can now evaluate relatedness for 3756 in under 30 seconds on my laptop.
  • better scaling of X and Y depth
  • use final RG as the sample id in relate
  • output expected relatedness in .pairs.tsv file
  • fix ref/alt (a/b-allele ordering for VCF) this was a bug that caused problems when comparing
    samples extracted from VCF files to other samples extracted from BAM/CRAM files. Thanks very
    much to Filipe and Sergio for finding this issue and providing several test-cases. (if you
    have previously downloaded the thousand genomes files from zenodo, please update to the latest).

sites files

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz

static build with curl. default output dir

20 Jul 02:25
Compare
Choose a tag to compare

v0.2.2

  • add a default output directory. previously if not outputdir was specified, it would try to write to / and give a non-informative error.
  • static build with libcurl. the static binary now supports bams/vcfs/crams over https/s3 etc.

Install

  • somalier_static is a completely static binary and the recommended way to run somalier; just wget, chmod+x (get a sites file) and go.

sites files

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz

v0.2.1

19 Jul 19:25
Compare
Choose a tag to compare

v0.2.1

  • fix hover in html
  • add --unknown flag for somalier relate to set unknown genotypes to hom-ref (useful when merging single-sample VCFs).
  • change sites to be alphabetical by allele so that they are the same between genome builds
  • add version to .somalier files created with extract -- these will not be compatible with those made with v0.2.0. I don't
    forsee a backwards incompatible change like this one in the near future.
  • sites files for hg38 and GRCh37 are compatible. That is, we can extract sites from bams or vcfs from samples aligned to GRCh37
    reference and accurately calculate relatedness on files extracted from samples aligned to hg38.
  • better HTML performance for large numbers of samples by sub-sampling individiuals that are expected to be unrelated and that
    have a calculated relatedness < 0.09.
  • add a depthview sub-command to plot the depth of each sample along each chromosome.
  • much nicer html and several fixes thanks to Joe Brown

Install

This release comes with 2 linux binaries:

  • somalier_static is a completely static binary and the recommended way to run somalier; just wget, chmod+x (get a sites file) and go.
  • somalier_shared requires htslib (and libhts.so). use this binary if you need to access S3 or https files.

sites files

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz

)

major refactor for scalability

29 May 20:14
Compare
Choose a tag to compare

v0.2.0

This was a large re-write of somalier. The command-line usage is backwards incompatible (but
should not change moving forward). There is now a per-sample extract step:

somalier extract -d extracted/ -s $sites_vcf -f $fasta $sample.cram

followed by a relate step:

somalier relate --ped $ped extracted/*.somalier

This enables parallelization by sample across nodes and the resulting, extracted, binary "somalier"
files are only ~220KB per sample so reading them is nearly instant and the relate step
runs in 10 seconds for my 603-sample test-case which makes adjusting pedigree files or removing samples
and re-running a much faster process.
This means we can add a single (n+1) sample and once it's extracted, we can compare it to an entire cohort in a few seconds.

somalier extract can also take a (multi-sample) VCF and create an idential "somalier" file
for cases when a VCF is available.

The sites files (linked below) are also greatly improved (with fewer sites, better accuracy) in this release)
For example, here is the output from previous version:
somalier-before
compared to this version:
somalier-after

Note how on the bottom figure for this version, like colors (relationships indicated from a pedigree file) cluster more tightly than in the previous version.

This release also reports values for X and Y chromosomes which help to evaluate observed vs expected sex, which can help resolve sample swaps.

Install

This release comes with 2 linux binaries:

  • somalier_static is a completely static binary and the recommended way to run somalier; just wget, chmod+x (get a sites file) and go.
  • somalier_shared requires htslib (and libhts.so). use this binary if you need to access S3 or https files.

sites files

sites.hg38.vcf.gz
sites.GRCh37.vcf.gz