ancestry
this release adds an initial implementation of an ancestry sub-command that can use a set of labelled samples (with extracted somalier files) to train a small neural network which is then used to predict the ancestry of incoming samples.
the implementation is incomplete, but works for well-behaved data. Here is an example:
http://home.chpc.utah.edu/~u6000771/somalier/somalier-ancestry.n.html
This is possible thanks to a very fast randmized PCA implementation (along with a neural network framework) from @mratsim in Arraymancer.
There are also improvements for huge cohorts. See below for full change-set.
Installation
grab the static binary below, or use docker via brentp/somalier:v0.2.7
v0.2.7
- new subcommand
ancestryto predict ancestry using a simple neural network on the somalier
sketches. creates an interactive html output and a text file - fix for "Argument list too long" on huge cohorts (#37)
- sub-sample .pairs.tsv output for huge cohorts -- only for unrelated samples.
- better sub-sampling of html output
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz