html for huge cohorts
somalier was fast enough to use on large >5,000 sample cohorts, but the html output was not useful.
this fixes that by sub-sampling pairs of samples that are expected to be unrelated and also appear to be unrelated by the genotype information.
v0.2.6
- for large cohorts (>1K samples) the html output is now usable.
it randomly subsets samples that should be and are unrelated. - better error messages for bad input
- inspect environment variable:
SOMALIER_ALLOWED_FILTERSso that users can
give a comma-delimited list of FILTERs that should be allowed (by default only PASS and RefCall
variants are considered. This is useful for some GVCF formats.
sites files
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz