-
Notifications
You must be signed in to change notification settings - Fork 23
gnotate
gnotate
is a format and an annotation engine. it is short for "genome annotation".
vcfanno is good at providing flexible annotation of VCFs with VCFs, BEDs, and other tabix-able files. It is very fast and does decent parallelization. However, given a very dense file like gnomad whole genomes, it must parse a lot of data (the whole genomes file is > 600GB!).
Often, a user only requires a small subset of fields from that file.
gnotate
facilitates the extraction and encoding of INFO fields from a VCF/BCF into a compressed, reduced format. It stores each chromosome in 2(*) separate files:
- a 64 bit integer that encodes:
- position up to 2^28 (which is more than enough for the longest human chromosome)
- encoded REF and ALT allele
- FILTER (a boolean indicating a non-PASS filter)
- a 32 bit float that encodes a single value from the VCF.
With this encoding, we can store the popmax_AF
from the union of gnomad genomes and exomes in 1.5GB, a > 400X reduction.
In addition, this format allows for extremely rapid annotation. The encoded positions are sortable, so, given a query position, gnotate
does a binary search and finds variants that match on position, REF, and ALT.
In a small percentage of cases, the REF+ALT length is too long to store (along with the position) in a 64 bit integer. For those, the variants are stored in a text file with a pointer in the encoded list that indicates there is a large variant at that position. gnotate
handles all this internally.
A gnotate file is simply a zip file with this information encoded.
The following command will make a gnotate zip file of the controls_nhomalt
field (number of homozygous alternates in gnomad controls) and the maximum allele frequency in any subpopulation using the exomes and genomes files.
slivar make-gnotate --prefix gnomad-2.1 \
--field controls_nhomalt:gnomad_nhomalt \
--field popmax_AF:gnomad_popmax_af \
gnomad.exomes.r2.1.sites.vcf.bgz \
gnomad.genomes.r2.1.sites.chr*.vcf.bgz
The output will be gnomad-2.1.zip
. The format of the --field
argument is: $info_field:$new_name
. The $new_name
will be used when annotating later files with the zip created here so it should be descriptive.
To use these files, we can specify --gnotate for each zip file in any of the slivar
sub-commands. slivar expr
is generally used for filtering, but it can also be used for annotation:
slivar expr \
--gnotate gnomad-2.1.zip \
--info "INFO.gnomad_nhomalt < 2" \ # variants absent from gnomad-2.1.zip will have a value of -1 for this.
--out my.annotated-and-filtered.bcf \
-v $my_cohort_vcf
To get a VCF without filtering, just don't specify --info
.
Once this command is completed, the my.annotated-and-filtered.bcf
will contain INFO fields for gnomad_nhomalt
and gnomad_popmax_af
. The --info
here is optional, but, as with using --gnotate
in, for example the slivar expr
sub-command, the "gnotation" takes place first so that the result (in this case gnomad_nhomalt
) will be available for use in the filter expression.