Skip to content

gnotate

Brent Pedersen edited this page Jun 27, 2019 · 9 revisions

gnotate is a format and an annotation engine. it is short for "genome annotation".

motivation

vcfanno is good at providing flexible annotation of VCFs with VCFs, BEDs, and other tabix-able files. It is very fast and does decent parallelization. However, given a very dense file like gnomad whole genomes, it must parse a lot of data (the whole genomes file is > 600GB!).

Often, a user only requires a small subset of fields from that file.

gnotate: a minimal representation of a single INFO field.

gnotate facilitates the extraction and encoding of INFO fields from a VCF/BCF into a compressed, reduced format. It stores each chromosome in 2(*) separate files:

  • a 64 bit integer that encodes:
    • position up to 2^28 (which is more than enough for the longest human chromosome)
    • encoded REF and ALT allele
    • FILTER (a boolean indicating a non-PASS filter)
  • a 32 bit float that encodes a single value from the VCF.

With this encoding, we can store the popmax_AF from the union of gnomad genomes and exomes in 1.5GB, a > 400X reduction.

In addition, this format allows for extremely rapid annotation. The encoded positions are sortable, so, given a query position, gnotate does a binary search and finds variants that match on position, REF, and ALT.

In a small percentage of cases, the REF+ALT length is too long to store (along with the position) in a 64 bit integer. For those, the variants are stored in a text file with a pointer in the encoded list that indicates there is a large variant at that position. gnotate handles all this internally.

A gnotate file is simply a zip file with this information encoded.

Usage

creating gnotate files.

The following command will make a gnotate zip file of the controls_nhomalt field (number of homozygous alternates in gnomad controls) and the maximum allele frequency in any subpopulation using the exomes and genomes files.

slivar make-gnotate --prefix gnomad-2.1 \
    --field controls_nhomalt:gnomad_nhomalt \
    --field popmax_AF:gnomad_popmax_af \
    gnomad.exomes.r2.1.sites.vcf.bgz \
    gnomad.genomes.r2.1.sites.chr*.vcf.bgz

The output will be gnomad-2.1.zip. The format of the --field argument is: $info_field:$new_name. The $new_name will be used when annotating later files with the zip created here so it should be descriptive.

Annotating with gnotate files.

To use these files, we can specify --gnotate for each zip file in any of the slivar sub-commands. slivar expr is generally used for filtering, but it can also be used for annotation:

slivar expr \
    --gnotate gnomad-2.1.zip \
    --info "INFO.gnomad_nhomalt < 2" \ # variants absent from gnomad-2.1.zip will have a value of -1 for this.
    --out my.annotated-and-filtered.bcf \
    -v $my_cohort_vcf 

To get a VCF without filtering, just don't specify --info.

Once this command is completed, the my.annotated-and-filtered.bcf will contain INFO fields for gnomad_nhomalt and gnomad_popmax_af. The --info here is optional, but, as with using --gnotate in, for example the slivar expr sub-command, the "gnotation" takes place first so that the result (in this case gnomad_nhomalt) will be available for use in the filter expression.