Skip to content

Annotation

cristinaetrv edited this page Sep 25, 2023 · 4 revisions

Bystro Annotator

The Bystro annotator is a program that adds information/features/annotations to genetic variants from sources such as refSeq, gnomAD, and CADD.

Modifying Annotation Features

The features that Bystro can annotate are defined in a YAML configuration file. This is a static definition of the maximum set of features that Bystro can provide, and this YAML is passed as an argument during annotation.

A somewhat simplified view of the YAML configuration is below

---
  assembly: hg38
  database_dir: /path/to/embedded/database
  tracks:
    outputOrder:
    - ref
    - refSeq
    - cadd
    - gnomad.genomes
    tracks:
    - name: ref
      type: reference
    - name: cadd
      type: cadd
    - features:
      - name
      - name2
      - description
      - kgID
      - mRNA
      - spID
      - spDisplayID
      - protAcc
      - rfamAcc
      - tRnaName
      - ensemblID
      name: refSeq
      type: gene
    - features:
      - alt
      - id
      - af: number
      - an: number
      - an_afr: number
      - an_amr: number
      - an_asj: number
      - an_eas: number
      - an_fin: number
      - an_nfe: number
      - an_oth: number
      - an_sas: number
      - an_male: number
      - an_female: number
      - af_afr: number
      - af_amr: number
      - af_asj: number
      - af_eas: number
      - af_fin: number
      - af_nfe: number
      - af_oth: number
      - af_sas: number
      - af_male: number
      - af_female: number
      name: gnomad.genomes
      type: vcf

There are a number of moving pieces here, so let's focus on the piece related to adding or removing annotation features:

  1. There is a top-level tracks object, which has 2 keys: outputOrder and tracks
  2. The inner tracks is an array of track definitions. You can think of a track as a set of features that come from one input source, e.g. CADD, dbSNP, gnomAD, etc.
  3. Each track must contain the following key properties:
    • name that defines the track name
    • type one of a series of types (e.g. sparse, gene, vcf, cadd, etc.) , which we'll describe in a separate section of this document
  4. For tracks that have more than 1 grouping of information, in addition to name and type they will include:
    • features that include each grouping of information as a separate field (for refSeq, features include the transcript labels in name, the gene labels in name2, the transcript descriptions in description and so on)
  5. The outputOrder must contain every list name from the inner tracks array

YAML configuration files can be modified. Adding new annotations to them (defining new tracks or features) requires a build step, which pre-compiles the track's input data into a super fast embedded database, which enables millions of queries per minute on even modest machines.

Removing annotations is much simpler. Let's say you were using the above YAML and didn't need the description annotation in the refSeq track, which contains a long-form description of a transcript. To remove this, simply remove the line - description from the name: refSeq track's features array.

Similarly, entire tracks can be dropped. If we wanted to annotate our VCF without CADD scores, we would remove the following lines:

    - name: cadd
      type: cadd

from the inner tracks array, and also remove - cadd from outputOrder.

Clone this wiki locally