## AnyVar VCF processing and annotation

### Setup

First, we'll initialize AnyVar (we already have some required services running in the background) and the VCF registrar object

In [1]:
from timeit import default_timer as timer

from anyvar.anyvar import AnyVar, create_storage, create_translator
from anyvar.extras.vcf import VcfRegistrar

  import pkg_resources
* 'schema_extra' has been renamed to 'json_schema_extra'


Running the next command assumes that you have a local seqrepo at usr/local/share/seqrepo/2024-12-20. You can also change it to the RestAPI seqrepo URI.

In [5]:
from os import environ
environ["SEQREPO_DATAPROXY_URI"] = "seqrepo+file:///usr/local/share/seqrepo/2024-12-20"

av = AnyVar(
    create_translator(),
    create_storage("postgresql://postgres:postgres@localhost:5432/anyvar"),
)
vcf_registrar = VcfRegistrar(data_proxy=av.translator.dp, av=av)

### Input

We have a sample file `demo-input.vcf`, with about 1,000 rows comprised of simple SNPs and indels:

In [6]:
!wc -l ./demo-input.vcf

    1045 ./demo-input.vcf


In [8]:
!bat --line-range=0:100 ./demo-input.vcf  # for example

]10;?]11;?[c[38;5;246m───────┬────────────────────────────────────────────────────────────────────────[0m
       [38;5;246m│ [0mFile: [1m./demo-input.vcf[0m
[38;5;246m───────┼────────────────────────────────────────────────────────────────────────[0m
[38;5;246m   1[0m   [38;5;246m│[0m [38;5;231m##fileformat=VCFv4.1[0m
[38;5;246m   2[0m   [38;5;246m│[0m [38;5;231m##fileDate=2025-05-21[0m
[38;5;246m   3[0m   [38;5;246m│[0m [38;5;231m##source=ClinVar[0m
[38;5;246m   4[0m   [38;5;246m│[0m [38;5;231m##reference=GRCh38[0m
[38;5;246m   5[0m   [38;5;246m│[0m [38;5;231m##ID=<Description="ClinVar Variation ID">[0m
[38;5;246m   6[0m   [38;5;246m│[0m [38;5;231m##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies f[0m
[38;5;246m    [0m   [38;5;246m│[0m [38;5;231mrom GO-ESP">[0m
[38;5;246m   7[0m   [38;5;246m│[0m [38;5;231m##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies [0m
[38;5;246m    [0m   [38

### Ingestion and annotation

We'll run the `annotate()` method and track wall clock time:

In [9]:
from pathlib import Path

start = timer()
vcf_registrar.annotate(Path("./demo-input.vcf"), vcf_out=Path("./demo-output.vcf"))
end = timer()
print(f"processed all VCF rows in {end - start} seconds")

[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)


processed all VCF rows in 3.6417058749993885 seconds


In [10]:
allele_count = av.object_store.get_variation_count("all")
print(f"Between references and alternates, this registers {allele_count} alleles.")

Between references and alternates, this registers 1872 alleles.


### Output

This process adds VRS allele IDs to the VCF's INFO field:

In [11]:
!bat --line-range=0:100 ./out.vcf

]10;?]11;?[c[38;5;246m───────┬────────────────────────────────────────────────────────────────────────[0m
       [38;5;246m│ [0mFile: [1m./out.vcf[0m   <EMPTY>
[38;5;246m───────┴────────────────────────────────────────────────────────────────────────[0m


We can dereference those IDs to retrieve the complete VRS allele:

In [12]:
av.get_object("ga4gh:VA.tT2-U2WwLDM0r77vQwCu3amz8fCkuVw_", True).as_dict()

KeyError: 'ga4gh:VA.tT2-U2WwLDM0r77vQwCu3amz8fCkuVw_'

### Search

Currently, we support basic genomic region searches:

In [None]:
chr4 = av.translator.get_sequence_id("NCBI:NC_000004.12")
av.object_store.search_variations(chr4, 400000, 500000)