## AnyVar VCF processing and annotation

### Setup

First, we'll initialize AnyVar (we already have some required services running in the background) and the VCF registrar object

In [1]:
from pathlib import Path
from timeit import default_timer as timer

from anyvar.anyvar import create_storage, create_translator, AnyVar
from anyvar.storage.postgres import PostgresBatchManager
from anyvar.extras.vcf import VcfRegistrar

Removing allOf attribute from CopyNumber to avoid python-jsonschema-objects error.
Removing allOf attribute from SequenceInterval to avoid python-jsonschema-objects error.
Removing allOf attribute from RepeatedSequenceExpression to avoid python-jsonschema-objects error.


In [2]:
av = AnyVar(
    create_translator("http://localhost:7999/variation/"),
    create_storage("postgresql://postgres@localhost:5432/anyvar")
)
vcf_registrar = VcfRegistrar(av)

### Input

We have a sample file `vcf-100k-no-added-errors-01-20-23.vcf`, with about 100,000 rows comprised of simple SNPs and indels:

In [3]:
!wc -l ../vcf-100k-no-added-errors-01-20-23.vcf

  107139 ../vcf-100k-no-added-errors-01-20-23.vcf


In [4]:
!bat --line-range=4000:4003 ../vcf-100k-no-added-errors-01-20-23.vcf  # for example

[38;5;238m───────┬────────────────────────────────────────────────────────────────────────[0m
       [38;5;238m│ [0mFile: [1m../vcf-100k-no-added-errors-01-20-23.vcf[0m
[38;5;238m───────┼────────────────────────────────────────────────────────────────────────[0m
[38;5;238m4000[0m   [38;5;238m│[0m [38;2;192;202;245mchr1    18476814    .   ATTCATCTCTCC    A   .   PASS    QUALapprox=1784[0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245m;SB=29,17,26,20;MQ=60.0000;MQRankSum=1.04600;VarDP=92;AS_ReadPosRankSum[0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245m=-0.371000;AS_pab_max=0.867939;AS_QD=19.3913;AS_MQ=60.0000;QD=19.3913;A[0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245mS_MQRankSum=1.04600;FS=1.73310;AS_FS=1.73310;ReadPosRankSum=-0.371000;A[0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245mS_QUALapprox=1784;AS_SB_TABLE=29,17,26,20;AS_VarDP=92;AS_SOR=0.466938;S[0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245mO

### Ingestion and annotation

We'll run the `annotate()` method and track wall clock time:

In [5]:
start = timer()
vcf_registrar.annotate(
    "../vcf-100k-no-added-errors-01-20-23.vcf", 
    vcf_out="out.vcf"
)
end = timer()
print(f"processed all VCF rows in {end - start} seconds")

processed all VCF rows in 1964.3764593320002 seconds


In [6]:
allele_count = av.object_store.get_variation_count('all')
print(f"Between references and alternates, this registers {allele_count} alleles.")

Between references and alternates, this registers 198098 alleles.


### Output

This process adds VRS allele IDs to the VCF's INFO field:

In [7]:
!bat --line-range=4000:4003 out.vcf

[38;5;238m───────┬────────────────────────────────────────────────────────────────────────[0m
       [38;5;238m│ [0mFile: [1mout.vcf[0m
[38;5;238m───────┼────────────────────────────────────────────────────────────────────────[0m
[38;5;238m4000[0m   [38;5;238m│[0m [38;2;192;202;245mchr1    18357472    .   GGGATGAGGTGGGGATGGGGATGGGAATGAAGTGGA    G   .  [0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245m AS_VQSR QUALapprox=1111;SB=377,138,62,9;MQ=59.9291;MQRankSum=0.48;VarD[0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245mP=586;AS_ReadPosRankSum=-1.271;AS_pab_max=0.375;AS_QD=1.89811;AS_MQ=59.[0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245m9236;QD=1.8959;AS_MQRankSum=0.48;FS=6.83343;AS_FS=15.2899;ReadPosRankSu[0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245mm=-1.231;AS_QUALapprox=1006;AS_SB_TABLE=377,138,58,5;AS_VarDP=530;AS_SO[0m
[38;5;238m    [0m   [38;5;238m│[0m [38;2;192;202;245mR=2.64441;SOR=1.851;AS_VQSLOD=-2.4

We can dereference those IDs to retrieve the complete VRS allele:

In [8]:
av.get_object("ga4gh:VA.tT2-U2WwLDM0r77vQwCu3amz8fCkuVw_", True).as_dict()

{'_id': 'ga4gh:VA.tT2-U2WwLDM0r77vQwCu3amz8fCkuVw_',
 'type': 'Allele',
 'location': {'type': 'SequenceLocation',
  'sequence_id': 'ga4gh:SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO',
  'interval': {'type': 'SequenceInterval',
   'start': {'type': 'Number', 'value': 18476813},
   'end': {'type': 'Number', 'value': 18476825}}},
 'state': {'type': 'LiteralSequenceExpression', 'sequence': 'ATTCATCTCTCC'}}

### Search

Currently, we support basic genomic region searches:

In [9]:
chr4 = av.translator.get_sequence_id("NCBI:NC_000004.12")
av.object_store.search_variations(chr4, 400000, 500000)

[{'_id': 'ga4gh:VA.Q19O8HhV1UnaYYAmcgmpcy1UDHkU4mdD',
  'type': 'Allele',
  'state': {'type': 'LiteralSequenceExpression', 'sequence': 'C'},
  'location': 'ga4gh:VSL.tWfR6n2aEy6patCt2DcWa7mf4UD6poT_'},
 {'_id': 'ga4gh:VA.SwdQzWZyRDzJSVDKZCaa1BDX-zjCP8GJ',
  'type': 'Allele',
  'state': {'type': 'LiteralSequenceExpression',
   'sequence': 'TTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT'},
  'location': 'ga4gh:VSL.PpwQTE2qCqjlCILek9uZTnfXycki__tX'},
 {'_id': 'ga4gh:VA.qihaf7S9gRb2fxvOA1OJ6ghcfr7OudaS',
  'type': 'Allele',
  'state': {'type': 'LiteralSequenceExpression', 'sequence': 'T'},
  'location': 'ga4gh:VSL.iJCaR2HHgifLaqbyK3CYik4XRKJUvwL8'},
 {'_id': 'ga4gh:VA.Kqa1gjWWWfiuc54Ze2J170k9t0WPCUQN',
  'type': 'Allele',
  'state': {'type': 'LiteralSequenceExpression', 'sequence': 'C'},
  'location': 'ga4gh:VSL.iJCaR2HHgifLaqbyK3CYik4XRKJUvwL8'}]