# ga4gh variant API test

This contains some tests of the Python ga4gh variant API <https://github.com/ga4gh/vrs-python>. To run this notebook please install the following packages:

```bash
python -m pip install ga4gh.vrs[extras] seqrepo
```

You will need to setup [seqrepo](https://github.com/biocommons/biocommons.seqrepo) with the reference genome [NC_045512.2](https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2/). To do this, please download the fasta file and run the following:

```bash
seqrepo --root-directory data init -i references
seqrepo --root-directory data load -i references -n NCBI reference/NC_045512.2.fasta
```

Now, when you reference the seqrepo repository, you can reference as:

```python
SeqRepo("data/references")
```

Much of the testing here is drawn from the examples in <https://github.com/ga4gh/vrs-python>.

In [31]:
from ga4gh.vrs import __version__, models, normalize

print(__version__)

0.6.0rc0


In [2]:
location = models.SequenceLocation(
    sequence_id="refseq:NC_045512.2",
    interval=models.SimpleInterval(start=10, end=10))
allele = models.Allele(location=location,
                       state=models.SequenceState(sequence="ATC"))
allele.as_dict()

{'location': {'interval': {'end': 10, 'start': 10, 'type': 'SimpleInterval'},
  'sequence_id': 'refseq:NC_045512.2',
  'type': 'SequenceLocation'},
 'state': {'sequence': 'ATC', 'type': 'SequenceState'},
 'type': 'Allele'}

## Reference

Dealing with the reference genome.

In [3]:
from biocommons.seqrepo import SeqRepo
import ga4gh.vrs.dataproxy as dataproxy

sr = SeqRepo("data/references")
dp = dataproxy.SeqRepoDataProxy(sr)
dp

<ga4gh.vrs.dataproxy.SeqRepoDataProxy at 0x7fb309be7d60>

In [4]:
sr.translate_identifier('NC_045512.2')

['MD5:105c82802b67521950854a851fc6eefd',
 'NCBI:NC_045512.2',
 'refseq:NC_045512.2',
 'SEGUID:TGmvT2vKTXx8/+dSNEdefxwTreY',
 'SHA1:4c69af4f6bca4d7c7cffe75234475e7f1c13ade6',
 'VMC:GS_SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D',
 'sha512t24u:SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D',
 'ga4gh:SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D']

In [5]:
dp.get_sequence('NCBI:NC_045512.2', start=10, end=20)

'TATACCTTCC'

In [6]:
n = normalize(allele, dp)
n

<Allele _id=None location=<SequenceLocation _id=None interval=<SimpleInterval end=<Literal<int> 10> start=<Literal<int> 10> type=<Literal<str> SimpleInterval>> sequence_id=<Literal<str> refseq:NC_045512.2> type=<Literal<str> SequenceLocation>> state=<SequenceState sequence=<Literal<str> ATC> type=<Literal<str> SequenceState>> type=<Literal<str> Allele>>

# Hashing

Used to produce unique identifiers for each variant.

In [7]:
from ga4gh.core import sha512t24u

sha512t24u(b"ACGT")

'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

In [13]:
from ga4gh.core import ga4gh_identify

location = models.SequenceLocation(
    sequence_id="refseq:NC_045512.2",
    interval=models.SimpleInterval(start=11, end=12))
allele2 = models.Allele(location=location,
                       state=models.SequenceState(sequence="G"))

ga4gh_identify(allele2)

'ga4gh:VA.tmFduBOOW5xeKfZUEwLybg88W490taiZ'

# Variation Set

In [18]:
import json
def ppo(o):
    """pretty print object as json"""
    print(json.dumps(o.as_dict(), indent=2))

a1 = models.Allele(location=models.SequenceLocation(
                        sequence_id="refseq:NC_045512.2",
                        interval=models.SimpleInterval(start=10, end=11)
                      ),
                   state=models.SequenceState(sequence="A"),
)

a2 = models.Allele(location=models.SequenceLocation(
                        sequence_id="refseq:NC_045512.2",
                        interval=models.SimpleInterval(start=20, end=21)
                      ),
                   state=models.SequenceState(sequence="T"),
)


vs = models.VariationSet(members=[a1,a2])
ppo(vs)

{
  "members": [
    {
      "location": {
        "interval": {
          "end": 11,
          "start": 10,
          "type": "SimpleInterval"
        },
        "sequence_id": "refseq:NC_045512.2",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "A",
        "type": "SequenceState"
      },
      "type": "Allele"
    },
    {
      "location": {
        "interval": {
          "end": 21,
          "start": 20,
          "type": "SimpleInterval"
        },
        "sequence_id": "refseq:NC_045512.2",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "T",
        "type": "SequenceState"
      },
      "type": "Allele"
    }
  ],
  "type": "VariationSet"
}


In [22]:
vs_referenced = models.VariationSet(members=[ga4gh_identify(a) for a in [a1,a2]])
vs_referenced.as_dict()

{'members': ['ga4gh:VA.rqHqpdFHmoZuAb5ENO0li9oN8NPTvfj6',
  'ga4gh:VA.x0nUy9gmIIQ9RARvi5rt6UpbdkBpkJf5'],
 'type': 'VariationSet'}

In [27]:
print(ga4gh_identify(vs_referenced))
print(ga4gh_identify(vs))

ga4gh:VS.Hfdd7bqexeBvX8QF_HyRYXzF1jePhnDu
ga4gh:VS.Hfdd7bqexeBvX8QF_HyRYXzF1jePhnDu


# Haplotypes

In [39]:
ht = models.Haplotype(members=[a1, a2])
ppo(ht)

{
  "members": [
    {
      "location": {
        "interval": {
          "end": 11,
          "start": 10,
          "type": "SimpleInterval"
        },
        "sequence_id": "refseq:NC_045512.2",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "A",
        "type": "SequenceState"
      },
      "type": "Allele"
    },
    {
      "location": {
        "interval": {
          "end": 21,
          "start": 20,
          "type": "SimpleInterval"
        },
        "sequence_id": "refseq:NC_045512.2",
        "type": "SequenceLocation"
      },
      "state": {
        "sequence": "T",
        "type": "SequenceState"
      },
      "type": "Allele"
    }
  ],
  "type": "Haplotype"
}


In [40]:
ga4gh_identify(ht)

'ga4gh:VH.b7udRSiJbDJEwViJta97xqyutOusWvti'

# Variant translation

In [43]:
from ga4gh.vrs.extras.translator import Translator

tlr = Translator(data_proxy=dp,
                 translate_sequence_identifiers=True,
                 normalize=True,
                 identify=True)
tlr

<ga4gh.vrs.extras.translator.Translator at 0x7fb2f19398b0>

In [66]:
a = tlr.translate_from("NC_045512.2:10:11:A")
print(ga4gh_identify(a))
print(ga4gh_identify(a1))
a

ga4gh:VA.Ejez0NZaxrvPYgpDDEhi7GkrZTi5DdWE
ga4gh:VA.rqHqpdFHmoZuAb5ENO0li9oN8NPTvfj6


<Allele _id=<Literal<str> ga4gh:VA.Ejez0NZaxrvPYgpDDEhi7GkrZTi5DdWE> location=<SequenceLocation _id=None interval=<SimpleInterval end=<Literal<int> 21> start=<Literal<int> 10> type=<Literal<str> SimpleInterval>> sequence_id=<Literal<str> ga4gh:SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D> type=<Literal<str> SequenceLocation>> state=<SequenceState sequence=<Literal<str> A> type=<Literal<str> SequenceState>> type=<Literal<str> Allele>>

# Feature-based locations

In [34]:
#gl1 = models.GeneLocation(gene="ncbigene:672")
#gl1.as_dict()