# Query EVA data
In this notebook we try to replicate the API request which [EBI EVA](https://www.ebi.ac.uk/eva/) does. Start by importing and declaring some stuff:

In [1]:
import io
import difflib
import requests

import Bio.SeqIO

from src.data.common import AssemblyConf
from src.features.smarterdb import global_connection, VariantGoat

In [2]:
session = requests.session()
conn = global_connection()
dbSNP152 = AssemblyConf(version="CHI1.0", imported_from="dbSNP152")

## Getting info about the service

In [3]:
response = session.get("https://www.ebi.ac.uk/eva/webservices/release/v1/info/latest")

In [4]:
response.status_code

200

In [5]:
response.json()

{'releaseVersion': 4,
 'releaseDate': '2022-11-21T00:00:00',
 'releaseDescription': 'The fourth EVA RS Release is done in the continuation of release 3 updating 5 reference genomes including chicken, rainbow trout and marmoset and adding support for new species such as haplochromine. The released variants with genomic coordinates are available from our website, API, and downloadable from our FTP in VCF. Released variants without coordinates are only available through FTP in text format.',
 'releaseFtp': 'https://ftp.ebi.ac.uk/pub/databases/eva/rs_releases/release_4/'}

## Getting info by rsID
To retrive information by `rsID`, you need to provide a number to the proper endpoint

In [6]:
rs_id = "rs268293069"
response = session.get(f"https://www.ebi.ac.uk/eva/webservices/identifiers/v1/clustered-variants/{rs_id.replace('rs', '')}")
response.status_code

200

In [7]:
clustered = response.json()
clustered

[{'accession': 268293069,
  'version': 1,
  'data': {'assemblyAccession': 'GCA_000317765.1',
   'taxonomyAccession': 9925,
   'contig': 'CM001715.1',
   'start': 82680594,
   'type': 'SNV',
   'validated': False,
   'mapWeight': None,
   'createdDate': '2012-08-17T00:44:00'}},
 {'accession': 268293069,
  'version': 1,
  'data': {'assemblyAccession': 'GCA_001704415.1',
   'taxonomyAccession': 9925,
   'contig': 'CM004567.1',
   'start': 85981710,
   'type': 'SNV',
   'validated': False,
   'mapWeight': None,
   'createdDate': '2022-01-06T05:02:04.287'}}]

There's no species name in this response nor information about chromosome names nor assebly version (like *ARS1* or *CHI1.0*)

## Get information on assembly

In [8]:
response = session.get("https://www.ebi.ac.uk/eva/webservices/rest/v1/meta/species/list")
response.status_code

200

Here I got information on all the species. Try to filter out goat assemblies using taxon id:

In [9]:
species_list = response.json()
goat_assemblies = list(filter(lambda assembly: assembly['taxonomyId'] == 9925, species_list['response'][0]['result']))
goat_assemblies

[{'assemblyAccession': 'GCA_000317765.1',
  'assemblyChain': 'GCA_000317765',
  'assemblyVersion': '1',
  'assemblyName': 'CHIR_1.0',
  'assemblyCode': '10',
  'taxonomyId': 9925,
  'taxonomyCommonName': 'Goat',
  'taxonomyScientificName': 'Capra hircus',
  'taxonomyCode': 'chircus',
  'taxonomyEvaName': 'goat'},
 {'assemblyAccession': 'GCA_001704415.1',
  'assemblyChain': 'GCA_001704415',
  'assemblyVersion': '1',
  'assemblyName': 'ARS1',
  'assemblyCode': 'ars1',
  'taxonomyId': 9925,
  'taxonomyCommonName': 'Goat',
  'taxonomyScientificName': 'Capra hircus',
  'taxonomyCode': 'chircus',
  'taxonomyEvaName': 'goat'}]

Here I can find the correspondance between the *NCBI Accession* and the *Assembly Name*. I can also retrieve the `assemblyCode` and `taxonomyCode`, which are both required to collect studies. There's another endpoint from which I can collect information about assemblies:

In [10]:
response = session.get("https://www.ebi.ac.uk/eva/webservices/rest/v1/meta/species/accessioned")
response.status_code

200

In [11]:
species_accessioned = response.json()
list(filter(lambda assembly: assembly['taxonomyId'] == 9925, species_list['response'][0]['result']))

[{'assemblyAccession': 'GCA_000317765.1',
  'assemblyChain': 'GCA_000317765',
  'assemblyVersion': '1',
  'assemblyName': 'CHIR_1.0',
  'assemblyCode': '10',
  'taxonomyId': 9925,
  'taxonomyCommonName': 'Goat',
  'taxonomyScientificName': 'Capra hircus',
  'taxonomyCode': 'chircus',
  'taxonomyEvaName': 'goat'},
 {'assemblyAccession': 'GCA_001704415.1',
  'assemblyChain': 'GCA_001704415',
  'assemblyVersion': '1',
  'assemblyName': 'ARS1',
  'assemblyCode': 'ars1',
  'taxonomyId': 9925,
  'taxonomyCommonName': 'Goat',
  'taxonomyScientificName': 'Capra hircus',
  'taxonomyCode': 'chircus',
  'taxonomyEvaName': 'goat'}]

It seems to me the same identical information I can get from `species/list`

## Getting chromosome names
This type of query will be to a different endpoint. Let's concentrate on *CHI1.0* only:

In [12]:
chi1 = next(filter(lambda assembly: assembly['assemblyName'] == 'CHIR_1.0', goat_assemblies))

Now filter only the SNP I want:

In [13]:
snp = next(filter(lambda snp: snp['data']['assemblyAccession'] == chi1['assemblyAccession'], clustered))['data']
snp

{'assemblyAccession': 'GCA_000317765.1',
 'taxonomyAccession': 9925,
 'contig': 'CM001715.1',
 'start': 82680594,
 'type': 'SNV',
 'validated': False,
 'mapWeight': None,
 'createdDate': '2012-08-17T00:44:00'}

Ok time to request for resolve contig name:

In [14]:
response = session.get(f"https://www.ebi.ac.uk/ena/browser/api/text/{snp['contig']}?lineLimit=1000&annotationOnly=true")
response.status_code

200

This type of reponse is not a JSON, is an embl (with no sequence)

In [15]:
sequence = Bio.SeqIO.read(io.StringIO(response.text), "embl")
sequence

SeqRecord(seq=Seq(None, length=114334461), id='CM001715.1', name='CM001715', description='Capra hircus chromosome 6, whole genome shotgun sequence.', dbxrefs=['Project:PRJNA158393', 'MD5:6fdd43012bf7c06f97b8685101a9f810', 'ENA:AJPT010000000', 'ENA:AJPT000000000', 'BioSample:SAMN02953816', 'PubMed:23263233'])

I can get information on chromosome names by searching in features qualifiers:

In [16]:
sequence.features[0].qualifiers['chromosome'][0]

'6'

Get informations from database:

In [17]:
variants = VariantGoat.objects.filter(rs_id=rs_id)
variants

[<VariantGoat: name='snp59420-scaffold980-297169', rs_id='['rs268293069']', illumina_top='A/C'>, <VariantGoat: name='snp59995-CSN1S1-ex3-1', rs_id='['rs268293069']', illumina_top='A/C'>]

Well, this rs_is seems to be associated with two SNPs:

In [18]:
sequences = {}

for variant in variants:
    location = variant.get_location(**dbSNP152._asdict())
    sequence = variant.sequence["IlluminaGoatSNP50"]
    sequences[variant.name] = sequence
    print(variant.name, sequence, location)

snp59420-scaffold980-297169 TTGAGTGCTTTTGTTTTTACAGTTCTTGCATTTTTTTTTTAACAGAAACATCCAATCAAT[A/C]ACCAAGGACTCTCTCCAGTGAGTGTTCTATTCTGTTCCAAGAACTCGCTATAAATTGTGT (dbSNP152:CHI1.0) 6:82680594 [A/C]
snp59995-CSN1S1-ex3-1 TTGAGTGCTTTTGTTTTTACAGTTCTTGCATTTTTTTTTTAACAGAAACATCCTATCAAT[A/C]ACCAAGGACTCTCTCCAGTGAGTGTTCTTTTCTGTTCCAAGAACTCGCTATAAATTGTGT (dbSNP152:CHI1.0) 6:82680594 [A/C]


In [19]:
s1, s2 = sequences.values()
s1 == s2

False

They seems to be identical but they aren't:

In [20]:
# https://towardsdatascience.com/side-by-side-comparison-of-strings-in-python-b9491ac858
matcher = difflib.SequenceMatcher(a=s1, b=s2)
print("Matching Sequences:")
for match in matcher.get_matching_blocks():
    print("Match             : {}".format(match))
    print("Matching Sequence : {}".format(s1[match.a:match.a+match.size]))

Matching Sequences:
Match             : Match(a=0, b=0, size=53)
Matching Sequence : TTGAGTGCTTTTGTTTTTACAGTTCTTGCATTTTTTTTTTAACAGAAACATCC
Match             : Match(a=54, b=54, size=39)
Matching Sequence : ATCAAT[A/C]ACCAAGGACTCTCTCCAGTGAGTGTTCT
Match             : Match(a=94, b=94, size=31)
Matching Sequence : TTCTGTTCCAAGAACTCGCTATAAATTGTGT
Match             : Match(a=125, b=125, size=0)
Matching Sequence : 


In [21]:
print(s1)
print(s2)

TTGAGTGCTTTTGTTTTTACAGTTCTTGCATTTTTTTTTTAACAGAAACATCCAATCAAT[A/C]ACCAAGGACTCTCTCCAGTGAGTGTTCTATTCTGTTCCAAGAACTCGCTATAAATTGTGT
TTGAGTGCTTTTGTTTTTACAGTTCTTGCATTTTTTTTTTAACAGAAACATCCTATCAAT[A/C]ACCAAGGACTCTCTCCAGTGAGTGTTCTTTTCTGTTCCAAGAACTCGCTATAAATTGTGT


So, for what I understand, SNP `snp59420-scaffold980-297169` is correctly placed, while `snp59995-CSN1S1-ex3-1` not

## Search variant by position
Last type of query I can do is to search for a variant by position:

In [22]:
chrom = '6'
position = 82680594

response = session.get(f'https://www.ebi.ac.uk/eva/webservices/rest/v1/segments/{chrom}:{position}-{position}/variants?species={chi1["taxonomyCode"]}_{chi1["assemblyCode"]}&limit=10')
response.status_code

200

In [23]:
data = response.json()
print(f"Got {data['response'][0]['numTotalResults']} results")
results = data['response'][0]['result']
result = results[0]

Got 1 results


Get rid of `sourceEntries`, which are SNPs from projects submitted to EBI

In [24]:
del(result['sourceEntries'])
result

{'chromosome': '6',
 'start': 82680594,
 'end': 82680594,
 'reference': 'C',
 'alternate': 'A',
 'mainId': 'rs268293069',
 'ids': ['rs268293069', 'ss1282532123'],
 'hgvs': {'genomic': ['6:g.82680594C>A']},
 'length': 1,
 'type': 'SNV'}

## Get submitted variant information
Try to collect data on submitted variant:

In [25]:
ss_id = 'ss1282532123'
response = session.get(f"https://www.ebi.ac.uk/eva/webservices/identifiers/v1/submitted-variants/{ss_id.replace('ss', '')}")
response.status_code

200

In [26]:
response.json()

[{'accession': 1282532123,
  'version': 1,
  'data': {'referenceSequenceAccession': 'GCA_001704415.1',
   'taxonomyAccession': 9925,
   'projectAccession': 'PRJEB6057',
   'contig': 'CM004567.1',
   'start': 85981710,
   'referenceAllele': 'C',
   'alternateAllele': 'A',
   'clusteredVariantAccession': 268293069,
   'supportedByEvidence': True,
   'assemblyMatch': True,
   'allelesMatch': True,
   'validated': False,
   'mapWeight': None,
   'createdDate': '2014-08-15T23:46:00',
   'remappedFrom': 'GCA_000317765.1',
   'remappedDate': '2021-08-20T03:17:08.407',
   'remappingId': '3AE0EC08B25C825C709440295D17F5D3AC4CF5EA',
   'backPropagatedVariantAccession': None}},
 {'accession': 1282532123,
  'version': 1,
  'data': {'referenceSequenceAccession': 'GCA_000317765.1',
   'taxonomyAccession': 9925,
   'projectAccession': 'PRJEB6057',
   'contig': 'CM001715.1',
   'start': 82680594,
   'referenceAllele': 'C',
   'alternateAllele': 'A',
   'clusteredVariantAccession': 268293069,
   'suppor

Now I can get two results: the second is the received `ss_id`, the first is the same SNP mapped on the new *ARS1* genome