# Demo of Utilizing New GraphQL API for SNP Data

## Setting Up API Connection

The script begins by setting up a connection to a GraphQL API, specifying the base URL and the endpoint.

In [1]:
import requests
import json
import pandas as pd
from config.settings import settings

BASE_URL = settings.API_URL
GRAPHQL_ENDPOINT = 'graphql'

## Understanding Annotations in the API

It executes a GET request to retrieve a list of annotations. These annotations describe various data fields available through the API, detailing their characteristics such as name, description, and how they relate to each other in a hierarchical manner, much like a structured catalog of options you can query. 

**api_field:** Specifies the field name as it should be used in API requests, particularly when crafting queries for a GraphQL API. This ensures you're asking for data in a format the API understands.

In [2]:
response = requests.get(f"{BASE_URL}annotations")

annotations = response.json()
annotations

{'results': [{'id': '0',
   'leaf': False,
   'name': 'root',
   'label': 'Annotation',
   'sort': 0.0},
  {'id': '1',
   'parent_id': '0',
   'leaf': False,
   'name': 'Basic Info',
   'detail': 'Basic information about the variant, such as chromosome number, position, etc.',
   'sort': 1.0},
  {'id': '26',
   'parent_id': '0',
   'leaf': False,
   'name': 'ANNOVAR',
   'detail': 'Pre-computed ANNOVAR annotations for all alternative SNVs based on human reference genome hg19',
   'link': 'http://annovar.openbioinformatics.org/en/latest/user-guide/download/',
   'pmid': '20601685',
   'sort': 2.0},
  {'id': '208',
   'parent_id': '0',
   'leaf': False,
   'name': 'SnpEff',
   'detail': 'AnpEff is a program for annotating and predicting the effects of single nucleotide polymorphisms',
   'link': 'http://pcingola.github.io/SnpEff/',
   'pmid': '22728672',
   'sort': 3.0},
  {'id': '132',
   'parent_id': '0',
   'leaf': False,
   'name': 'VEP',
   'detail': 'Variant Effect Predictor (VEP) 

## Extracting SNP Data Through a GraphQL Query

The script continues by constructing a GraphQL query designed to fetch specific information about Single Nucleotide Polymorphisms (SNPs) based on criteria such as chromosome number and position range. This query illustrates GraphQL's capability to precisely target and retrieve the needed data from the server, thus optimizing the data acquisition process. The response from this query provides detailed attributes of SNPs for subsequent processing or analysis.

In [3]:
query = """
query MyQuery {
  GetSNPsByChromosome(chr: "1", end: 1000000, start: 10) {
    alt {
      value
    }
    chr {
      value
    }
    pos {
      value
    }
    rs_dbSNP151 {
      value
    }
    ref {
      value
    }
    ANNOVAR_ensembl_Effect {
      value
    }
    ANNOVAR_refseq_Effect {
      value
    }
  }
}
"""

response = requests.post(f"{BASE_URL}{GRAPHQL_ENDPOINT}", json={'query': query})

data = json.loads(response.text)
snps_by_chromosome = data['data']['GetSNPsByChromosome']

## Processing and Displaying the Data

After receiving data from the GraphQL query, the script processes it for analysis. This involves flattening the nested structure of the data response to a more straightforward, table-like format.

In [4]:
flattened_data = [{k: v['value'] if v is not None else None for k, v in record.items()} for record in snps_by_chromosome]
flattened_data
snp_df = pd.DataFrame(flattened_data)
snp_df

Unnamed: 0,alt,chr,pos,rs_dbSNP151,ref,ANNOVAR_ensembl_Effect,ANNOVAR_refseq_Effect
0,A,1,54353,rs140052487,C,ncRNA_intronic|downstream,intergenic
1,G,1,54763,rs548455890,T,ncRNA_intronic|downstream,intergenic
2,C,1,55427,rs183189405,T,downstream,intergenic
3,A,1,56586,rs541979596,G,downstream,intergenic
4,C,1,56644,rs143342222,A,downstream,intergenic
5,C,1,57033,rs2691311,T,downstream,intergenic
6,C,1,62055,rs559425327,G,upstream,intergenic
7,A,1,62162,rs140556834,G,upstream,intergenic
8,G,1,64670,rs545257650,A,upstream|downstream,upstream
9,G,1,64904,rs1452689085,T,upstream|downstream,upstream
