# Sample Queries on the 1000 Genomes, gnomAD and ClinVar data lake

In this notebook, we will demonstrate some sample genomics queries that are typically made by clinical geneticists and researchers on genomics variant data. We will use the parquet/ORC transformed variant data from the 3502 DRAGEN-reanalyzed 1000 Genomes dataset available in 3 different schemas at s3://1000genomes-DRAGEN-data-lake-ready/:

1. var_partby_samples - Dataset partitioned by sample ID
2. var_partby_chrom - Dataset partitioned by chromosome and bucketed by samples
3. var_nested - Nested schema consisting of variant sites with sample IDs and genotypes that contain the variant

We use the annotations from ClinVar that are available at https://registry.opendata.aws/clinvar/ to demonstrate how to make queries that use the raw variant data with annotations.

## Import Dependencies

In [57]:
import boto3, os
import pandas as pd

s3 = boto3.resource('s3')
glue = boto3.client('glue')
cfn = boto3.client('cloudformation')
import sys
!{sys.executable} -m pip install PyAthena


You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [58]:
session = boto3.session.Session()
region = session.region_name
print(region)

us-east-1


In [59]:
databasename="\"1kg_full\""
partitioned_chr = "\"var_sortby\""
partitioned_samples = "\"var_partby_samples\""
nested = "\"var_nested\""

### Connect to Athena

In [60]:
from pyathena import connect
from pyathena.pandas.util import as_pandas
from pyathena.async_cursor import AsyncCursor, AsyncDictCursor
from pyathena.error import NotSupportedError, ProgrammingError
from pyathena.model import AthenaQueryExecution
from pyathena.result_set import AthenaResultSet

conn = connect(s3_staging_dir='s3://athena-query-results-ss',region_name=region) #replace with your own bucket name
cursor = conn.cursor()

def execute_query_async(query):
    query_summary = '''Query execution summary:
        DataScanned: {}
        ExecutionTime(s): {}
        QueuingTime(s): {}'''

    df = None
    acursor = conn.cursor(AsyncCursor)
    query_id, future = acursor.execute(query)
    result_set = future.result()
    if result_set.state == AthenaQueryExecution.STATE_SUCCEEDED:
        print(query_summary.format(result_set.data_scanned_in_bytes, 
                                   result_set.engine_execution_time_in_millis/1000, 
                                   result_set.query_queue_time_in_millis/1000))    
        rows = result_set.fetchall()
        cols = [x[0] for x in result_set.description]
        df = pd.DataFrame(rows, columns=cols)
        
    acursor.close()
    return df

#### Let us explore the schema of the tables under the 1000 genomes transformed dataset

Here is the flat table schema for the data sorted by sample ID

In [61]:
cursor.execute("SELECT * from " + databasename + "." + partitioned_samples + "limit 10")
df = as_pandas(cursor)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 41 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   variant_id                     10 non-null     object 
 1   chrom                          10 non-null     object 
 2   pos                            10 non-null     int64  
 3   alleles                        10 non-null     object 
 4   rsid                           0 non-null      object 
 5   qual                           10 non-null     float64
 6   filters                        10 non-null     object 
 7   info.ac                        10 non-null     object 
 8   info.af                        10 non-null     object 
 9   info.an                        10 non-null     int64  
 10  info.db                        10 non-null     bool   
 11  info.dp                        10 non-null     int64  
 12  info.end                       0 non-null      object

Here is the schema for the flat table partitioned by chromosome and bucketed by samples. As you can see, the schema is very similar to the one partitioned by samples, except that "chrom" is the field that the data is partitioned by

In [30]:
cursor.execute("SELECT * from " + databasename + "." + partitioned_chr + "limit 10")
df = as_pandas(cursor)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 40 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   variant_id                     10 non-null     object 
 1   pos                            10 non-null     int64  
 2   ref                            10 non-null     object 
 3   alt                            10 non-null     object 
 4   sample_id                      10 non-null     object 
 5   alleles                        10 non-null     object 
 6   rsid                           0 non-null      object 
 7   qual                           10 non-null     float64
 8   filters                        10 non-null     object 
 9   info.ac                        0 non-null      object 
 10  info.af                        0 non-null      object 
 11  info.an                        0 non-null      object 
 12  info.db                        10 non-null     bool  

Here is the nested schema. In this schema, most of the FORMAT and INFO fields are not retained. Each row is a variant site with an array consisting of sample IDs and genotypes

In [31]:
cursor.execute("SELECT * from " + databasename +"." + nested + " limit 10")
df = as_pandas(cursor)
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   variant_id  10 non-null     object
 1   pos         10 non-null     int64 
 2   ref         10 non-null     object
 3   alt         10 non-null     object
 4   samples     10 non-null     object
 5   chrom       10 non-null     object
dtypes: int64(1), object(5)
memory usage: 608.0+ bytes


Unnamed: 0,variant_id,pos,ref,alt,samples,chrom
0,chrY:G:C:2949388,2949388,G,C,"[{id=NA18504, gts=[1]}, {id=HG03812, gts=[1]},...",chrY
1,chrY:A:T:4634780,4634780,A,T,"[{id=NA18504, gts=[1]}, {id=HG01986, gts=[1]},...",chrY
2,chrY:C:T:4758611,4758611,C,T,"[{id=NA18504, gts=[1]}, {id=HG01986, gts=[1]},...",chrY
3,chrY:C:G:4806400,4806400,C,G,"[{id=NA18504, gts=[1]}, {id=HG01986, gts=[1]},...",chrY
4,chrY:C:CGTATATATATGTGTATATATATACGTGTATATATACAT...,4896733,C,CGTATATATATGTGTATATATATACGTGTATATATACAT,"[{id=NA18504, gts=[1]}, {id=HG01986, gts=[1]},...",chrY
5,chrY:C:CTA:5667108,5667108,C,CTA,"[{id=NA18504, gts=[1]}, {id=HG01986, gts=[1]},...",chrY
6,chrY:T:G:6792036,6792036,T,G,"[{id=NA18504, gts=[1]}, {id=HG01986, gts=[1]},...",chrY
7,chrY:T:TTG:7524147,7524147,T,TTG,"[{id=NA18504, gts=[1]}, {id=HG01986, gts=[1]},...",chrY
8,chrY:T:TG:7664244,7664244,T,TG,"[{id=NA18504, gts=[1]}, {id=HG01986, gts=[1]},...",chrY
9,chrY:T:TA:8067814,8067814,T,TA,"[{id=NA18504, gts=[1]}, {id=HG01986, gts=[1]},...",chrY


### Queries on variant data alone
Here are some examples of queries on just the variant data with no annotations

**Example 1:** Query for variants that are on a specific gene BRCA1 (chr17:43044295-43125364) in a specific sample HG02625.
This query will run faster and will need to scan only data within the specific sample partition, so we will use the dataset that is partitioned by sample

In [62]:
query = """SELECT * from {}.{}
  WHERE chrom='chr17'
  AND pos BETWEEN 43044295 AND 43125364 
  AND sample_id = 'HG02625' """.format(databasename,partitioned_samples)

df = execute_query_async(query)
df

Query execution summary:
        DataScanned: 47353926
        ExecutionTime(s): 5.781
        QueuingTime(s): 0.194


Unnamed: 0,variant_id,chrom,pos,alleles,rsid,qual,filters,info.ac,info.af,info.an,...,mb,pl,pri,ps,sb,sq,sample_id,ref,alt,partition_0
0,chr17:G:A:43044391,chr17,43044391,"[G, A]",,45.73,[],[1],[0.5],2,...,"[12, 17, 9, 10]","[81, 0, 50]","[0.0, 34.77, 37.77]",,"[13, 16, 13, 6]",,HG02625,G,A,HG02625
1,chr17:CTT:C:43044804,chr17,43044804,"[CTT, C, CTTTT]",,308.73,[],"[1, 1]","[0.5, 0.5]",2,...,"[0, 0, 17, 10]","[317, 263, 52, 391, 0, 54]","[0.0, 4.0, 7.0, 4.0, 8.0, 7.0]",,"[0, 0, 17, 10]",,HG02625,CTT,C,HG02625
2,chr17:C:A:43045257,chr17,43045257,"[C, A]",,49.94,[],[1],[0.5],2,...,"[9, 10, 11, 7]","[85, 0, 50]","[0.0, 34.77, 37.77]",,"[12, 7, 8, 10]",,HG02625,C,A,HG02625
3,chr17:A:G:43046604,chr17,43046604,"[A, G]",,48.94,[],[1],[0.5],2,...,"[9, 8, 5, 8]","[84, 0, 50]","[0.0, 34.77, 37.77]",,"[11, 6, 9, 4]",,HG02625,A,G,HG02625
4,chr17:C:CA:43046757,chr17,43046757,"[C, CA]",,50.00,[],[1],[0.5],2,...,"[8, 5, 9, 6]","[56, 0, 50]","[0.0, 6.0, 9.0]",,"[7, 6, 10, 5]",,HG02625,C,CA,HG02625
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
206,chr17:GTT:G:43123349,chr17,43123349,"[GTT, G]",,50.00,[],[1],[0.5],2,...,"[3, 5, 7, 3]","[52, 0, 50]","[0.0, 2.0, 5.0]",,"[5, 3, 4, 6]",,HG02625,GTT,G,HG02625
207,chr17:A:G:43123628,chr17,43123628,"[A, G]",,47.93,[],[1],[0.5],2,...,"[10, 12, 6, 10]","[83, 0, 50]","[0.0, 34.77, 37.77]",,"[12, 10, 8, 8]",,HG02625,A,G,HG02625
208,chr17:A:G:43124230,chr17,43124230,"[A, G]",,50.00,[],[1],[0.5],2,...,"[9, 7, 6, 12]","[85, 0, 50]","[0.0, 34.77, 37.77]",,"[8, 8, 13, 5]",,HG02625,A,G,HG02625
209,chr17:T:C:43124331,chr17,43124331,"[T, C]",,50.00,[],[1],[0.5],2,...,"[6, 5, 6, 6]","[85, 0, 50]","[0.0, 34.77, 37.77]",,"[10, 1, 8, 4]",,HG02625,T,C,HG02625


**Example 2:** Query to find samples that have variants in a specific region, in this case, the BRCA1 gene (chr17:43044295-43125364). 
This query can be run on the data that is partitioned by chromosome 

In [63]:
query = """SELECT DISTINCT(sample_id) from {}.{}
  where chrom='chr17' 
  and pos between 43044295 and 43125364 """.format(databasename, partitioned_chr)

df=execute_query_async(query)
df

Query execution summary:
        DataScanned: 915862188
        ExecutionTime(s): 2.794
        QueuingTime(s): 0.183


Unnamed: 0,sample_id
0,HG01200
1,NA18988
2,NA19651
3,NA18559
4,NA12335
...,...
3197,HG02602
3198,NA21110
3199,HG03763
3200,HG01162


**Example 2a:** Query 2 on the nested data

In [11]:
query = """SELECT DISTINCT(sample.id) from (
  SELECT samples from {}.{}
  where chrom='chr1'
  and pos between 9033567 and 9142000
  ) as f,
  unnest(f.samples) as s(sample)
  order by 1""".format(databasename,nested)
df=execute_query_async(query)
df

Query execution summary:
        DataScanned: 1482113335
        ExecutionTime(s): 6.756
        QueuingTime(s): 0.324


Unnamed: 0,id
0,HG00096
1,HG00097
2,HG00099
3,HG00100
4,HG00101
...,...
3197,NA21137
3198,NA21141
3199,NA21142
3200,NA21143


**Example 3:** Query to find only samples that have homozygous variants in a specific region, in this case, the BRCA1 gene (chr17:43044295-43125364). 

In [13]:
query = """SELECT DISTINCT(sample_id) from {}.{}
  where chrom='chr17'
  and pos between 43044295 and 43125364
  and array_join(gt.alleles, '|') in ('0|1')""".format(databasename,partitioned_chr)

df=execute_query_async(query)
df

### Queries joining variant data with ClinVar annotation data

**ClinVar has a number of tables, but the one that we will use here is the summary_variants table. This table has many fields, but we will only be using a subset of them. We also need to create a Variant ID that can be used to join with the raw variant tables, so to make this easier, we will create a "view" of the summary_variants table that we can use with subsequent queries**

In [93]:
cursor.execute("create or replace view clinvar as "
       "select concat('chr', chromosome, ':', referenceallelevcf, ':', alternateallelevcf, ':', cast(positionvcf as varchar)) as variant_id, "
       "chromosome as chrom, "
       "positionvcf as pos, "
       "referenceallelevcf as ref, "
       "alternateallelevcf as alt, "
       "genesymbol as genename, " 
       "clinicalsignificance as clinicalsignificance, "
       "numbersubmitters as num_submitters, "
       "reviewstatus as reviewstatus, "
       "split(phenotypeids, ',') as phenotypeids, "
       "split(phenotypelist, ',') as phenotypelist "
       "from \"clinvar_summary_variants_dl-awsroda\".\"variant_summary\" where assembly = 'GRCh38'"
        "and referenceallelevcf <> 'na' and alternateallelevcf <> 'na'")


<pyathena.cursor.Cursor at 0x7ffabeb462b0>

**Example 4:** Modify query 3 to find the number of samples that have "variants of Unknown Significance" in the BRCA1 gene

In [65]:
query = """SELECT count(DISTINCT(sample.id)) FROM (
       SELECT samples FROM {}.{} AS v 
       JOIN clinvar a 
       ON v.variant_id = a.variant_id 
       WHERE a.clinicalsignificance = 'Uncertain significance' 
       AND v.chrom='chr17' 
       AND v.pos BETWEEN 43044295 AND 43125364 
       ) AS f, 
       UNNEST(f.samples) AS s(sample) """.format(databasename, nested)
df=execute_query_async(query)
df

Query execution summary:
        DataScanned: 667942669
        ExecutionTime(s): 5.764
        QueuingTime(s): 0.17


Unnamed: 0,_col0
0,1550


**Example 4a:** To have a greater degree of confidence in the results, we want to only consider those entries in ClinVar that have more than 1 submission. This query will filter the results from Example 4 to only consider those that have > 1 submitter.

In [18]:
query = """SELECT count(DISTINCT(sample.id)) from (
       SELECT samples FROM {}.{} as v 
       JOIN clinvar a
       ON v.variant_id = a.variant_id
       WHERE a.clinicalsignificance = 'Uncertain significance' 
       AND v.chrom='chr17' 
       AND v.pos BETWEEN 43044295 AND 43125364 
       AND a.num_submitters > 1 
       ) AS f, 
       UNNEST(f.samples) AS s(sample)  """.format(databasename, nested)
df=execute_query_async(query)
df

Query execution summary:
        DataScanned: 668341959
        ExecutionTime(s): 5.507
        QueuingTime(s): 0.167


Unnamed: 0,_col0
0,41


**Example 5a:**  Find all variants in this dataset that have a pathogenic variant in the BRCA1 gene (using the gene annotation from ClinVar). Running the same query using the chromosome and position range for the BRCA1 gene is much faster due to the partitioning of the data.

In [19]:
query = """SELECT COUNT(DISTINCT sample_id) FROM {}.{} as v 
           JOIN clinvar AS a  
           ON v.variant_id = a.variant_id 
           WHERE a.genename='BRCA1' AND clinicalsignificance='Pathogenic' """.format(databasename,partitioned_chr)

df=execute_query_async(query)
df

Query execution summary:
        DataScanned: 92508510549
        ExecutionTime(s): 78.06
        QueuingTime(s): 0.201


Unnamed: 0,_col0
0,4


**Example 5b:** Run Query 5a using the nested dataset

In [66]:
query = """SELECT COUNT(DISTINCT sample.id) FROM 
           (SELECT samples from {}.{} AS v  
           JOIN clinvar AS a 
           ON v.variant_id = a.variant_id 
           WHERE a.genename='BRCA1' AND clinicalsignificance='Pathogenic') AS f, 
           UNNEST(f.samples) AS s(sample)""".format(databasename, nested)

df=execute_query_async(query)
df

Query execution summary:
        DataScanned: 23065300512
        ExecutionTime(s): 47.786
        QueuingTime(s): 0.136


Unnamed: 0,_col0
0,4


### Queries using 1000 genomes with gnomAD and ClinVar

The gnomAD sites data is available at s3://juayu-sampledata/geno_dataset/gnomad/sites/. We will first create a View of the gnomAD table with the fields we are going to be using in this set of queries to simplify things. 

In [87]:
cursor.execute("create or replace view gnomad as "
       " select concat(\"locus.contig\", ':', alleles[1], ':', alleles[2], ':', cast(\"locus.position\" as varchar)) as variant_id "
       ", \"locus.contig\" as chrom"
       ", \"locus.position\" as pos"
       ", alleles[1] as ref"
       ", alleles[2] as alt"
       ", rsid" 
       ", filters"
       ", \"info.ac\" as info_ac"
       ", \"info.an\" as info_an"
       ", \"info.af\" as info_af"
       ", \"info.popmax\" as info_popmax"
       ", partition_0 "
"from gnomad.sites " 
"where cardinality(filters) = 0")

<pyathena.cursor.Cursor at 0x7ffabeb462b0>

**gnomAD view**

In [88]:
query = "select * from gnomad limit 10"
df = execute_query_async(query)
df

Query execution summary:
        DataScanned: 79099723
        ExecutionTime(s): 1.826
        QueuingTime(s): 0.195


Unnamed: 0,variant_id,chrom,pos,ref,alt,rsid,filters,info_ac,info_an,info_af,info_popmax,partition_0
0,chr18:T:G:35554345,chr18,35554345,T,G,rs1263024,[],[4],92048,[4.34556E-5],[afr],chr18
1,chr18:T:G:35554346,chr18,35554346,T,G,rs1157614334,[],[1],148450,[6.73627E-6],[afr],chr18
2,chr18:T:G:35554347,chr18,35554347,T,G,rs546890215,[],[123],149644,[8.21951E-4],[afr],chr18
3,chr18:T:TG:35554347,chr18,35554347,T,TG,rs1258316118,[],[54],149644,[3.60856E-4],[amr],chr18
4,chr18:T:C:35554348,chr18,35554348,T,C,,[],[3],149782,[2.00291E-5],[sas],chr18
5,chr18:T:G:35554348,chr18,35554348,T,G,rs567107368,[],[103],149782,[6.87666E-4],[afr],chr18
6,chr18:T:TG:35554348,chr18,35554348,T,TG,rs1482316494,[],[62],149782,[4.13935E-4],[sas],chr18
7,chr18:T:G:35554349,chr18,35554349,T,G,rs943666262,[],[77],149914,[5.13628E-4],[nfe],chr18
8,chr18:T:TG:35554349,chr18,35554349,T,TG,rs1039625077,[],[1],149916,[6.6704E-6],[amr],chr18
9,chr18:T:G:35554350,chr18,35554350,T,G,rs1400847255,[],[6],149982,[4.00048E-5],[sas],chr18


**Example 6:** Find all rare variants (Minor Allele frequency < 0.01) in the BRCA1 gene from gnomAD. We will use the view we created above.

In [79]:
query = "SELECT COUNT(*) from gnomad \
         WHERE partition_0 = 'chr17' \
         AND (cardinality(info_af) > 0 and info_af[1] < 0.01) \
         AND pos BETWEEN 43044295 and 43125364"
df=execute_query_async(query)
df

Query execution summary:
        DataScanned: 343072
        ExecutionTime(s): 2.611
        QueuingTime(s): 0.153


Unnamed: 0,_col0
0,16938


**Example 7:** Let us now find how many samples in the 1000 genomes dataset have rare variants (at least 2 subjects with Minor allele frequency < 0.01) in the BRCA1 region.

In [91]:
query = """SELECT COUNT(DISTINCT sample_id) FROM {}.{} as v 
           JOIN gnomad AS a  
           ON v.variant_id = a.variant_id 
           WHERE v.chrom='chr17' AND
           v.pos BETWEEN 43044295 AND 43125364 AND
           cardinality(a.info_af) > 0 AND a.info_af[1] < 0.01 
           AND a.info_an > 1 """.format(databasename,partitioned_chr)
df=execute_query_async(query)
df

Query execution summary:
        DataScanned: 14558490366
        ExecutionTime(s): 28.527
        QueuingTime(s): 0.114


Unnamed: 0,_col0
0,3068


**Example 7a:** Find all samples in the 1000 genomes dataset have rare variants (at least 2 subjects with Minor allele frequency < 0.01) in the BRCA1 region and are labeled Pathogenic or Likely pathogenic in ClinVar.

In [97]:
query = """SELECT * FROM {}.{} as v 
           JOIN gnomad AS a  
           ON v.variant_id = a.variant_id 
           JOIN clinvar as clin
           on a.variant_id = clin.variant_id
           WHERE v.chrom='chr17' AND
           v.pos BETWEEN 43044295 AND 43125364 AND
           cardinality(a.info_af) > 0 AND a.info_af[1] < 0.01 
           AND a.info_an > 1 
           AND (clin.clinicalsignificance ='Pathogenic' OR
                clin.clinicalsignificance = 'Likely pathogenic')""".format(databasename,partitioned_chr)
df=execute_query_async(query)
df

Query execution summary:
        DataScanned: 40625906637
        ExecutionTime(s): 26.783
        QueuingTime(s): 0.168


Unnamed: 0,variant_id,pos,ref,alt,sample_id,alleles,rsid,qual,filters,info.ac,...,chrom,pos.1,ref.1,alt.1,genename,clinicalsignificance,num_submitters,reviewstatus,phenotypeids,phenotypelist
0,chr17:G:A:43092128,43092128,G,A,HG03929,"[G, A]",,49.67,[],[1],...,17,43092128,G,A,BRCA1,Pathogenic,11,reviewed by expert panel,"[MONDO:MONDO:0011450, MedGen:C2676676, OMIM:60...","[Breast-ovarian cancer, familial 1|not provid..."
1,chr17:C:T:43082403,43082403,C,T,HG00365,"[C, T]",,50.0,[],[1],...,17,43082403,C,T,BRCA1,Pathogenic,16,reviewed by expert panel,"[MONDO:MONDO:0011450, MedGen:C2676676, OMIM:60...","[Breast-ovarian cancer, familial 1|not provid..."
2,chr17:G:A:43124063,43124063,G,A,HG02164,"[G, A]",,50.0,[],[1],...,17,43124063,G,A,BRCA1,Pathogenic,14,reviewed by expert panel,"[MONDO:MONDO:0011450, MedGen:C2676676, OMIM:60...","[Breast-ovarian cancer, familial 1|not provid..."


In [None]:
cursor.close()
conn.close()