# Building a Drug Response Report

For this notebook, you are going to focus on a single region in the genome, defined as chromosome 22, for all 2,548 samples in the Thousand Genomes dataset. As chromosome 22 was the first chromosome to be sequenced as part of the Human Genome Project, it is your first here as well. 

What small molecules/drugs are most likely to affect a subpopulation of individuals (ancestry, age, etc.) based on their genomic information?

In this query, assume that you have some phenotype data about your population. In this case, also assume that all samples sharing the pattern “NA12” are part of a specific demographic.

**NOTE: Declare the names of the "variant" and "annotation" tables in the "Define Variables" section based on the names given at the "Create Resource Link" stage of the solution**

In this query, use sampleid as your predicate pushdown. The general steps are:

1. Filter by the samples in your subpopulation
2. Aggregate variant frequencies for the subpopulation-of-interest
3. Join on ClinVar dataset
4. Filter by variants that have been implicated in drug-response
5. Order by highest frequency variants

The raw clinvar data and a parquet version of chromosome 22 of 1000 genomes, partitioned by sample id, are in your data lake. You also have a VCF in your data lake for chromosome 22 of 1000 genomes.

### Import Dependencies

In [1]:
import boto3, os

s3 = boto3.resource('s3')
glue = boto3.client('glue')
cfn = boto3.client('cloudformation')

In [29]:
import sys
!{sys.executable} -m pip install PyAthena --quiet

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [30]:
from pyathena import connect
import pandas as pd
from pyathena.pandas.util import as_pandas

### Define Variables

In [32]:
import jmespath

session = boto3.session.Session()
region = session.region_name
print(region)

project_name = os.environ.get('RESOURCE_PREFIX')
database_name = project_name.lower()
work_group_name = project_name.lower() + '-' + region
print(f'Project Name: {project_name}')
print(f'Database Name: {database_name}')
print(f'Workgroup Name: {work_group_name}')

resources = cfn.describe_stacks(StackName='{0}-Pipeline'.format(project_name))
query = 'Stacks[].Outputs[?OutputKey==`DataLakeBucket`].OutputValue'
data_lake_bucket = path = jmespath.search(query, resources)[0][0]
print(f'Data lake bucket: {data_lake_bucket}')

variant_table_name = 'variants'
annotation_table_name = 'annotations'


us-east-1
Project Name: GenomicsAnalysis
Database Name: genomicsanalysis
Workgroup Name: genomicsanalysis-us-east-1
Data lake bucket: genomicsanalysis-pipeline-datalakebucket-mdr2tq03e5w4


### Create drug response result set

In [33]:
conn = connect(s3_staging_dir='s3://%s/results/drug_response' % data_lake_bucket, region_name=region, schema_name=database_name)
cursor = conn.cursor(work_group=work_group_name)
query = f"""
SELECT  count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency 
    ,cv.attributes['RS'] as rs_id
    ,cv.attributes['CLNDN'] as clinvar_disease_name
    ,cv.attributes['CLNSIG'] as clinical_significance
    ,sv.contigname
    ,sv.start
    ,sv."end"
    ,sv.referenceallele
    ,sv.alternatealleles
    ,sv.calls
        FROM {variant_table_name} sv 
        CROSS JOIN 
            (SELECT count(1) AS numsamples 
            FROM 
                (SELECT DISTINCT vs.sampleid 
                FROM {variant_table_name} vs
                WHERE vs.sampleid LIKE 'NA12%')) 
        JOIN {annotation_table_name} cv 
        ON sv.contigname = cv.contigname 
            AND sv.start = cv.start 
            AND sv."end" = cv."end" 
            AND sv.referenceallele = cv.referenceallele 
            AND sv.alternatealleles = cv.alternatealleles
            AND cv.attributes['CLNSIG'] LIKE '%response%' 
            AND sv.sampleid LIKE 'NA12%' 
        GROUP BY  sv.contigname 
                  ,sv.start 
                  ,sv."end" 
                  ,sv.referenceallele 
                  ,sv.alternatealleles
                  ,sv.calls
                  ,cv.attributes['RS']
                  ,cv.attributes['CLNDN']
                  ,cv.attributes['CLNSIG'] 
                  ,numsamples 
        ORDER BY genotypefrequency DESC LIMIT 50 
               """
cursor.execute(query)

df = as_pandas(cursor)
df

Unnamed: 0,genotypefrequency,rs_id,clinvar_disease_name,clinical_significance,contigname,start,end,referenceallele,alternatealleles,calls
0,1.0,6267,"Schizophrenia,_susceptibility_to|Tramadol_resp...",drug_response|_risk_factor,22,19962739,19962740,G,[T],"[0, 0]"
1,1.0,554056486,Tramadol_response,drug_response,22,19970008,19970009,G,[A],"[0, 0]"
2,1.0,544846648,Tramadol_response,drug_response,22,19962831,19962832,C,[T],"[0, 0]"
3,1.0,11569716,Tramadol_response,drug_response,22,19962223,19962224,T,[C],"[0, 0]"
4,1.0,201225516,Tramadol_response,drug_response,22,19964186,19964187,C,[T],"[0, 0]"
5,1.0,561536243,Tramadol_response,drug_response,22,19962540,19962541,G,[A],"[0, 0]"
6,1.0,548235125,Tramadol_response,drug_response,22,19969904,19969905,T,[C],"[0, 0]"
7,0.984615,35481270,Tramadol_response,drug_response,22,19969443,19969444,C,[T],"[0, 0]"
8,0.969231,188159376,Tramadol_response,drug_response,22,19951706,19951707,C,[T],"[0, 0]"
9,0.953846,35478083,Tramadol_response,drug_response,22,19969361,19969362,T,[C],"[0, 0]"


### Query annotation dataset

In [34]:
conn = connect(s3_staging_dir='s3://%s/results/annotation/clinvar' % data_lake_bucket, region_name=region, schema_name=database_name)
cursor = conn.cursor(work_group=work_group_name)
cursor.execute(f'SELECT * FROM {annotation_table_name} limit 10')
df = as_pandas(cursor)
df

Unnamed: 0,importjobid,contigname,start,end,names,referenceallele,alternatealleles,qual,filters,splitfrommultiallelic,...,calls,genotypelikelihoods,phredlikelihoods,alleledepths,conditionalquality,spl,depth,ps,sampleid,information
0,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,11,17442745,17442746,[370657],G,[GT],,,False,...,,,,,,,,,,
1,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,12,132638015,132638016,[1008494],G,[A],,,False,...,,,,,,,,,,
2,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,4,177439635,177439637,[371375],TA,[T],,,False,...,,,,,,,,,,
3,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,11,17442747,17442748,[1107489],C,[T],,,False,...,,,,,,,,,,
4,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,11,17442751,17442752,[1502031],A,[G],,,False,...,,,,,,,,,,
5,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,15,89776452,89776453,[1575155],C,[T],,,False,...,,,,,,,,,,
6,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,1,925951,925952,[1019397],G,[A],,,False,...,,,,,,,,,,
7,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,12,132638016,132638020,[484509],CTGG,[C],,,False,...,,,,,,,,,,
8,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,12,132638019,132638020,[1405963],G,[C],,,False,...,,,,,,,,,,
9,efdbfdf9-81b8-43e7-bca8-60bc3834a2bc,19,42293924,42293925,[717577],G,[A],,,False,...,,,,,,,,,,


### Query cohort dataset

In [35]:
conn = connect(s3_staging_dir='s3://%s/results/variants/' % data_lake_bucket,region_name=region, schema_name=database_name)
cursor = conn.cursor(work_group=work_group_name)
cursor.execute(f"SELECT * FROM {variant_table_name} WHERE sampleid LIKE 'NA12%' limit 10")
df = as_pandas(cursor)
df


Unnamed: 0,importjobid,contigname,start,end,names,referenceallele,alternatealleles,qual,filters,splitfrommultiallelic,...,calls,genotypelikelihoods,phredlikelihoods,alleledepths,conditionalquality,spl,depth,ps,sampleid,information
0,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12003,
1,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12004,
2,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12005,
3,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12006,
4,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12043,
5,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12044,
6,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12045,
7,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12046,
8,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12058,
9,204afc74-76ae-44db-9428-89cc546c3c7c,22,41061122,41061123,,T,[C],,[PASS],False,...,"[0, 0]",,,,,,,,NA12144,


### Query individual variant dataset

In [36]:
conn = connect(s3_staging_dir='s3://%s/results/vcf/' % data_lake_bucket,region_name=region, schema_name=database_name)
cursor = conn.cursor(work_group=work_group_name)
cursor.execute(f"SELECT * FROM {variant_table_name} where sampleid='default' limit 10")
df = as_pandas(cursor)
df


Unnamed: 0,importjobid,contigname,start,end,names,referenceallele,alternatealleles,qual,filters,splitfrommultiallelic,...,calls,genotypelikelihoods,phredlikelihoods,alleledepths,conditionalquality,spl,depth,ps,sampleid,information
0,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118833153,118833154,,C,[T],75.0,[LowGQX],False,...,"[1, 1]",,"[111, 9, 0]","[0, 3]",7,,3,,default,"{min_dp=null, adf=[0, 2], genotype_filters=[Lo..."
1,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118840547,118840548,,G,[T],11.0,"[LowDepth, LowGQX]",False,...,"[0, 1]",,"[46, 3, 0]","[0, 1]",3,,1,,default,"{min_dp=null, adf=[0, 1], genotype_filters=[Lo..."
2,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118850771,118850772,,G,[T],10.0,"[LowDepth, LowGQX]",False,...,"[0, 1]",,"[44, 3, 0]","[0, 1]",3,,1,,default,"{min_dp=null, adf=[0, 0], genotype_filters=[Lo..."
3,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118856568,118856569,,C,[T],11.0,"[LowDepth, LowGQX]",False,...,"[0, 1]",,"[46, 3, 0]","[0, 1]",3,,1,,default,"{min_dp=null, adf=[0, 0], genotype_filters=[Lo..."
4,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118872336,118872337,,G,[A],11.0,"[LowDepth, LowGQX]",False,...,"[0, 1]",,"[46, 3, 0]","[0, 1]",3,,1,,default,"{min_dp=null, adf=[0, 0], genotype_filters=[Lo..."
5,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118874817,118874818,,T,[C],7.0,"[LowDepth, LowGQX]",False,...,"[0, 1]",,"[41, 3, 0]","[0, 1]",3,,1,,default,"{min_dp=null, adf=[0, 0], genotype_filters=[Lo..."
6,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118890083,118890084,,G,[T],11.0,"[LowDepth, LowGQX]",False,...,"[0, 1]",,"[46, 3, 0]","[0, 1]",3,,1,,default,"{min_dp=null, adf=[0, 1], genotype_filters=[Lo..."
7,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118893858,118893859,,A,[C],4.0,"[LowDepth, LowGQX]",False,...,"[0, 1]",,"[37, 3, 0]","[0, 1]",3,,1,,default,"{min_dp=null, adf=[0, 1], genotype_filters=[Lo..."
8,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118894541,118894542,,G,[A],1.0,[LowGQX],False,...,"[0, 1]",,"[27, 0, 81]","[4, 1]",25,,5,,default,"{min_dp=null, adf=[0, 0], genotype_filters=[Lo..."
9,571ce61e-f0b5-40b1-993b-650ab2af8284,chr6,118894557,118894558,,G,[A],26.0,[LowGQX],False,...,"[0, 1]",,"[60, 0, 82]","[5, 4]",58,,9,,default,"{min_dp=null, adf=[0, 0], genotype_filters=[Lo..."
