# Finding ACMG Genes

Variants were annotated with the ACMG actionable gene table. Variants located in ACMG gene regions were annotated with Kaviar frequency, ClinVar clinical significance ratings, DANN predicted pathogenicity scores for SNV's, and coding consequence predictions using SnpEff.
Variants were filtered for appropriate mode of inheritance and then labeled as predictive when meeting the following conditions:
Presence of mutation(s) with appropriate inheritance (eg, 2 bi-allelic pathogenic mutations for a recessive disorder).  

Mutations defined strictly as either:
- Annotated in ClinVar with a clinical significance of 4 or 5, but never 2 or 3 or labeled pathogenic in HGMD (to be added later)  
OR    
- Novel (in Kaviar with frequency less than .03 or not in Kaviar) but either predicted to be disease-causing with a SnpEff impact score of 'high' or a DANN score higher than specified cutoff below

## User Specified Parameters

In [3]:
# specify your impala host
impala_host = 'glados19'

#############
## kaviar ##
############
# enter kaviar frequency threshold
kav_freq = '.03'

###############
## family id ##
###############
# uncomment for list 
sample_list = 'all'

####################
## trio member(s) ##
####################
trio_member = 'all'

################
## dann_score ##
################
# enter minimum dann score for predicted NBS
dann_score = '0.96'

########################
## database locations ##
########################
# enter database.name for variant table
variant_table = 'training.illumina_vars'
# enter database.name of global variants table
gv_table = 'training.global_vars'
# enter user database to output tables
out_db = 'training'
# enter desired basename for output files
out_name = 'clarity_acmg_genes'

## Adding positional information to the ACMG gene list¶

Chromosome, start and stop positions were added to the ACMG Gene list by joining it with the ensembl_genes table using gene name.

    CREATE TABLE acmg_ensembl AS (
        SELECT acmg.*, ens.chrom, ens.start, ens.stop, ens.gene_name, 
        ens.gene_id, ens.transcript_id
        FROM p7_ref_grch37.acmg_genes acmg, p7_ref_grch37.ensembl_genes ens
        WHERE acmg.gene = ens.gene_name
     )

The results were saved as training.acmg_ensembl

## Parse User Arguments

In [2]:
# format trio argument
member_list = []
for member in trio_member:
    if member == 'NB':
        member_list.append("bv.sample_id LIKE '%03'")
    if member == 'M':
        member_list.append("bv.sample_id LIKE '%01'")
    if member == 'F':
        member_list.append("bv.sample_id LIKE '%02'")
    if member_list == 'all':
        member_list =''
        
# if the member argument is not empty create statement
if len(member_list) > 0:
    member_arg = 'AND (' + ' OR '.join(member_list) + ')'
# otherwise statment is empty
else: member_arg = ''

# format sample id argument
sample_arg = []
if sample_list != 'all':
    sample_arg.append("AND bv.sample_id IN " + str(sample_list))
    subject_list = ", ".join(str(i) for i in sample_arg)
else: 
    subject_list = ''

# list of user args to join 
arg_list = [subject_list, member_arg]

# if there's an argument, format
if len(arg_list) > 0:
    subject_statement = ' '.join(arg_list)
# otherwise return empty string
else:
    subject_statement = ''

## Connecting to Impala 

In [4]:
# connect to impala
from impala.dbapi import connect
from impala.util import as_pandas

# to connect to specific database, use the database argument
conn=connect(host=impala_host, port=21050)

In [None]:
acmg_query = '''


'''

# open connection, run query, close connection 
def run_query(query_name, db_name):
    cur = conn.cursor()
    # run query 
    print 'Running the following query on impala: \n' + query_name
    cur = conn.cursor()
    cur.execute(query_name)
    print 'Query finished. Please note that the 1/2 genotype was converted to 0/1 for downstream compatibility. \n Closing connection.'
    cur.close()

# run kaviar annotation query
run_query(kaviar_query, 'acmg_query')

## Examine MOI

In [None]:
# subset data frame by trio member
newborns = acmg_annot[acmg_annot['member'] == 'NB']
mothers = acmg_annot[acmg_annot['member'] == 'M']
fathers = acmg_annot[acmg_annot['member'] == 'F']

The newborn subset was split to report predictive variants for:  

- All variants in regions of dominant disorders  
- All homozygous recessive variants in regions of autosomal recessive disorders  
- All heterozygous variants in autosomal recessive regions for downstream analysis of compound heterozygosity

In [None]:
# disable erroneous pandas warning
pd.options.mode.chained_assignment = None

# subset variants by variant MOI and/or zygosity
nb_dominant = newborns[((newborns['inheritance'] == 'AD') & (newborns['predictive'] == True))]
nb_dominant.name = 'dominant'

nb_hom_recessive = newborns[((newborns['inheritance'] == 'AR') & (newborns['gt'] == '1/1') & (newborns['predictive'] == True))]
nb_hom_recessive.name = 'hom_recessive'

nb_het = newborns[((newborns['inheritance'] == 'AR') & (newborns['gt'] == '0/1') & (newborns['predictive'] == True))]

### Locate Potential Compound Heterozygots