# ClinVar exploratory analysis notebook - T2D demo 

### <font color=blue> Instructions for use:</font>  
    1. Copy & rename this Jupyter Notebook
    2. In the new notebook: Update the USER supplied variables in the first code cell  
    3. In the new notebook: Run ALL cells  

##### USER supplied variables

In [None]:
## REQUIRED: path to clinvar_workflow package
CV_WKFLW_PKG_PATH = '..'

################## FileIO ##########################################################
## REQUIRED: input variant FILE - relative or absolute path
VAR_FILE = 'demo_input_variant_files/demo_variants_T2D_hg19.txt'

## REQUIRED: output file DIRECTORY - relative or absolute path
OUT_DIR = 'demo_output'

## REQUIRED: prefix for the outputs (default = '')
OUT_PREFIX = 'demo_T2D'

################## update values based on YOUR current input file ##################
## REQUIRED: Genome build: hg19 or hg38 (default = hg19)
BUILD = 'hg19'

## REQUIRED: 4 variant columns - the column order *must* remain the same* but the column names can change
COLS_VAR = ['CHR', 'POS', 'REF', 'ALT']

## optional: list of additional input columns to include in output DF
COLS_INPUT = ['dbSNP ID', 'Consequence', 'P-value', 'Odds ratio']


##### Imports

In [2]:
from __future__ import print_function
import pandas as pd

import os, sys
sys.path.insert(0, os.path.abspath(CV_WKFLW_PKG_PATH))

## Custom ClinVar query workflow module
from clinvar_workflow.workflows import exploratory_analysis_workflow as cv
from clinvar_workflow.vizualization import viz_jupyter as nb

## Jupyter & ipywidgets
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from ipywidgets import HBox, VBox, AppLayout, HTML, Tab

Invoking __init__.py for clinvar_workflow
Invoking __init__.py for clinvar_workflow.helpers
Invoking __init__.py for clinvar_workflow.workflows
Invoking __init__.py for clinvar_workflow.query_clinvar
Invoking __init__.py for clinvar_workflow.vizualization


In [3]:
%%html
<style>.css_widgets {font-size:150%}


## Run ClinVar Exploratory Analysis

In [4]:
results = cv.run_clinvar_exploratory_analysis(var_file=VAR_FILE, 
                                              out_dir=OUT_DIR, 
                                              out_prefix=OUT_PREFIX, 
                                              build=BUILD, 
                                              cols_var=COLS_VAR, 
                                              cols_input=COLS_INPUT)

results.keys()


Step 1: verify & process user inputs
	.. User specified variant input file exists
	.. Input variant file contains the specified variant columns
	.. User specified output directory exists & is writable


Step 2: run MyVariant ClinVar query
	.. run ClinVar query
querying 1-249...done.



pandas.io.json.json_normalize is deprecated, use pandas.json_normalize instead



	.. ClinVar query data wrangling


Step 3: process MyVariant ClinVar query results
	.. generating variant ClinVar summary (1 row per variant)
	.. adding input columns to variant summary DF --> update UNREPORTED variants
	.. adding aggregation stats, Boolean indicator columns & FLAG columns to CV variant summary DF
	.. adding variant summary DF columns to full ClinVar DF


Step 4: run Exploratory analysis
	.. converting DF for analyses & visualization
	.. generate dataset summary
	.. identify pathogenic variants
	.. add data viz & pathogenic variant DataFrames to result dictionary
	.. starting Clinical Significance exploratory analyses
		.. performing Variant Clinical Significance analysis
		.. performing Variant-Condition (RCV) Clinical Significance analysis
		.. performing Gene-based analysis
		.. performing Condition-based analysis
		.. assembling Clinical Significance analysis results


Step 5: write output files
	.. Writing ClinVar Variant summary
	.. Writing ClinVar Variant full d

dict_keys(['cv_var_summary_df', 'cv_full_df', 'input_df', 'viz_df', 'data_summary_df', 'patho_var_df', 'patho_var_detail_df', 'clinsig_var', 'clinsig_rcv', 'clinsig_var_gene', 'clinsig_var_cond'])

## Summary of current ClinVar query 

In [5]:
results['data_summary_df']

## Identify ClinVar pathogenic variants

In [6]:
results['patho_var_df']

Unnamed: 0,CHR,POS,REF,ALT,hgvs_id.hg19,rsid,gene.symbol,clinical_significance,clinical_significance.rcv.set,conditions.name,conditions.name.set,conditions.synonyms,patho_cond.set,patho_cond.nuniq,patho_cond.%,_.patho_ALL_cond,_.patho_ANY_cond,preferred_name,hg19.start,hg19.end,hg38.start,hg38.end
0,1,17588689,T,A,chr1:g.17588689T>A,rs142129409,PADI3,Pathogenic,{Pathogenic},Uncombable hair syndrome (UHS1),{Uncombable hair syndrome (UHS1)},,{Uncombable hair syndrome (UHS1)},1,1.0,True,True,NM_016233.2(PADI3):c.335T>A (p.Leu112His),17588689,17588689,17262194,17262194
1,1,23189553,G,T,chr1:g.23189553G>T,rs35882952,EPHB2,Conflicting interpretations of pathogenicity,"{Benign, Pathogenic}","[Prostate cancer/brain cancer susceptibility, ...","{Prostate cancer/brain cancer susceptibility, ...","['PROSTATE CANCER, PROGRESSION AND METASTASIS ...",{Prostate cancer/brain cancer susceptibility},1,0.5,False,True,NM_004442.7(EPHB2):c.835G>T (p.Ala279Ser),23189553,23189553,22863060,22863060
2,1,40557070,T,A,chr1:g.40557070T>A,rs137852695,PPT1,Pathogenic,{Pathogenic},"[Inborn genetic diseases, Neuronal ceroid lipo...","{Neuronal ceroid lipofuscinosis, Neuronal cero...","['CEROID LIPOFUSCINOSIS, NEURONAL, 1, VARIABLE...","{Neuronal ceroid lipofuscinosis, Neuronal cero...",4,1.0,True,True,NM_000310.3(PPT1):c.364A>T (p.Arg122Trp),40557070,40557070,40091398,40091398
3,1,52863445,C,T,chr1:g.52863445C>T,rs143141689,ORC1,Likely pathogenic,"{Likely pathogenic, Pathogenic}","[Meier-Gorlin syndrome (MGORS1), Meier-Gorlin ...","{Meier-Gorlin syndrome (MGORS1), Meier-Gorlin ...","['Ear, patella, short stature syndrome'], ['Mi...",{Meier-Gorlin syndrome (MGORS1)},1,0.5,False,True,NM_004153.4(ORC1):c.314G>A (p.Arg105Gln),52863445,52863445,52397773,52397773
4,1,159175494,C,T,chr1:g.159175494C>T,rs34599082,ACKR1,Pathogenic,{Pathogenic},"DUFFY BLOOD GROUP SYSTEM, FY(bwk) PHENOTYPE","{DUFFY BLOOD GROUP SYSTEM, FY(bwk) PHENOTYPE}",,"{DUFFY BLOOD GROUP SYSTEM, FY(bwk) PHENOTYPE}",1,1.0,True,True,NM_002036.4(ACKR1):c.265C>T (p.Arg89Cys),159175494,159175494,159205704,159205704
5,1,216143987,C,A,chr1:g.216143987C>A,rs199840367,USH2A,Pathogenic/Likely pathogenic,"{Likely pathogenic, Pathogenic/Likely pathogen...","[Retinal dystrophy, Retinitis pigmentosa (RP),...","{Retinitis pigmentosa 39 (RP39), not provided,...","['Tapetoretinal degeneration', 'Retinotapetal ...",{Retinal dystrophy},1,0.25,False,True,NM_206933.3(USH2A):c.6937G>T (p.Gly2313Cys),216143987,216143987,215970645,215970645
6,2,99779295,C,G,chr2:g.99779295C>G,rs137891647,LIPT1,Pathogenic,{Pathogenic},"[Lipoyltransferase 1 deficiency (LIPT1D), not ...","{Lipoyltransferase 1 deficiency (LIPT1D), not ...",,"{Lipoyltransferase 1 deficiency (LIPT1D), not ...",2,1.0,True,True,NM_145199.3(LIPT1):c.875C>G (p.Ser292Ter),99779295,99779295,99162832,99162832
7,3,50379904,A,G,chr3:g.50379904A>G,rs200913791,ZMYND10,Pathogenic,{Pathogenic},"[Ciliary dyskinesia, primary, 22 (CILD22), Kar...","{Kartagener syndrome (CILD1), Ciliary dyskines...","['CILIARY DYSKINESIA, PRIMARY, 1', 'CILIARY DY...","{Kartagener syndrome (CILD1), Ciliary dyskines...",3,1.0,True,True,NM_015896.4(ZMYND10):c.797T>C (p.Leu266Pro),50379904,50379904,50342473,50342473
8,3,81698005,A,G,chr3:g.81698005A>G,rs192044702,GBE1,Pathogenic,{Pathogenic},"[GBE1-Related Disorders, Glycogen storage dise...","{Glycogen storage disease IV, classic hepatic,...","['GBE1 DEFICIENCY', 'GLYCOGENOSIS IV', 'GSD IV...","{Glycogen storage disease IV, classic hepatic,...",5,1.0,True,True,NM_000158.4(GBE1):c.691+2T>C,81698005,81698005,81648854,81648854
9,3,132394747,G,A,chr3:g.132394747G>A,rs114925667,NPHP3-ACAD11,Pathogenic,{Pathogenic},[Early infantile epileptic encephalopathy (EIE...,{Early infantile epileptic encephalopathy (EIE...,['DEVELOPMENTAL AND EPILEPTIC ENCEPHALOPATHY 4...,{Early infantile epileptic encephalopathy (EIE...,4,1.0,True,True,NM_024818.4(UBA5):c.1111G>A (p.Ala371Thr),132394747,132394747,132675903,132675903


## Display ClinVar Clinical Significance exploratory analysis & visualization 

<div class="alert alert-info">


**_ClinVar Variant vs RCV:_**   
- The ClinVar accession (RCV) is based on a variant-condition(s) combination, not the variant alone   
- Some variants have more than one RCV because the variant has been reported for multiple distinct disorders 

For more information about ClinVar RCV, see ClinVar FAQ:  
https://www.ncbi.nlm.nih.gov/clinvar/docs/faq/#accs  
https://www.ncbi.nlm.nih.gov/clinvar/docs/faq/#var_rcv  


</div>



**Note - *RCV* clinical significance:**   
   - the # of variants is **_NOT DISTINCT_**!   
   - some variants have >1 unique RCV clinsig classification


In [7]:
results, display = nb.display_clinsig_exploratory_analysis(results)
display_results = nb.display_css(display, 'css_widgets')

display_results

		.. generating Variant Clinical Significance Plotly Table
		.. generating Variant-Condition (RCV) Clinical Significance Plotly Table
		.. generating Gene Variant Clinical Significance Plotly Table
		.. generating Condition Variant Clinical Significance Plotly Table
		.. generating containers to display widgets
		.. assembling results to display


VBox(children=(HBox(children=(Accordion(children=(HBox(children=(FigureWidget({
    'data': [{'hole': 0.45,
  …