# Jupyter Reviewer Tutorial

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from src.ReviewData import ReviewData, ReviewDataAnnotation
from src.ReviewDataApp import ReviewDataApp

import pandas as pd
import numpy as np
import functools
import time



# Reviewing Data 

1. Enforce consistent and meaningful annotation
1. Consolidate multiple sources of data in a single place
1. Make review dashboard flexible for different needs

The only constrains are:
1. Each row corresponds to specific data item you want to annotated. It is independent of the other rows in the table
1. History can not be "undone"

Below is an example of a table where each row corresponds to a sample. In each column is all the data that I plan to use or look at to annotat. 
- existing annotations (such as clinical data)
- paths to files I can plot as a graph or display as a table

Recommendations:
- Do as much automation for annotations as possible first. You can use this tool to manually check and update these annotations
- Preprocess your files so when each sample's data is rendered, it will take less time to switch between samples.

## ReviewData object

1. data: 
2. annot: 
3. history:


In [3]:
bucket_0c1_cchu_manual_purity_review_session_dir = 'gs://taml_vm_analysis/data/Full-Analysis/1_Full-Analysis-2022-02-22_pran3/0c1_Manual_Purity_Review_cchu'
cchu_purities_df = pd.read_csv(f'{bucket_0c1_cchu_manual_purity_review_session_dir}/manual_purity_review_table.tsv', sep='\t', index_col=0)
cchu_purities_df.head()


Unnamed: 0_level_0,BETA_FLAG_not_enough_drivers,BETA_annot_maf_fn,BETA_clonal_muts,BETA_clonal_muts_genes,BETA_half_purity,BETA_has_beta_solution,BETA_num_clonal_drivers,BETA_ploidy,BETA_purity,BETA_purity_lower,...,manual_purity,manual_purity_lower,manual_purity_upper,manual_ploidy,manual_confidence,manual_flags,last_manual_update,manual_method,MAFLITE,VCF
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000725_ZS_2668,True,/home/cchu/cgaprojects_ibm_tAML_analysis/data/...,,,,False,,2.0,0.0,0.0,...,0.63,0.57,0.69,2.01,"No purity called, unsure",Post_Allo,2022-02-23 21:43:37.104717,Manual_Other,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...
005982_GD_1875,False,/home/cchu/cgaprojects_ibm_tAML_analysis/data/...,[0],['RTEL1:p.A1062T'],0.488,True,1.0,2.0,0.976,0.86,...,0.91,0.86,0.96,1.95,Confident,,2022-02-23 21:44:00.961518,Keep_auto_call,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...
012413_AT_1634,False,/home/cchu/cgaprojects_ibm_tAML_analysis/data/...,[0],['BRCA1:p.V772A'],0.512,True,2.0,2.0,1.024,0.8,...,1.024,0.8,1.244,2.0,"Purity called, unsure",,2022-02-23 21:44:51.038330,Manual_BETA,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...
016198_VX_1736,False,/home/cchu/cgaprojects_ibm_tAML_analysis/data/...,[0],['TERT:p.R756H'],0.43,True,1.0,2.0,0.86,0.76,...,0.86,0.76,0.964,2.0,"Purity called, unsure",No CNA,2022-02-23 21:51:14.522862,Manual_BETA,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...
022613_PU_3426,True,/home/cchu/cgaprojects_ibm_tAML_analysis/data/...,,,,False,,2.0,0.0,0.0,...,0.46,0.39,0.53,1.88,Confident,No_AML_drivers,2022-02-23 21:53:02.173752,Keep_auto_call,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...,gs://fc-fed5ee4d-4de5-429a-b88e-681cde1f0558/a...


## Setting up your review session

`ReviewData` is an object meant to mirror how one may go about annotations by going row by row in a spreadsheet, and filling in/editing the corresponding columns. Instantiate your `ReviewData` session by specifying:
1. A directory to store the meta data related to your review session
1. A pandas dataframe with all the information you need for each data point. The data point id's must be the index
1. Specify what you want to annotate and set validation (in progress)

If the ReviewData Session directory already exists and as the expected files, it will simply reload those existing files. Some caveats:
1. If you add items to `annotation_data`, it will add the column to the annot table. However, deleting one from the list will not remove it from the table. However, later you will see you cannot update that column in the app
1. any changes to df between re-runs will NOT change any of the values or paths in the data table. You will have to manually update the path/data.tsv file if this is what you want to do. Depending on why you may want to update your input data, generally I recommend making a new session. (there is an option to "autofill" annotations, so you do not necessarily have to completely redo everything)

In [4]:
test_rd_dir = '/home/cchu/cgaprojects_ibm_tAML_analysis/data/test_getzlab-JupyterReviewer/Reviewer_Tutorial'
test_rd = ReviewData(review_dir=test_rd_dir,
                     df = cchu_purities_df, # optional if directory above already exists. 
                     annotate_data = {'purity': ReviewDataAnnotation('number', validate_input=lambda x: x < 0.5),
                                      'rating': ReviewDataAnnotation('number', options=range(10)),
                                      'description': ReviewDataAnnotation('text'),
                                      'class': ReviewDataAnnotation('radioitem', options=[f'Option {n}' for n in range(4)])})
test_rd.annot.head()


Unnamed: 0_level_0,purity,class,rating,description
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
000725_ZS_2668,2.0,Option 3,2.0,asdf
005982_GD_1875,0.86,Option 1,,
012413_AT_1634,1.308,Option 0,,
016198_VX_1736,,,,
022613_PU_3426,0.0,Option 2,0.0,0


# Interactive Review data with Plotly Dash

Plotly dash is a packages that allows you to create dashboards pythonically. It has built in objects and functions to easily assemble components so you can display multiple things at once and implement interactivity.


In [5]:
import plotly.express as px
from plotly.subplots import make_subplots
from jupyter_dash import JupyterDash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output, State
from dash.exceptions import PreventUpdate
from dash import Dash, dash_table
import dash
import dash_bootstrap_components as dbc
import functools

## 1. Instantiate the App by passing in you `ReviewData` object

In [6]:
test_app = ReviewDataApp(test_rd)

To run, call `run_app()`

In [11]:
test_app.run(mode='external', port=8064)

Dash app running on http://0.0.0.0:8064/


  func()


This produces the baseline dashboard, which simply allows you to iterate through each row, add annotations, and view the history of annotations.

Run app parameters:
- mode:
- port:
- host:

## 2. Add simple components


Already implemented is a table from a given path. 


In [12]:
test_app.add_table_from_path('DFCI MAF file', 
                             'maf-component-id', 
                             'DFCI_local_sample_dfci_maf_fn', 
                             ['Hugo_Symbol', 'Chromosome', 't_alt_count', 't_ref_count', 'Tumor_Sample_Barcode'])



## 3. Custom components

You may want to use this if you want to display:
- graphs
- implement interactive components
- utilize multiple inputs to produce a plot (note that it's better to precompute as much as possible)

To add a custom component, you need to define:
1. A name for the component
1. A dash layout (link to site on how to make these). Fill your components with empty data first.
1. callback_output: define which components in your dash layout 


In [14]:

def gen_data_summary_table(r, cols):
    return [[html.H1(f'{r.name} Data Summary'), dbc.Table.from_dataframe(r[cols].to_frame().reset_index())]]

test_app.add_custom_component('sample-info-component', 
                              html.Div(children=[html.H1('Data Summary'), 
                                                 dbc.Table.from_dataframe(df=pd.DataFrame())],
                                       id='sample-info-component'
                                      ), 
                              callback_output=[Output('sample-info-component', 'children')],
                              new_data_callback=gen_data_summary_table, 
                              cols=['BETA_ploidy',
                                     'BETA_purity',
                                     'BETA_purity_lower',
                                     'BETA_purity_upper'])



**How to write a callback**

1. The first parameter must be a pandas series. Automatically, the `ReviewDataApp` will pass in the data associated with the current row as the first parameter. All data references are made via key access to the `ReviewData` data table. 
1. Any additional arguments can be specified in `test_app.add_custom_component()` as `kwargs`. This way, you can reuse existing functions, and customize arguments
1. Your output must look like your `callback_output` argument, with the outputs corresponding by order of which components to send the results to
    1. You may notive that in the above example, `gen_data_summary_table()` returns a nested list. This is because my `callback_output` parameter is a list with a single dash `Output()` object. This dash `Output()` object refers to the `children` of the component `'sample-info-component'`, which refers to the `html.Div` dash component. 
    1. The outer brackets correspond to the outer brackets of my input to `callback_output`. The inner bracket is to value sent to update the `children` attribute of the `html.Div` object, which consists of a list of two components. 
    
Alternatively, you can pass a dictionary to `callback_output` specifying keywords to assign values to each specified dash `Output()` object.

## 4. Custom Interactive Components

Each component can be made of multiple components. If you have multiple components you want to have interact with each other, then they need to be grouped into one large component. Each time you `add_custom_component()`, those components cannot interact with the components added in separate calls. 

Below is an example where an interactive table can be used to modify the graph and recalculate the purity based on the selected mutations. 


In [15]:
from scipy.stats import beta, kruskal
import plotly.graph_objects as go

tumor_f_bin_width = 1.0/500.0
tumor_f_bins = np.arange(0, 1, tumor_f_bin_width)
pval_threshold = 1.1E-4
def plot_beta(maf_df, data_id):
    
    if maf_df.empty:
        raise ValueError("There are no mutations in the maf dataframe.")

    for idx, r in maf_df.iterrows():
        pdf = beta.pdf(tumor_f_bins, r['t_alt_count'] + 1, r['t_ref_count'] + 1)
        maf_df.loc[idx, tumor_f_bins] = pdf / (sum(pdf) * tumor_f_bin_width)

    sum_pdf = maf_df[tumor_f_bins].sum(axis=0)
    sum_pdf = sum_pdf / (sum_pdf.sum() * tumor_f_bin_width)
    if 'tumor_f' not in maf_df.columns:
        maf_df['tumor_f'] = maf_df['t_alt_count'].astype(float) / (maf_df['t_alt_count'] + maf_df['t_ref_count'])
    maf_df = maf_df.sort_values(by='tumor_f',
                                ascending=False).reset_index()

    clonal_muts = [maf_df.index[0]]  # Get the first one
    for j in np.arange(maf_df.shape[0], 1, -1):
        h_stat, pval = kruskal(*maf_df.iloc[:j].apply(lambda x: np.concatenate((np.ones(x['t_alt_count']),
                                                                                np.zeros(x['t_ref_count']))),
                                                      axis=1).tolist())
        if pval > pval_threshold:
            clonal_muts = maf_df.index[:j].tolist()
            break

    subclonal_muts = maf_df.index[clonal_muts[-1] + 1:].tolist() if clonal_muts[-1] < maf_df.shape[0] else []

    clonal_prod_pdf = maf_df.loc[clonal_muts, tumor_f_bins].product(axis=0)
    clonal_prod_pdf = clonal_prod_pdf / (clonal_prod_pdf.sum() * tumor_f_bin_width)
    half_purity = clonal_prod_pdf.argmax()
    purity = clonal_prod_pdf.index[half_purity] * 2

    log_clonal_prod_pdf = np.log10(clonal_prod_pdf)
    log_clonal_prod_pdf = log_clonal_prod_pdf - np.max(log_clonal_prod_pdf)
    cis = log_clonal_prod_pdf[log_clonal_prod_pdf >= -1].index.tolist()
    purity_lower_ci = cis[0] * 2
    purity_upper_ci = cis[-1] * 2
    
    # plotly plot
    # Step 1: make the figure
    maf_df['clonal_status'] = maf_df.index.map(lambda x: 'clonal' if x in clonal_muts else 'subclonal')
    maf_df['Mut_Label'] = maf_df['Hugo_Symbol'] + ':' + maf_df['Start_position'].astype(str) + ':' + maf_df['Protein_Change'].astype(str) + ':' + maf_df['Variant_Classification'].astype(str)
    to_plot_maf_df = maf_df.set_index('Mut_Label')[list(tumor_f_bins)].stack().reset_index()
    to_plot_maf_df['clonal_status'] = to_plot_maf_df['Mut_Label'].map(maf_df[['Mut_Label', 'clonal_status']].set_index('Mut_Label')['clonal_status'])
    to_plot_maf_df['pdf_log10'] = np.log10(to_plot_maf_df[0])
    fig = px.line(to_plot_maf_df, x='level_1', y='pdf_log10', color='clonal_status', 
                  hover_data=['Mut_Label'], title=f'{data_id}: purity = {round(purity, 2)} [{round(purity_lower_ci, 2)} - {round(purity_upper_ci, 2)}]')
    fig.add_trace(go.Scatter(x=tumor_f_bins, y=np.log10(clonal_prod_pdf),
                    mode='lines',
                    name='clonal product pdf'))
    fig.add_trace(go.Scatter(x=tumor_f_bins, y=np.log10(sum_pdf),
                    mode='lines',
                    name='all mutations sum pdf'))
    
    fig.add_vrect(x0=cis[0], x1=cis[-1], line_width=0, fillcolor="red", opacity=0.2)
    fig.add_vline(x=clonal_prod_pdf.index[half_purity], name='Half purity')
    
    ylim_min=10 ** (-4)
    ylim_max=10 ** 2
    fig.update_yaxes(range=[np.log10(ylim_min), np.log10(ylim_max)])

    return fig, purity, purity_lower_ci, purity_upper_ci


In [35]:
beta_table_cols = ['CHIP_mut_status', 
                  'aSCNA', 
                  'Hugo_Symbol', 
                  'Chromosome', 
                  'Start_position', 
                  'Variant_Classification', 
                  'Protein_Change', 
                  't_alt_count', 
                  't_ref_count', 
                  'total_count', 
                  'tumor_f', 
                  'gnomADg_AF']
blank_beta_df = pd.DataFrame(columns=beta_table_cols)
blank_beta_df.loc[0, beta_table_cols] = 'Test'

@functools.lru_cache(maxsize=32) # faster to reload
def read_maf(fn):
    return pd.read_csv(fn, sep='\t')

def beta_graph_callback(df, idx, 
                        reload_beta_graph_button, 
                        selected_rows, 
                        beta_table_fn_col, 
                        beta_table_display_col):
    r = df.loc[idx]
    maf_df = read_maf(r[beta_table_fn_col])
    selected_rows = maf_df[maf_df['pass_known_driver_filter']].reset_index().index.tolist()
    fig, purity, purity_lower_ci, purity_upper_ci = plot_beta(maf_df.loc[selected_rows], r.name)
    return [fig, maf_df[beta_table_display_col].to_dict('records'), selected_rows, purity, 2]

def internal_beta_graph_callback(df, idx, 
                        reload_beta_graph_button, 
                        selected_rows, 
                        beta_table_fn_col, 
                        beta_table_display_col):
    r = df.loc[idx]
    maf_df = read_maf(r[beta_table_fn_col])
    fig, purity, purity_lower_ci, purity_upper_ci = plot_beta(maf_df.loc[selected_rows], r.name)
    return [fig, maf_df[beta_table_display_col].to_dict('records'), selected_rows, purity, 2]
    
test_app.add_custom_component('beta-graph', 
                              html.Div([html.H1("Beta MAF"), 
                                        html.Button('Reload Beta Plot', id='reload-beta-button', n_clicks=0),
                                        dash_table.DataTable(
                                                              id='beta-maf-table',
                                                              columns=[{"name": i, "id": i} for i in beta_table_cols],
                                                              data=blank_beta_df.to_dict('records'),
                                                              filter_action="native",
                                                              sort_action="native",
                                                              sort_mode="multi",
                                                              column_selectable="single",
                                                              row_selectable="multi",
                                                              selected_columns=[],
                                                              selected_rows=[0],
                                                              page_action="native",
                                                              page_current= 0,
                                                              page_size= 12,
                                         ), 
                                        html.Div([html.P('Purity: ', style={'display': 'inline'}), html.P(0, id='beta-graph-purity', style={'display': 'inline'})]), 
                                        html.Div([html.P('Ploidy: ', style={'display': 'inline'}), html.P(0, id='beta-graph-ploidy', style={'display': 'inline'})]), 
                                        dcc.Graph(id='beta-graph', figure={})]), # todo just make name the heading
                              callback_output=[Output('beta-graph', 'figure'), 
                                               Output('beta-maf-table', 'data'), 
                                               Output('beta-maf-table', 'selected_rows'),
                                               Output('beta-graph-purity', 'children'),
                                               Output('beta-graph-ploidy', 'children')
                                              ],
                              callback_input=[Input('reload-beta-button', 'n_clicks'), 
                                               State('beta-maf-table', 'selected_rows')],
                              new_data_callback=beta_graph_callback, 
                              internal_callback=internal_beta_graph_callback,
                              add_autofill=True,
                              autofill_dict={'purity': Input('beta-graph-purity', 'children')},
                              beta_table_fn_col='BETA_annot_maf_fn',
                              beta_table_display_col=beta_table_cols
                             )

This is also an example of where you can specify the outputs of this component can be used to autofill annotations if you recalculate something. The requirements are:
1. The data you want to prefill is the value of one of your output components (temporarily storing your data)
1. `autofill_dict` keys must correspond to the names of the columns in the review data object annotation table you made



# Run the app

If you are running the notebook in a VM, you may need to specify a host and port. To view, you will need to forward the correspoding port. 

You can run directly in the notebook with `mode='inline'`, or in a separate window with `mode='external'`

In [36]:
test_app.run_app(mode='external', port=8055)

Dash app running on http://0.0.0.0:8055/



divide by zero encountered in log10

