# Review model results - Step 1 - Identify a sample to review

# Setup

<div class="alert alert-block alert-warning">
    This notebook assumes:
    <ul>
        <li><b>Terra</b> is running custom Docker image <kbd>gcr.io/uk-biobank-sek-data/ml4cvd_terra:20200226_122553</kbd>.</li>
        <li><b>ml4cvd</b> is running custom Docker image <kbd>gcr.io/broad-ml4cvd/deeplearning:tf2-latest-gpu</kbd>.</li>
    </ul>
</div>

<div class="alert alert-block alert-danger">
    Problems:
    <ul>
        <li><b>ml4cvd</b> notebooks do not currently have permission to access data in project <a href='https://console.cloud.google.com/bigquery?project=uk-biobank-sek-data'>uk-biobank-sek-data</a>
</li>
        <li><b>Terra</b> notebooks with Facets for interactive data exploration are broken in the latest version of chrome. The work-around described in <a href="https://github.com/PAIR-code/facets/issues/207">https://github.com/PAIR-code/facets/issues/207</a> is in place but a bit more work is needed per <a href="https://broadworkbench.atlassian.net/browse/IA-1684">https://broadworkbench.atlassian.net/browse/IA-1684</a>.</li>
    </ul>
</div>

In [None]:
# TODO(deflaux): remove this cell after gcr.io/broad-ml4cvd/deeplearning:tf2-latest-gpu has this preinstalled.
from ml4cvd.runtime_data_defines import determine_runtime
from ml4cvd.runtime_data_defines import Runtime

if Runtime.ML4CVD_VM == determine_runtime():
    !pip3 install --user facets-overview

In [None]:
from ml4cvd.visualization_tools.facets import FacetsOverview, FacetsDive  # Interactive data exploration of tabular data.
import numpy as np
import os
import pandas as pd
import re

In [None]:
%load_ext google.cloud.bigquery

# Identify a sample to review

<div class="alert alert-block alert-info">
    If you want to change the SQL below, you can view the available tables:
    <ul>
        <li><a href="https://storage.cloud.google.com/uk-biobank-sek-data-us-east1/ukb21481.html">phenotype descriptions</a>
        <li><a href="https://bigquery.cloud.google.com/table/uk-biobank-sek-data:raw_phenotypes.ukb9222_no_empty_strings_20181128">phenotype values</a>
        <li><a href="https://bigquery.cloud.google.com/dataset/uk-biobank-sek-data:a_ttl_one_week">available ML results</a>
    </ul>      
<div>


In [None]:
%%bigquery sample_info

---[ EDIT THIS QUERY IF YOU LIKE ]---

SELECT
  sample_id,
  CASE u31_0_0
    WHEN 0 THEN 'Female'
    WHEN 1 THEN 'Male'
    ELSE 'Unknown' END AS sex_at_birth,
  u21003_0_0 AS age_at_assessment,
  u21001_0_0 AS bmi,
  CASE u1249_0_0
    WHEN 1 THEN 'Smoked on most or all days'
    WHEN 2 THEN 'Smoked occasionally'
    WHEN 3 THEN 'Just tried once or twice'
    WHEN 4 THEN 'I have never smoked'
    WHEN -3 THEN 'Prefer not to answer' END AS past_tobacco_smoking,
  ecg.* EXCEPT(sample_id)
FROM
  `uk-biobank-sek-data.raw_phenotypes.ukb9222_no_empty_strings_20181128`
INNER JOIN
  `uk-biobank-sek-data.ml_results.inference_ecg_rest_age_sex_autoencode_lvmass` AS ecg
ON
  eid = sample_id

In [None]:
%%bigquery sample_info

---[ Demonstrate that this notebook will work once ml4cvd permissions are sorted out. ]---

SELECT * FROM `bigquery-public-data.human_genome_variants.1000_genomes_sample_info`

In [None]:
sample_info.shape

In [None]:
# Compute the deltas between actual values and predicted value columns.
actual_regexp = re.compile('^(\w+)_actual$')
for actual_col in sample_info.columns:
  if actual_col.endswith('_actual'):
    prediction_col = actual_regexp.sub(r'\1_prediction', actual_col)
    if prediction_col in sample_info.columns:
      delta_col = actual_regexp.sub(r'\1_delta', actual_col)
      print('Adding ' + delta_col)
      sample_info[delta_col] = (sample_info[actual_col].astype('float')
                                - sample_info[prediction_col].astype('float'))
        
sample_info.shape

## Facets Overview

Use this visualization to get an overview of the type and distribution of sample information available.

For detailed instructions, see [Facets Overview](https://pair-code.github.io/facets/).

In [None]:
FacetsOverview(sample_info)

## Facets Dive

Use this visualization to get an overview the distributions of values for *groups* of samples.

For detailed instructions, see [Facets Dive](https://pair-code.github.io/facets/).

**NOTE**:
* It might take a few seconds for the visualization to appear.
* If the table of contents pane is in the way of the column selector drop down, click on the button to turn the table of contents off.
* Try:
 * Binning | X-Axis: `sex_at_birth`
 * Binning | Y-Axis: `bmi`, use the 'count' drop down to increase/decrease the number of categorical bins
 * Label By: `sample_id`
 * Color By: `age_at_assesment`
 * Scatter | X-Axis: `LVM_prediction_sentinel_actual`
 * Scatter | Y-Axis: `LVM_prediction_sentinel_prediction`
 
Zoom in, click on the sample(s) of interest and you'll see a pane on the right hand side with all the data for the sample **including the sample_id** which you should use for the next step.

In [None]:
FacetsDive(sample_info)

# Provenance

In [None]:
import datetime
print(datetime.datetime.now())

In [None]:
%%bash
pip3 freeze

Questions about these particular notebooks? Reach out to Puneet Batra pbatra@broadinstitute.org, Paolo Di Achille pdiachil@broadinstitute.org, and Nicole Deflaux deflaux@verily.com.