# Introduction

This notebook demonstrates typical initial steps to exploring phenotype distributions. It has been written to be interactive, allowing you to make choices as you go.

## Data disclaimer
----

All data in this notebook (and this workspace) are publicly available thanks to the effort of many dedicated individuals: 

- Genotype and some phenotypic data were produced by the [1000 Genomes Project (phase 3)](https://www.internationalgenome.org/)

- Individual phenotypes were modeling using the [GCTA software](cnsgenomics.com/software/gcta) and variant-level summary statistics from [MAGIC](https://www.magicinvestigators.org/), the [GIANT Consortium](https://portals.broadinstitute.org/collaboration/giant/index.php/Main_Page), the [UK Biobank](https://www.ukbiobank.ac.uk/), and the [MVP](https://www.research.va.gov/mvp/)  

Phenotypes were modeled to reflect the actual genetic architecture of these complex traits as closely as possible. Most single variant association results should correspond well to published GWAS, but others may not. **Results produced from these data should not be taken as representing real, replicable genetic associations. These data are provided for demonstration and training purposes only.**

## Load Python packages

In [None]:
%%capture 
!pip install -U terra_notebook_utils
import pandas as pd
import os
from analysis_utils import AnalysisUtils
from terra_notebook_utils import drs

## Set configuration

Obtain a JWT by going through the auth flow here:

https://tdrb2ctest.b2clogin.com/tdrb2ctest.onmicrosoft.com/oauth2/v2.0/authorize?p=B2C_1A_SIGNUP_SIGNIN&client_id=bc8119eb-e425-4ff7-945a-05f90a37fca7&nonce=defaultNonce&redirect_uri=https%3A%2F%2Fjwt.ms&scope=openid&response_type=id_token&prompt=login

In [None]:
# Paste in the JWT token obtained via Azure B2C
token = ""
# Paste snapshot ID generated from the AzureY1Demo notebook
snapshot_id = ""

# Load phenotypes 

Phenotypic data for each individual in the study are stored in Azure TDR. To analyze inside this notebook, we have to explicitly load the data in our notebook environment. 

## Retrieve TDR snapshot and copy parquet data to the notebook

Note: this duplicates some functionality from the `AzureY1Demo` notebook.

In [None]:
utils = AnalysisUtils(token)
snapshot = utils.snapshots_api().retrieve_snapshot(snapshot_id, include=["ACCESS_INFORMATION"])
table = next(iter(snapshot.access_information.parquet.tables), lambda t: t.name == "demo_pheno_data")
local_parquet_dir = "/tmp/az"
os.system("rm -r %s/%s.parquet" % (local_parquet_dir, table.name))
os.system("azcopy cp '%s?%s' '%s' --recursive" % (table.url, table.sas_token, local_parquet_dir))

## Load phenotype data

Load phenotype data into a pandas dataframe. The columns correspond to:  

* **sample:** a unique label for each individual sample in our dataset
* **age:** numerical age of the individual at the time of each phenotype measure
* **ancestry:** superpopulation group of each individual
  * AFR: African
  * AMR: Ad Mixed American
  * EAS: East Asian
  * EUR: European
  * SAS: South Asian
* **bmi:** body mass index
* **fg:** fasting glucose
* **fi:** fasting insulin
* **hdl:** high density lipoprotein
* **height:** standing height
* **ldl:** low density lipoprotein
* **population:** population of each sample, see [1000 Genomes description](https://www.internationalgenome.org/category/population/)
* **sex:** biological sex
* **tc:** total cholesteral
* **tg:** total triglycerides
* **whr:** waist-to-hip ratio

In [None]:
samples = pd.read_parquet("%s/%s.parquet" % (local_parquet_dir, table.name))
samples = samples.drop(columns=["datarepo_row_id"])
samples.head(10)

# Examine phenotype data
----

Let's take a look at the phenotype distributions. In a GWAS - and statistical genetics more generally - we should always be on the lookout for correlations within our dataset. Correlations between phenotypic values can confound our analysis, leading to results that may not represent true genetic associations with our traits. Exploring these relationships may help in choosing a reasonable set of covariates to model.    

We've included a number of plotting functions below to make this as easy as possible. Feel free to modify - or write your own functions - as you explore the data. 


## Goals of this section
----
    
1. Visualize the distribution of phenotype values  
    - Within each continuous trait (using the kdplot function)  
    - Between two continuous traits (with the bivariateDistributionPlot function)
2. Determine whether trait distributions follow patterns we might expect

## Generating distribution plots

<img src="https://raw.githubusercontent.com/tmajaria/ashg_2019_workshop/master/ldl_kdplot.png" align="left" width="20%">

***Univariate distributions*** are easily visualized in histograms or density plots. We provide a function (<font color='red'>kdplot</font>) that will generate both types of plots, overlayed in a single figure. A continuously-valued variable corresponding to a column in the phenotype dataframe should be used as input, *ldl* in this example. The function is called with the following syntax:

```python
kdPlot(samples, var = "ldl")
```

<img src="https://raw.githubusercontent.com/tmajaria/ashg_2019_workshop/master/whr_hdl_bivariateDistributionPlot.png" align="left" width="20%"> 

***Bivariate distributions*** can be visualized using a scatterplot. Use the function <font color='red'>bivariateDistributionPlot</font> to visualize two continuously values variables. The *type* argument determines the type of plot generated and can be one of: "scatter", "reg", "resid", "kde", and "hex".

```python
bivariateDistributionPlot(samples, var1 = "hdl", var2 = "whr", kind = "scatter")
```

### Exercise: Univariate distributions

Use the code cells below to plot the distribution of single variables of your choice (such as ldl or bmi). You may need to refer to section 3.2 above for the list of variables and to section 4.1 for the plotting syntax. 

In [None]:
utils.kdPlot(samples, var = "ldl")

### Exercise: Bivariate distributions

Generate scatter plots with different combinations of variables. Think about what you would expect versus what you see in the plot. You may need to refer to 3.2 for the list of variables and to section 4.1 for the plotting syntax. 

In [None]:
utils.bivariateDistributionPlot(samples, var1 = "bmi_baseline", var2 = "ldl", kind = "scatter")

# Look at IGV
----

Just for fun, let's use the `igv-jupyter` notebook extension to look at genotype information for 1 sample.


In [None]:
# Get the DRS URLs from the sample table
bam_file = samples.at[0, 'bam_file']
bam_file_index = samples.at[0, 'bam_file_index']

print(f'bam_file: {bam_file}')
print(f'bam_file_index: {bam_file_index}')

In [None]:
# Use terra-notebook-utils to resolve the DRS and download the data
drs.copy_batch([bam_file, bam_file_index], ".")

In [None]:
# Load up IGV

import igv

b = igv.Browser({"genome": "hg38"})

b.load_track(
  {
    "name": "HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam",
    "url": "HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam",
    "indexURL": "HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20101123.bam.bai",
    "format": "bam",
    "type": "alignment"
  }
)

b.search('chr20:32,214,217-32,229,950')

b.show()