# Retrieving data using `dxdata` and plotting results

***
This notebook is delivered "As-Is". Notwithstanding anything to the contrary, DNAnexus will have no warranty, support, liability or other obligations with respect to Materials provided hereunder.



[MIT License](https://github.com/dnanexus/OpenBio/blob/master/LICENSE.md) applies to this notebook.
***

## Introduction
This notebook shows the application of basic functions from `dxdata` on a UKB RAP dispensed project to explore the dataset and plot phenotypic trait correlations.

## Jupyterlab app details (launch configuration)

### Recommended configuration
- runtime: < 10 min
- cluster configuration: `Spark cluster`
- number of nodes: 2
- recommended instance: `mem1_ssd1_v2_x4`
- cost: < £0.05


### Performance comparison
- **mem1_ssd1_v2_x4, Spark cluster, 2 nodes**:    
    - runtime: < 10 min
    - cost: < £0.05
- mem1_ssd1_v2_x16, Spark cluster, 2 nodes:
    - runtime: < 10 min
    - cost: < £0.2

### Import and initialize some packages

In [None]:
import databricks.koalas as ks
import dxdata
import dxpy
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

### Open UK Biobank dataset

In [None]:
dispensed_dataset = dxpy.find_one_data_object(
    typename='Dataset', 
    name='app*.dataset', 
    folder='/', 
    name_mode='glob')
dispensed_dataset_id = dispensed_dataset['id']
dataset = dxdata.load_dataset(id=dispensed_dataset_id)

## Phenotype data

### Before we begin, a little reminder about terminology...


<table style="float:left; text-align: center; border: 1px solid black">   
    <tr style="border-bottom: 1px solid black"><td style="border-right: 1px solid black">Field name/Column name</td><td style="border-right: 1px solid black">Field title</td>
    <tr><td style="border-right: 1px solid black">eid</td><td style="border-right: 1px solid black">Participant ID</td>
    <tr><td style="border-right: 1px solid black">p31</td><td style="border-right: 1px solid black">Sex</td>
    <tr><td style="border-right: 1px solid black">p84_i0_a3</td><td style="border-right: 1px solid black">Cancer year/age first occurred | Instance 0 | Array 3</td>
</table>



Select fields to extract into dataframe.

In [None]:
pheno = dataset['participant']

# Find by field name
field_eid = pheno.find_field(name='eid')

# Find by exact title
field_sex = pheno.find_field(title='Sex')
field_age = pheno.find_field(title='Age at recruitment')
field_own_rent = pheno.find_field(title='Own or rent accommodation lived in | Instance 0')

# Find by title pattern
pattern = 'Length of time at current address \| Instance [0-2]'
fields_len = list(pheno.find_fields(title_regex=pattern))

### Extract phenotype data for selected fields

The `participant.retrieve_fields` function can be used to construct a Spark DataFrame of the given fields.

By default, this retrieves data as encoded by UK Biobank. For example, field `p31` (participant's sex) will be returned as an integer column with values of 0 and 1. To receive decoded values, supply the `coding_values='replace'` argument.

For more information, see [Tips for Retrieving Fields](https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/working-with-ukb-data#tips-for-retrieving-fields) in the documentation.

In [None]:
# Final list of fields
field_list = [field_eid, field_sex, field_own_rent, field_age] + fields_len

# Extract data
pheno_data = pheno.retrieve_fields(fields=field_list, engine=dxdata.connect()).to_koalas()

# See first five entries
pheno_data.head()

Let's list column name and title.

In [None]:
pd.DataFrame(
    {
        'Name': [f.name for f in field_list],
        'Title': [f.title for f in field_list]
    }
)

### Summarize data

In [None]:
pheno_data.describe()

### Get averages and group counts by sex

In [None]:
# Show average of numeric columns (age, own or rent accommodation lived in, length of time at current address) by sex
pheno_data.groupby('p31').mean()

In [None]:
# Show counts of type of accommodation lived in grouped by sex
pheno_data.groupby('p31')['p680_i0'].value_counts().unstack()

### Visually display correlation

In [None]:
len_address_inst0 = pheno_data.p699_i0.to_numpy()
len_address_inst1 = pheno_data.p699_i1.to_numpy()
age = pheno_data.p21022.to_numpy()

In [None]:
# Plot length of time at current address instance 1 against instance 2
ax = sns.jointplot(x=len_address_inst0, y=len_address_inst1, kind='scatter', space=0, color='black', alpha=0.1, s=4)
ax.set_axis_labels(fields_len[0].title, fields_len[1].title, fontsize=16)

In [None]:
# Plot age against length of time at current address
ax = sns.jointplot(x=age, y=len_address_inst0, kind='kde')
ax.set_axis_labels(field_age.title, fields_len[0].title, fontsize=16)