# Explore participant data

> This notebook explains how to explore phenotypic data table retrieve fields

- runtime: 10min 
- recommended instance: mem1_ssd1_v2_x8
- cost: <£0.10

This notebook depends on:
* **A Spark instance**

In this notebook, we will dive deeper into the phenotypic data stored in the Spark database.
We will retrieve the information about the fields, and learn how to get field id, title, and link to the UK Biobank Showcase, which provides more details and basic statistics about field.

## Import `dxdata` package and initialize Spark engine
### Docs at: https://github.com/dnanexus/OpenBio/blob/master/dxdata/getting_started_with_dxdata.ipynb

In [1]:
import dxdata
import os

# Initialize dxdata engine
engine = dxdata.connect(dialect="hive+pyspark")

## Connect to the dataset

Next, we can set a `DATASET_ID` variable, which takes a value: `[projectID]:[dataset ID]`
We use it to define the `dataset` with `dxdata.load_dataset` function.

**projectID** and **dataset ID** values are unique to your project.
Notebook example **101** explains how to get them.

In [2]:
project = os.popen("dx env | grep project- | awk -F '\t' '{print $2}'").read().rstrip()
record = os.popen("dx describe *dataset | grep  record- | awk -F ' ' '{print $2}'").read().rstrip().split('\n')[0]
DATASET_ID = project + ":" + record
dataset = dxdata.load_dataset(id=DATASET_ID)

## Retrieve data from the table

The following code selects the `participant` table.
Then we can define which field we are interested in using the `find_field` function.

There are three main ways to identify the field of interest:

- With `name` argument: here we give field ID. We can construct field ID used by the `dxdata` package from the field ID defined by UKB Showcase. The numeric showcase ID is translated to the Spark DB column name by adding the letter `p` at the beginning: e.g. *Standing height* showcase id is `50`, so Spark ID would be `p50`. Usually, fields have multiple instances. In such case, we add the `_i` suffix followed by instance number, e.g. *Standing height | Instance 0* will be `p50_i0`
- With `title` argument: here we define the field by full title, followed by ` | Instance` suffix, e.g. `Age at recruitment` or `Standing height | Instance 0`
- With `title_regex` argument: here we define the field by [regular expression](https://docs.python.org/3/howto/regex.html) matching the part of the title. We can use a keyword here, e.g. `.*height.*` will return all columns with the word *height* in the title.

In [3]:
pheno = dataset['participant']

# Find by field name
field_eid = pheno.find_field(name="eid")

# Find by exact title
field_sex = pheno.find_field(title="Sex")
field_age = pheno.find_field(title="Age at recruitment")
field_height = pheno.find_field(title="Standing height | Instance 0")

# Find by title pattern
pattern = ".*height.*"
fields_height = list(pheno.find_fields(title_regex=pattern))


## Explore the dataset

In the last line of the previous step, we retrieved all the files with the word "height" in the title to the `fields_height` variable. Let's explore these fields now.

Here we use a for loop to iterate through the fields in `fields_height`, printing the field ID, title, and link to the UKB showcase. In this way, we can find fields that will be useful for further analyses.


In [4]:
for _ in fields_height:
    print("[" + _.name + "]\t" + _.title + " (" + _.linkout + ")")
    

[p50_i0]	Standing height | Instance 0 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=50)
[p50_i1]	Standing height | Instance 1 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=50)
[p50_i2]	Standing height | Instance 2 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=50)
[p50_i3]	Standing height | Instance 3 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=50)
[p51_i0]	Seated height | Instance 0 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=51)
[p51_i1]	Seated height | Instance 1 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=51)
[p51_i2]	Seated height | Instance 2 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=51)
[p51_i3]	Seated height | Instance 3 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=51)
[p1697_i0]	Comparative height size at age 10 | Instance 0 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=1697)
[p1697_i1]	Comparative height size at age 10 | Instance 1 (http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=1697)
[p1697_i2]	Comparative height size