# Explore phenotype tables

> This notebook explains how to connect to the phenotypic database and retrieve information about available tables (using Python)

- runtime: 5min 
- recommended instance: mem1_ssd1_v2_x8
- cost: <£0.10

This notebook depends on:
* **A Spark instance**

This notebook describes the basics of connecting to phenotype databases and exploring tables and fields.
We will use a `dxdata.connect` function to initiate a connection to the database. 
Next, we will learn how to obtain the project and dataset IDs required to load a dataset.
We will iterate through the tables in the dataset and obtain a short description of each table. 
Finally, we will retrieve the information from one of these tables to local memory, inspect the content, and print the few first rows of the data.


## Import `dxdata` package for subsequent Spark engine employment
### Docs at: https://github.com/dnanexus/OpenBio/blob/master/dxdata/getting_started_with_dxdata.ipynb

In [1]:
import dxdata
import os

## Connect to the dataset

In the next step, we need to input your project ID and dataset ID. 
Use the following shell commands to find the values specific to your project:

In [None]:
project = os.popen("dx env | grep project- | awk -F '\t' '{print $2}'").read().rstrip()
project

In [None]:
record = os.popen("dx describe *dataset | grep  record- | awk -F ' ' '{print $2}'").read().rstrip().split('\n')[0]
record

Next, we can set a `DATASET_ID` variable, which takes a value: [projectID]:[dataset ID]
We use it to define the `dataset` with `dxdata.load_dataset` function.

In [None]:
DATASET_ID = project + ":" + record
DATASET_ID

In [5]:
#load dataset
dataset = dxdata.load_dataset(id=DATASET_ID)

## Explore the dataset

In this step, we iterate through the tables in `dataset`, printing the table ID, title, and short description.
For example, the `participant` table contains general UK Biobank participant data. 
Other tables contain specific information, like hospitalization records, 
death records, GP registration, and COVID-19 results. 
Different tables might be available in your project - you will see tables associated with fields approved in your application.
See more info in UK Biobank Docs [here](https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/working-with-ukb-data).

In [6]:
for _ in dataset.entities:
    print('-> ' + _.entity_label_singular + ' [' + _.name + ']' )
    print(_.entity_description)

-> Participant [participant]

-> Death Cause Record [death_cause]

-> Hospitalization Record [hesin]

-> Hospital Critical Care Record [hesin_critical]

-> Hospital Delivery Record [hesin_delivery]

-> Death Record [death]

-> Hospital Maternity Record [hesin_maternity]

-> Hospital Operation Record [hesin_oper]

-> Hospital Diagnosis Record [hesin_diag]

-> Hospital Psychiatric Detention Record [hesin_psych]

-> COVID19 Test Result Record (England) [covid19_result_england]

-> COVID19 Test Result Record (Scotland) [covid19_result_scotland]

-> COVID19 Test Result Record (Wales) [covid19_result_wales]

-> GP Clinical Event Record [gp_clinical]

-> GP Registration Record [gp_registrations]

-> GP Prescription Record [gp_scripts]



## Retrieve data from the table

Following functions select the `participant` table, retrieve data to local memory, 
and convert them to [Pandas](https://pandas.pydata.org/) data frame.

In [7]:
participant = dataset['participant']

Here we select the first 5 fields from the `participant` table

In [8]:
fields = participant.fields[0:5]
fields

[<Field "eid">,
 <Field "p3_i0">,
 <Field "p3_i1">,
 <Field "p3_i2">,
 <Field "p3_i3">]

Next, we retrieve the data and convert them to a pandas table. 

We can limit the number of rows retrieved. In the following example, we retrieve the first 100 rows.

In [None]:
participant_data = participant.retrieve_fields(engine=dxdata.connect(), fields=fields, coding_values="replace", limit=100)

In the step above, we obtain a PySpark DataFrame - a distributed dataset.

We assess columns and row count with the following commands:

In [11]:
participant_data.describe()

                                                                                

DataFrame[summary: string, eid: string, p3_i0: string, p3_i1: string, p3_i2: string, p3_i3: string]

In [12]:
participant_data.count()

100

The command below collects distributed data from PySpark DataFrame and converts it to a pandas table:

In [12]:
participant_data_pd = participant_data.toPandas()

Finally, we rename the columns to reflect column titles rather than IDs.

In [13]:
colnames = [_.title for _ in fields]
participant_data_pd.columns = colnames


## Inspect the content of the table

Here we print some basic information about the data in the `participant_data` table.
`count` shows the number of columns, `unique` shows the number of unique participants, 
`top` shows the most common value, and `freq` counts how many times this value is observed.

In [14]:
participant_data_pd.describe(include='all')

Unnamed: 0,Participant ID,Verbal interview duration | Instance 0,Verbal interview duration | Instance 1,Verbal interview duration | Instance 2,Verbal interview duration | Instance 3
count,100.0,100.0,4.0,12.0,1.0
unique,100.0,,,,
top,5982304.0,,,,
freq,1.0,,,,
mean,,492.93,589.75,596.916667,591.0
std,,251.554491,214.454929,228.428807,
min,,162.0,418.0,256.0,591.0
25%,,330.25,484.0,485.0,591.0
50%,,425.0,519.0,561.0,591.0
75%,,552.25,624.75,730.5,591.0


## Print the top rows of the data

It is possible to use the `head` function to render the table representing the first five rows in the retrieved table:

In [15]:
participant_data_pd.drop(['Participant ID'], axis=1).head()

Unnamed: 0,Verbal interview duration | Instance 0,Verbal interview duration | Instance 1,Verbal interview duration | Instance 2,Verbal interview duration | Instance 3
0,577,,,
1,439,,,
2,1547,,,
3,349,,,
4,508,,,
