# Explore phenotype tables and data in R with a reticulate package

> Scope: This notebook shows how to explore participant table metadata in R.

Run info: 
- runtime: 15min 
- recommended instance: mem1_ssd1_v2_x8
- estimated cost: <£0.20

This notebook depends on:
* **NA**

This notebook describes the basics of connecting to phenotype databases and exploring tables and fields.
We will use a `reticulate` R package to connect to Python and call `dxdata.connect` function. 
Next, we will learn how to convert a Python (data frames) object to an R object (tibble) and work with them using `dplyr` package.
We will browse available tables to get a short description of each table. 
Finally, we will iterate across all fields descriptors in the `participant` table, retrieve the field codes, and save all this information to a CSV file. 

## Install required packages

Function `p_load` from `pacman` loads packages into R.
If the given package is missing `p_load` will automatically install it - this can take a considerable amount of time for a package that needs C or FORTRAN code compilation.

The following packages are needed to run this notebook:

- `reticulate` - R-Python interface, required to use `dxdata` package and allow retrieval of phenotypic data 
- `dplyr` - tabular data manipulation in R, required for pre-processing, encoding and filtering of phenotypic data
- `parallel` - parallel computation in R

In [None]:
message('Installing packages...')
if(!require(pacman)) install.packages("pacman")
pacman::p_load(reticulate, dplyr, parallel)

# Set Python environment explicitly
reticulate::use_python("/opt/conda/bin/python3", required = TRUE) 

## Import dxdata package: https://github.com/dnanexus/OpenBio/blob/master/dxdata/getting_started_with_dxdata.ipynb

In [2]:
dxdata <- import("dxdata")

## Connect to the dataset

Next, we can set a `DATASET_ID` variable, which takes a value: `[projectID]:[dataset ID]`
We use it to define the `dataset` with `dxdata.load_dataset` function.

**projectID** and **dataset ID** values are unique to your project.
Notebook example **101** explains how to get them.

In [None]:
project <- Sys.getenv('DX_PROJECT_CONTEXT_ID')
record <- system("dx find data --type Dataset --delimiter ',' | sort -t ',' -k2,2r | head -n1 | awk -F ',' '{print $5}'", intern = TRUE)
DATASET_ID <- paste0(project, ":", record)
dataset <- dxdata$load_dataset(id=DATASET_ID)
DATASET_ID

## Explore the dataset

In this step, we iterate through the tables in the `dataset`, with `lapply` function.
We extract table names and short descriptions from dataset metadata. 
Then we construct a `tibble` object, which can be previewed or exported to tabular format.

For example, the `participant` table contains general UK Biobank participant data. 
Other tables contain specific information, like hospitalization records, 
death records, GP registration, and COVID-19 results. 
Different tables might be available in your project - you will see tables associated with fields approved in your application.
See more info in UK Biobank Docs [here](https://dnanexus.gitbook.io/uk-biobank-rap/getting-started/working-with-ukb-data).

In [4]:
tables <- as_tibble(do.call(rbind, lapply(dataset$entities_by_name, function(x) {
    return(c(
        name = x$name, 
        description = x$entity_description
    ))
}))) 

In [5]:
t(tables)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
name,participant,death_cause,hesin,hesin_critical,hesin_delivery,death,hesin_maternity,hesin_oper,hesin_diag,hesin_psych,covid19_result_england,covid19_result_scotland,covid19_result_wales,gp_clinical,gp_registrations,gp_scripts
description,,,,,,,,,,,,,,,,


## Retrieve table metadata

The following functions select the `participant` table and retrieve table metadata to local memory.
We iterate through the fields in `participant` table with `lapply` function.
At each step, we extract name following field information:
- title 
- type
- units 
- path
- coding

Next, we construct a `tibble` object, which can be previewed or exported to tabular format.



In [6]:
pheno <- dataset$entities_by_name[['participant']]

In [7]:
fields_table <- as_tibble(do.call(rbind, mclapply(pheno$fields, function(x) {
    
    codes <- x$coding$codes
    
    if(length(codes)) {
        coding <- paste(names(codes), ' => ', unlist(codes), collapse = '; ')
    } else {
        coding <- ''
    }
    
    return(c(
        name = x$name, 
        title = x$title,
        type = x$type,
        units = paste(x$units, collapse = '; '),
        path = paste(x$folder_path, collapse=' -> '),
        coding = coding,
        linkout = paste(x$linkout, collapse = '; ')
    ))
}, mc.cores=16))) 

In [8]:
head(fields_table)

name,title,type,units,path,coding,linkout
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
eid,Participant ID,string,,Participant Information,,
p3_i0,Verbal interview duration | Instance 0,integer,seconds,Assessment centre -> Procedural metrics -> Process durations,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=3
p3_i1,Verbal interview duration | Instance 1,integer,seconds,Assessment centre -> Procedural metrics -> Process durations,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=3
p3_i2,Verbal interview duration | Instance 2,integer,seconds,Assessment centre -> Procedural metrics -> Process durations,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=3
p3_i3,Verbal interview duration | Instance 3,integer,seconds,Assessment centre -> Procedural metrics -> Process durations,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=3
p4_i0,Biometrics duration | Instance 0,integer,seconds,Assessment centre -> Procedural metrics -> Process durations,,http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=4


## Enumerate coding types

The code below tally the columns with each encoding type.
There are five data types in `participant` table:

- `date` - used to store data
- `datetime` - used to store date and time
- `double` - used to store real numbers, e.g. participant height
- `integer` - used to store categorial encoded values, e.g. participant ethnicity
- `string` - used for fields that cannot be expressed on any of the above encodings


In [9]:
table(fields_table$type)


    date datetime   double  integer   string 
    2004      231     8570    13241     2082 

##  Save participant table metadata as an R export file and upload

In [10]:
save(fields_table, file='field_info_tibble_17999x7.Rdata')

In [11]:
system('dx upload field_info_tibble_17999x7.Rdata --path pheno/')