# Export participant data to R

> Scope: This notebook shows how to retrieve and export phenotypic data in R

Run info: 
- runtime: 10min 
- recommended instance: mem1_ssd1_v2_x8
- estimated cost: <£0.15

This notebook depends on:
* **A Spark instance**

This notebook explains how to retrieve and save phenotypic data for further analyses, such as genome-wide association studies or epidemiological studies. 
We will use a `reticulate` R package to connect to Python and call the `dxdata.connect` function, which connects to the Spark database. 
Next, we will convert Python (Spark data frame) object to an R object (tibble) and export data to a tabular text file. This file can be used as an input to external tools, such as PLINK or regenie.  

## Install required packages

Function `p_load` from `pacman` loads packages into R.
If the given package is missing `p_load` will automatically install it - this can take a considerable amount of time for a package that needs C or FORTRAN code compilation.
The following packages are needed to run this notebook:

- `reticulate` - R-Python interface, required to use `dxdata` package that connects to Spark database and allows retrieval of phenotypic data 
- `dplyr` - tabular data manipulation in R, require to pre-process, encode and filter phenotypic data
- `parallel` - parallel computation in R
- `arrow` - input/output library for Apache binary files
- `skimr` - provide summary statistics about variables in data frames, tibble objects, data tables and vectors

In [None]:
message('Installing packages...')
if(!require(pacman)) install.packages("pacman")
pacman::p_load(reticulate, dplyr, parallel, readr, skimr, arrow)

## Import dxdata package and initialize Spark (dxdata) engine

In [None]:
dxdata <- import("dxdata")

## Connect to the dataset

Next, we can set a `DATASET_ID` variable, which takes a value: `[projectID]:[dataset ID]`
We use it to define the `dataset` with `dxdata.load_dataset` function.

**projectID** and **dataset ID** values are unique to your project.
Notebook example **101** explains how to get them.

In [None]:
project <- system("dx env | grep project- | awk -F '\t' '{print $2}'", intern = TRUE)
record <- system("dx describe *dataset | grep  record- | awk -F ' ' '{print $2}' | head -n 1" , intern = TRUE)
DATASET_ID <- paste0(project, ":", record)
dataset <- dxdata$load_dataset(id=DATASET_ID)

##  Select the `participant` table

The following code selects the `participant` table.

In [None]:
pheno <- dataset$entities_by_name[['participant']]

## Select fields from `participant` table


We can define which field we are interested in using the `find_field` function.

There are three main ways to identify the field of interest:

- With `name` argument: we give the field ID. We can construct filed ID used by `dxdata` package from the field ID defined by UKB Showcase. The numeric showcase ID is translated to Spark DB column name by adding the letter `p` at the beginning: e.g. *Standing height* showcase id is `50`, so Spark ID would be `p50`. Usually, fields have multiple instances. In such case, we add `_i` suffix followed by instance number, e.g. *Standing height | Instance 0* will be `p50_i0`
- With `title` argument: we define the field by the full title, followed by ` | Instance` suffix, e.g. `Age at recruitment` or `Standing height | Instance 0`
- With `title_regex` argument: we define the field by [regular expression](https://docs.python.org/3/howto/regex.html) matching the part of the title. We can use a keyword here, e.g. `.*height.*` will return all columns with the word *height* in the title.

In [5]:
fld = list(
    pheno$find_field(name="eid"),
    pheno$find_field(title="Sex"),
    pheno$find_field(title="Age at recruitment"),
    pheno$find_field(title="Standing height | Instance 0")
)

## Define the Spark engine

In [6]:
engine <- dxdata$connect(dialect="hive+pyspark")
engine

Engine(hive+pyspark:///)

## Retrieve the fields defined in `fld` list

In [7]:
df <- pheno$retrieve_fields(engine=engine, fields=fld, coding_values="replace")

## Write the data to a temporary `parquet` file 

You can learn more about the _parquet_ file format here: [https://parquet.apache.org/](https://parquet.apache.org/)

In [8]:
system('hadoop fs -rm -r -f tmpdf.parquet', intern = TRUE)

In [9]:
df$write$parquet('tmpdf.parquet')

## Copy the temporary _parquet_ file from distributed to the local file system

In [10]:
if(dir.exists('tmpdf.parquet')) unlink("tmpdf.parquet", recursive=TRUE)
system('hadoop fs -copyToLocal tmpdf.parquet', intern = TRUE)

## Read the dataset information R using Apache `arrow` package

In [14]:
ds <- arrow::open_dataset('tmpdf.parquet')

## Collect the data from the dataset to R memory

Now, the phenotypic data are available as standard `tibble` objects, which can be interacted with using methods from `tidyverse` environment.

In [15]:
tbl <- ds %>% collect

In [None]:
skim(tbl)

At this point, you can inspect your data. For example, you can print 5 first rows using the `head(tbl)` command.

## Export data to CSV format

In [17]:
readr::write_csv(tbl, 'pheno_height.csv')

## Export data in PLINK phenotype format

In [18]:
pheno_out <- tbl %>% 
    transmute(
        FID=eid, 
        IID=eid, 
        Y1=as.double(p50_i0)
)

pheno_out %>% write_delim(file = 'ukb_phenotypes_height.txt', delim = ' ')

In [None]:
system('dx upload ukb_phenotypes_height.txt --path pheno/')