# Access the Sleep Questionnaire data
This notebook explains how to explore and extract the phenotypic sleep data from the participant data entity on UK Biobank RAP.

runtime: 15min\
recommended instance: mem1_ssd1_v2_x8\
cost: <£0.30\
dependencies:\
The [Retrieve-a-single-specific-field-in-this-Notebook](#Retrieve-a-single-specific-field-in-this-Notebook) section of this notebook depends on starting your JupyterNotebook Instance with a **Spark** instance.

In this notebook, we will dive into the phenotypic data stored in the Spark database and retrieve the values and meta of the sleep questionnaire data fields.

### Import dxdata package and initialize Spark engine
Docs at: https://github.com/dnanexus/OpenBio/blob/master/dxdata/getting_started_with_dxdata.ipynb

In [None]:
import dxdata
import pandas as pd
import os
from datetime import datetime

### Connect to the dataset
Next, we can set a DATASET_ID variable, which takes a value: [projectID]:[dataset ID]\
We use it to define the dataset with dxdata.load_dataset function.

projectID and dataset ID values are unique to your project. Notebook example 101 explains how to get them.

In [None]:
project = os.getenv('DX_PROJECT_CONTEXT_ID')
record = os.popen("dx find data --type Dataset --delimiter ',' | awk -F ',' '{print $5}'").read().rstrip()
DATASET_ID = project + ":" + record
dataset = dxdata.load_dataset(id=DATASET_ID)

### Identify the relevant fields
The `field.tsv` file contains detailed meta-data to describe each field. This includes a field title that matches between the data Showcase and the UK Biobank RAP.

In [None]:
field_file = os.popen("dx find data --name 'field.tsv' --brief").read().rstrip()
sys_output = os.popen("dx download " + field_file).read()

In [None]:
field_table = pd.read_csv('field.tsv', sep='\t')

The subcategories of category 205 include the data from the sleep questionnaire.
We've listed the categories here for simplicity but could alternatively be extracted from the catbrowse.tsv file in the Showcase metadata folder of every project

In [None]:
sleep_category_fields = field_table[field_table['main_category'].isin([206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216])]
sleep_category_fields.head()

We'll demonstrate two methods for extracting the relevant data.\
First using the field titles.\
Second using the DNAnexus field names. These are structured p\<ukb file id\> so  UK Biobank field ID 30427 would become p30427. For more information on Instancing, including the field name structure for fields with multiple instances, see [UKB RAP instancing](https://community.ukbiobank.ac.uk/hc/en-gb/articles/15955986227357-What-is-an-instance-index).

In [None]:
sleep_category_titles = sleep_category_fields['title']
print(sleep_category_titles.iloc[0])

sleep_category_field_ids = sleep_category_fields['field_id'].apply(lambda x: 'p' + str(x))
sleep_category_field_ids = pd.concat([sleep_category_field_ids, pd.Series(['eid'])], ignore_index=True)

print(sleep_category_field_ids.iloc[0])

### Retrieve data

#### Table exporter method

Save the list of relevant fields to a file and save it to your project space so that can be picked up by the table exporter application.

In [None]:
timestamp = datetime.now().strftime("%Y-%m-%d")
timestamp = timestamp + "_" + datetime.now().strftime("%H-%M")
filename = f"sleep_fields_" + timestamp + ".txt"

sleep_category_field_ids.to_csv(filename, sep = "\t", index = False, header = False)
!dx upload sleep_fields_*

Further information on the table exporter application can be found here:\
[Table exporter documentation](https://ukbiobank.dnanexus.com/app/table-exporter)

In [None]:
os.popen(f"dx run table-exporter -idataset_or_cohort_or_dashboard=" + DATASET_ID + " -ifield_names_file_txt=" + filename + " -ientity=participant -ioutput='sleep_data' -iheader_style=FIELD-NAME").read()

#### Retrieve a single specific field in this Notebook - Spark Instance dependent

The following code selects the participant table. Then we can define which field we are interested in using the find_field function.
Here, we are using the title filter in the find_field function to list the fields which are relevant to us.

Notebook A103 describes alternative methods for finding relevant fields.

*Warning* - the sleep questionnaire data contains a field title with quotations in. This is likely to cause issues with this data. The table exporter method is not affected by this issue.

In [None]:
# Initialize dxdata engine
engine = dxdata.connect(dialect="hive+pyspark")
pheno = dataset['participant']

In [None]:
field_eid = pheno.find_field(name="eid")

field_sleep = pheno.find_field(title=sleep_category_titles.iloc[0])

field_list = [field_eid, field_sleep]
field_list

In [None]:
## this will only work if your instance is Spark enabled
pheno_data = pheno.retrieve_fields(engine=engine, fields=field_list, coding_values="replace")

#### Translate the output in order to print to file

In [None]:
data_tab = pheno_data.toPandas()

In [None]:
data_tab.head()

In [None]:
data_tab.to_csv('pheno_sleep_data.csv')