# Retrieve UK Biobank lipids phenotypes and covariates

In this notebook we retrieve lipid phenotypes and covariates from the database using Spark SQL and store that extract as a CSV for further downstream use.
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30690
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30760
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30780
  * https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30870

# Setup

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the UK Biobank Research Analysis Platform..
    <ul>
        <li>Use compute type 'Spark Cluster' with default CPU, RAM, and worker instances.</li>
        <li>This notebook is pretty fast, but in general it is recommended to be run in the background via <kbd>dx run dxjupyterlab_spark_cluster</kbd> to capture provenance.</li>
    </ul>
</div>

```
dx run dxjupyterlab_spark_cluster \
    -icmd="papermill 4_ukb_lipids_phenotypes_retrieval.ipynb 4_ukb_lipids_phenotypes_retrieval_$(date +%Y%m%d).ipynb" \
    -iin=4_ukb_lipids_phenotypes_retrieval.ipynb \
    --folder=outputs/spark-pheno-retrieval/$(date +%Y%m%d)/
```
See also https://platform.dnanexus.com/app/dxjupyterlab_spark_cluster

In [None]:
import dxdata
import pandas as pd
import re

## Define constants

In [None]:
# Papermill parameters. See https://papermill.readthedocs.io/en/latest/usage-parameterize.html

# Inputs
UKB_TABULAR_DATASET = 'app7089_202103231620.dataset'

# Outputs
DRUG_MAPPING_FILENAME = 'drug_mapping.csv'
PHENO_DATA_FILENAME = 'lipids.csv'

## Initialize dxdata engine.

In [None]:
engine = dxdata.connect(dialect='hive+pyspark')
pt = engine.execute('SET spark.sql.shuffle.partitions=50').to_pandas()

In [None]:
dataset = dxdata.load_dataset(UKB_TABULAR_DATASET)

In [None]:
participant = dataset['participant']

## Discover the lipid and covariate fields of interest 

In [None]:
def print_field_list(fields):
    for field in sorted(fields, key=lambda fld: fld.title):
        print(f'\n{field.column_name}: {field.title}')
        print(f'\t{field.units}')
        print(f'\t{field.type}')
        print(f'\t{field.coding}')
        if field.coding is not None and field.coding.name != 'data_coding_4':
            print(f'\t{field.coding.codes}')

In [None]:
fields_by_title = list(participant.find_fields(titles=['Sex', 'Date of birth']))
print_field_list(fields_by_title)

### Where is field 'Date of Birth'?

<div class="alert alert-block alert-danger">
The Date of Birth field does not appear to reside in the default UKB_TABULAR_DATASET.
</div>


In [None]:
print_field_list(list(participant.find_fields(names=['p33', 'p31'])))

In [None]:
print_field_list(list(participant.find_fields(name_regex='(?i)p33')))

In [None]:
print_field_list(list(participant.find_fields(title_regex='(?i)birth')))

In [None]:
fields_by_title_regex = list(participant.find_fields(title_regex='(?i)cholesterol|hdl|ldl|triglycerides|Age when attended assessment centre|treatment/medication code'))
len(fields_by_title_regex)

In [None]:
all_fields = fields_by_title + fields_by_title_regex
print_field_list(all_fields)

## Discover the coding for the statin drugs of interest

In [None]:
drug_mapping = {k : field.coding.codes[str(k)] for field in all_fields
    if field.coding is not None and field.coding.name == 'data_coding_4'
        for k in [1140861958, 1140861970, 1140864592, 1140881748, 1140888594, 1140888648, 1140910632, 1140910654, 1141146138, 1141146234, 1141192410, 1141192414, 1141200040]}

drug_mapping

In [None]:
drug_mapping_df = pd.DataFrame.from_dict(drug_mapping, orient='index', columns=['drug_name']).rename_axis('drug_number').reset_index()

drug_mapping_df

## Retrieve the data

In [None]:
import time

start = time.time()
pheno_data = participant.retrieve_fields(engine=engine, fields=all_fields, coding_values='replace').toPandas()
end = time.time()
print(end - start)

In [None]:
pheno_data.shape

In [None]:
# Uncomment to see row level data.
#pheno_data.head()

In [None]:
pheno_data.columns

### Construct improved column names 

In [None]:
col_names = {'eid': 'eid'}
for field in sorted(all_fields, key=lambda fld: fld.name):
    name = '_'.join([field.column_name, re.sub(' \| Instance \d', '', field.title).replace(' ', '_').replace('/', '_')])
    if field.units is not None:
        name += f'_{field.units.replace(" ", "_").replace("/", "_")}'
    print(name)
    col_names[field.column_name] = name

In [None]:
pheno_data = pheno_data.rename(columns=col_names)

In [None]:
pheno_data.columns

## Write out the data extract to a CSV 

In [None]:
drug_mapping_df.to_csv(DRUG_MAPPING_FILENAME, index=False)

In [None]:
pheno_data.to_csv(PHENO_DATA_FILENAME, index=False)

# Provanance

In [None]:
import datetime
print(datetime.datetime.now())

In [None]:
!pip3 freeze