## Olink (viz. Normalized protein expression/NPX measures) full dataset

This notebook demonstrates how to extract all the Olink instance tables, relevant  resources and data fields and how to join them all together to create a single dataset using R.

[Resource 4654](https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=4654) provides an overview of the proteomics data available in UK Biobank

Note the final dataset contains 15 columns and ~142 million rows

##### Run info

- Runtime: 15 minutes
- Instance: mem1_hdd1_v2_x16
- Cost: £0.50

### This notebook depends on
- **A spark instance** 
- [Encoding 143](https://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=143) txt file is uploaded to the project 

## Install required packages 
Function `p_load` from `pacman` loads packages into R. If the given package is missing p_load will automatically install it - this can take a considerable amount of time for a package that needs C or FORTRAN code compilation. The following packages are needed to run this notebook:


- `sparklyr` – Allows access to spark data and interact with using familiar R interfaces such as dplyr.
- `data.table` – Read data into data.table format 
- `dplyr` – Tabular data manipulation in R
- `stringr` – Used for character manipulation 
- `DBI` - Communicate between R and relational database management systems
- `purrr` - `reduce` function to perform repeated joins
- `bit64` - Process float data

In [None]:
# Load required packages
if (!require(pacman)) install.packages("pacman")
install.packages("sparklyr")
pacman::p_load(sparklyr, data.table, dplyr, stringr, DBI, purrr, bit64, readr)

### Initiate Spark cluster

In [None]:
# Connect to master node to orchestrates the analysis in spark
port <- Sys.getenv("SPARK_MASTER_PORT")
master <- paste("spark://master:", port, sep = "")
sc <- spark_connect(master)

# Paths to database
database_path <- system("dx find data --class database", intern =TRUE)
app_substring <- na.omit(str_extract(database_path, '(app\\d+_\\d+)'))
database_substring <- str_extract(database_path[str_detect(database_path, app_substring)], 'database-([A-Za-z0-9]+)') %>% tolower()  %>% str_replace("database-", "database_")
database <- paste0(database_substring, "__", app_substring)

### Define dataset ID
Dataset ID takes a value [projectID]:[dataset ID]. These values are unique to your project.

In [None]:
# Project_id
project_id <- Sys.getenv('DX_PROJECT_CONTEXT_ID')

# Record_id
record_id <- system("dx find data --type Dataset --delimiter ',' | awk -F ',' '{print $5}'" , intern = TRUE)

# Project_record_id
project_record_id <- paste0(project_id, ":", record_id)

### Explore the dataset and filter for tables related to Olink 

In [None]:
# Olink tables within database
tables <- DBI::dbGetQuery(sc, paste0("SHOW TABLES IN ", database))
tables %>%
    filter(str_detect(tableName, "olink")) %>%
    pull(tableName)

### Retrieve data from the tables for all available instances
The Olink proteomic biomarkers dataset is instanced:

Instance 0 - Baseline assessment 

Instance 2 - Imaging assesment 

Instance 3 - First repeat imaging visit  


In [None]:
# Instance 0
table_dataframes_i0 <- replicate(12, data.frame(matrix(ncol = 0, nrow = 0)), simplify = FALSE)

# Loop through each table name
for (i in 1:12) {
  # Construct the table name
  table_name <- paste0("olink_instance_0_00", sprintf("%02d", i))

  # Construct the SQL query
  query <- paste0("SELECT * FROM ", database, ".", table_name)

  # Execute the query and store the result in a dataframe
  table_dataframes_i0[[i]] <- sdf_sql(sc, query)
}

# Pivot long
instance_0_sdf <- reduce(table_dataframes_i0, left_join, by = "eid") %>%
  mutate(ins_index = 0) %>%
  pivot_longer(cols = -c(eid, ins_index), names_to = "protein_id", values_to = "result") %>%
  na.omit()

In [None]:
# Instance 2
table_dataframes_i2 <- replicate(6, data.frame(matrix(ncol = 0, nrow = 0)), simplify = FALSE)


# Loop through each table name
for (i in 1:6) {
  # Construct the table name
  table_name <- paste0("olink_instance_2_00", sprintf("%02d", i))

  # Construct the SQL query
  query <- paste0("SELECT * FROM ", database, ".", table_name)

  # Execute the query and store the result in a dataframe
  table_dataframes_i2[[i]] <- sdf_sql(sc, query)
}

instance_2_sdf <- reduce(table_dataframes_i2, left_join, by = "eid") %>%
  mutate(ins_index = 2) %>%
  pivot_longer(cols = -c(eid, ins_index), names_to = "protein_id", values_to = "result") %>%
  na.omit()

In [None]:
# Instance 3
table_dataframes_i3 <- replicate(6, data.frame(matrix(ncol = 0, nrow = 0)), simplify = FALSE)

# Loop through each table name
for (i in 1:6) {
  # Construct the table name
  table_name <- paste0("olink_instance_3_00", sprintf("%02d", i))

  # Construct the SQL query
  query <- paste0("SELECT * FROM ", database, ".", table_name)

  # Execute the query and store the result in a dataframe
  table_dataframes_i3[[i]] <- sdf_sql(sc, query)
}

instance_3_sdf <- reduce(table_dataframes_i3, left_join, by = "eid") %>%
  mutate(ins_index = 3) %>%
  pivot_longer(cols = -c(eid, ins_index), names_to = "protein_id", values_to = "result") %>%
  na.omit()

In [None]:
# Join all instances
olink_sdf <- instance_0_sdf %>% sdf_bind_rows(instance_2_sdf, instance_3_sdf)

### Load encoding
An encoding index ([encoding 143](https://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=143)) can be used to link the protein ID
in the NPX data to the UniProt text description of the protein.
 

The downloaded encoding 143 txt file must be uploaded to your RAP project

In [None]:
# Find and load encoding 143
system(paste0("dx find data --name coding143.tsv --brief | xargs dx download"))

# Alternatively use direct path to 'coding143.tsv' on your project
# system(paste0("dx download ./coding143.tsv"))

coding143 <- fread("coding143.tsv") %>% mutate(meaning = str_to_lower(str_replace(meaning, ";.*", "")), meaning = str_replace_all(meaning, "-", "_"))
coding143_spark <- sparklyr::copy_to(sc, coding143, overwrite = TRUE)

### Select data fields

[Data field 30900](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30900) - Number of proteins measured 

In [None]:
project_record_id

In [None]:
system(paste0("dx extract_dataset ", project_record_id, " --fields participant.eid,participant.p30900_i0,participant.p30900_i1,participant.p30900_i2,participant.p30900_i3 --o 'field_30900.csv'"))

In [None]:
# Number of proteins
field_30900_df <- fread("field_30900.csv") %>%
    select(-participant.p30900_i1) %>%
    pivot_longer(cols = -c(participant.eid), names_to = "instance", values_to = "N_proteins") %>%
    filter(!is.na(N_proteins)) %>%
    mutate(instance = str_remove(instance, "participant.p30900_i"))
field_30900_sdf <- sparklyr::copy_to(sc, field_30900_df, overwrite = TRUE)

[Data field 30901](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30901) - Plate used for sample run

In [None]:
system(paste0("dx extract_dataset ", project_record_id, " --fields participant.eid,participant.p30901_i0,participant.p30901_i1,participant.p30901_i2,participant.p30901_i3 --o 'field_30901.csv'"))

In [None]:
# Plate ID
field_30901_df <- fread("field_30901.csv") %>%
    select(-participant.p30901_i1) %>%
    pivot_longer(cols = -c(participant.eid), names_to = "instance", values_to = "PlateID") %>%
    filter(!is.na(PlateID)) %>%
    mutate(instance = str_remove(instance, "participant.p30901_i"),
          PlateID = as.double(PlateID))
field_30901_sdf <- sparklyr::copy_to(sc, field_30901_df, overwrite = TRUE)

[Data field 30902](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=30902) - Well used for sample run

In [None]:
system(paste0("dx extract_dataset ", project_record_id, " --fields participant.eid,participant.p30902_i0,participant.p30902_i1,participant.p30902_i2,participant.p30902_i3 --o 'field_30902.csv'"))

In [None]:
# Well ID
field_30902_df <- fread("field_30902.csv") %>%
    select(-participant.p30902_i1) %>%
    pivot_longer(cols = -c(participant.eid), names_to = "instance", values_to = "WellID") %>%
    filter(WellID != "") %>%
    mutate(instance = str_remove(instance, "participant.p30902_i"))
field_30902_sdf <- sparklyr::copy_to(sc, field_30902_df, overwrite = TRUE)

### Load resources
Additional data is available as showcase resources.
Assay-level results are provided as downloadable showcase resources.
These are generic tab-separated datasets and are available via the
resources section in [Category 1839](https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=1839)



#### Assay 
Provides the lookup between an assay, its respective UniProt ID and the Olink Explore panel in which it is categorised.

In [None]:
# Assay
system(" wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/olink_assay.dat")
olink_assay <- fread("olink_assay.dat") %>% mutate(Assay = tolower(Assay))
olink_assay_sdf <- sparklyr::copy_to(sc, olink_assay, overwrite = TRUE)

### Assay version 
Provides the version number for each assay per panel lot number.

In [None]:
# Assay version
system(" wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/olink_assay_version.dat")
olink_assay_version <- fread("olink_assay_version.dat") %>% mutate(Assay = tolower(Assay))
olink_assay_version_sdf <- sparklyr::copy_to(sc, olink_assay_version, overwrite = TRUE)

#### Batch number 
Provides the shipment batch number for each plate ID, allowing for correction of potential batch processing effects.

In [None]:
# Batch number
system(" wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/olink_batch_number.dat")
olink_batch_number <- fread("olink_batch_number.dat") %>% mutate(PlateID = as.double(PlateID))
olink_batch_number_sdf <- sparklyr::copy_to(sc, olink_batch_number, overwrite = TRUE)

#### Limit of detection
Provides the instance-level limit of detection for each assay per shipment plate, allowing for filtering of sample results based on target protein detectability.

In [None]:
# Limit of detection
system(" wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/olink_limit_of_detection.dat")
olink_limit_of_detection <- fread("olink_limit_of_detection.dat") %>% 
                            mutate(Assay = tolower(Assay), PlateID = as.double(PlateID))
olink_limit_of_detection_sdf <- sparklyr::copy_to(sc, olink_limit_of_detection, overwrite = TRUE)

#### Panel lot number 
Provides the processing lot number per assay panel within each shipment batch.

In [None]:
# Panel lot number
system(" wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/olink_panel_lot_number.dat")
olink_panel_lot_number <- fread("olink_panel_lot_number.dat")
olink_panel_lot_number_sdf <- sparklyr::copy_to(sc, olink_panel_lot_number, overwrite = TRUE)

#### Processing start date 
Provides the processing date for each
shipment plate, broken down by assay panel.

In [None]:
# Processing start date
system("wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/olink_processing_start_date.dat")
olink_processing_start_date <- fread("olink_processing_start_date.dat")  %>% mutate(PlateID = as.double(PlateID))
olink_processing_start_date_sdf <- sparklyr::copy_to(sc, olink_processing_start_date, overwrite = TRUE)

### Join data
Join data fields, tables and resources

In [None]:
olink_full_dataset_sdf <- olink_sdf %>%
    left_join(field_30900_sdf, by = c("eid" = "participant_eid", "ins_index" = "instance")) %>%
    left_join(field_30901_sdf, by = c("eid" = "participant_eid", "ins_index" = "instance")) %>%
    left_join(field_30902_sdf, by = c("eid" = "participant_eid", "ins_index" = "instance")) %>%
    left_join(coding143_spark, by = c("protein_id" = "meaning")) %>%
    left_join(olink_limit_of_detection_sdf, by = c("protein_id" = "Assay", "ins_index" = "Instance", "PlateID" = "PlateID")) %>%
    left_join(olink_assay_sdf, by = c("protein_id" = "Assay")) %>%
    left_join(olink_processing_start_date_sdf, by = c("PlateID" = "PlateID", "Panel" = "Panel")) %>%
    left_join(olink_batch_number_sdf, by = c("PlateID" = "PlateID")) %>%
    left_join(olink_panel_lot_number_sdf, by = c("Batch" = "Batch", "Panel" = "Panel")) %>%
    left_join(olink_assay_version_sdf, by = c("Panel_Lot_Nr" = "Panel_Lot_Nr", "protein_id" = "Assay"))


### Filter Olink dataset 
The dataset can now be filtered. 

In the below examples we have filtered by protein IDs and protein biomark panels.

In [None]:
proteins <- c("dapk2", "ngfr", "zp4")
proteins_of_interest <- olink_full_dataset_sdf %>%
  filter(protein_id %in% proteins)

In [None]:
# You may write your results as a CSV using:
readr::write_csv(data.frame(proteins_of_interest), 'proteins_of_interest.csv')

In [None]:
neurology_panel <- olink_full_dataset_sdf %>%
  filter(Panel == "Neurology")