# Finding, extracting and classifying OMOP hypertension participant data 

##### Run info

- Runtime: 10 mins 
- Instance: mem1_ssd1_v2_x16
- Cost: £0.10
- Tier required: tier 2 or above (2, 3, 11, 12)



This notebook performs similar analysis to [notebook **201**](https://github.com/UK-Biobank/UKB-RAP-Notebooks-Access/blob/main/Notebooks/A106_Hypertension-data_R.ipynb) “Find, visualize, organize and export hypertension participant data”. In addition to analysing hypertension within OMOP data, we will investigate how different OMOP tables interact and how to use spark within r to perform analysis on large tables. The Observational Health Data Science and Informatics community developed the Observational Medical Outcomes Partnership (OMOP) Common Data Model, which is a standardised healthcare data model. OMOP is designed to standardise the structure and content of observational data and enable efficient analyses. The goal of the notebook is to demonstrate methods described for finding, extracting, and classifying hypertension data from the OMOP data. The data utilised in constructing the OMOP dataset was provided by UKBiobank to in 2018. Consequently, any existing data issues from 2018 persist in the current OMOP dataset. Known issues will be noted within the relevant OMOP field. All figures produces are accurate at the time of writing (Q1-2024). Currently the data dictionary does not contain the correct data types for the columns within the omop tables. However, the columns tab within data showcase reports the correct column data type. This notebook uses three sources: 

- OMOP condition occurrence 
- OMOP Person
- OMOP concept resources 

### This notebook depends on
- **A spark instance** 

### Why use spark?

"With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back—resulting in a much faster execution. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset" - [Amazon Web Services](https://aws.amazon.com/what-is/apache-spark/#:~:text=With%20Spark%2C%20only%20one%2Dstep,function%20on%20the%20same%20dataset.)

With large datasets such as OMOP, loading the data into cache results in much faster subsequent analysis.  



## Install required packages 
Function `p_load` from `pacman` loads packages into R. If the given package is missing p_load will automatically install it - this can take a considerable amount of time for a package that needs C or FORTRAN code compilation. The following packages are needed to run this notebook:

- `sparklyr` – Allows access to spark data and interact with using familiar R interfaces such as dplyr.
- `data.table` – read data into data.table format 
- `dplyr` – tabular data manipulation in R 
- `ggplot2` – Create plots
- `scales` – essentially just used to fix y axis using scientific format values
- `stringr` – used for character manipulation 
- `glue` - used for sql queries

In [None]:
# Load required packages 
if(!require(pacman)) install.packages("pacman")
install.packages("sparklyr")
pacman::p_load(sparklyr, data.table, dplyr, ggplot2, scales, stringr, glue, readr)

## Import OMOP resources 

The OMOP resources do not contain participant specific information. The resources available provide standardised keywords and other such data that can be used to provide addition information to the OMOP tables by joining to relevant columns - [OHDSI](https://github.com/OHDSI/Vocabulary-v5.0/wiki#--ohdsi-standardized-vocabularies)

For example the care_site resource can be joined to the Person OMOP table using the care_site_id column, allowing the id to be linked to the care_site_name.

#### OMOP concept 
The Standardized Vocabularies contains records, or Concepts, that uniquely identify each fundamental unit of meaning used to express clinical information in all domain tables of the Common Data Model.

#### OMOP concept ancestor 
Designed to simplify observational analysis by providing the complete hierarchical relationships between Concepts (omop_concept)

In [None]:
# Load in the OMOP resources concept (1545) and concept ancestor (1541)
system("wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/omop_concept.tsv")
system("wget  -nd  biobank.ndph.ox.ac.uk/ukb/ukb/auxdata/omop_concept_ancestor.tsv")

In [None]:
# join concept_name to ancestor and descendant concept_id
omop_concept <- fread("omop_concept.tsv", sep = "\t")
omop_concept_ancestor <- fread("omop_concept_ancestor.tsv", sep = "\t") %>%
    left_join(select(omop_concept, concept_id, "ancestor_concept_name" = concept_name), by = c("ancestor_concept_id" = "concept_id")) %>%
    left_join(select(omop_concept, concept_id, "descendant_concept_name" = concept_name), by = c("descendant_concept_id" = "concept_id"))

### Search [Accepted Concepts](https://athena.ohdsi.org/search-terms/terms?domain=Condition&standardConcept=Standard&page=1&pageSize=15&query=hypertension&boosts) to identify concept ids of interest.

For example searching "hypertension" results in Essential hypertension (concept id = 320128)

Once a concept id has been identified searching the concept id can show you how the data is structured.
An ancestor of Essential hypertension is Hypertensive disorder which will encomopass a more complete list of hypertensive disorders.

Looking at the output below, ancestor_concept_id 316866 ‘Hypertension disorders' seems like a likely umbrella term for our interest in hypertension.

In [None]:
omop_concept_ancestor %>% 
    filter(descendant_concept_id == "320128") %>% 
    arrange(desc(min_levels_of_separation))

In [None]:
# There are 141 conditions that are included as a descendant of the Hypertensive disorder
hypertension_concept_ids <- omop_concept_ancestor %>%
    filter(ancestor_concept_id == "316866") %>% 
    distinct(descendant_concept_id, ancestor_concept_name, descendant_concept_name) %>%
    mutate(descendant_concept_id = as.character(descendant_concept_id))

### Now create spark connection and define database

Next, we set a `sc` variable, which establishes a connection to the Spark cluster.

To connect to the database, manipulate the database path to be the correct format. 

In [None]:
# Connecting to master node to orchestrates the analysis in spark 
port <- Sys.getenv("SPARK_MASTER_PORT")
master <- paste("spark://master:", port, sep = '')
sc = spark_connect(master)

# assign app_id and database
database_path <- system("dx find data --class database", intern =TRUE)
app_substring <- na.omit(str_extract(database_path, '(app\\d+_\\d+)'))
database_substring <- str_extract(database_path[str_detect(database_path, app_substring)], 'database-([A-Za-z0-9]+)') %>% tolower()  %>% str_replace("database-", "database_")
database <- paste0(database_substring, "__", app_substring)

## Load [OMOP condition occurence table](https://biobank.ctsu.ox.ac.uk/crystal/rectab.cgi?id=925)

Condition occurence table contains records of Events of a Person suggesting the presence of a disease or medical condition stated as a diagnosis, a sign, or a symptom.

In [None]:
# Loading omop_condition_occurrence data
tbl_cache(sc, paste0(database, '.omop_condition_occurrence'))
omop_condition_occurrence <- dplyr::tbl(sc, paste0(database, '.omop_condition_occurrence'))

## Filter for concepts of interest

We now filter the OMOP condition occurrence table with the concepts that have hypertensive disorder as the ancestor:

* When working with spark dataframes, using `sparklyr`, you do not have access to all the functions available within r, you are restricted to a limited number, primarily `dplyr`, `broom` and `DBI`. More information can be found on [sparklyr](https://spark.rstudio.com/).

* The majority of `base r` functions also do not work with spark dfs e.g. `unique(hypertension_concept_ids_spark$descendant_concept_id)`

* When filtering the data using a list created from the hypertension_concept_ids_spark, does not work e.g. `filter(condition_concept_id %in% c(hypertension_concept_ids_spark %>% pull(descendant_concept_id)))`

* Different joins have to be used depending on the need, and the joining df must also be a spark tbl class as shown below


In [None]:
# Making hypertension_concept_ids a spark df, enabling joins
hypertension_concept_ids_spark <- sparklyr::copy_to(sc, hypertension_concept_ids, overwrite = TRUE)

##### Removing condition_concept_ids that are not a descendant of Hypertensive disorder

From the possible 141 conditions that are within Hypertensive disorder, 44 conditions have records of events of a Person suggesting the presence of a disease or medical condition stated as a diagnosis, a sign, or a symptom. 

In [None]:
omop_condition_occurrence_filtered <- omop_condition_occurrence %>% 
    inner_join(hypertension_concept_ids_spark, by = c("condition_concept_id" = "descendant_concept_id")) %>%
    select(eid, condition_occurrence_id, condition_concept_id, descendant_concept_name)

omop_condition_occurrence_filtered %>% distinct(condition_concept_id) %>% count()

# Why use umbrella concepts

Utilising Hypertensive disorder as an umbrella concept allows a greater understanding of the effect hypertension has on the participants compared to a specific single term. 

This notebook investigates the effect using umbrella concepts has on the data captured.


### Effect on eids captured

There are 30157 additional eid x condition combinations that have been captured. An eid may have Essential hypertentsion and two other hypertensive disorders - explains why total count is higher then total eids

In [None]:
omop_condition_occurrence_filtered %>%
  distinct(eid, condition_concept_id, descendant_concept_name) %>%
  count(eid, condition_concept_id, descendant_concept_name) %>%
  mutate(
    condition = case_when(
      descendant_concept_name == "Essential hypertension" ~ "Essential hypertension",
      TRUE ~ "Other Hypertensive disorders"
    )
  ) %>%
  count(condition, wt = n)

### Effect on Hypertensive conditions captured

Investigating the increase in conditions and participants captured among participants with Hypertensive disorder. There are three distinct participant groups:

* Essential hypertension only - These cases would have been recorded without utilising OMOP concept ancestors.
* Essential and other hypertensive disorders - Provides additional information about participants other hypertensive conditions.
* Other hypertensive disorders - Additional participants and conditions captured due to using OMOP concept ancestor. 

In [None]:
# Manipulating data to investigate the increase in conditions and partcipants captured

omop_condition_occurrence_filtered %>% 
    mutate(
    condition = case_when(
      descendant_concept_name == "Essential hypertension" ~ "Essential hypertension",
      TRUE ~ "Other Hypertensive disorders"
    )
  ) %>%
  distinct(eid, condition) %>% 
  add_count(eid) %>%
  group_by(eid) %>%
  summarise(
    has_essential = any(condition == "Essential hypertension" & n == 1),
    has_other = any(condition == "Other Hypertensive disorders" & n == 1),
    has_both = all(condition %in% c("Essential hypertension", "Other Hypertensive disorders") & n == 2)
  ) %>%
  ungroup() %>%
  mutate(
    condition_combination = case_when(
      has_essential ~ "Essential Hypertension",
      has_other ~ "Other Hypertensive disorders",
      has_both ~ "Has Essential and Other Hypertensive disorders",
      TRUE ~ "Unknown"
    ),
    x_var = ""
  ) %>%
    count(x_var, condition_combination) %>%
    ggplot(aes(x = x_var, y = n, fill = condition_combination)) +
    geom_bar(stat = "identity", position = position_stack(reverse = TRUE)) +
  geom_text(aes(label = n, y = n), position = position_stack(vjust = 0.5, reverse = TRUE), size = 3) +
  labs(title = "Condition Combinations",
       x = "",
       y = "Count") +
  theme_minimal() +
  scale_y_continuous(labels = label_number(scale = 1e0)) +
    scale_fill_manual(values = c("#006994", "#00a36f", "#ffa700"))
    

In [None]:
# You may write your results as a CSV using:
readr::write_csv(data.frame(omop_condition_occurrence_filtered), 'omop_condition_occurrence_filtered.csv')

We find out additional information on nearly 29k (20.5%) of participants by including all descendant conditions.

### Effect on total observations

There are 64,995 condition_occurrence_ids that do not have Essential hypertension but have a descendant condition within Hypertensive disorder

In [None]:
# count total condition occurence ids 
omop_condition_occurrence_filtered %>% distinct(condition_occurrence_id) %>% count() 
# count, condition occurence ids with just essential hypertension 
omop_condition_occurrence_filtered %>% filter(descendant_concept_name == "Essential hypertension") %>% distinct(condition_occurrence_id) %>% count() 

# SQL within sparklyr

Using sql queries within spark can be very powerful and reduce the size of spark dataframes

In [None]:
# eids with Hypertension disorder
hypertension_eid <- omop_condition_occurrence_filtered %>% distinct(eid) %>% pull(eid) 

In [None]:
# to know what columns are within a table
sdf_sql(sc, paste0("SHOW COLUMNS FROM ", database, ".omop_person")) %>% pull(col_name)

## Filter the omop persons table 

OMOP persons table to include eids that have a condition within our hypertension conditions of interest

* When filtering within `sdf_sql` using `paste0()` and `toString(filter_values)` worked much faster compared to `glue_sql()` and `{*filter_values}`
* gender_concept_id can be found within [Athena](https://athena.ohdsi.org/search-terms/terms?domain=Gender&standardConcept=Standard&page=1&pageSize=15&query=8532+8507&boosts)

In [None]:
omop_persons_query_filtered  <- paste0("SELECT eid, year_of_birth, gender_concept_id  FROM  ", database, ".omop_person WHERE eid  IN (", toString(hypertension_eid), ")")
omop_persons <- sdf_sql(sc, omop_persons_query_filtered) %>%
    mutate(gender = case_when(gender_concept_id == 8532 ~ "Female", gender_concept_id == 8507 ~ "Male"),
           decade = case_when(
               between(as.numeric(year_of_birth), 1930, 1939) ~ "1930s",
               between(as.numeric(year_of_birth), 1940, 1949) ~ "1940s",
               between(as.numeric(year_of_birth), 1950, 1959) ~ "1950s",
               between(as.numeric(year_of_birth), 1960, 1969) ~ "1960s",
               between(as.numeric(year_of_birth), 1970, 1979) ~ "1970s",
               between(as.numeric(year_of_birth), 1980, 1989) ~ "1980s")) %>%
    count(decade, gender)

##### Plot the decade of birth by sex for participants with a Hypertensive disorder

In [None]:
omop_persons %>%
    ggplot(aes(x = decade, y = n, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Count of Gender by Birth Decade",
       x = "Decade",
       y = "Count",
       fill = "Gender") +
  scale_fill_manual(values = c('#00429d', '#ffa83c')) +
  theme_minimal()

## Summary

The code in this notebook provides examples demonstrating the utilisation of OMOP resources and tables, along with leveraging Spark within r using `sparklyr`. It provides examples of connecting to, employing basic SQL and interacting with Spark data frames which provides much faster computing for larger dataframes. The notebook shows a small fraction of what the is possible using the OMOP data, it is possible to join multiple tables together to create a comprehensive data pool. 

If you encounter any problems while running this notebook, please add issues to the [UKB-RAP-Notebooks github](https://github.com/UK-Biobank/UKB-RAP-Notebooks/issues).