# Participants with statin use

In this notebook we review and explore the data for participants with statin use.

<div class="alert alert-block alert-success">
This notebook was exported from dataset "Participants with statin use" and further modified.
</div>

See also:
* statin drug [concept set](https://preprod-workbench.researchallofus.org/workspaces/aou-rw-preprod-e2a365a8/pooledanalysisofallofusandukbiobankgenomicdata/data/concepts/sets/53)
* participants with statin use [dataset](https://preprod-workbench.researchallofus.org/workspaces/aou-rw-preprod-e2a365a8/pooledanalysisofallofusandukbiobankgenomicdata/data/data-sets/70)

TODOs
  * the current set of concept codes are pulling in some non-statin medications such as B-vitamins which include niacin
  * determine whether any particular drug exposures should be discarded (e.g., those too short in duration)
  * many drug exposures have an NA for DRUG_EXPOSURE_END_DATETIME, consider filling that in for certain categories such as 

# Retrieve the dataset created by dataset builder 

In [None]:
library(bigrquery)

# This query represents dataset "Participants with statin use" for domain "drug" and was 
# generated for All of Us Dataset v5.
dataset_01268909_drug_sql <- paste("
    SELECT
        d_exposure.SIG,
        d_exposure.VERBATIM_END_DATE,
        d_exposure.ROUTE_SOURCE_VALUE,
        d_exposure.ROUTE_CONCEPT_ID,
        d_exposure.DRUG_CONCEPT_ID,
        d_exposure.DRUG_EXPOSURE_START_DATETIME,
        d_exposure.DRUG_TYPE_CONCEPT_ID,
        d_exposure.VISIT_OCCURRENCE_ID,
        d_exposure.DRUG_SOURCE_VALUE,
        d_exposure.DAYS_SUPPLY,
        d_exposure.QUANTITY,
        d_exposure.REFILLS,
        d_exposure.DOSE_UNIT_SOURCE_VALUE,
        d_exposure.LOT_NUMBER,
        d_exposure.DRUG_EXPOSURE_END_DATETIME,
        d_exposure.PERSON_ID,
        d_exposure.DRUG_SOURCE_CONCEPT_ID,
        d_exposure.STOP_REASON,
        d_route.concept_name as ROUTE_CONCEPT_NAME,
        d_type.concept_name as DRUG_TYPE_CONCEPT_NAME,
        d_standard_concept.concept_code as STANDARD_CONCEPT_CODE,
        d_standard_concept.vocabulary_id as STANDARD_VOCABULARY,
        d_standard_concept.concept_name as STANDARD_CONCEPT_NAME,
        d_source_concept.concept_code as SOURCE_CONCEPT_CODE,
        d_source_concept.concept_name as SOURCE_CONCEPT_NAME,
        d_source_concept.vocabulary_id as SOURCE_VOCABULARY,
        d_visit.concept_name as VISIT_OCCURRENCE_CONCEPT_NAME 
    from
        ( SELECT
            * 
        from
            `drug_exposure` d_exposure 
        WHERE
            (
                drug_concept_id in  (
                    select
                        distinct ca.descendant_id 
                    from
                        `cb_criteria_ancestor` ca 
                    join
                        (
                            select
                                distinct c.concept_id 
                            from
                                `cb_criteria` c 
                            join
                                (
                                    select
                                        cast(cr.id as string) as id 
                                    from
                                        `cb_criteria` cr 
                                    where
                                        domain_id = 'DRUG' 
                                        and is_standard = 1 
                                        and concept_id in (
                                            46287466, 1551860, 1549686, 1592085, 1592180, 1545958, 1526475, 1332418, 46275447, 1510813, 40165636, 1517824, 1539403
                                        ) 
                                        and is_selectable = 1 
                                        and full_text like '%[drug_rank1]%'
                                ) a 
                                    on (
                                        c.path like concat('%.',
                                    a.id,
                                    '.%') 
                                    or c.path like concat('%.',
                                    a.id)) 
                                where
                                    domain_id = 'DRUG' 
                                    and is_standard = 1 
                                    and is_selectable = 1
                                ) b 
                                    on (
                                        ca.ancestor_id = b.concept_id
                                    )
                            )
                        )
                ) d_exposure 
        LEFT JOIN
            `concept` d_route 
                on d_exposure.ROUTE_CONCEPT_ID = d_route.CONCEPT_ID 
        LEFT JOIN
            `concept` d_type 
                on d_exposure.drug_type_concept_id = d_type.CONCEPT_ID 
        left join
            `concept` d_standard_concept 
                on d_exposure.DRUG_CONCEPT_ID = d_standard_concept.CONCEPT_ID 
        LEFT JOIN
            `concept` d_source_concept 
                on d_exposure.DRUG_SOURCE_CONCEPT_ID = d_source_concept.CONCEPT_ID 
        left join
            `visit_occurrence` v 
                on d_exposure.VISIT_OCCURRENCE_ID = v.VISIT_OCCURRENCE_ID 
        LEFT JOIN
            `concept` d_visit 
                on v.VISIT_CONCEPT_ID = d_visit.CONCEPT_ID", sep="")

dataset_01268909_drug_df <- bq_table_download(bq_dataset_query(Sys.getenv("WORKSPACE_CDR"), dataset_01268909_drug_sql, billing=Sys.getenv("GOOGLE_PROJECT")), bigint="integer64")

In [None]:
dim(dataset_01268909_drug_df)

In [None]:
head(dataset_01268909_drug_df, 5)

# Setup 

<div class="alert alert-block alert-warning">
    <b>Cloud Environment</b>: This notebook was written for use on the All of Us Workbench. It runs fine on the default Cloud Environment. 
</div>

In [None]:
library(skimr)
library(tidyverse)
library(lubridate)

In [None]:
## BigQuery setup.
BILLING_PROJECT_ID <- Sys.getenv('GOOGLE_PROJECT')
# Get the BigQuery curated dataset for the current workspace context.
CDR <- Sys.getenv('WORKSPACE_CDR')

In [None]:
# Shorten the name.
drug_df <- dataset_01268909_drug_df

# Explore the data 

In [None]:
print(skim(drug_df))

## Examine the drugs

In [None]:
exposure_counts_by_drug <- drug_df %>%
    group_by(STANDARD_CONCEPT_NAME) %>%
    summarize(
        count = n()
    ) %>%
    arrange(desc(count))

dim(exposure_counts_by_drug)

In [None]:
exposure_counts_by_drug

In [None]:
exposure_counts_by_drug %>% filter(str_detect(STANDARD_CONCEPT_NAME, '(?i)niacin'))

<div class="alert alert-block alert-warning">
<p>In the above table, we see some drug concepts for B vitamin use, unrelated to statin use. This list of concept ids needs to be trimmed to just the relevant concepts.</p>

<p>Alternatively, when we have the list of the 1,000 participant ids and determine which of those use any of these medications, we can then "review" that subset of the cohort to check whether their medications really do indicate statin use.</p>
</div>

## Examine the drug type concepts

In [None]:
drug_df %>%
    group_by(DRUG_TYPE_CONCEPT_NAME) %>%
    summarize(
        count = n()
    ) %>%
    arrange(desc(count))

<div class="alert alert-block alert-info">
<b>Note:</b> 'Medication list entry' and 'Patient Self-Reported Medication' appear to be strong indications of regular statin use.
</div>

## Examine the drug exposure durations 

For the individual drug exposures, what is the distribution of durations? Are these mostly one month long prescriptions and/or also reports of long term use?

### All 

In [None]:
drug_df %>%
    mutate(
        duration = as.duration(DRUG_EXPOSURE_END_DATETIME - DRUG_EXPOSURE_START_DATETIME)
    ) %>%
    group_by(duration) %>%
    summarize(
        count = n()
    ) %>%
    arrange(desc(count))

<div class="alert alert-block alert-warning">
<b>Question:</b> discard all drug exposures with a short duration? (e.g. must be at least one week long)

### Type 'Medication list entry'

In [None]:
drug_df %>%
    filter(DRUG_TYPE_CONCEPT_NAME %in% c('Medication list entry')) %>%
    mutate(
        duration = as.duration(DRUG_EXPOSURE_END_DATETIME - DRUG_EXPOSURE_START_DATETIME)
    ) %>%
    group_by(duration) %>%
    summarize(
        count = n()
    ) %>%
    arrange(desc(count))

<div class="alert alert-block alert-warning">
<b>Question:</b> fill NA DRUG_EXPOSURE_END_DATETIME with now()?
</div>

### Type 'Patient Self-Reported Medication'

In [None]:
drug_df %>%
    filter(DRUG_TYPE_CONCEPT_NAME %in% c('Patient Self-Reported Medication')) %>%
    mutate(
        duration = as.duration(DRUG_EXPOSURE_END_DATETIME - DRUG_EXPOSURE_START_DATETIME)
    ) %>%
    group_by(duration) %>%
    summarize(
        count = n()
    ) %>%
    arrange(desc(count))

<div class="alert alert-block alert-warning">
<b>Question:</b> fill NA DRUG_EXPOSURE_END_DATETIME with now()?
</div>

# Summarize the data by patient

In [None]:
drug_df %>%
    group_by(PERSON_ID) %>%
    summarize(
        # Note, this is simplistically assuming continuous use.
        first_use = min(DRUG_EXPOSURE_START_DATETIME),
        last_use_start = max(DRUG_EXPOSURE_START_DATETIME),
        last_use_end = max(DRUG_EXPOSURE_END_DATETIME),  # This is often NA.
        statin_drug_rx_count = n(),
        statin_drug_count = length(unique(STANDARD_CONCEPT_CODE)),
        statin_drugs = str_c(sort(unique(STANDARD_CONCEPT_NAME)), collapse = ', ')
    ) %>%
    arrange(desc(statin_drug_count))

# Export to CSV

Export the individual drug exposures for now. Later, if the summarized information becomes more accurate, that could be exported instead.

In [None]:
# This snippet assumes that you run setup first

# This code saves your dataframe into a csv file in a "data" folder in Google Bucket

# Replace df with THE NAME OF YOUR DATAFRAME
my_dataframe <- drug_df

# Replace 'test.csv' with THE NAME of the file you're going to store in the bucket (don't delete the quotation marks)
destination_filename <- 'participants_with_statin_use.csv'

########################################################################
##
################# DON'T CHANGE FROM HERE ###############################
##
########################################################################

# store the dataframe in current workspace
write_excel_csv(my_dataframe, destination_filename)

# Get the bucket name
my_bucket <- Sys.getenv('WORKSPACE_BUCKET')

# Copy the file from current workspace to the bucket
system(paste0("gsutil cp ./", destination_filename, " ", my_bucket, "/data/"), intern=T)

# Check if file is in the bucket
system(paste0("gsutil ls ", my_bucket, "/data/*.csv"), intern=T)


# Appendix - alternate identification of statin concept codes

This section of the notebook attempts to identify relevant statin use drug concept ids. It is modelled after the code found in featured workspace [Type 2 Diabetes Analysis](https://workbench.researchallofus.org/workspaces/aou-rw-c697f47e/phenotypetype2diabetes/notebooks/preview/Type%202%20Diabetes%20Analysis.ipynb).

In [None]:
## ---------------[ CHANGE THESE AS NEEDED] ---------------------------------------
STATIN_GENERICS <- c(
  'Atorvastatin',
  'Fluvastatin',
  'Lovastatin',
  'Pravastatin',
  'Rosuvastatin',
  'Simvastatin',
  'Pitavastatin',
  'Cerivastatin',
  'amlodipine',
  'Niacin',
  'Ezetimibe')  # All of these were available in cohort builder.

STATIN_RX_ADJUVANT_THERAPIES <- c(
  'ezetemibe',  # No results in cohort builder, misspelling?
  'Zetia',  # In cohort builder, included under Ezetimibe.
  'alirocumab',
  'Praluent',  # In cohort builder, included under Alirocumab.
  'evolucumab',  # No results in cohort builder, misspelling?
  'Repatha') # In cohort builder, included under Evolocumab.

In [None]:
(STATIN_DRUGS <- str_c('LOWER(c.concept_name) LIKE "%',
      str_to_lower(c(STATIN_GENERICS, STATIN_RX_ADJUVANT_THERAPIES)),
     '%"',
     collapse = ' OR '))

In [None]:
(statin_drugs_summary_df <- bq_table_download(bq_project_query(
    BILLING_PROJECT_ID, page_size = 25000,
    query = str_glue('                                     
SELECT
    DISTINCT c2.concept_name,
    c2.concept_code,
    c2.concept_id
FROM
    `{CDR}.concept` c
    JOIN `{CDR}.concept_ancestor` ca
        ON c.concept_id = ca.ancestor_concept_id
    JOIN `{CDR}.concept` c2
        ON c2.concept_id = ca.descendant_concept_id
WHERE
    c.concept_class_id = "Ingredient"
    AND ({STATIN_DRUGS})'))))

<div class="alert alert-block alert-warning">
<p>In the above table, we see some drug concepts for B vitamin use, unrelated to statin use. This list of concept ids needs to be trimmed to just the relevant concepts.</p>

<p>Alternatively, when we have the list of the 1,000 participant ids and determine which of those use any of these medications, we can then "review" that subset of the cohort to check whether their medications really do indicate statin use.</p>
</div>

In [None]:
# This snippet assumes that you run setup first

# This code saves your dataframe into a csv file in a "data" folder in Google Bucket

# Replace df with THE NAME OF YOUR DATAFRAME
my_dataframe <- statin_drugs_summary_df

# Replace 'test.csv' with THE NAME of the file you're going to store in the bucket (don't delete the quotation marks)
destination_filename <- 'statin_drugs_summary.csv'

########################################################################
##
################# DON'T CHANGE FROM HERE ###############################
##
########################################################################

# store the dataframe in current workspace
write_excel_csv(my_dataframe, destination_filename)

# Get the bucket name
my_bucket <- Sys.getenv('WORKSPACE_BUCKET')

# Copy the file from current workspace to the bucket
system(paste0("gsutil cp ./", destination_filename, " ", my_bucket, "/data/"), intern=T)

# Check if file is in the bucket
system(paste0("gsutil ls ", my_bucket, "/data/*.csv"), intern=T)

# Provenance 

In [None]:
devtools::session_info()