<a href="https://colab.research.google.com/github/cps41/health-informatics-portfolio/blob/main/assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1



## Background
This notebook was created for an assingment for course HINF-4220: Data Mining for Health Data Science by the University of Denver.

The purpose of this assignment is to:
- Develop advanced skills in data cleaning and preprocessing techniques to prepare datasets for sustainable analysis, specifically with clinical data.
- Manage large-scale clinical data sets using cloud-based tools such as a data warehouse and notebooks.
- Apply descriptive statistical methods to analyze health data with summary statistics, visualizations, and pattern/trend discovery.

To do so, we were allowed to choose a health database and answer 3 of the provided questions. I will use this notebook to demonstrate my process to answering my chosen questions and focusing on the goals of the assignment.


### Database
I chose to use the **CMS Synthetic Patient Data OMOP Database**. It is a database that uses the OMOP CDM and has been populated with synthetically formed data. More information can be found [here](https://redivis.com/datasets/ye2v-6skh7wdr7). This database is accessible via Google Cloud Platform for free. I primarily chose this database to build upon the knowledge I have already been gaining from this course on the OMOP CDM, as I believe it will be an important skill for my personal goals.

### General Approach
While I have the background knowledge on the OMOP CDM, I was lacking the technical experience with it. To start, I reviewed some of the [OMOP Documentation](https://ohdsi.github.io/CommonDataModel/index.html) and previewed some of the sample queries for the CMS Synthetic Patient Data OMOP database. Then, I used the [OMOP CDM v5.4 Documentation](https://ohdsi.github.io/CommonDataModel/cdm54.html) to determine what tables and fields I should be using, and how.

## Setup
First we must authenticate our Google Colab account to access the database from BigQuery. Additionally, for visualization I've loaded the provided IPython datatable extension.

In [62]:
from google.colab import auth
from google.cloud import bigquery

auth.authenticate_user()
print('Authenticated')

%load_ext google.colab.data_table

Authenticated
The google.colab.data_table extension is already loaded. To reload it, use:
  %reload_ext google.colab.data_table


Then, authenticate the client. I created a project for assignment 1 and used the ID to create the client connection. You may replace this id with your own to run the notebook.

In [63]:
project_id = "assignment-1-448801"
client = bigquery.Client(project=project_id);

## Q1: How many prescriptions were for metformin each month?

### Assumptions
- [Drug exposures are prescription instances](https://ohdsi.github.io/CommonDataModel/cdm54.html#drug_exposure)
- We only want to retreive instances of prescriptions that were filled
- A drug exposure with quantity of 0 can be left out
- A drug exposure with days supply of 0 can be left out
- A drug exposure with a null start date can be left out
- A drug exposure with a missing drug concept can be left out
- We only want to retreive concepts under the `Drug` domain

### Approach

To answer this question, we need to search the drug_exposure table for exposures involving metformin and obtain the total exposures per month.

First, let's confirm all of the concepts containing `metformin` in the name are valid.

In [64]:
metformin_concepts = client.query('''
  SELECT
    DISTINCT c.concept_name
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` as e
    -- join to drug concept
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c
    ON e.drug_concept_id = c.concept_id
    -- limit concept types to those referencing metformin
  WHERE LOWER(c.concept_name) LIKE "%metformin%"
    -- apply assumptions
    AND c.domain_id = 'Drug'
    AND e.quantity > 0
    AND days_supply > 0
    AND e.drug_exposure_start_date IS NOT NULL
    AND e.drug_concept_id IS NOT NULL
''').to_dataframe()

metformin_concepts

Unnamed: 0,concept_name
0,24 HR Metformin hydrochloride 750 MG Extended ...
1,24 HR Metformin hydrochloride 500 MG Extended ...
2,Metformin hydrochloride 500 MG Extended Releas...
3,Metformin hydrochloride 850 MG Oral Tablet
4,24 HR Metformin hydrochloride 500 MG Extended ...
5,Glipizide 2.5 MG / Metformin hydrochloride 250...
6,Osmotic 24 HR Metformin hydrochloride 500 MG E...
7,Glyburide 5 MG / Metformin hydrochloride 500 M...
8,Metformin hydrochloride 1000 MG Oral Tablet
9,Metformin hydrochloride 1000 MG / rosiglitazon...


We can clearly see that all of the concepts are valid.

Now, let's see what types of exposures we are dealing with. To do so, we will join the drug_exposure table with the concept table for matching drug concepts and drug type concepts. This will allow us to limit the rows to just exposures of metformin, and then check what types of exposures are relevant.

> The prior query could also be stored in a dataframe and then joined with the drug type concept to avoid querying the data again, though with large datasets this may not be performant.



In [65]:
metformin_exposure_types = client.query('''
  SELECT DISTINCT c2.concept_name, c2.concept_id as type_concept_id
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` as e
    -- join to drug concept
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c1
    ON e.drug_concept_id = c1.concept_id
    -- join to drug type concept
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c2
    ON e.drug_type_concept_id = c2.concept_id
  -- limit concept types to reference metformin
  WHERE LOWER(c1.concept_name) LIKE "%metformin%"
    -- apply assumptions
    AND c1.domain_id = 'Drug'
    AND e.quantity > 0
    AND days_supply > 0
    AND e.drug_exposure_start_date IS NOT NULL
    AND e.drug_concept_id IS NOT NULL
''').to_dataframe()

metformin_exposure_types

Unnamed: 0,concept_name,type_concept_id
0,Prescription dispensed in pharmacy,38000175


Fortunately, we can see there are only type instances of `Prescription dispensed in pharmacy`. We will include it still as a limitation to be thorough.

### Conclusion

We can now perform our final query to answer how many prescriptions were for metformin each month.

We will join the drug exposure table to the concept table again, matching on drug concept and concepts containing "metformin" in the name. We will also limit the query based on our assumptions. With these rows, we will group by the extracted year and date, and count the number of drug exposures in each.

In [66]:
# store drug type concept id to limit on
prescription_type_id = metformin_exposure_types['type_concept_id'][0]

metformin_prescriptions_per_month = client.query(f'''
  SELECT
    -- retreive month and year for grouping
    EXTRACT(MONTH FROM e.drug_exposure_start_date) as month,
    EXTRACT(YEAR FROM e.drug_exposure_start_date) as year,
    -- retrieve count of unique exposures
    COUNT(DISTINCT e.drug_exposure_id) as prescription_count
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` as e
    -- join on metformin concepts
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c
    ON e.drug_concept_id = c.concept_id
  WHERE LOWER(c.concept_name) LIKE "%metformin%"
    -- apply assumptions
    AND c.domain_id = 'Drug'
    AND e.quantity > 0
    AND days_supply > 0
    AND e.drug_exposure_start_date IS NOT NULL
    AND e.drug_concept_id IS NOT NULL
    AND e.drug_type_concept_id = {prescription_type_id}
  GROUP BY year, month
  ORDER BY year, month
''').to_dataframe()

metformin_prescriptions_per_month

Unnamed: 0,month,year,prescription_count
0,1,2008,25890
1,2,2008,37975
2,3,2008,48317
3,4,2008,51526
4,5,2008,54096
5,6,2008,52819
6,7,2008,54202
7,8,2008,54754
8,9,2008,53177
9,10,2008,55515


For the x-axis we need to combine the month and year as a unique id for each row.

To visualize this data, I have chosen a simple bar graph. This makes it easy to read and see the difference in metformin prescriptions month to month.

In [67]:
# combine month and year for indexing
metformin_prescriptions_per_month['year_month'] = metformin_prescriptions_per_month['year'].astype(str) + '-' + metformin_prescriptions_per_month['month'].astype(str).str.zfill(2)

fig = px.bar(metformin_prescriptions_per_month, x='year_month', y="prescription_count", title = 'Metformin Prescriptions Per Month', height=600, text_auto=True)
fig.update_layout(xaxis=dict(title='Year, Month'), yaxis=dict(title='Total Prescriptions'))
fig.show()

## Q2: How many patients were prescribed insulin month over month compared to last year?


### Assumptions
- Drug exposures are prescription instances
- We only want to retreive instances of prescriptions that were filled
- A drug exposure with quantity of 0 can be left out
- A drug exposure with days supply of 0 can be left out
- A drug exposure with a null start date can be left out
- A drug exposure with a missing drug concept can be left out
- We only want to retreive concepts under the `Drug` domain
- Due to the constraints of the database, we will use years 2009 and 2010 for comparison

### Approach

To answer this question, we will have a similar approach to answering Q1. We need to search the drug_exposure table for exposures involving insulin. Then we will obtain the total exposures per month. The rest of our calculations will be done mathematically using the pandas library.

First, let's confirm all of the concepts containing `insulin` in the name are valid.

In [68]:
insulin_concepts = client.query('''
  SELECT
    DISTINCT c.concept_name
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` as e
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c
    ON e.drug_concept_id = c.concept_id
  WHERE LOWER(c.concept_name) LIKE "%insulin%"
    AND c.domain_id = 'Drug'
    AND e.quantity > 0
    AND days_supply > 0
    AND e.drug_exposure_start_date IS NOT NULL
    AND e.drug_exposure_start_date >= '2009-01-01'
    AND e.drug_exposure_start_date <= '2010-12-31'
    AND e.drug_concept_id IS NOT NULL
''').to_dataframe()
insulin_concepts

Unnamed: 0,concept_name
0,Insulin Glargine 100 UNT/ML Injectable Solutio...
1,"insulin human, isophane 70 UNT/ML / Regular In..."
2,"Regular Insulin, Human 100 UNT/ML Injectable S..."
3,"3 ML Insulin, Aspart, Human 100 UNT/ML Cartrid..."
4,"Insulin, Aspart, Human 100 UNT/ML Injectable S..."
...,...
63,"1.5 ML insulin human, isophane 70 UNT/ML / Reg..."
64,"Regular Insulin, Human 100 UNT/ML Injectable S..."
65,"Ultralente Insulin, Beef 100 UNT/ML Injectable..."
66,"3 ML Insulin, Aspart, Human 100 UNT/ML Pen Inj..."


Here we can see all of the drug exposures mapping to insulin concepts are valid for consideration in answering our question. Let's confirm the exposure types are valid as well.

In [69]:
insulin_exposure_types = client.query('''
  SELECT distinct c2.concept_name, c2.concept_id as type_concept_id
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` as e
    -- join to drug concept
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c1
    ON e.drug_concept_id = c1.concept_id
    -- join to drug type concept
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c2
    ON e.drug_type_concept_id = c2.concept_id
  -- limit concept types to reference insulin
  WHERE LOWER(c1.concept_name) LIKE "%insulin%"
    -- apply assumptions
    AND c1.domain_id = 'Drug'
    AND e.quantity > 0
    AND days_supply > 0
    AND e.drug_exposure_start_date IS NOT NULL
    AND e.drug_concept_id IS NOT NULL
    AND e.drug_exposure_start_date >= '2009-01-01'
    AND e.drug_exposure_start_date <= '2010-12-31'
''').to_dataframe()

insulin_exposure_types

Unnamed: 0,concept_name,type_concept_id
0,Prescription dispensed in pharmacy,38000175


Once again, we can see there are only type instances of "Prescription dispensed in pharmacy". Let's retreive the total insulin prescriptions per month, per year.

In [70]:
# store drug type concept id to limit on
prescription_type_id = insulin_exposure_types['type_concept_id'][0]

insulin_prescriptions_per_month = client.query(f'''
  SELECT
    EXTRACT(MONTH FROM e.drug_exposure_start_date) as month, EXTRACT(YEAR FROM e.drug_exposure_start_date) as year, COUNT(DISTINCT e.drug_exposure_id) as prescription_count
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` as e
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c
    ON e.drug_concept_id = c.concept_id
  WHERE LOWER(c.concept_name) LIKE "%insulin%"
    AND c.domain_id = 'Drug'
    AND e.quantity > 0
    AND days_supply > 0
    AND e.drug_exposure_start_date IS NOT NULL
    AND e.drug_exposure_start_date >= '2009-01-01'
    AND e.drug_exposure_start_date <= '2010-12-31'
    AND e.drug_concept_id IS NOT NULL
    AND e.drug_type_concept_id = {prescription_type_id}
  GROUP BY year, month
  ORDER BY year, month
''').to_dataframe()

insulin_prescriptions_per_month

Unnamed: 0,month,year,prescription_count
0,1,2009,6573
1,2,2009,5759
2,3,2009,6425
3,4,2009,6336
4,5,2009,6435
5,6,2009,6223
6,7,2009,6530
7,8,2009,6490
8,9,2009,6332
9,10,2009,6363


Now, let's calculate the MoM comparison. The formula is:

$$
\text{MoM Growth Rate (\%)} = \left( \frac{\text{Current Month Prescriptions} - \text{Previous Month Prescriptions}}{\text{Previous Month Prescriptions}} \right) \times 100
$$

So, we will need to use the pandas library to create a column for previous month prescriptions, then apply the formula.

In [71]:
import pandas as pd

# create new column for calculation
insulin_prescriptions_per_month['previous_month_prescriptions'] = insulin_prescriptions_per_month['prescription_count'].shift(1)
# calculate MoM growth rate %
insulin_prescriptions_per_month['mom_growth_rate'] = (
    (insulin_prescriptions_per_month['prescription_count'] - insulin_prescriptions_per_month['previous_month_prescriptions'])
    / insulin_prescriptions_per_month['previous_month_prescriptions']) * 100

insulin_prescriptions_per_month

Unnamed: 0,month,year,prescription_count,previous_month_prescriptions,mom_growth_rate
0,1,2009,6573,,
1,2,2009,5759,6573.0,-12.383995
2,3,2009,6425,5759.0,11.564508
3,4,2009,6336,6425.0,-1.385214
4,5,2009,6435,6336.0,1.5625
5,6,2009,6223,6435.0,-3.294483
6,7,2009,6530,6223.0,4.933312
7,8,2009,6490,6530.0,-0.612557
8,9,2009,6332,6490.0,-2.434515
9,10,2009,6363,6332.0,0.489577


### Conclusion

We can now see the MoM growth rates, but we still need to visualize the comparison by year. To do so, let's use a stacked area chart. This will allow us to see the growth rate by month, but also compare that month's growth rate to the previous year.

In [72]:
insulin_prescriptions_fig = px.line(insulin_prescriptions_per_month, x="month", y="mom_growth_rate", color="year")
insulin_prescriptions_fig.update_layout(xaxis=dict(title='Month', dtick=1), yaxis=dict(title='MoM Growth Rate (%)'))
insulin_prescriptions_fig.show()

#### Note
2009 is the first year, thus it does not have a MoM rate for the first month.

## Q3: How many procedures were performed for hip replacement in each year?

### Assumptions
- Both total and partial hip replacement procedures will be included
- Revisions to hip replacement will be included as they are relevant procedures for hip replacement
- Procedure date must not be null
- Procedure quantity must not be zero
- [Procedures with a null quantity count as 1](https://ohdsi.github.io/CommonDataModel/cdm54.html#procedure_occurrence)

### Approach

Let's see what kind of hip replacement procedures have been recorded in this database.

In [73]:
hip_replacement_procedures = client.query('''
  SELECT
    DISTINCT concept_name
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` as p
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c
    ON p.procedure_concept_id = c.concept_id
  WHERE c.concept_name LIKE '%hip replacement%'
''').to_dataframe()

hip_replacement_procedures

Unnamed: 0,concept_name
0,"Revision of hip replacement, not otherwise spe..."
1,"Revision of hip replacement, femoral component"
2,Partial hip replacement
3,"Revision of hip replacement, both acetabular a..."
4,"Revision of hip replacement, acetabular component"
5,Total hip replacement
6,"Revision of hip replacement, acetabular liner ..."


We can clearly see all procedures in the database are relevant, and none of the types found need to be excluded. Let's confirm all of the procedure types are also relevant.

In [77]:
hip_replacement_procedure_types = client.query('''
  SELECT
    DISTINCT c2.concept_name
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` as p
    -- join on procedure concept
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c
    ON p.procedure_concept_id = c.concept_id
    -- join on procedure type concept
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c2
    ON p.procedure_type_concept_id = c2.concept_id
  -- limit to hip replacements
  WHERE c.concept_name LIKE '%hip replacement%'
    -- apply assumptions
    AND p.procedure_dat IS NOT NULL
    AND (p.quantity IS NULL OR p.quantity > 0)
''').to_dataframe()

hip_replacement_procedure_types

Unnamed: 0,concept_name
0,Inpatient header - 1st position
1,Outpatient header - 1st position


We can see that both procedure types are in fact valid for consideration. Now we can answer our question.

### Conclusion
With our final query we can retreive the total number of hip replacement procedures per year. To visualize, I have chosen a bar graph as it is simple to read and compare the year to year totals.

In [78]:
annual_hip_replacements = client.query('''
  SELECT
    -- grab year and total procedures
    EXTRACT(YEAR FROM p.procedure_dat) as year,
    COUNT(distinct p.procedure_occurrence_id) as total_hip_replacement_procedures
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` as p
    -- join on procedure concept
    INNER JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` as c
    ON p.procedure_concept_id = c.concept_id
  -- limit to hip replacements
  WHERE c.concept_name LIKE '%hip replacement%'
    -- apply assumptions
    AND p.procedure_dat IS NOT NULL
    AND (p.quantity IS NULL OR p.quantity > 0)
  GROUP BY year
  ORDER BY year
''').to_dataframe()

annual_hip_replacements_fig = px.bar(annual_hip_replacements, x='year', y='total_hip_replacement_procedures', title='Annual Hip Replacement Procedures')
# increment year by whole step
annual_hip_replacements_fig.update_layout(xaxis=dict(dtick=1))
annual_hip_replacements_fig.show()

#### Note
While the database is described to have data only between 2008-2010, we can see we have some results from 2007. For completeness, I have included it.

## Outcome
This project was not as direct as I had originally expected. Finding all the necessary documentation and comprehending how to use which tables took more time than I had expected. My previous database experiences include databases with much more precise mappings and consistently generated data. My biggest takeaway from this project is that standardization is increasingly important for the health industry in the face of ever growing data. It also allowed me to gain desired experience with clinical data and the OMOP CDM, and practice using BigQuery and choosing data visualizations as intended.