# Using BiasAnalyzer for Cohort Concept Prevalence Exploration

This tutorial demonstrates how to use the `BiasAnalyzer` package to explore **concept prevalence** within a cohort - a key step in identifying potential biases during cohort selection. It complements the [Cohort Exploration Tutorial](./BiasAnalyzerCohortsTutorial.ipynb) by focusing specifically on analyzing which clincial concepts (e.g., diagnoses, procedures, medications) are most common in a selected cohort. In the OMOP (Observational Medical Outcomes Partnership) CDM (Common Data Model), a **concept** refers to a coded term from a standardized medical vocabulary, uniquely identified by a **concept ID**. All clinical events in OMOP, such as conditions, drug exposures, procedures, measurements, and events, are represented as concepts.

---

### Overview

**Objective**:  
Learn how to retrieve and analyze concept prevalence within a cohort using `BiasAnalyzer`.

**Before You Begin**:  
The `BiasAnalyzer` package is currently in active development and has not yet been officially released on PyPI.
You can install it in one of the two ways:

- **Install from GitHub (recommended during development)**:
```bash
pip install git+https://github.com/vaclab/BiasAnalyzer.git
```
- **Install from PyPI (once the pacakge is officially released)**:
```bash
pip install biasanalyzer
```

For full setup and usage instructions, refer to the [README](https://github.com/VACLab/BiasAnalyzer/blob/main/README.md).

---


### Preparation for cohort concept prevalence exploration
**Preparation step 1**: Import the `BIAS` class from the `api` module of the `BiasAnalyzer` package, create an object `bias` of the `BIAS` class, specify OMOP CDM database configurations on the `bias` object, and set OMOP CDM database to enable connection to the database. Refer to the [Cohort Exploration Tutorial](./BiasAnalyzerCohortsTutorial.ipynb) for more details.

In [1]:
from biasanalyzer.api import BIAS

bias = BIAS()

bias.set_config('../config.yaml')

bias.set_root_omop()

configuration specified in ../config.yaml loaded successfully
Connected to the OMOP CDM database (read-only).
Cohort Definition table created.
Cohort table created.


———————————————

**Preparation step 2**: Create a cohort of young female COVID patients using the `create_cohort(cohort_name, cohort_description, query_or_yaml_file, created_by)` method on the `bias` object for cohort concept prevalence exploration. You'll pass the name of the cohort as the first argument, the description of the cohort as the second argument, a yaml file that specifies cohort inclusion and exclusion criteria or a cohort selection SQL query as the third argument, and the cohort owner's name indicating who owns or creates this cohort as the fourth argument. After the cohort is created, you can call `get_stats()` and `get_distributions()` methods on the returned `cohort_data` object to explore cohort statistics and distributions.

In [2]:
# create a cohort with young female COVID patients
cohort_data = bias.create_cohort('Young female COVID patients', 
                                  'A cohort of female COVID patients born between 2000 and 2020', 
                                  '../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_study.yaml', 
                                  'system')
# get stats of the cohort
cohort_stats = cohort_data.get_stats()
print(f'the cohort stats: {cohort_stats}')
cohort_age_stats = cohort_data.get_stats("age")
print(f'the cohort age stats: {cohort_age_stats}')
cohort_gender_stats = cohort_data.get_stats("gender")
print(f'the cohort gender stats: {cohort_gender_stats}')
cohort_race_stats = cohort_data.get_stats("race")
print(f'the cohort race stats: {cohort_race_stats}')
cohort_ethnicity_stats = cohort_data.get_stats("ethnicity")
print(f'the cohort ethnicity stats: {cohort_ethnicity_stats}')
# get discrete probability distribution of the age variable in the cohort
cohort_age_distr = cohort_data.get_distributions('age')
print(f'the cohort age discrete probability distribution: {cohort_age_distr}')

template_path: /home/hongyi/BiasAnalyzer/biasanalyzer/sql_templates, cohort_creation: True


Cohort creation:   0%|                                 | 0/3 [00:00<?, ?stage/s]

configuration specified in ../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_study.yaml loaded successfully
Cohort definition inserted successfully.
Cohort Young female COVID patients successfully created.
template_path: /home/hongyi/BiasAnalyzer/biasanalyzer/sql_templates, cohort_creation: False
cohort created successfully
the cohort stats: [{'total_count': 10208, 'earliest_start_date': datetime.date(2020, 1, 18), 'latest_start_date': datetime.date(2020, 3, 30), 'earliest_end_date': datetime.date(2020, 2, 7), 'latest_end_date': datetime.date(2020, 5, 3), 'min_duration_days': 8, 'max_duration_days': 37, 'avg_duration_days': 24.25, 'median_duration': 24, 'stddev_duration': 7.2}]
the cohort age stats: [{'total_count': 10208, 'min_age': 0, 'max_age': 20, 'avg_age': 10.94, 'median_age': 11, 'stddev_age': 5.92}]
the cohort gender stats: [{'gender': 'female', 'gender_count': 10208, 'probability': 1.0}]
the cohort race stats: [{'race': 'Other', 'race_count': 53,

**Now that you have connected to your OMOP CDM database and created the `cohort_data` cohort object, you are ready to explore cohort concept prevalence.** 

---

### Exploring cohort concept prevalence
You can retrieve concept prevalence statistics for a cohort using the `get_concept_stats(concept_type='condition_occurrence', filter_count=0, vocab=None, include_hierarchy=False)` method on the `cohort_data` object. Each input argument to this method has a default value, so you can call the method without specifying all parameters.
- The `concept_type` input argument specifies the OMOP domain to analyze. It must be one of the OMOP domain names: `condition_occurrence`, `drug_exposure`, `procedure_occurrence`, `visit_occurrence`, `measurement`, or `observation`.
- The `vocab` input argument specifies the OMOP vocabulary ID to filter concepts by. If set to `None`, a default vocabulary is used based on the domain: `RxNorm` for `drug_exposure`, `LOINC` for `measurement`, and `SNOMED` for all other domains.
- The `filter_count` input argument filters out concepts with fewer than this number of patients in the cohort. Set it to `0` to include all without filtering.
- The `include_hierarchy` input argument specifies whether to include concept hierarchical relationship. If set to `True`, ancestor concepts using the OMOP concept hierarchy are included when calculating prevalence.
This method helps identify the most prevalent clinical concepts in your cohort, which can reveal patterns or potential sources of selection bias in the cohort data.

**Cohort condition occurrence concept prevalence**: 
The code block below demonstrates how to use the default parameters of the `get_concept_stats()` method to retrieve concept prevalence for the `condition occurrence` domain. By default, it uses the `SNOMED` vocabulary, excludes hierarchical relationships, and applies no filtering. The method returns a dictionary where the **key** is the `concept_type` (e.g., `condition_occurrence`) and the **value** is a list of concept dictionaries. Each concept dictionary in the list contains `concept_name`, `concept_code`, `count_in_cohort`, `prevalence`, `ancestor_concept_id`, and `descendant_concept_id`. These values allow you to explore which clinical concepts are most prevalent in your cohort and support deeper investigations into potential sources of selection bias.

**Note** that this prevalence computation may take some time, especially for large cohorts. A progress bar will appear to indicate the progress of the prevalence calculation.

In [3]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

import time

# get cohort concept prevalance
t1 = time.time()
cohort_concepts = cohort_data.get_concept_stats()
print(pd.DataFrame(cohort_concepts["condition_occurrence"]))
print(f'the time taken to get cohort concept stats for condition_occurrence is {time.time() - t1}s')

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

cohort concept hierarchy for condition_occurrence with root concept ids []:
                                          concept_name       concept_code  count_in_cohort  prevalence  ancestor_concept_id  descendant_concept_id
0                                     Clinical finding          404684003            10208    1.000000               441840                 441840
1                         Disease due to Coronaviridae           27619001            10208    1.000000              4100065                4100065
2                                        Viral disease           34014006            10208    1.000000               440029                 440029
3                                              Disease           64572001            10208    1.000000              4274025                4274025
4                                Coronavirus infection          186747009            10208    1.000000               439676                 439676
5                                         

———————————————

**Cohort drug exposure concept prevalence**: 
The code block below demonstrates how to use `get_concept_stats(concept_type='drug_exposure', filter_count=500, include_hierarchy=True)` method to retrieve concept prevalence for the `drug_exposure` domain. By default, this uses the `RxNorm` vocabulary. Concepts with fewer than 500 patients are excluded, and hierarchical relationships are included in the results. The method returns a dictionary where the **key** is the `concept_type` (in this case, `drug_exposure`) and the **value** is a list of concept dictionaries. Each concept dictionary in the list contains the following fields: `concept_name`, `concept_code`, `count_in_cohort`, `prevalence`, `ancestor_concept_id`, and `descendant_concept_id`. These values allow you to explore which clinical concepts are most prevalent in your cohort and suppoert deeper investigations into potential sources of selection bias.

**Note**: Prevalence computation may take some time, especially for large cohorts or when hierarchical relationships are included. A progress bar will appear to indicate the progress of the computation. 

When `include_hierarchy=True`, the output also includes a text-based, indented representation of the concept hierarchy. Each concept is displayed along with its **concept code**, **patient count**, and **prevalence** in parentheses, providing a quick summary of both the structure and frequency of clinical concepts in the cohort.

In [4]:
t1 = time.time()
cohort_de_concepts = cohort_data.get_concept_stats(concept_type='drug_exposure', filter_count=500, include_hierarchy=True)
print(pd.DataFrame(cohort_de_concepts["drug_exposure"]))
print(f'the time taken to get cohort concept stats for drug_exposure is {time.time() - t1}s')

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

cohort concept hierarchy for drug_exposure with root concept ids [36217216, 1301025, 1125315, 36217210, 36217214]:
Pill (Code: 1151133, Count: 931, Prevalence: 9.120%)
  acetaminophen Pill (Code: 1152843, Count: 638, Prevalence: 6.250%)
    acetaminophen Oral Tablet (Code: 369097, Count: 609, Prevalence: 5.966%)
      acetaminophen 500 MG Oral Tablet (Code: 198440, Count: 582, Prevalence: 5.701%)
    acetaminophen 500 MG Oral Tablet (Code: 198440, Count: 582, Prevalence: 5.701%)
enoxaparin (Code: 67108, Count: 582, Prevalence: 5.701%)
  enoxaparin Prefilled Syringe (Code: 727722, Count: 582, Prevalence: 5.701%)
    enoxaparin sodium 100 MG/ML Prefilled Syringe (Code: 1360019, Count: 582, Prevalence: 5.701%)
      0.4 ML enoxaparin sodium 100 MG/ML Prefilled Syringe (Code: 854235, Count: 582, Prevalence: 5.701%)
    0.4 ML enoxaparin sodium 100 MG/ML Prefilled Syringe (Code: 854235, Count: 582, Prevalence: 5.701%)
  enoxaparin sodium 100 MG/ML (Code: 854227, Count: 582, Prevalence: 5.70

---

### Final cleanup to ensure database connections are closed

In [5]:
bias.cleanup()

Connection to BiasDatabase closed.
Connection to the OMOP CDM database closed.


### ✅ Summary

In this tutorial, you learned how to use the BiasAnalyzer package to explore clinical concept prevalence in a cohort. Exploring clinical concept prevalence in a cohort allows you to explore which clinical concepts are most prevalent in your cohort and support deeper investigations into potential sources of cohort selection bias. 
  
For more information, refer to the [BiasAnalyzer GitHub repo](https://github.com/VACLab/BiasAnalyzer) and the [README file](https://github.com/VACLab/BiasAnalyzer/blob/main/README.md).
