# Using BiasAnalyzer for Exploring Union of Cohort Concept Prevalences over Multiple Cohorts

This tutorial demonstrates how to use the `BiasAnalyzer` package to explore **union of concept prevalences** across multiple cohorts to faciliate potential cohort selection bias exploration. It complements the [Single Cohort Concept Prevalence Exploration Tutorial](./BiasAnalyzerCohortConceptTutorial.ipynb) by focusing specifically on how to get unionized view of hierarchy of concept prevalences across multiple cohorts using the `union` method of the `ConceptHierarchy` class or using the BiasAnalyzer API method `get_cohorts_concept_stats()`. In the OMOP (Observational Medical Outcomes Partnership) CDM (Common Data Model), a **concept** refers to a coded term from a standardized medical vocabulary, uniquely identified by a **concept ID**. All clinical events in OMOP, such as conditions, drug exposures, procedures, measurements, and events, are represented as concepts.

---

### Overview

**Objective**:  
Learn how to obtain concept prevalences across multiple cohorts for cohort selection bias analysis using `BiasAnalyzer`.

**Before You Begin**:  
The `BiasAnalyzer` package is currently in active development and has not yet been officially released on PyPI.
You can install it in one of the two ways:

- **Install from GitHub (recommended during development)**:
```bash
pip install git+https://github.com/vaclab/BiasAnalyzerCore.git
```
- **Install from PyPI (once the pacakge is officially released)**:
```bash
pip install biasanalyzer
```

For full setup and usage instructions, refer to the [README](https://github.com/VACLab/BiasAnalyzerCore/blob/main/README.md).

---


### Preparation for exploring union of multiple cohort concept prevalences
**Preparation step 1**: Import the `BIAS` class from the `api` module of the `BiasAnalyzer` package, create an object `bias` of the `BIAS` class, specify OMOP CDM database configurations on the `bias` object, and set OMOP CDM database to enable connection to the database. Refer to the [Cohort Exploration Tutorial](./BiasAnalyzerCohortsTutorial.ipynb) for more details.

In [1]:
from biasanalyzer.api import BIAS

bias = BIAS()

bias.set_config('../config.yaml')

bias.set_root_omop()

configuration specified in ../config.yaml loaded successfully
Connected to the OMOP CDM database (read-only).
Cohort Definition table created.
Cohort table created.


———————————————

**Preparation step 2**: Create a baseline cohort of young female patients and a study cohort of young female COVID patients using the `create_cohort(cohort_name, cohort_description, query_or_yaml_file, created_by)` method on the `bias` object for exploring union of concept prevalences in these two cohorts. You'll pass the name of the cohort as the first argument, the description of the cohort as the second argument, a yaml file that specifies cohort inclusion and exclusion criteria or a cohort selection SQL query as the third argument, and the cohort owner's name indicating who owns or creates this cohort as the fourth argument.

In [2]:
# create a baseline cohort with young female patients
baseline_cohort_data = bias.create_cohort('Young female patients', 
                                          'A cohort of female patients born between 2000 and 2020', 
                                          '../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_baseline.yaml', 
                                          'system')
# get stats of the baseline cohort
baseline_cohort_stats = baseline_cohort_data.get_stats()
print(f'the baseline cohort stats: {baseline_cohort_stats}')

# create a study cohort with young female COVID patients
study_cohort_data = bias.create_cohort('Young female COVID patients', 
                                       'A cohort of female COVID patients born between 2000 and 2020', 
                                       '../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_study.yaml', 
                                       'system')
# get stats of the cohort
study_cohort_stats = study_cohort_data.get_stats()
print(f'the study cohort stats: {study_cohort_stats}')


template_path: /home/hongyi/BiasAnalyzer/biasanalyzer/sql_templates


Cohort creation:   0%|                                                                                        …

configuration specified in ../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_baseline.yaml loaded successfully
Cohort definition inserted successfully.
Cohort Young female patients successfully created.
template_path: /home/hongyi/BiasAnalyzer/biasanalyzer/sql_templates
cohort created successfully
the baseline cohort stats: [{'total_count': 12360, 'earliest_start_date': datetime.date(2000, 2, 19), 'latest_start_date': datetime.date(2020, 5, 26), 'earliest_end_date': datetime.date(2002, 7, 20), 'latest_end_date': datetime.date(2020, 5, 27), 'min_duration_days': 0, 'max_duration_days': 7379, 'avg_duration_days': 1192.32, 'median_duration': 296, 'stddev_duration': 1779.19}]


Cohort creation:   0%|                                                                                        …

configuration specified in ../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_study.yaml loaded successfully
Cohort definition inserted successfully.
Cohort Young female COVID patients successfully created.
template_path: /home/hongyi/BiasAnalyzer/biasanalyzer/sql_templates
cohort created successfully
the study cohort stats: [{'total_count': 10208, 'earliest_start_date': datetime.date(2020, 1, 18), 'latest_start_date': datetime.date(2020, 3, 30), 'earliest_end_date': datetime.date(2020, 2, 7), 'latest_end_date': datetime.date(2020, 5, 3), 'min_duration_days': 8, 'max_duration_days': 37, 'avg_duration_days': 24.25, 'median_duration': 24, 'stddev_duration': 7.2}]


**Now that you have connected to your OMOP CDM database and created two cohort objects, you are ready to explore union of cohort concept prevalences across these two cohorts.** 

---

### Exploring cohort concept prevalence with concept hierarchy taken into account
You can retrieve concept prevalence statistics for each cohort using the `get_concept_stats(concept_type='condition_occurrence', filter_count=0, vocab=None)` method on the cohort object. This method helps identify the most prevalent clinical concepts in your cohort with concept hierarchy taken into account, which can reveal patterns or potential sources of selection bias in the cohort data. As detailed in the [Single Cohort Concept Exploration Tutorial Notebook](./BiasAnalyzerCohortConceptTutorial.ipynb) the method returns concept hierarchical relationships for the cohort in a dictionary and a `ConceptHierarchy` object. The `ConceptHierarchy` class provides a `union()` method that allows the `ConceptHierarchy` object to union with another `ConceptHierarchy` object obtained for another cohort to get a unionized view of concept prevalences in both cohorts. 

Alternatively, you can directly call the BiasAnalyzer API method `get_cohorts_concept_stats(cohorts, concept_type='condition_occurrence', filter_count=0, vocab=None)` to pass in a list of cohort objects and obtain the unionized concept prevalences across all cohorts. The input arguments to this API methods are detailed below:

- The `cohorts` input argument specifies the list of cohort ids to have their concept prevalence hierarchies unionized.
- The `concept_type` input argument specifies the OMOP domain to analyze. It must be one of the OMOP domain names: condition_occurrence, drug_exposure, procedure_occurrence, visit_occurrence, measurement, or observation.
- The `vocab` input argument specifies the OMOP vocabulary ID to filter concepts by. If set to None, a default vocabulary is used based on the domain: RxNorm for drug_exposure, LOINC for measurement, and SNOMED for all other domains.
- The `filter_count` input argument filters out concepts with fewer than this number of patients in the cohort. Set it to 0 to include all without filtering. 

The code block below demonstrates how to use this API method `get_cohorts_concept_stats()` to get the union of concept prevalences over all cohorts. This API method is a helper method that calls the `union()` method of the `ConceptHierarchy` class internally to obtain the aggregated concept prevalence metrics over all input cohorts, and is a recommended API method for downstream apps to use.

**Note** that this prevalence computation may take some time, especially for large cohorts. A progress bar will appear to indicate the progress of the prevalence calculation.

In [3]:
import time

# get cohort concept prevalance
t1 = time.time()
cohort_list = [baseline_cohort_data.cohort_id, study_cohort_data.cohort_id]
print(f'cohort_list is {cohort_list}')
unioned_cohort_concept_prevalence_dict = bias.get_cohorts_concept_stats(cohort_list)
print('Unioned concept prevalence hierarchy across the baseline and study cohorts are:')
print(unioned_cohort_concept_prevalence_dict)
print(f'the time taken to get unionized concept prevalence hierarchy across two cohorts for condition_occurrence is {time.time() - t1}s')

cohort_list is [1, 2]


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unioned concept prevalence hierarchy across the baseline and study cohorts are:
{'hierarchy': [{'concept_id': 441840, 'concept_name': 'Clinical finding', 'concept_code': '404684003', 'metrics': {'1': {'count': 11607, 'prevalence': 0.9390776699029126}, '2': {'count': 10208, 'prevalence': 1.0}}, 'source_cohorts': [1, 2], 'parent_ids': [], 'children': [{'concept_id': 4274025, 'concept_name': 'Disease', 'concept_code': '64572001', 'metrics': {'1': {'count': 11359, 'prevalence': 0.9190129449838188}, '2': {'count': 10208, 'prevalence': 1.0}}, 'source_cohorts': [1, 2], 'parent_ids': [441840], 'children': [{'concept_id': 432250, 'concept_name': 'Disorder due to infection', 'concept_code': '40733004', 'metrics': {'1': {'count': 10606, 'prevalence': 0.8580906148867314}, '2': {'count': 10208, 'prevalence': 1.0}}, 'source_cohorts': [1, 2], 'parent_ids': [4274025], 'children': [{'concept_id': 440029, 'concept_name': 'Viral disease', 'concept_code': '34014006', 'metrics': {'1': {'count': 10537, 'pre

---

### Final cleanup to ensure database connections are closed

In [4]:
bias.cleanup()

Connection to BiasDatabase closed.
Connection to the OMOP CDM database closed.


### ✅ Summary

In this tutorial, you learned how to use the BiasAnalyzer package to retrieve aggregated clinical concept prevalences in multiple cohorts. Exploring aggregated clinical concept prevalences in multiple cohorts such as  allows you to explore which clinical concepts are most prevalent in your cohort and support deeper investigations into potential sources of cohort selection bias. 
  
For more information, refer to the [BiasAnalyzer GitHub repo](https://github.com/VACLab/BiasAnalyzerCore) and the [README file](https://github.com/VACLab/BiasAnalyzerCore/blob/main/README.md).
