# BiasAnalyzer Tutorial: Exploring Cohorts

This tutorial demonstrates how to use the `BiasAnalyzer` package to create and analyze cohorts by connecting to an [OMOP (Observational Medical Outcomes Partnership) CDM (Common Data Model)](https://www.ohdsi.org/data-standardization/) database. The currently supported database types include postgreSQL and duckDB. 

---

### Overview

**Objective**:  
Guide users through the creation, exploration, and comparison of a baseline and a study cohort using `BiasAnalyzer`, illustrating how to define, explore, and compare them.

**Before You Begin**:
The `BiasAnalyzer` package is currently in active development and has not yet been officially released on PyPI.
You can install it in one of the two ways:

- **Install from GitHub (recommended during development)**:
```bash
pip install git+https://github.com/vaclab/BiasAnalyzerCore.git
```
- **Install from PyPI (once the pacakge is officially released)**:
```bash
pip install biasanalyzer
```

For full setup and usage instructions, refer to the [README](https://github.com/VACLab/BiasAnalyzerCore/blob/main/README.md).

---


### Preparation for cohort creation
**Preparation step 1**: Import the `BIAS` class from the `api` module of the `BiasAnalyzer` package

In [1]:
from biasanalyzer.api import BIAS

**Preparation step 2**: Create an object of the `BIAS` class

In [2]:
bias = BIAS()

**Preparation step 3**: Specifiy OMOP Common Data Model (CDM) database configurations on the `bias` object to allow connection to the OMOP CDM database for cohort creation and selection bias analysis. A configuration file must include root_omop_cdm_database key. An example of the configuration file is shown below:
```
root_omop_cdm_database:
  database_type: duckdb   # set it to one of the two supported types: postgresql or duckdb
  username: test_username
  password: test_password
  hostname: test_db_hostname
  database: "shared_test_db.duckdb"    # use a shared name for an in-memory duckdb or database name for postgresql
  port: 5432
```

In [3]:
bias.set_config('../config.yaml')

configuration specified in ../config.yaml loaded successfully


**Preparation step 4**: Set OMOP CDM database as specified in the configuration on the `bias` object to connect to the OMOP CDM database and create Cohort Definition metadata table and Cohort data table.

In [4]:
bias.set_root_omop()

Connected to the OMOP CDM database (read-only).
Cohort Definition table created.
Cohort table created.


**Now that you have connected to your OMOP CDM database, you can start to use the APIs to explore your data.** The rest of this notebook illustrates how to create and explore a baseline and a study cohort, and then compare them using the BiasAnalyzer APIs.

---

### Baseline cohort creation and exploration
**Baseline cohort creation**: To create a baseline cohort of young female patients, use the `create_cohort(cohort_name, cohort_description, query_or_yaml_file, created_by)` method on the `bias` object. You'll pass the name of the cohort as the first argument, the description of the cohort as the second argument, a yaml file that specifies cohort inclusion and exclusion criteria or a cohort selection SQL query as the third argument, and the cohort owner's name indicating who owns or creates this cohort as the fourth argument. The method will show a progress bar to indicate cohort creation progress over three stages.

In [5]:
baseline_cohort = bias.create_cohort('Young female patients', 
                                     'A cohort of female patients born between 2000 and 2020', 
                                     '../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_baseline.yaml', 
                                     'system')

template_path: /home/hongyi/BiasAnalyzer/biasanalyzer/sql_templates


Cohort creation:   0%|                                 | 0/3 [00:00<?, ?stage/s]

configuration specified in ../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_baseline.yaml loaded successfully
Cohort definition inserted successfully.
Cohort Young female patients successfully created.
cohort created successfully


———————————————

**Baseline cohort basic exploration**: Check the baseline cohort metadata and data to ensure the cohort is created successfully and get a high level idea about the cohort. Note that the SQL query string converted from the input YAML file for creating this cohort is included in the `creation_info` field of the cohort definition table as part of the cohort metadata.

In [6]:
baseline_cohort_def = baseline_cohort.metadata
print(f'Baseline cohort definition metadata: {baseline_cohort_def}')
baseline_cohort_data = baseline_cohort.data
print(f'The total number of patients in the baseline cohort: {len(baseline_cohort_data)}')
print(f'The first five patients in the baseline cohort: {baseline_cohort_data[:5]}')

Baseline cohort definition metadata: {'id': 1, 'name': 'Young female patients', 'description': 'A cohort of female patients born between 2000 and 2020', 'created_date': datetime.date(2025, 6, 16), 'creation_info': 'WITH ranked_events_condition_occurrence AS ( SELECT person_id, condition_concept_id AS concept_id, condition_start_date AS event_start_date, condition_end_date AS event_end_date, ROW_NUMBER() OVER ( PARTITION BY person_id, condition_concept_id ORDER BY condition_start_date ASC ) AS event_instance FROM condition_occurrence ), ranked_events_drug_exposure AS ( SELECT person_id, drug_concept_id AS concept_id, drug_exposure_start_date AS event_start_date, drug_exposure_end_date AS event_end_date, ROW_NUMBER() OVER ( PARTITION BY person_id, drug_concept_id ORDER BY drug_exposure_start_date ASC ) AS event_instance FROM drug_exposure ), ranked_events_procedure_occurrence AS ( SELECT person_id, procedure_concept_id AS concept_id, procedure_date AS event_start_date, procedure_date AS 

———————————————

**Baseline cohort deeper exploration**: you can get statistics on age, gender, race, and ethnicity of the baseline cohort by calling `get_stats()` method on the created baseline cohort object. You can also get cohort distributions on age and gender by calling `get_distributions()` method on the created baseline cohort object.

In [7]:
# get stats of the baseline cohort
cohort_stats = baseline_cohort.get_stats()
print(f'the baseline cohort stats: {cohort_stats}')
cohort_age_stats = baseline_cohort.get_stats("age")
print(f'the baseline cohort age stats: {cohort_age_stats}')
cohort_gender_stats = baseline_cohort.get_stats("gender")
print(f'the baseline cohort gender stats: {cohort_gender_stats}')
cohort_race_stats = baseline_cohort.get_stats("race")
print(f'the baseline cohort race stats: {cohort_race_stats}')
cohort_ethnicity_stats = baseline_cohort.get_stats("ethnicity")
print(f'the baseline cohort ethnicity stats: {cohort_ethnicity_stats}')

the baseline cohort stats: [{'total_count': 12360, 'earliest_start_date': datetime.date(2000, 2, 19), 'latest_start_date': datetime.date(2020, 5, 26), 'earliest_end_date': datetime.date(2002, 7, 20), 'latest_end_date': datetime.date(2020, 5, 27), 'min_duration_days': 0, 'max_duration_days': 7379, 'avg_duration_days': 1192.32, 'median_duration': 296, 'stddev_duration': 1779.19}]
the baseline cohort age stats: [{'total_count': 12360, 'min_age': 0, 'max_age': 25, 'avg_age': 7.24, 'median_age': 6, 'stddev_age': 6.01}]
the baseline cohort gender stats: [{'gender': 'female', 'gender_count': 12360, 'probability': 1.0}]
the baseline cohort race stats: [{'race': 'Other', 'race_count': 66, 'probability': 0.01}, {'race': 'Asian', 'race_count': 878, 'probability': 0.07}, {'race': 'Black or African American', 'race_count': 1056, 'probability': 0.09}, {'race': 'White', 'race_count': 10360, 'probability': 0.84}]
the baseline cohort ethnicity stats: [{'ethnicity': 'other', 'ethnicity_count': 12360, 'p

In [8]:
# get discrete probability distribution of the age variable in the baseline cohort
cohort_age_distr = baseline_cohort.get_distributions('age')
print(f'the baseline cohort age discrete probability distribution: {cohort_age_distr}')

the baseline cohort age discrete probability distribution: [{'age_bin': '0-10', 'bin_count': 8230, 'probability': 0.6659}, {'age_bin': '11-20', 'bin_count': 4129, 'probability': 0.3341}, {'age_bin': '21-30', 'bin_count': 1, 'probability': 0.0001}, {'age_bin': '31-40', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '41-50', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '51-60', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '61-70', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '71-80', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '81-90', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '91+', 'bin_count': 0, 'probability': 0.0}]


---

### Study cohort creation and exploration
**Study cohort creation**: To create a study cohort of young female COVID patients, use the `create_cohort(cohort_name, cohort_description, query_or_yaml_file, created_by)` method on the `bias` object. You'll pass the name of the cohort as the first argument, the description of the cohort as the second argument, a yaml file that specifies cohort inclusion and exclusion criteria or a cohort selection SQL query as the third argument, and the cohort owner's name indicating who owns or creates this cohort as the fourth argument. The method will show a progress bar to indicate cohort creation progress over three stages.


In [9]:
# create a user study cohort with young female COVID patients
study_cohort = bias.create_cohort('Young female COVID patients', 
                                  'A cohort of female COVID patients born between 2000 and 2020', 
                                  '../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_study.yaml', 
                                  'system')

Cohort creation:   0%|                                 | 0/3 [00:00<?, ?stage/s]

configuration specified in ../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_study.yaml loaded successfully
Cohort definition inserted successfully.
Cohort Young female COVID patients successfully created.
cohort created successfully


———————————————

**Study line cohort basic exploration**: Check the study cohort metadata and data to ensure the cohort is created successfully and get a high level idea about the cohort. Note that the SQL query string converted from the input YAML file for creating this cohort is included in the creation_info field of the cohort definition table as part of the cohort metadata.

In [10]:
study_cohort_def = study_cohort.metadata
print(f'Young female COVID-19 patient cohort definition: {study_cohort_def}')
study_cohort_data = study_cohort.data
print(f'The total number of patients in the study cohort: {len(study_cohort_data)}')
print(f'The first five patients in the young female COVID-19 patient cohort: {study_cohort_data[:5]}')

Young female COVID-19 patient cohort definition: {'id': 2, 'name': 'Young female COVID patients', 'description': 'A cohort of female COVID patients born between 2000 and 2020', 'created_date': datetime.date(2025, 6, 16), 'creation_info': 'WITH ranked_events_condition_occurrence AS ( SELECT person_id, condition_concept_id AS concept_id, condition_start_date AS event_start_date, condition_end_date AS event_end_date, ROW_NUMBER() OVER ( PARTITION BY person_id, condition_concept_id ORDER BY condition_start_date ASC ) AS event_instance FROM condition_occurrence ), domain_qualifying_events AS ( (SELECT person_id, event_start_date, event_end_date FROM ranked_events_condition_occurrence WHERE concept_id = 37311061) ), filtered_cohort AS ( SELECT c.person_id, MIN(c.event_start_date) AS cohort_start_date, MAX(c.event_end_date) AS cohort_end_date FROM domain_qualifying_events c JOIN person p ON c.person_id = p.person_id WHERE 1=1 AND p.gender_concept_id = 8532 AND p.year_of_birth >= 2000 AND p.ye

———————————————

**Study cohort deeper exploration**: you can get statistics on age, gender, race, and ethnicity of the study cohort by 
calling `get_stats()` method on the created study cohort object. You can also get cohort distributions on age and gender by 
calling `get_distributions()` method on the created study cohort object.

In [11]:
# get stats and distributions of the user study cohort
study_cohort_stats = study_cohort.get_stats()
print(f'the user study cohort stats: {study_cohort_stats}')
study_cohort_age_stats = study_cohort.get_stats("age")
print(f'the user study cohort age stats: {study_cohort_age_stats}')
study_cohort_gender_stats = study_cohort.get_stats("gender")
print(f'the user study gender stats: {study_cohort_gender_stats}')
study_cohort_race_stats = study_cohort.get_stats("race")
print(f'the user study cohort race stats: {study_cohort_race_stats}')
study_cohort_ethnicity_stats = study_cohort.get_stats("ethnicity")
print(f'the user study ethnicity stats: {study_cohort_ethnicity_stats}')

the user study cohort stats: [{'total_count': 10208, 'earliest_start_date': datetime.date(2020, 1, 18), 'latest_start_date': datetime.date(2020, 3, 30), 'earliest_end_date': datetime.date(2020, 2, 7), 'latest_end_date': datetime.date(2020, 5, 3), 'min_duration_days': 8, 'max_duration_days': 37, 'avg_duration_days': 24.25, 'median_duration': 24, 'stddev_duration': 7.2}]
the user study cohort age stats: [{'total_count': 10208, 'min_age': 0, 'max_age': 20, 'avg_age': 10.94, 'median_age': 11, 'stddev_age': 5.92}]
the user study gender stats: [{'gender': 'female', 'gender_count': 10208, 'probability': 1.0}]
the user study cohort race stats: [{'race': 'Other', 'race_count': 53, 'probability': 0.01}, {'race': 'Asian', 'race_count': 723, 'probability': 0.07}, {'race': 'Black or African American', 'race_count': 866, 'probability': 0.08}, {'race': 'White', 'race_count': 8566, 'probability': 0.84}]
the user study ethnicity stats: [{'ethnicity': 'other', 'ethnicity_count': 10208, 'probability': 1.

In [12]:
# get discrete probability distribution of the age variable in the baseline cohort
study_cohort_age_distr = study_cohort.get_distributions('age')
print(f'the user study cohort age discrete probability distribution: {study_cohort_age_distr}')

the user study cohort age discrete probability distribution: [{'age_bin': '0-10', 'bin_count': 4744, 'probability': 0.4647}, {'age_bin': '11-20', 'bin_count': 5464, 'probability': 0.5353}, {'age_bin': '21-30', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '31-40', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '41-50', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '51-60', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '61-70', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '71-80', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '81-90', 'bin_count': 0, 'probability': 0.0}, {'age_bin': '91+', 'bin_count': 0, 'probability': 0.0}]


---

### Baseline and study cohort comparison
You can compare the baseline and study cohorts by calling the method `compare_cohorts(id1, id2)` on the `bias` object. Note that currently only hellinger distances between age and gender distributions of two cohorts are computed as a comparison metrics. More comparative metrics will be added in the future.

In [13]:
# compare the baseline and user study cohorts
result = bias.compare_cohorts(baseline_cohort_def['id'], study_cohort_def['id'])
print(result)

[{'age_hellinger_distance': 0.14447523081257604}, {'gender_hellinger_distance': 0.0}]


---

### Final cleanup to ensure database connections are closed

In [14]:
bias.cleanup()

Connection to BiasDatabase closed.
Connection to the OMOP CDM database closed.


### ✅ Summary

In this tutorial, you learned how to connect to an OMOP CDM database, create a baseline and a study cohort, explore each created cohort, and compare two created cohorts using the BiasAnalyzer python package.

For more information, refer to the [BiasAnalyzer GitHub repo](https://github.com/VACLab/BiasAnalyzerCore) and the [README file](https://github.com/VACLab/BiasAnalyzerCore/blob/main/README.md).
