# Using BiasAnalyzer for Cohort Concept Prevalence Exploration

This tutorial demonstrates how to use the `BiasAnalyzer` package to explore **concept prevalence** within a cohort - a key step in identifying potential biases during cohort selection. It complements the [Cohort Exploration Tutorial](./BiasAnalyzerCohortsTutorial.ipynb) by focusing specifically on analyzing which clincial concepts (e.g., diagnoses, procedures, medications) are most common in a selected cohort. In the OMOP (Observational Medical Outcomes Partnership) CDM (Common Data Model), a **concept** refers to a coded term from a standardized medical vocabulary, uniquely identified by a **concept ID**. All clinical events in OMOP, such as conditions, drug exposures, procedures, measurements, and events, are represented as concepts.

---

### Overview

**Objective**:  
Learn how to retrieve and analyze concept prevalence within a cohort using `BiasAnalyzer`.

**Before You Begin**:  
The `BiasAnalyzer` package is currently in active development and has not yet been officially released on PyPI.
You can install it in one of the two ways:

- **Install from GitHub (recommended during development)**:
```bash
pip install git+https://github.com/vaclab/BiasAnalyzerCore.git
```
- **Install from PyPI (once the pacakge is officially released)**:
```bash
pip install biasanalyzer
```

For full setup and usage instructions, refer to the [README](https://github.com/VACLab/BiasAnalyzerCore/blob/main/README.md).

---


### Preparation for cohort concept prevalence exploration
**Preparation step 1**: Import the `BIAS` class from the `api` module of the `BiasAnalyzer` package, create an object `bias` of the `BIAS` class, specify OMOP CDM database configurations on the `bias` object, and set OMOP CDM database to enable connection to the database. Refer to the [Cohort Exploration Tutorial](./BiasAnalyzerCohortsTutorial.ipynb) for more details.

In [1]:
from biasanalyzer.api import BIAS

bias = BIAS()

bias.set_config('../config.yaml')

bias.set_root_omop()

configuration specified in ../config.yaml loaded successfully
Connected to the OMOP CDM database (read-only).
Cohort Definition table created.
Cohort table created.


———————————————

**Preparation step 2**: Create a cohort of young female COVID patients using the `create_cohort(cohort_name, cohort_description, query_or_yaml_file, created_by)` method on the `bias` object for cohort concept prevalence exploration. You'll pass the name of the cohort as the first argument, the description of the cohort as the second argument, a yaml file that specifies cohort inclusion and exclusion criteria or a cohort selection SQL query as the third argument, and the cohort owner's name indicating who owns or creates this cohort as the fourth argument. After the cohort is created, you can call `get_stats()` and `get_distributions()` methods on the returned `cohort_data` object to explore cohort statistics and distributions.

In [2]:
# create a cohort with young female COVID patients
cohort_data = bias.create_cohort('Young female COVID patients', 
                                  'A cohort of female COVID patients born between 2000 and 2020', 
                                  '../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_study.yaml', 
                                  'system')
# get stats of the cohort
cohort_stats = cohort_data.get_stats()
print(f'the cohort stats: {cohort_stats}')
cohort_age_stats = cohort_data.get_stats("age")
print(f'the cohort age stats: {cohort_age_stats}')
cohort_gender_stats = cohort_data.get_stats("gender")
print(f'the cohort gender stats: {cohort_gender_stats}')
cohort_race_stats = cohort_data.get_stats("race")
print(f'the cohort race stats: {cohort_race_stats}')
cohort_ethnicity_stats = cohort_data.get_stats("ethnicity")
print(f'the cohort ethnicity stats: {cohort_ethnicity_stats}')
# get discrete probability distribution of the age variable in the cohort
cohort_age_distr = cohort_data.get_distributions('age')
print(f'the cohort age discrete probability distribution: {cohort_age_distr}')

template_path: /home/hongyi/BiasAnalyzer/biasanalyzer/sql_templates


Cohort creation:   0%|                                                                                        …

configuration specified in ../tests/assets/cohort_creation/test_cohort_creation_condition_occurrence_config_study.yaml loaded successfully
Cohort definition inserted successfully.
Cohort Young female COVID patients successfully created.
template_path: /home/hongyi/BiasAnalyzer/biasanalyzer/sql_templates
cohort created successfully
the cohort stats: [{'total_count': 10208, 'earliest_start_date': datetime.date(2020, 1, 18), 'latest_start_date': datetime.date(2020, 3, 30), 'earliest_end_date': datetime.date(2020, 2, 7), 'latest_end_date': datetime.date(2020, 5, 3), 'min_duration_days': 8, 'max_duration_days': 37, 'avg_duration_days': 24.25, 'median_duration': 24, 'stddev_duration': 7.2}]
the cohort age stats: [{'total_count': 10208, 'min_age': 0, 'max_age': 20, 'avg_age': 10.94, 'median_age': 11, 'stddev_age': 5.92}]
the cohort gender stats: [{'gender': 'female', 'gender_count': 10208, 'probability': 1.0}]
the cohort race stats: [{'race': 'Other', 'race_count': 53, 'probability': 0.01}, {

**Now that you have connected to your OMOP CDM database and created the `cohort_data` cohort object, you are ready to explore cohort concept prevalence.** 

---

### Exploring cohort concept prevalence with concept hierarchy taken into account
You can retrieve concept prevalence statistics for a cohort using the `get_concept_stats(concept_type='condition_occurrence', filter_count=0, vocab=None)` method on the `cohort_data` object. Each input argument to this method has a default value, so you can call the method without specifying all parameters.
- The `concept_type` input argument specifies the OMOP domain to analyze. It must be one of the OMOP domain names: `condition_occurrence`, `drug_exposure`, `procedure_occurrence`, `visit_occurrence`, `measurement`, or `observation`.
- The `vocab` input argument specifies the OMOP vocabulary ID to filter concepts by. If set to `None`, a default vocabulary is used based on the domain: `RxNorm` for `drug_exposure`, `LOINC` for `measurement`, and `SNOMED` for all other domains.
- The `filter_count` input argument filters out concepts with fewer than this number of patients in the cohort. Set it to `0` to include all without filtering.
This method helps identify the most prevalent clinical concepts in your cohort with concept hierarchy taken into account, which can reveal patterns or potential sources of selection bias in the cohort data.

**Cohort condition occurrence concept prevalence**: 
The code block below demonstrates how to use the default concept_type and vocab parameters of the `get_concept_stats(filter_count=5000)` method to retrieve concept prevalence with concept hierarchical relationships taken into account. By default, it uses the `SNOMED` vocabulary for the `condition occurrence` domain. Concepts with fewer than 5000 patients are excluded as specified in the `filter_count` input parameter. The method returns a dictionary and a `ConceptHierarchy` object as detailed below:
- The returned dictionary contains a key-value pair where the **key** is the `concept_type` (e.g., `condition_occurrence`) and the **value** is a list of concept dictionaries. Each concept dictionary in the list contains `concept_name`, `concept_code`, `count_in_cohort`, `prevalence`, `ancestor_concept_id`, and `descendant_concept_id` (i.e., the concept id corresponding to the item's concept name, code, count, and prevalence). These values allow you to explore which clinical concepts are most prevalent in your cohort with concept hierarchy taken into account and support deeper investigations into potential sources of selection bias.
- The returned `ConceptHierarchy` object stores concept hierarchical relationsips with concept nodes indexed to allow quick information retrival of a concept node and provides hierarchy traversal methods for concept hierarchy navigation. This `ConceptHierarchy` object can be serialized into a dictionary via the `to_dict()` method of the `ConceptHierarchy` class to be loaded easily into a JSON object by downstream apps. The `to_dict(root_id: Optional[int] = None)` method enables downstream apps to retrieve the concept hierarchical information rooted at a specific input concept id or at the root of the concept hierarchy if no input concept id is provided. The returned dictionary contains a key-value pair where `hierarchy` is the key and the value is a list of concept dictionaries with each concept dictionary corresponding to each root of the concept hierarchy. Each concept dictionary in the list contains values corresponding to `concept_id`, `concept_name`, `concept_code`, `metrics`, `parent_ids`, and `children` keys, where `parent_ids` contains a list of parent concept ids, `children` contains a list of serialized, potentially nested, children concept nodes, and `metrics` is a dictionary with the following two keys:
    - 'cohorts`: a dictionary with a cohort id as key and a sub-dictionary as value that includes the count and prevalence corresponding to the cohort. For example, `'cohorts': {'1': {'count': 10208, 'prevalence': 1.0}}}`. The `cohorts` dictionary may include metrics information for multiple cohorts.
    - `union`: a dictionary including unioned counts and prevalence for the concept node over all cohorts included in the `cohorts` dictionary. For example, `'union': {'count': 10208, 'prevalence': 1.0}`. When there is only one cohort included, the union metric dictionary is the same as the single cohort metric dictionary. Refer to [Union of Cohort Concepts Over multiple Cohorts Tutorial](./BiasAnalyzerMultipleCohortConceptUnionTutorial.ipynb) for how to use this `union` method of `ConceptHierarchy` class for aggregating concept metrics over multiple cohorts.

**Note** that this prevalence computation may take some time, especially for large cohorts. A progress bar will appear to indicate the progress of the prevalence calculation.

In [3]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

import time

# get cohort concept prevalance
t1 = time.time()
cohort_concepts, cohort_concept_hierarchy = cohort_data.get_concept_stats(filter_count=5000)
print(pd.DataFrame(cohort_concepts["condition_occurrence"]))
print(f"returned cohort_concept_hierarchy object converted to dict: {cohort_concept_hierarchy.to_dict()}")
print(f'the time taken to get cohort concept stats for condition_occurrence is {time.time() - t1}s')

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

                                 concept_name concept_code  count_in_cohort  prevalence  ancestor_concept_id  descendant_concept_id
0                       Coronavirus infection    186747009            10208    1.000000               439676                 439676
1                                     Disease     64572001            10208    1.000000              4274025                4274025
2                            Clinical finding    404684003            10208    1.000000               441840                 441840
3                       Coronavirus infection    186747009            10208    1.000000              4100065                 439676
4                Disease due to Coronaviridae     27619001            10208    1.000000              4100065                4100065
5                Disease due to Coronaviridae     27619001            10208    1.000000               440029                4100065
6                   Disorder due to infection     40733004            10208 

———————————————

**Navigating cohort concept hierarchy**
The following methods in the `ConceptHierarchy` class enables concept hierarchical navigation:
- `get_root_nodes(serialization=False)`: If the input parameter `serialization` is False, it returns a list of root nodes of the `ConceptHierarchy` object where each root node is a `ConceptNode` object. If the input parameter `serialization` is True, it returns a list of dictionaries with each dict item representing a serialized root `ConceptNode` object ready to be converted into a JSON object for downstream apps.
- `get_node(concept_id, serialization=False)`: returns the `ConceptNode` object (when `serialization` is set to False) or a dict item representing a serialized `ConceptNode` object (when `serialization` is set to True) corresponding to the input `concept_id`.
- `get_leaf_nodes(serialization=False)`: If the input parameter `serialization` is False, it returns a list of leaf nodes of the `ConceptHierarchy` object where each leaf node is a `ConceptNode` object. If the input parameter `serialization` is True, it returns a list of dictionaries with each dict item representing a serialized leaf `ConceptNode` object ready to be converted into a JSON object for downstream apps.
- `iter_nodes(root_id, order='bfs', serialization=False)`: allows downstream apps to iterate the concept hierarchy in breadth-first search (`bfs`) or depth-first search (`dfs`) order and yield an ordered list of ConceptNode objects (`serialization` is set to False) or an ordered list of dict items with each item representing a serialized ConceptNode object (`serialization` is set to True). 
- `union(other)`: merges current `ConceptHierarchy` object with another `ConceptHierarchy` object (specified by `other` input parameter) to get an aggregated metrics between two hierarchies. It returns a new unioned `ConceptHierarchy` object with aggregated metrics computed.
- `to_dict(root_id=None)`: converts the entire `ConceptHierarchy` object (if `root_id` input parameter is None) or a sub-hierarchy rooted at the `root_id` input concept id parameter in the concept hierarchy to a serialized and nested dict structure ready to be loaded into a JSON object by downstream apps. The returned dict is a key-value pair with `hierarchy` as the key and a list of serialized ConceptNode objects as the value.

The code block below demonstrate how to navigate the cohort condition occurrence concept hierarchy.

———————————————

**Cohort drug exposure concept prevalence**: 
The code block below demonstrates how to use `get_concept_stats(concept_type='drug_exposure', filter_count=500)` method to retrieve concept prevalence for the `drug_exposure` domain with hierarchical relationships included in the results. By default, this uses the `RxNorm` vocabulary. Concepts with fewer than 500 patients are excluded as specified in the `filter_count` input parameter. The method returns a dictionary and a ConceptHierarchy object. Refer to the detailed description for the returned dictionary and ConceptHierarchy object with the 'condition_occurrence` concept domain example above.  These returned results allow you to explore which clinical concepts are most prevalent in your cohort and suppoert deeper investigations into potential sources of selection bias.

**Note**: Prevalence computation may take some time, especially for large cohorts or when hierarchical relationships are included. A progress bar will appear to indicate the progress of the computation. 


In [4]:
t1 = time.time()
cohort_de_concepts, cohort_de_concept_hierarchy = cohort_data.get_concept_stats(concept_type='drug_exposure', filter_count=500)
print(pd.DataFrame(cohort_de_concepts["drug_exposure"]))
print(f"returned cohort_de_concept_hierarchy object converted to dict: {cohort_de_concept_hierarchy.to_dict()}")
print(f'the time taken to get cohort concept stats for drug_exposure is {time.time() - t1}s')

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

                                         concept_name concept_code  count_in_cohort  prevalence  ancestor_concept_id  descendant_concept_id
0                                        Oral Product      1151131              937    0.091791             36217214               36217214
1                                                Pill      1151133              931    0.091203             36217216               36217216
2                                       acetaminophen          161              676    0.066223              1125315                1125315
3                                  acetaminophen Pill      1152843              638    0.062500             36216999               36216999
4                                  acetaminophen Pill      1152843              638    0.062500             36217216               36216999
5                          acetaminophen Oral Product      1152842              638    0.062500              1125315               36216998
6                   

———————————————

**Navigating cohort concept hierarchy**
The following methods in the `ConceptHierarchy` class enables concept hierarchical navigation:
- `get_root_nodes(serialization=False)`: If the input parameter `serialization` is False, it returns a list of root nodes of the `ConceptHierarchy` object where each root node is a `ConceptNode` object. If the input parameter `serialization` is True, it returns a list of dictionaries with each dict item representing a serialized root `ConceptNode` object ready to be converted into a JSON object for downstream apps.
- `get_node(concept_id, serialization=False)`: returns the `ConceptNode` object (when `serialization` is set to False) or a dict item representing a serialized `ConceptNode` object (when `serialization` is set to True) corresponding to the input `concept_id`.
- `get_leaf_nodes(serialization=False)`: If the input parameter `serialization` is False, it returns a list of leaf nodes of the `ConceptHierarchy` object where each leaf node is a `ConceptNode` object. If the input parameter `serialization` is True, it returns a list of dictionaries with each dict item representing a serialized leaf `ConceptNode` object ready to be converted into a JSON object for downstream apps.
- `iter_nodes(root_id, order='bfs', serialization=False)`: allows downstream apps to iterate the concept hierarchy in breadth-first search (`bfs`) or depth-first search (`dfs`) order and yield an ordered list of ConceptNode objects (`serialization` is set to False) or an ordered list of dict items with each item representing a serialized ConceptNode object (`serialization` is set to True). 
- `union(other)`: merges current `ConceptHierarchy` object with another `ConceptHierarchy` object (specified by `other` input parameter) to get a unionized concept hierarchy between two hierarchies. It returns a new unioned `ConceptHierarchy` object with a unionized view of each concept across two hierarchies.
- `to_dict(root_id=None)`: converts the entire `ConceptHierarchy` object (if `root_id` input parameter is None) or a sub-hierarchy rooted at the `root_id` input concept id parameter in the concept hierarchy to a serialized and nested dict structure ready to be loaded into a JSON object by downstream apps. The returned dict is a key-value pair with `hierarchy` as the key and a list of serialized ConceptNode objects as the value.

Since `ConceptHierarchy` methods return a `ConceptNode` object or a list of `ConceptNode` objects if `serialization` is set to False, downstream apps can access `ConceptNode` properties such as `name`, `code`, `parents`, and `children` to get the concept's name, code, parents, and children, and call `ConceptNode` methods, such as `get_metrics(cohort_id)` to get the concept's count and prevalence metrics computed by the linked `ConceptHierarchy` object for the cohort identified by the input `cohort_id` parameter, and `get_union_metrics()` to get the concept's unioned or aggregated count and prevalence metrics computed by the linked `ConceptHierarchy` object across multiple cohorts. There is also a `ConceptNode` `to_dict(include_children=True)` method that returns the serialized dict item representing the `ConceptNode` object either with nested children included (`include_children` input parameter is set to True) or not (`include_children` is set to False).   

The code block below demonstrate how to navigate the cohort condition occurrence concept hierarchy using `ConceptHierarchy` and `ConceptNode` objects.

In [5]:
root_nodes = cohort_concept_hierarchy.get_root_nodes()
leaf_nodes = cohort_concept_hierarchy.get_leaf_nodes()
infection_node = cohort_concept_hierarchy.get_node(432250) # disorder due to infection
for index, root in enumerate(root_nodes):
    print(f'root node {index} info - name: {root.name}, code: {root.code}')
    print(f'root metric info - {root.get_metrics(1)}')
for index, leaf in enumerate(leaf_nodes):    
    print(f'leaf node {index} info - name: {leaf.name}, code: {leaf.code}')
    print(f'leaf metric info - {leaf.get_metrics(1)}')
print(f'infection_node info - name: {infection_node.name}, code: {infection_node.code}')
print(f'infection_node metric info - {infection_node.get_metrics(1)}')
print(f'infection_node union metric info - {infection_node.get_union_metrics()}')
print(f'serialized infection_node dict info - {infection_node.to_dict()}')

print('print all concept nodes being iterated in the concept hierarchy starting from the root node in breadth-first order:')
for n in cohort_concept_hierarchy.iter_nodes(441840):
    print(f'({n.id}, {n.name})')
print('print all concept nodes being iterated in the concept hierarchy starting from the root node in depth-first order:')
for n in cohort_concept_hierarchy.iter_nodes(441840, order="dfs"):
    print(f'({n.id}, {n.name})')
print('print serialized dict of all concept nodes being iterated in the concept hierarchy starting from the root node in depth-first order:')
print(list(cohort_concept_hierarchy.iter_nodes(441840, order="dfs", serialization=True)))

root node 0 info - name: Clinical finding, code: 404684003
root metric info - {'count': 10208, 'prevalence': 1.0}
leaf node 0 info - name: COVID-19, code: 840539006
leaf metric info - {'count': 10208, 'prevalence': 1.0}
leaf node 1 info - name: Fever, code: 386661006
leaf metric info - {'count': 8650, 'prevalence': 0.8473746081504702}
leaf node 2 info - name: Finding of sensation by site, code: 699697007
leaf metric info - {'count': 6657, 'prevalence': 0.6521355799373041}
leaf node 3 info - name: Cough, code: 49727002
leaf metric info - {'count': 6596, 'prevalence': 0.6461598746081505}
leaf node 4 info - name: Mouth and/or pharynx finding, code: 249376008
leaf metric info - {'count': 5619, 'prevalence': 0.5504506269592476}
leaf node 5 info - name: Finding of head region, code: 298364001
leaf metric info - {'count': 5194, 'prevalence': 0.5088166144200627}
infection_node info - name: Disorder due to infection, code: 40733004
infection_node metric info - {'count': 10208, 'prevalence': 1.0

---

### Final cleanup to ensure database connections are closed

In [6]:
bias.cleanup()

Connection to BiasDatabase closed.
Connection to the OMOP CDM database closed.


### ✅ Summary

In this tutorial, you learned how to use the BiasAnalyzer package to explore clinical concept prevalence in a cohort. Exploring clinical concept prevalence in a cohort allows you to explore which clinical concepts are most prevalent in your cohort and support deeper investigations into potential sources of cohort selection bias. 
  
For more information, refer to the [BiasAnalyzer GitHub repo](https://github.com/VACLab/BiasAnalyzerCore) and the [README file](https://github.com/VACLab/BiasAnalyzerCore/blob/main/README.md).
