Skip to content

PheKnowVec Manuscript: Analysis Evaluation Plan

Tiffany J. Callahan edited this page Mar 7, 2019 · 1 revision

PheKnowVec: A Novel Approach to Computational Phenotyping

Proposed Journal:

Documents Linked to Outline


Overview

Motivation

Computational phenotyping (CP) leverages predefined sets of clinical concepts (e.g., diagnoses, medications, procedures, and laboratory test codes) to identify patients with and without a condition. CP approaches have great potential to aid in diagnosis, prognosis, therapeutic decision-making, and identification of mechanisms or novel biomarkers.

Existing methods face three unsolved barriers

  1. CP definitions may have limited generalizability because they are tailored to specific source vocabularies or hospital systems.
  2. CP definitions may lack translational relevance because they primarily rely on clinical data requiring additional mapping to incorporate, for example, molecular or physiologic data.
  3. CP definitions may lack scalability because the current process for creating definitions is a time-consuming, iterative process requiring both domain expertise and external validation.

How can we solve these problems?
PheKnowVec is a novel method for deriving, implementing, and validating CPs that addresses these barriers by:

  • The mapping of standardized clinical terminology concepts to linked open data, such as biomedical ontologies, has been shown to significantly improve the process of integrating and incorporating sources of non-clinical data.
  • Embedding methods, which convert large complex heterogeneous data into scalable compressed vectors without semantic information loss, have successfully solved a wide range of problems in the biomedical domain.

Approach

Data Sources

We will use two independent datasets for our experiments:

  1. The first dataset contains pediatric data and was extracted from a de-identified database built using data from the Children’s Hospital Colorado (CHCO). The CHCO De-ID data conforms to the structure defined by the Pediatric Learning Health System (PEDSnet) Observational Medical Outcomes Partnership (OMOP) common data model version 5.
  2. The second dataset contains adult intensive care data built using data from the MIMIC III database that has also been standardized to the OMOP common data model version 5.3.

All diagnosis, medication, and laboratory test data in the current build of each dataset were considered for analysis. All patients whose record contained at least 1 code in the defined code sets were included in the analysis. Use of this data was approved by the Colorado Multiple Institutional Review Board (protocol # 15-0445).

Phenotype Definitions

We used all eMERGE phenotypes appropriate for implementation in pediatric and adult populations. Of these 13, four overlapped with the original set implemented by PMC6289550. The SQL queries used to retrieve patients for each code set can be found here (will add GitHubGist link to query) and all code sets can be found here .

Experiments

To demonstrate the generalizability, translatability, and scalability of PheKnowVec, we will perform several experiments.

  • Generalizability
    • Task: derive and compare patient cohorts using different code sets (OMOP)
      • Use codes defined in eMERGE phenotype to construct patient cohorts (without phenotype definition rules)
      • Use codes defined in eMERGE phenotype to construct patient cohorts with respect to phenotype definition rules
    • Performance: count the number of patients inappropriately included in (false positive or FP) or missing from (false negative or FN) the gold standard patient cohort
    • GOAL: show what is lost or gained between the different cohorts derived from the different code sets. Hopefully, there is not significant information loss between the code sets
  • Translatability
    • Task: derive and compare patient cohorts using different code sets (OBOs)
      • Use codes defined in eMERGE phenotype to construct patient cohorts (without phenotype definition rules)
      • Use codes defined in eMERGE phenotype to construct patient cohorts with respect to phenotype definition rules
      • Consider adding data that is annotated to each ontology as a way to show how easy it is to add external data?
    • Performance: count the number of patients inappropriately included in (false positive or FP) or missing from (false negative or FN) the gold standard patient cohort
    • GOAL: That the OBOs perform similarly well to OMOP code sets
  • Scalability
    • Task1: derive and compare patient cohorts embeddings
      • Use codes defined in eMERGE phenotype to construct patient cohorts (without phenotype definition rules)
      • Use codes defined in eMERGE phenotype to construct patient cohorts with respect to phenotype definition
    • Performance1: perform leave-one-patient-out CV with Youden index thresholding to demonstrate how well the embeddings group the cases/controls within each phenotype
    • Task 2: same as Task 1, but now derive and compare patient cohorts embeddings to an aggregated case/control vector for each phenotype
    • GOAL: The embeddings, especially those derived from phenotype definitions, have good performance. This would support the use of embeddings

The generalizability and translatability tasks will be performed using the CHCO OMOP DeID and the MIMC OMOP data. OMOP MIMIC 3 will only be used in the scalability task to validate the representations built from CHCO OMOP DeID.

Qualitative Validation: Take 1 or 2 of the phenotypes and extend the ontology code sets to include additional open data sources that are already manually annotated to the ontologies (e.g. DOID and HPO contain hand-annotated gene list mappings). Perform basic clustering within each phenotype to derive sub-groups. Identify the most important codes within each cluster and get a domain expert to help with interpreting the results. This will hopefully provide enough interesting findings to get people excited, providing a nice foundation for the Med2Mech paper.

Code Sets

We define the following:

  • Definition Rules: the logic that underlies the definitions of a clinical phenotype (e.g., must have two occurrences of abnormally elevated eosinophil counts within 1 month).
  • Code Sets: as the clinical concepts (i.e., diagnoses, medications, and laboratory tests, and procedure codes) that are used in the definition rules of each phenotype. There are two types of code sets:
    • Clinical code sets: derived by mapping the clinical codes from eMERGE phenotypes to OMOP standardized terminologies.
    • Ontology code sets: derived by mapping the clinical codes from eMERGE phenotypes and the OMOP standardized terminologies to an Open Biomedical Ontology (OBO).
Method Description
Raw Baseline Original codes used in the eMERGE phenotypes. If data is provided as a string it will be mapped to RxNorm. Only exact matches will be kept. These sets of codes are gold standards and reflect the basic fidelity of the translated query.
OMOP Baseline Mappings from OMOP were used to map raw data from each clinical domain to a standard clinical vocabulary. Code sets will include the SNOMED mimic and optimize from the original study.
Exact Match Code sets will use all OBO terms with an exact mapping (either by database cross-reference or exact string match) to a raw and OMOP baseline term.
Exact Match- Descendants Code sets will use all OBO terms with an exact mapping (either by database cross-reference or exact string match) to a raw and OMOP baseline term as well as all descendants of those mapped terms.
Manually Mapping Code sets will use all OBO terms with an exact mapping (either by database cross-reference or exact string match) and manual mapping to a raw and OMOP baseline term.
Manually Mapping- Descendants Code sets will use all OBO terms with an exact mapping (either by database cross-reference or exact string match) and manual mapping to a raw and OMOP baseline term as well as all descendants of those mapped terms.
Constructed Maps In additional to exact and manually mappings, an addition set of codes requiring knowledge engineering in order to match the baseline code setts were constructed.
Constructed Maps- Descendants In additional to exact and manually mappings, an addition set of codes requiring knowledge engineering in order to match the baseline code setts were constructed. The descendants of each term in the constructed concept will be included.

The following mappings between clinical concepts, OMOP, and OBOs will be created:

  • Conditions
    • ICD9-CM/ICD10-CM to SNOMED-CT
    • ICD9-CM/ICD10-CM/SNOMED-CT to HPO
    • ICD9-CM/ICD10-CM/SNOMED-CT to DOID
    • ICD9-CM/ICD10-CM/SNOMED-CT to HPO+DOID
  • Medications
    • STRING to RxNORM ingredients
    • STRING/RxNORM to ChEBI
    • STRING/RxNORM to VO
    • STRING/RxNORM to NCBITaxon
  • Laboratory Tests
    • STRING to LOINC, if LOINC code not provided
    • STRING/LOINC to HPO
  • Procedures
    • CPT to ??
    • CPT/?? to NCIT

Mappings between standardized clinical concepts and the ontologies are part of the BioLater project; separate mappings exist for conditions, medications, and laboratory tests.

Challenges here include:

  • When a string and not a code are provided in phenotype definition (RxNORM + LOINC). How do we evaluate mapping to OMOP?
  • What are problem sets?
  • How do we account for procedure codes?
  • When to use pediatric vs. adult data and how to perform evaluation?
  • Do we need to include non-eMERGE phenotypes, perhaps provide an example of a phenotype that is not well-defined? As a way to show the power of this method? Show that a clinician defined set of clinical concepts can return relevant patients of a given phenotype?