# Electronic phenotyping

## Introduction to electronic phenotyping

Electronic phenotyping is a valuable process for determining and analyzing patient characteristics or conditions using electronic medical records (EMRs). Here’s a summary of key points:

### What is Electronic Phenotyping?
- **Definition**: Electronic phenotyping involves computational methods to determine the presence or absence of a specific condition or characteristic in a patient based on their electronic medical record. It also helps identify when the condition started and ended.
  
### Why is Electronic Phenotyping Useful?
1. **Clinical Research**: Enables the use of observational data to explore health conditions and their impacts, facilitating research into disease patterns and treatment outcomes.
2. **Clinical Trials Recruitment**: Helps in identifying and recruiting patients who meet specific criteria for clinical trials, ensuring that studies target the right patient population.
3. **Quality Metrics Calculation**: Assists healthcare systems in calculating quality metrics, which are often reported publicly and used for performance evaluation and improvement.
4. **Finding Similar Patients**: Aids in finding patients with similar conditions or characteristics, which can be useful for comparative studies or personalized treatment plans.
5. **Cross-Site Research**: Provides standardized definitions of conditions that help in sharing data and facilitating research across different sites or institutions.

### Summary
Electronic phenotyping leverages EMRs to identify patient phenotypes and track conditions over time, supporting various aspects of clinical research and healthcare quality improvement.

## Challenges in electronic phenotyping

#### Overview
Electronic phenotyping involves using electronic health records (EHRs) to identify and characterize patient phenotypes. This process relies on the accurate extraction and interpretation of patient data from a timeline of features.

#### Challenges

1. **Accuracy of Disease Codes**:
   - **Initial Assignment**: Disease codes (ICD9, ICD10) might be assigned during the diagnostic process, even before a definitive diagnosis is made. This can lead to incorrect codes being recorded if the final diagnosis is negative.
   - **Timing of Codes**: Codes may be assigned at discharge or later, which might not reflect the true onset of the condition. For instance, a condition may develop during hospitalization but only be coded upon discharge.

2. **Feature vs. Phenotype**:
   - **Feature**: Directly measured data (e.g., blood pressure readings).
   - **Phenotype**: Inferred or interpreted condition derived from features (e.g., hypertension, which considers blood pressure measurements along with additional factors like medication and lab results).

#### Key Points

- **Phenotyping Process**: Requires careful selection of data and timing to accurately capture the occurrence and timing of events of interest.
- **Feature Interpretation**: Distinguishing between raw features and derived phenotypes is crucial for accurate analysis.
- **Temporal Considerations**: The timing of data entry and code assignment can impact the phenotyping outcome, making it important to align data accurately with clinical events.


## Specifying an electronic phenotype


#### Specifying an Electronic Phenotype

1. **Criteria for Specification**:
   - **Necessary and Sufficient Conditions**: Define the required features and their values to determine if an exposure or outcome of interest occurred.
   - **Start and End Times**: Identify when the condition or exposure begins and ends. 
     - **Start Time**: Often straightforward, based on the initial appearance of relevant features.
     - **End Time**: More challenging to determine, as symptoms or conditions may persist beyond the acute phase (e.g., pneumonia symptoms lingering after infection).

2. **Defining the Phenotype**:
   - **Intended Meaning**: Clarify whether the phenotype refers to a current condition or a historical one. Accurate specification is crucial.
   - **Specification Format**: Can be in plain language, formal computer-based specifications, or specialized languages.

#### Evaluating an Electronic Phenotype

1. **Validation**:
   - **Reference Standard**: Typically involves comparison with a complete review of patient charts by trained clinicians.
   - **Challenges**: Time-consuming and may involve disagreements among clinicians.

2. **Evaluation Considerations**:
   - **Effect Magnitude**: Determine how much error in phenotype definition can be tolerated based on the research question.
   - **Portability**: Assess how well a phenotype definition developed at one site applies to data from another site.

#### Key Points

- **Accurate Definition**: Essential for ensuring the phenotype reflects the intended condition and is useful for analysis.
- **Evaluation Methods**: Important for validating the phenotype and understanding its applicability across different settings.


# Two approaches to phenotyping

## Two approaches to phenotyping

#### Categories of Approaches

1. **Rules-Based Approach**:
   - **Description**: Uses explicit inclusion and exclusion criteria to define phenotypes.
   - **Construction**: Criteria are created by experts who reach consensus through iterative processes. This often involves reviewing sample case records to refine the criteria.
   - **Process**: Typically involves manual definition of rules based on clinical knowledge and data review.

2. **Probabilistic Phenotyping**:
   - **Description**: Utilizes machine learning techniques to determine the likelihood of a patient having the exposure or outcome of interest.
   - **Method**: Machine learning algorithms learn from data to assign probabilities to patient records, indicating the likelihood of the condition or exposure.
   - **Process**: Involves training models on historical data to predict outcomes or exposures, rather than relying on predefined rules.

#### Key Points

- **Rules-Based Approach**: Relies on expert knowledge and consensus, and involves defining criteria explicitly.
- **Probabilistic Phenotyping**: Leverages machine learning to estimate probabilities, potentially capturing complex patterns in the data.


## Rule-based electronic phenotyping

#### Example from PheKB: Sickle Cell Disease

1. **Phenotype Knowledge Base (PheKB)**:
   - A publicly available repository of phenotype definitions that offers examples of rule-based electronic phenotyping.

2. **Sickle Cell Disease Phenotype**:
   - **Inclusion Criteria**:
     - Presence of **ICD-9 codes**: At least one relevant code must appear in the patient’s record.
     - **Hospitalization/Visits Requirement**: The patient must have either one hospitalization or two office visits for the condition.
   - **Exclusion Criteria**:
     - **Sickle Cell Trait**: If the patient has more diagnoses for sickle cell trait than for sickle cell disease, the patient is excluded from being classified as having sickle cell disease.

#### Key Points

- **Inclusion Criteria**: Define the necessary conditions (e.g., specific ICD-9 codes and healthcare visits) to classify a patient as having the phenotype.
- **Exclusion Criteria**: Specify conditions (e.g., diagnoses of sickle cell trait) that disqualify a patient from having the phenotype.


## Examples of rule-based electronic phenotype definitions

#### Overview

1. **Flowchart-Based Algorithm**:
   - The type 2 diabetes phenotype is expressed as a flowchart (algorithm) to systematically determine whether a patient has the condition.
   - Steps in the algorithm involve checking various patient records and medical data to make a decision.

2. **Process Overview**:
   - **Step 1**: Check for a diagnosis of **type 1 diabetes**. If present, the patient is excluded from having type 2 diabetes.
   - **Step 2**: If no type 1 diabetes diagnosis is found, check for **type 2 diabetes diagnosis codes**.
   - **Step 3**: If no type 2 diabetes diagnosis is present, evaluate other factors:
     - **Prescriptions for type 2 diabetes drugs**.
     - **Relevant abnormal lab values**.
   - If these factors are present, the patient is still classified as having type 2 diabetes, even in the absence of a formal diagnosis code.

#### Key Considerations

- **Augmentation Beyond Diagnosis Codes**: Diagnosis codes alone are insufficient for accurate phenotyping. The addition of prescription information, lab values, and the **timing** of these events provides a more reliable assessment.
- **Timeline View of Patient Record**: The sequence and timing of events in the patient's medical history play a crucial role in phenotype determination.

#### Data Sources for Phenotyping

- **Billing Codes**: Useful but not always fully accurate.
- **Clinical Text and Medical Records**: Contain additional features that, when combined with codes, improve phenotype accuracy.
- **Way 2015 Paper**: Demonstrates the relative value of these different data sources and their utility in constructing phenotype definitions for various conditions.

#### Key Points

- **Comprehensive Assessment**: Combining multiple data sources (e.g., codes, prescriptions, lab results) enhances the accuracy of phenotyping.
- **Condition-Specific**: The relative importance of different data sources varies depending on the disease or phenotype of interest.


## Constructing a rule based phenotype definition


#### 1. **Identify Relevant Data Elements**:
   - Determine which **data elements** (e.g., symptoms, lab results, diagnoses, medications) should appear in a patient’s record to infer the condition of interest.
   
#### 2. **Convert Data Elements to Specific Identifiers**:
   - Use a **knowledge graph** to map the identified data elements to specific standardized identifiers like:
     - **ICD-9 or ICD-10 codes** (diagnosis codes)
     - **Medication codes** (e.g., NDC codes)

#### 3. **Create the Phenotype Definition**:
   - Specify the following:
     - **Necessary and Sufficient Conditions**: Define the minimum required data elements for the condition and additional factors that confirm it.
     - **Presence or Absence**: Specify which data elements must be present or absent in the patient’s record.
     - **Frequency of Occurrence**: Determine how often the elements must appear (e.g., number of visits or prescriptions).
     - **Start and End Conditions**: Identify when the condition begins and ends based on the data elements.

#### 4. **Iterate and Validate**:
   - **Comparison to Reference Standard**: Validate the definition by comparing it against a clinician-reviewed full chart (gold standard).
   - **Iterative Process**: Refine the definition in repeated cycles, asking:
     - Are the **necessary and sufficient conditions** clearly defined?
     - Does the definition correctly identify the **start and end** of the condition?
     - Are the **data elements** comprehensive and accurate for identifying the phenotype?

#### 5. **Achieve Consensus**:
   - Repeat the process of refining the definition until consensus is reached among medical experts.

#### Key Points

- **Data Elements to Codes**: Use knowledge graphs to convert clinical features into standardized identifiers.
- **Necessary/Sufficient Conditions**: Define clear criteria for presence/absence and frequency of data elements.
- **Validation and Refinement**: Compare the phenotype definition to clinical reviews, refining it iteratively for accuracy.


## Probabilistic phenotyping


#### 1. **Thought Experiment: Chief Data Scientist Role**:
   - **Challenge**: Identify patients with one of 50 different conditions within one week.
   - **Limitation of Rule-Based Phenotyping**: One week is too short to define and refine 50 rule-based phenotypes using medical experts, unless existing work can be reused.

#### 2. **Solution: Automating Phenotype Construction**:
   - **Machine Learning Approach**: Automate the construction of phenotypes using **supervised machine learning** methods.
   - **Supervised Learning Process**:
     - Start with a **training set**, where each patient is labeled as having (or not having) the condition of interest and when in their timeline this occurred.
     - Use the training set to build a **computational model** capable of classifying new, unseen patients as having or not having the condition.

#### 3. **Labeling the Data**:
   - **Challenge of Manual Labeling**: Manually labeling patients' records is time-intensive, which is the process we want to avoid.
   - **Need for Efficient Labeling**: The goal is to find a cost-effective way to generate labels that are **good enough** for training a machine learning classifier.

#### Key Points

- **Machine Learning for Speed**: When time is limited, machine learning can automate the phenotype construction process.
- **Training Data**: Requires labeled examples of patients for training.
- **Labeling Challenge**: Manual labeling is time-consuming; efficient, alternative labeling methods are necessary to build models quickly.


## Approaches for creating a probabilistic phenotype definition

#### 1. **Billing Code Counts as Probabilities**:
   - **Method**: Count how often **billing codes** appear for each patient.
     - **Interpretation**: Higher counts suggest a higher **probability** of having the phenotype, while lower counts may be spurious.
     - **Clustering**: Use the computed probabilities to cluster patients into those likely or unlikely to have the condition.
   - **Key Insight**: It’s often easier to get **more imperfectly labeled data** than to increase the accuracy of the labels.
   - **Error Reduction with Sample Size**: Increasing the **sample size** can reduce error rates in labeling. The formula for the expansion of sample size based on error rate $ t $ is:  
     $$
     \text{Expansion Factor} = \frac{1}{(1 - 2t)^2}
     $$
     Example: A sample size of 1,500 with an error rate of 21% can still yield a good classifier with enough data.

#### 2. **Keyword-Based Phenotyping**:
   - **Method**:
     - Start with **keywords** related to the condition (e.g., myocardial infarction).
     - Use a **knowledge graph** to expand the set of related terms.
     - Include all patients with **non-negated mentions** of the keywords on their timeline.
     - Train a classifier using these features to classify the phenotype.
   - **Performance**: This method worked well for conditions like **diabetes mellitus**, **myocardial infarction**, and **familial hypercholesterolemia**, but less so for conditions like **celiac disease**.

#### 3. **Anchor-Based Labeling**:
   - **Anchor Concept**:
     - An **anchor** is a reliable indicator of the presence of a phenotype.
     - Seeing an anchor means a patient **very likely** has the condition, but **absence of an anchor** does not imply the absence of the condition.
   - **Use in Labeling**: Anchors serve as a **labeling function** to generate large amounts of training data for a classifier.
   - **Performance Improvement**: Track how classifier performance improves by adding more anchors.
     - Example: For **myocardial infarction**, performance improves with the first few anchors but **plateaus** after a certain number.
   - **Effort-Precision Trade-off**: A **small human effort** yields large amounts of training data and captures most of the improvement in phenotype classification performance.

#### Key Points

- **Billing Code Counts**: Higher counts increase confidence in the phenotype classification.
- **Imperfect Data**: Using large datasets and efficient labeling strategies (e.g., keyword expansion or anchors) can yield accurate classifiers despite imperfect labels.
- **Anchor-Based Approach**: Anchors provide reliable but incomplete labeling, striking a balance between **human effort** and **classifier accuracy**.


## Software for probabilistic phenotype definitions


**Aphrodite** is an open-source software solution for automating the process of phenotype definition and identification in healthcare data. Here's a breakdown of the tool:

#### Key Features:
1. **Open Source**: Available freely on GitHub.
2. **Integration with Standardized Models**: 
   - Uses the **Odyssey Common Data Model (CDM) version 5**.
   - Leverages **vocabulary version 5** for standardized medical terms and codes.
3. **Automated Phenotyping Workflow**: 
   - Handles the entire process from **definition** to **identification**, **training**, and **evaluation** of phenotypes.
   - Implements **labeling processes** (such as those discussed, including billing code counts, keyword expansion, and anchor-based labeling).

#### Functionality:
- **Automates** many of the labor-intensive tasks associated with phenotype construction.
- **Simplifies** the use of machine learning techniques to label and classify patient records based on phenotypes.
  
**Aphrodite** is a powerful tool for healthcare data scientists to streamline the development of phenotype classifiers using machine learning and standardized vocabularies.