Skip to content

celehs/Harnessing-electronic-health-records-for-real-world-evidence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Harnessing Electronic Health Records for Real World Evidence

This page provides the resources and tools mentioned from the entire available biomedical scientific literature, Harnessing Electronic Health Records for Real World Evidence.

Table of Contents

Background and Flowchart

In this study, we outline an integrated pipeline to improve the resolution of EHR data that will enable researchers to perform robust analysis with high quality data from EHRs for RWE generation. Our pipeline has 4 modules: 1) creating meta-data for harmonization, 2) cohort construction, 3) variable curation, and 4) validation and robust modeling (Figure 1). The lists of methods and resources integrated into the pipeline are listed for each module of the pipeline, respectively. The pipeline contributes simultaneously to the creation of digital twins.

The Integrated Data Curation pipeline designed to enable researchers to extract high quality data from electronic health records (EHRs) for RWE. Figure 1:The Integrated Data Curation pipeline designed to enable researchers to extract high quality data from electronic health records (EHRs) for RWE.

Method

Module one: Creating Meta-Data for Harmonization

The first step in our pipeline is to perform data harmonization by mapping clinical variables of interest to relevant sources of data within EHRs. To make this mapping process more efficient and transparent, we propose an automated method using NLP for data harmonization. This approach can help streamline the process and improve accuracy in identifying the clinical relevant concepts.

Concept Identification

Identify the medical concepts associated with the clinical variables from the RCT documents using existing clinical NLP software.

Use Methods Links References
Identify medical concepts from RCT documents Metamap(Code, Ref) Tools: MetaMap Mapping Text to the UMLS Metathesaurus
HPO(Code, Ref) The Human Phenotype Ontology The Human Phenotype Ontology in 2021
NILE(Code, Ref) Narrative Information Linear Extraction (NILE) NILE: Fast Natural Language Processing for Electronic Health Records
cTAKES(Code, Ref) Apache cTAKES Entity Extraction for Clinical Notes, a Comparison Between MetaMap and Amazon Comprehend Medical

Concept Matching

Match the identified medical concepts to both structured and unstructured EHR data elements.

Use Methods Links References
Grouping of structured EHR PheWAS catalog(Code,Ref) Phenome Wide Association Studies PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations
CCS(Resource:ICD9-CM,ICD-10-PCS, Ref),CPT-4/HCPCS(Resource, ICD-9-CM(Resource), ICD-10-PCS(Resource) ICD-9-CM Diagnosis and Procedure Codes, 2023 ICD-10-PCS, List of CPT/HCPCS Codes, CLINICAL CLASSIFICATIONS SOFTWARE (CCS) FOR ICD-9-CM , CLINICAL CLASSIFICATIONS SOFTWARE (CCS) FOR ICD-10-PCS (BETA VERSION) Clinical Classifications for Health Policy Research: Version 2 : Software and User’s Guide. (U.S. Department of Health and Human Services, Public Health Service, Agency for Health Care Policy and Research
RxNorm(Resource, Ref) RxNorm Files RxNorm: Prescription for Electronic Drug Information Exchange
Lonic(Resource, Ref) Download Lonic LOINC, a universal standard for identifying laboratory observations: a 5-year update
Expansion and selection of relevant features using knowledge source or cooccurrence Export curation UMLS, Wikidata The Unified Medical Language System (UMLS): integrating biomedical terminology, Freebase (database)
Knowledge sources Distributional Semantics Resources , PubMed , MerkMannual , Medscape Exploring the application of deep learning techniques on medical text corpora,Exploring the application of deep learning techniques on medical text corpora
Matching descriptions via language model CODER++( CODE, REF)
Embedding from Co-occurrence in EHRs KESER(CODE, APP, REF) Clinical Knowledge Extraction via Sparse Embedding Regression (KESER) with Multi-Center Large Scale Electronic Health Record Data.

Module two: Cohort Construction

The construction of the study cohort for RWE involves identifying the patients with the condition/disease of interest, their time window for the indication and whether they underwent the interventions in the RCT. EHR data contain a large amount of data of which a subset is relevant to the study. To avoid involving unnecessary personal health identifiers into the data for analysist, we recommend a 3-phase cohort construction strategy that gradually extracts the minimally necessary data from the EHR, starting from an inclusive data mart to the disease cohort and then to the treatment arms.

Data Mart

The data mart is designed to include all patients with any indication of the disease or condition of interest. To achieve the desired inclusiveness, researchers should summarize a broad list of EHR variables with high sensitivity and construct the data mart to capture patients with at least one occurrence of the listed variables.

Use Methods Links References
Filter patients with diagnosis codes relevant to disease of interest PheWAS catalog(Code, Ref) Phenome Wide Association Studies PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations
HPO(Code, Ref) The Human Phenotype Ontology The Human Phenotype Ontology in 2021

Diease Corhort

After the data mart is created, the next step is to identify the disease cohort containing the subset of patients within the data mart who have the disease of interest.Commonly used phenotyping tools can be roughly classified as either rule-based or machine-learning based. Machine learning approaches can be further classified as either weakly supervised, semi-supervised, or supervised based on the availability of gold-standard labels for model training.

Use Methods Links References
Identify patients with disease of interest through phenotyping Unsupervised: Anchorexplorer(Code, Ref), Express(Code, Ref), Aphrodite(Code, Ref), PheNorm(Code, Ref), MAP(Code, Ref) sureLDA(Code, Ref) Phenome Wide Association Studies, Anchorexplorer, Express, Aphrodite, PheNorm, MAP, sureLDA Electronic medical record phenotyping using the anchor and learn framework, Learning statistical models of phenotypes using noisy labeled training data, Electronic medical record phenotyping using the anchor and learn framework, Enabling phenotypic big data with PheNorm, High-throughput multimodal automated phenotyping (MAP) with application to PheWAS , A multidisease automated phenotyping method for the electronic health record
Semi-supervised: AFEP(Code, Ref), SAFE(Code, Ref), PSST(Code, Ref), Likelihood approach(Code, Ref), PheCAP(Code, Ref) SAFE, PheCAP Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources, Surrogate-assisted feature extraction for high-throughput phenotyping, Phenotyping through Semi-Supervised Tensor Factorization (PSST), A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients., High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Treatment Arms and Timing

With a given disease cohort, one may proceed to identify patients who received the relevant treatments, which are typically medications or procedures.

Use Methods Links References
Identify indication conditions before treatment Phenotyping with temporal input(Code:MSMR, TSPM,AgeMatters, Ref) MSMR, TSPM, AgeMatters High-throughput phenotyping with temporal sequences.

Module three: Variable Extraction

RCT emulation with EHR data generally requires three categories of data elements: 1) the endpoints measuring the treatment effect; 2) eligibility criteria to match the RCT population; 3) confounding factors to correct for treatment by indication biases inherent in real world data. In the following, we describe the classification and extraction of the first two types while addressing the confounding in Module 4.

Extraction of Baseline Variables or Endpoints

Use Methods Links References
Extraction of binary variables through phenotypings Same as Identify patients with disease of interest through phenotyping Same as Identify patients with disease of interest through phenotyping Same as Identify patients with disease of interest through phenotyping
Extraction of numerical variables through NLP EXTEND (Code, Ref), NILE(Code, Ref) EXTEND, NILE EXTraction of EMR numerical data: an efficient and generalizable tool to EXTEND clinical research,Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer

Extraction of Baseline Variables

Use Methods Links References
Extraction of radiological characteristics through medical AI Same as Identify patients with disease of interest through phenotyping organs, blood vessel, neural system, CS-Net(Code, Ref), DeepLung(Code, Ref), nodule detection, cancer staging, fractional flow, reserve Abdominal multi-organ segmentation with organ-attention networks and statistical fusion, Blood vessel segmentation algorithms - Review of methods, datasets and evaluation metrics, Segmentation of Corneal Nerves Using a U-Net-Based Convolutional Neural Network, Channel and Spatial Attention Network for Curvilinear Structure Segmentation, Automated pulmonary nodule detection in CT images using deep convolutional neural networks, DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification, Diagnostic accuracy of a deep learning approach to calculate FFR from coronary CT angiography, Diagnostic accuracy of 3D deep-learning-based fully automated estimation of patient-level minimum fractional flow reserve from coronary computed tomography angiography

Extraction of Baseline Endpoints

Use Methods Links References
Extraction of event time through incidence phenotyping Unsupervised:AC_TPC(Code,Ref) AC_TPC(Code,Ref) Disease progression modeling using Hidden Markov Models, Temporal Phenotyping using Deep Predictive Clustering of Disease Progression
Semi-supervised: SAMGEP(Code,Ref) SAMGEP(Code,Ref) Samgep: A novel method for prediction of phenotype event times using the electronic health record, Semi-supervised Approach to Event Time Annotation Using Longitudinal Electronic Health Records
Supervised Determining the Time of Cancer Recurrence Using Claims or Electronic Medical Record Data, Detecting Lung and Colorectal Cancer Recurrence Using Structured Clinical/Administrative Data to Enable Outcomes Research and Population Health Management

Module four: Validation and Robust Modelling

Confounding factors, variables that affect both the treatment assignment and outcome, must be properly adjusted. To minimize the bias, the pipeline should include 1) validation for optimizing the medical informatics tools in Modules 2 and 3 ; 2) analyses robust to remaining data error; 3) comprehensive confounding adjustment.

Robust analysis and adjustment

Use Methods Links References
Efficient and robust estimation of treatment effect with partially annotated noisy data SMMAL(Code, Ref) Efficient and Robust Semi-supervised Estimation of ATE with Partially Annotated Treatment and Response

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages