# The Healthcare system

## Review of the healthcare system

That sounds like a solid plan! Let's break it down step-by-step.

### Data Sources in Healthcare

1. **Electronic Health Records (EHRs)**:
   - **Content**: Patient demographics, medical history, diagnoses, medications, lab results, and visit notes.
   - **Uses**: Can be used to study patterns of disease progression, treatment effectiveness, and patient outcomes.

2. **Claims Data**:
   - **Content**: Billing records, insurance claims, and reimbursement information.
   - **Uses**: Useful for analyzing healthcare utilization, costs, and outcomes across different populations.

3. **Clinical Trials Data**:
   - **Content**: Data collected during clinical trials, including patient responses to treatments and experimental conditions.
   - **Uses**: Ideal for evaluating the efficacy and safety of new treatments or interventions.

4. **Patient Registries**:
   - **Content**: Longitudinal data about patients with specific conditions or treatments.
   - **Uses**: Helpful for studying long-term outcomes and effectiveness of interventions in specific populations.

5. **Genomic Data**:
   - **Content**: Information about genetic variations, including single nucleotide polymorphisms (SNPs) and gene expression.
   - **Uses**: Can be used to investigate genetic predispositions to diseases and responses to treatments.

6. **Wearable Devices**:
   - **Content**: Continuous monitoring data, such as heart rate, activity levels, and sleep patterns.
   - **Uses**: Useful for studying real-time health metrics and their impact on long-term health outcomes.

### Types of Data and Their Accuracy Issues

1. **Missing Data**:
   - **Problem**: Data may be incomplete due to missed appointments, unrecorded symptoms, or lost information.
   - **Solution**: Use imputation methods to estimate missing values, or use statistical techniques that handle missing data.

2. **Data Entry Errors**:
   - **Problem**: Human errors during data entry can lead to inaccuracies.
   - **Solution**: Implement validation checks and cross-verify with other data sources.

3. **Systematic Bias**:
   - **Problem**: Certain groups may be underrepresented or overrepresented in the data due to healthcare access disparities.
   - **Solution**: Use stratified sampling or weighting techniques to adjust for these biases.

4. **Measurement Error**:
   - **Problem**: Instruments and methods used to collect data may not be perfectly accurate.
   - **Solution**: Calibrate instruments regularly and apply statistical adjustments to account for known biases.

5. **Data Integration Challenges**:
   - **Problem**: Combining data from different sources can be difficult due to differing formats and standards.
   - **Solution**: Use data standardization and integration tools to harmonize data across sources.

### Working with Inaccurate Data

1. **Data Cleaning**: 
   - Perform rigorous data cleaning to identify and correct errors. This might include removing duplicates, fixing inconsistencies, and normalizing data.

2. **Exploratory Data Analysis (EDA)**:
   - Conduct EDA to understand the data's characteristics and identify any obvious issues or patterns that could affect the analysis.

3. **Robust Statistical Methods**:
   - Use robust statistical methods that can handle imperfections in the data, such as non-parametric tests or models that are less sensitive to outliers.

4. **Sensitivity Analysis**:
   - Conduct sensitivity analysis to determine how changes in data accuracy impact the results of your research.

5. **Transparency**:
   - Clearly document any data limitations and potential sources of bias in your research findings.

By understanding the types of data available and their potential inaccuracies, you can design more effective studies and interpret results with a more critical eye. How does this align with what you've been learning so far?

## Review of key entities and the data they collect

You've laid out a comprehensive view of the healthcare system and the key players involved. Let's break down the data each entity generates and their roles:

### Key Entities and Their Data

1. **Patient**:
   - **Data Generated**: Health status, symptoms, search queries about health conditions, decisions to seek or avoid care.
   - **Data Use**: Patient data can provide insights into health behaviors, treatment-seeking patterns, and health literacy. Internet search data can inform about public health concerns and trends.

2. **Healthcare Providers**:
   - **Data Generated**: Patient diagnoses, treatment plans, laboratory test results, imaging results, prescriptions, and clinical notes.
   - **Data Use**: This data is used for patient care, longitudinal health tracking, and improving clinical outcomes. It also contributes to research and quality improvement initiatives.

3. **Pharmacies**:
   - **Data Generated**: Prescription records, medication dispensed, patient adherence to medications.
   - **Data Use**: Pharmacies track medication use and adherence, which can be crucial for understanding treatment efficacy and patient outcomes.

4. **Pharmaceutical Companies**:
   - **Data Generated**: Drug development data, clinical trial results, manufacturing data.
   - **Data Use**: This data helps in drug development, regulatory submissions, and understanding drug effectiveness and safety.

5. **Medical Device Companies**:
   - **Data Generated**: Device usage data, clinical outcomes associated with device use, manufacturing data.
   - **Data Use**: Information on device performance and patient outcomes helps in improving device design and ensuring safety.

6. **Pharmacy Benefits Management Companies**:
   - **Data Generated**: Claims data, medication utilization, cost data.
   - **Data Use**: They manage drug benefits, track medication costs, and work on optimizing drug formularies to balance cost and patient care.

7. **Insurance Companies**:
   - **Data Generated**: Claims data, patient coverage information, healthcare utilization data.
   - **Data Use**: Insurance data helps in understanding healthcare costs, patient access to care, and managing risk pools.

8. **Government Agencies**:
   - **Data Generated**: Public health data, disease surveillance data, healthcare utilization data, regulatory data.
   - **Data Use**: Governments use this data for policy making, health monitoring, and ensuring compliance with regulations. It also helps in identifying underserved populations and assessing healthcare access.

9. **Medical Researchers**:
   - **Data Generated**: Research findings, clinical trial data, patient outcomes data.
   - **Data Use**: Researchers analyze data to advance scientific knowledge, guide clinical practices, and influence policy decisions.

### Systemic Complexities

- **Conflicting Interests**: Different entities may have competing priorities (e.g., cost control vs. patient outcomes) that can impact data reporting and utilization.
- **Data Integration**: Combining data from diverse sources can be challenging due to differences in data formats, standards, and privacy concerns.
- **Regulation and Oversight**: Government agencies monitor compliance and regulate practices, but the diverse interests and interactions among entities can complicate regulatory efforts.

Understanding these interactions and the data each entity generates helps in forming a holistic view of the healthcare system and its challenges. This perspective is crucial for addressing systemic issues and improving healthcare delivery.

## Actors with different interests

Here’s a summary of the perspectives and how they influence research:

### Perspectives of Primary Actors

1. **Patients and Their Families**:
   - **Questions**: How can I stay healthy? What’s wrong with me? How can I get better?
   - **Interests**: Focus on personal health and well-being, often driven by immediate concerns and quality of life.
   - **Data Influence**: Patient-generated data (e.g., symptoms, treatment preferences) can be subjective and may not always align with clinical or cost-effectiveness perspectives.

2. **Healthcare Professionals**:
   - **Questions**: What is the best practice? How can I maximize benefit while minimizing harm?
   - **Interests**: Concerned with clinical effectiveness, patient safety, and evidence-based practices. They must balance individual patient needs with broader clinical guidelines.
   - **Data Influence**: Clinician decisions are guided by clinical data, treatment protocols, and patient history. Their data may focus on efficacy, safety, and patient outcomes.

3. **Payers (Insurers and Government)**:
   - **Questions**: What is best for society? What is cost-effective? What is equitable and just?
   - **Interests**: Focus on cost containment, maximizing value, and ensuring equitable access to care. They must balance individual patient needs with broader economic considerations.
   - **Data Influence**: Claims data, cost analyses, and utilization patterns are critical. Data may be influenced by cost constraints and coverage policies.

4. **Researchers**:
   - **Questions**: How can we improve future care? What does the data tell us about treatment effectiveness and safety?
   - **Interests**: Aim to advance scientific knowledge and improve patient care. They must ensure data privacy and ethical considerations while striving for comprehensive and accurate findings.
   - **Data Influence**: Research data must be detailed and robust. The focus is on understanding trends and outcomes, but there may be limitations in data completeness and generalizability.

### Addressing Different Interests

When defining research questions and selecting data sources, consider:

1. **Stakeholder Benefits**:
   - Identify who will benefit from the research (e.g., patients, clinicians, payers) and align research objectives with these benefits.
   - For example, studying treatment effectiveness might benefit patients and clinicians directly, while cost-effectiveness analyses serve payers.

2. **Potential Biases**:
   - Recognize that different data sources may reflect biases inherent to each stakeholder's interests. For instance, payer data might prioritize cost savings over patient outcomes.
   - Address these biases by using diverse data sources and triangulating findings.

3. **Multi-Audience Research**:
   - Design research questions that address multiple interests simultaneously. For example, a study on the cost-effectiveness of a new treatment can provide insights for patients, clinicians, and payers.
   - Consider how findings can inform policy, clinical practice, and patient care.

By acknowledging these varying interests and biases, you can design more balanced and impactful research. This approach ensures that research findings are relevant and useful across different facets of the healthcare system.

# Healthcare data types

## Common data types in Healthcare

Here’s a summary and further elaboration on the types of healthcare data:

### **Types of Healthcare Data**

1. **Structured Data**:
   - **Characteristics**: Organized in a consistent format, often in tables with rows and columns. Each row represents an instance (e.g., a patient), and each column represents a specific attribute (e.g., age, medical record number).
   - **Examples**: Patient demographics, lab results, and billing information.
   - **Handling Missing Data**: Missing values are often denoted by special codes like NA. Techniques like imputation or analysis methods designed to handle missing data are used to manage these gaps.

2. **Unstructured Data**:
   - **Characteristics**: Lacks a predefined format or structure. It can be more challenging to process and analyze due to its variability and complexity.
   - **Types**:
     - **Clinical Text**: Includes electronic health records (EHR) notes, discharge summaries, and progress notes. This text often uses medical jargon, abbreviations, and acronyms, making it distinct from general language. Text mining and natural language processing (NLP) are often used to extract useful information.
     - **Images**: Includes diagnostic imaging like X-rays, CT scans, and MRIs. These are large arrays of intensity values representing physical energy absorption or transmission. Image processing and machine learning techniques are applied for analysis.
     - **Signals**: Data from sensors capturing continuous measurements over time. Examples include ECGs (electrocardiograms) and EEGs (electroencephalograms). Signal processing techniques are used to analyze patterns and detect anomalies.

### **Data Characteristics**

- **Timescale**:
  - **Structured Data**: Often recorded at specific intervals (e.g., lab test results).
  - **Unstructured Data**: Can be continuous or episodic, depending on the type (e.g., an ECG recording versus a clinical note).

- **Point in the Care Journey**:
  - **Structured Data**: May be collected during specific interactions (e.g., during a visit or a test).
  - **Unstructured Data**: Can be collected during various stages of care (e.g., initial assessment, ongoing monitoring).

- **Value Types**:
  - **Structured Data**: Often categorical or numerical.
  - **Unstructured Data**: Includes qualitative information, images, and time-series data.

- **Patterns of Missing Values**:
  - **Structured Data**: Missing values are typically identified and handled through specific techniques.
  - **Unstructured Data**: Missing or incomplete information can be harder to detect and interpret.

### **Timeline View**

Understanding the timeline of data collection is crucial because:

- **Temporal Context**: Helps in interpreting the sequence of events and changes in the patient’s condition over time.
- **Integration**: Facilitates the integration of data from different sources and types, providing a comprehensive view of the patient’s care journey.
- **Analysis**: Enables time-based analyses, such as assessing the progression of a condition or the impact of a treatment over time.

By considering these aspects, you can effectively leverage healthcare data to address research questions, improve patient care, and inform healthcare decisions.

## Strengths and weakness of observational data

Here's a concise summary of the strengths and weaknesses of using such data:

### **Strengths of Observational Data**

1. **Large Scale**:
   - **Advantage**: Often encompasses millions or even billions of records, allowing for the study of rare events and trends across large populations.

2. **Real-World Utilization**:
   - **Advantage**: Provides insights into actual treatment use and effectiveness in everyday clinical practice, reflecting real-world scenarios rather than controlled experimental conditions.

3. **Cost and Efficiency**:
   - **Advantage**: Available at relatively low additional cost and without long delays since they are collected during routine care. Updates can be straightforward and frequent.

4. **Standardization**:
   - **Advantage**: Often subject to standardized formats and protocols due to use by multiple parties, which can aid in consistency over time.

### **Weaknesses of Observational Data**

1. **Static Nature**:
   - **Limitation**: Data may not capture all relevant details or features needed for specific analyses. Additional context or features might be missing.

2. **Record Linkage Challenges**:
   - **Limitation**: Connecting data across multiple sources to form a comprehensive patient record can be complex, especially in systems without a single personal identifier or integrated healthcare systems.

3. **Design Limitations**:
   - **Limitation**: Originally collected for other purposes, so the data may not be ideal for research or secondary analysis. Imperfections and biases may be present.

4. **Bias and Quality Issues**:
   - **Limitation**: Data collection processes can introduce biases related to the location of care, incentives, and changes in medical practices. The data may not fully capture a patient’s health status over time, particularly for self-treatment or pre-symptom periods.

### **Considerations for Research**

- **Advantages**: Utilize the large scale and real-world applicability to gain insights into healthcare practices and outcomes.
- **Limitations**: Be aware of potential biases and data imperfections. Employ data cleaning, validation, and methodological techniques to address these issues.

Understanding these aspects helps in effectively leveraging observational data for research while being mindful of its limitations. This approach ensures that the insights gained are both valuable and reliable.

# Sources of biases and errors

## Bias and error from the healthcare system perspective


Here’s a summary of potential inaccuracies and biases in healthcare data from different entities:

### **Patient**
- **Selection Bias**: Not all health statuses are recorded, especially for those who self-treat, avoid care, or are treated outside the health system.
- **Influencing Factors**: Patient decisions are affected by their knowledge, financial status, and insurance coverage.

### **Healthcare Provider**
- **Financial Incentives**: Treatment decisions and documentation may be influenced by incentives.
- **Documentation Issues**: Records may be inaccurate or incomplete due to timing and potential biases in how treatment is recorded.

### **Coder**
- **Coding Errors**: Systematic errors or omissions in coding, mistakes by coders, or biases in assigning codes can affect the accuracy of records.
- **Billing Focus**: Medical bills may only include enough information to support reimbursement, not a complete record of treatment.

### **Linking Records**
- **Technical Challenges**: Combining data from different sources can be technically complex and may introduce additional biases.
- **Privacy Concerns**: Increasing detail in linked records raises issues with patient privacy and identification.

## Bias and error of exposures and outcomes

### **Exposures**
- **Definition**: Factors or events that can affect a patient's health, such as diseases, treatments, or medical procedures.
- **Potential Biases**: Biases may arise from how exposures are recorded or reported, which could affect the reliability of the data. For example, certain exposures might be underreported or inaccurately documented.

### **Outcomes**
- **Definition**: Conditions or results that are measured after the exposure, such as complications, laboratory test results, or healthcare costs.
- **Potential Biases**: Errors or biases can occur in measuring or recording outcomes. For instance, outcomes might be influenced by the timing of data collection or differences in how outcomes are defined and documented.

### **Framework for Identifying Biases and Errors**
- **Relating Exposures to Outcomes**: Helps in understanding the causal relationships or associations between exposures and outcomes.
- **Quantifying Rates**: Involves calculating the frequency of exposures and outcomes, which can reveal biases in how often they are recorded or observed.

This framework aids in pinpointing where biases and errors might occur, allowing for more accurate and reliable analysis of healthcare data.

## How a patient's exposure might be misclassified

### **Misclassification of Medication Exposure**

1. **Free Samples**:
   - **Issue**: Free samples are not recorded in pharmacy claims data.
   - **Result**: Leads to gaps in the recorded medication use.

2. **Gaps in Data**:
   - **Issue**: Gaps can occur due to delays in refills or missed doses.
   - **Result**: Claims data may show discrepancies in the medication timeline.

3. **Incomplete Records**:
   - **Issue**: Medication usage might be recorded as longer than actual use if a patient stops taking the medication before it runs out.
   - **Result**: Claims data may inaccurately reflect the duration of exposure.

### **Cleanup Rules**

- **Purpose**: Aim to fill in gaps and produce more accurate estimates of medication exposure.
- **Example Rule**: Fill in gaps of less than seven days in the records.
- **Limitation**: Cleanup rules are not perfect and may not fully correct discrepancies.

Understanding these issues helps in interpreting claims data more accurately and applying appropriate adjustments in research and analysis.

## How a patient's outcome might be misclassified

### **Misclassification of Outcomes**

1. **Issue**: Diagnosis codes might be used prematurely or remain in the record even if the patient does not have the condition.

2. **Strategies to Reduce Bias**:
   - **Multiple Mentions**: Require multiple occurrences of the diagnosis code to confirm the condition.
   - **Disease-Specific Procedure**: Require a procedure code associated with the condition.
   - **Disease-Specific Medication**: Require a specific medication to confirm the diagnosis.

3. **Validation Methods**:
   - **Manual Review**: Compare data with a clinician’s manual review of patient charts.
   - **Simulation**: Use computer simulations to model the impact of random misclassification.

4. **Electronic Phenotyping**: Use these strategies to assess the occurrence of outcomes based on available data.

5. **Combining Data Sources**: Use multiple sources to counteract weaknesses in individual sources and estimate error rates, with clinical expertise needed for accurate assessment.

These approaches help ensure more accurate outcomes and reduce biases in healthcare data analysis.

# Healthcare data sources

## Electronic medical record data

### Sources of Clinical Data:

1. **Electronic Medical Records (EMR) / Electronic Health Records (EHR)**:
   - **Data Included**: Patient demographics, diagnosis and procedure codes, clinical notes, medication administration records, imaging, lab test results, genetic tests, and data from consumer devices.
   - **Examples**:
     - **MIMIC**: Data from critical care units at Beth Israel Deaconess Medical Center, with records for about 38,000 patients.
     - **Cerner Health Facts**: A commercially available database covering over 1.3 billion lab results, 84 million patient visits, and 151 million drug orders.

## Claims data

1. **Medical Claims / Billing Data**:
   - **Content**: Identifying patient information, insurance status, diagnosis and procedure codes, requested charges.
   - **Features**: Includes all charges covered by insurance, can follow a patient across multiple providers. However, vision and dental services are often separate.
   - **Code Assignment**:
     - **US**: Codes assigned at the date of service.
     - **UK**: Codes assigned on the day of discharge.

2. **Large Data Sets**:
   - **Truven Marketscan Commercial Claims and Encounters**: Large dataset for commercial insurance claims.
   - **Optum Clinformatics Data Mart**: Another large dataset of claims data.
   - **CMS Data**: Data from Centers for Medicare & Medicaid Services, including:
     - **Medicare Fee-for-Service**: Covers traditional fee-for-service claims.
     - **Medicare Advantage**: Fixed payments to providers, leading to different coding incentives.

## Pharmacy

### Pharmacy Data:

1. **Content**:
   - **Prescription Records**: Details on prescriptions written, when they were filled, and payment information.
   - **Usage**: Indicates that a prescription was filled and picked up by the patient, though not necessarily taken.

2. **Types of Pharmacies**:
   - **Retail Pharmacies**: Physical locations like Walgreens or CVS.
   - **Mail-Order Pharmacies**: Deliver medications directly to the patient’s home.
   - **Online Pharmacies**: Provide prescriptions at lower costs, may not have physical stores.

3. **Data Challenges**:
   - **Data Spread**: A patient's drug record may be dispersed across multiple pharmacy datasets if prescriptions are filled at different types of pharmacies.

## Surveillance datasets and Registries

### Post-Marketing Surveillance and Registries:

1. **Purpose**:
   - **Identify Adverse Events**: Detect serious side effects that occur after drug or device approval.
   - **Monitor and Respond**: Restrict use or recall products if significant issues are found.

2. **Databases and Agencies**:
   - **FDA Adverse Event Reporting System (FAERS)**: Monitors adverse events for drugs.
   - **Manufacturer and User Facility Device Experience (MAUDE)**: Tracks adverse events for medical devices.
   - **Centers for Disease Control and Prevention (CDC)**: Monitors for epidemics and emerging diseases.

3. **Professional and Disease-Specific Registries**:
   - **Prime Registry**, **Cancer Link**, **IRIS Registry**: Maintained by professional societies to improve service delivery and learn from patient experiences.
   - **SEER Program**: Government-managed cancer registry.
   - **American Joint Replacement Registry (AJRR)**: Privately managed registry for joint replacements.

## Population health data sets

1. **Population Health Data**:
   - **National Inpatient Sample (AHRQ)**: Hospitalization data, including resource utilization, costs, and patient outcomes.
   - **Medical Expenditure Panel Survey (MEPS)**: Survey data on medical usage and costs from patients, providers, and employers.
   - **National Health and Nutrition Examination Survey (NHANES)**: Demographic and health data from a wide sample, including nutrition information.

2. **Patient-Generated Data**:
   - **Mobile Devices & Wearables**: Health data from phones and wearables, which patients can share with doctors or researchers.
   - **Patient-Reported Outcomes**: Self-reported data on symptoms, diagnoses, treatments, and outcomes, often shared on social networks or online portals. These data sets may be used for research on patient experiences.

3. **Research and Clinical Trials**:
   - **Clinical Trials**: Systematic investigations using randomized controlled designs to answer questions about causality. Registered at **ClinicalTrials.gov** and crucial for advancing medical practice.
   - **Open Data for Reanalysis**: Projects like the **Yale University Open Data Access (YODA)** project provide access to clinical trial data for further research.

4. **Importance of Multiple Data Sources**:
   - No single data source gives a complete view of a patient's care timeline, so combining multiple sources helps generate more reliable research outcomes.

## A framework to assess if a data source its useful

When considering a dataset for answering a clinical question, here are key questions to ask:

1. **Is there a well-documented data model?** 
   - How are the data organized, and is the documentation accurate? Poor documentation can lead to significant additional work.

2. **Where are the data from? (Data provenance)**
   - Is the origin of the data known, and how were they collected and incorporated into the dataset?

3. **Are the data accessible?**
   - How do you obtain the data? Are there any restrictions, and is access prohibitively expensive?

4. **What are the known errors in the data?**
   - Are there missing values or other quality issues? Addressing these problems can be time-consuming.

Additional considerations include:
- **Do the data contain the patient characteristics you need to observe?**
  - If not, can you use a proxy?
  
- **Is the dataset large enough for your study, especially for rare conditions?**
  - A small dataset might not be sufficient to observe conditions with very low incidence rates.

These questions help determine the usability, quality, and appropriateness of a dataset for clinical research.