# Welcome

#### **Course Overview**
- **Instructors**: Professor Nigam Shah, Dr. Steve Bagley, David Magnus
- **Focus**: Using health care system data to enhance patient care.

#### **Key Topics**

1. **Understanding Research Questions**
   - **Types of Research Questions**: Identifying relevant and impactful questions.
   - **Criteria for a Good Question**: Clear, actionable, and measurable.
   - **Data Mining Workflow**: Steps to extract answers from data.

2. **Data Sources in Health Care**
   - **Types of Data**: Textual data, images, lab results, measurements.
   - **Data Accuracy**: Systematic inaccuracies and strategies to address them.

3. **Data Representations**
   - **Patient Timeline View**: Reflects the inherent time dimension of patient care.
   - **Patient Feature Matrix**: Tabular format used for analysis; forms the basis of subsequent analyses.

4. **Data Transformation**
   - **From Timeline to Matrix**: Process of converting diverse data formats into an analysis-ready matrix.
   - **Knowledge Graphs**: Use of prior biomedical knowledge to aid in data transformation.

5. **Computational Procedures**
   - **Exposure and Outcome Determination**: Methods to assess whether patients experienced relevant exposures and outcomes.

6. **Ethical Issues**
   - **Ethical Considerations**: Implications of using health care data for decision-making and patient care.

#### **Key Takeaways**
- **Data Importance**: The quality and accuracy of data are crucial for effective algorithm performance.
- **Transformation Decisions**: The process of converting data into a usable format involves critical choices that impact the results.
- **Ethical Implications**: Handling health care data responsibly is essential for ethical decision-making and patient trust.

This summary captures the main components of the course and will guide your exam preparation by highlighting the critical areas of study.

# Introduction to the data mining workflow

#### **Course Goals**
- **Objective**: Use clinical data to answer research questions and improve patient and population health.

#### **Course Structure**

1. **Choosing a Research Question**
   - **Importance**: The starting point of the data mining workflow; should address a significant problem or gap.

2. **Understanding the Healthcare System**
   - **Data Generation**: Overview of how and what types of patient data are produced within the system.

3. **Types of Data**
   - **Variety**: Includes structured data (e.g., lab results) and unstructured data (e.g., clinical notes).
   - **Processing and Analysis**: Methods to handle different data types to answer research questions.

4. **Data Mining Workflow**
   - **Four Steps**:
     1. **Pose a Research Question**: Define what you aim to discover or solve.
     2. **Identify Data Sources**: Find relevant data that can address the research question.
     3. **Extract and Transform Data**: Convert data into a format suitable for analysis.
     4. **Conduct Analysis**: Use the transformed data to perform the analysis.
     5. **Validate and Iterate**: Check the results and refine the process if needed.

5. **Focus Areas**
   - **Steps One to Three**: 
     - **Research Question**: Define and refine the question.
     - **Data Sources**: Locate and understand relevant data sources.
     - **Data Transformation**: Prepare data for analysis.

6. **Key Data Representations**
   - **Patient Timeline**: Visual representation of patient care over time.
   - **Patient Feature Matrix**: Tabular format for data analysis.

#### **Key Concepts**
- **Data Mining Workflow**: Essential process to convert raw data into actionable insights.
- **Patient Timeline vs. Patient Feature Matrix**: Fundamental representations used in the course for understanding and analyzing health care data.

This summary captures the essential elements of the course, focusing on the data mining workflow and key data representations to help guide your studies and preparation for the exam.

# Real-Life Example: Clinical Data Mining Workflow

#### **Case Study: Laura**
- **Patient**: Laura, a teenager with Systemic Lupus Erythematosus (SLE).
- **Symptoms**: Recent flare-up, proteinuria (protein in urine), pancreatitis, and antiphospholipid antibodies in blood.
- **Concern**: Risk of blood clot formation.

#### **Clinical Question**
- **Question**: Should Laura receive anticoagulant medication given her symptoms and condition?

#### **Data Mining Workflow Steps**

1. **Pose a Research Question**
   - **Initial Question**: Should a teenager with SLE, proteinuria, and antiphospholipid antibodies receive anticoagulant medication?
   - **Context**: Rare condition; no direct literature or expert consensus available.

2. **Identify Data Sources**
   - **Challenge**: Lack of specific literature and expert experience due to rarity of the condition.
   - **Approach**: Use electronic medical records (EMRs) from a large academic medical center to find similar past cases.

#### **Process Overview**
- **Step One**: Define the research question based on Laura's clinical scenario.
- **Step Two**: Identify and access relevant data sources (e.g., EMRs from similar past cases).
- **Next Steps**: 
  - **Data Extraction and Transformation**: Extract relevant patient data and transform it into a usable format for analysis.
  - **Analysis**: Analyze the data to determine if similar cases benefited from anticoagulant treatment.

This example illustrates how clinicians can use historical patient data to answer clinical questions when direct evidence is unavailable. The focus is on leveraging electronic health records to inform treatment decisions.

# Example: Finding similar patients

#### **Objective**
To transform electronic medical record (EMR) data into a usable format to answer the research question about anticoagulant treatment for a patient with SLE and related symptoms.

#### **Steps in Data Extraction and Transformation**

1. **Identify Patient Population**
   - **Query for Pediatric Patients**: Use age-based criteria to select all patients under 18 years old in the EMR system.
   - **Filter for SLE Diagnosis**: Retain only those with a diagnosis code for Systemic Lupus Erythematosus (SLE).

2. **Find Patients with Proteinuria**
   - **Urine Test Values**: Search for patients with positive results for proteinuria in urine tests.

3. **Detect Antiphospholipid Antibodies**
   - **Laboratory Test Results**: Ideally, look for numeric values for antiphospholipid antibodies.
   - **Alternative Methods**: If results are in text or scanned documents, use:
     - **Proxy Marker**: Identify patients treated with aspirin, as this is often prescribed for those with antiphospholipid antibodies.
     - **Validation**: Cross-check with laboratory test results if available in a searchable format.

4. **Identify Outcome of Interest**
   - **Find Blood Clot Cases**: Search for terms related to thrombosis or blood clot using clinical expertise to define relevant terms.

#### **Considerations**
- **Medical Expertise**: Essential for identifying correct diagnosis codes, test results, and relevant terms.
- **Automation vs. Expertise**: While automation is possible, using clinical knowledge can streamline the process and provide faster results.

#### **Current Status**
- **Step Completed**: Defining criteria for identifying patients and outcomes; now in the midst of extracting and transforming data.

This process highlights the importance of combining medical expertise with data extraction techniques to effectively use EMR data for clinical research.

# Example: Estimating risk

#### **Analysis Objective**
Compare the risk of clotting in teenagers with SLE who also have proteinuria and antiphospholipid antibodies to the baseline risk of clotting in teenagers with SLE alone.

#### **Analysis Steps**

1. **Define the Groups**
   - **Group 1**: Teenagers with SLE, proteinuria, and antiphospholipid antibodies.
   - **Group 2**: Teenagers with SLE only (baseline risk group).

2. **Data Extraction**
   - **Extract Data**: Retrieve patient data from the EMR system for both groups. This data should include information on whether patients developed blood clots.
   - **Format**: Ensure data is in a format suitable for comparative analysis.

3. **Comparative Analysis**
   - **Calculate Risk**:
     - **For Group 1**: Determine the proportion of patients with blood clots.
     - **For Group 2**: Determine the baseline proportion of patients with blood clots.
   - **Comparison**: Compare the proportion of blood clot cases in Group 1 to the baseline risk in Group 2.

4. **Statistical Methods**
   - **Statistical Tests**: Use appropriate statistical tests to determine if the risk of clotting in Group 1 is significantly different from the baseline risk in Group 2.

#### **Considerations**
- **Bottlenecks**: Data extraction from EMR systems can be challenging; for this example, we assume data is readily available for analysis.
- **Accuracy**: Ensure that the data is accurate and complete to draw valid conclusions.

By following these steps, you can assess whether teenagers with SLE and additional symptoms have a higher risk of blood clots compared to those with SLE alone, guiding treatment decisions.

# Putting patient data on timeline

#### **Steps to Analyze Data**

1. **Data Collection**
   - **Sources**: Electronic medical records (EMRs) including:
     - Diagnosis codes
     - Lab results
     - Medication orders
     - Clinician notes

2. **Patient Timeline Creation**
   - **Timeline Layout**: Create a timeline for each patient to arrange:
     - **Diagnosis**: When SLE was diagnosed.
     - **Lab Results**: When proteinuria and antiphospholipid antibodies were detected.
     - **Medication Orders**: When treatments (e.g., aspirin) were prescribed.
     - **Outcomes**: When blood clots occurred.

3. **Identify Relevant Patients**
   - **Pediatric Patients**: Filter the timeline to include only those under 18.
   - **Flag SLE Cases**: Mark patients diagnosed with SLE.
   - **Comorbidities**: Identify those with proteinuria and antiphospholipid antibodies.

4. **Event Tracking**
   - **Clinical Conditions**: Track when each patient developed the relevant conditions (proteinuria, antiphospholipid antibodies).
   - **Outcome Monitoring**: Track the occurrence of blood clots in relation to each condition.

5. **Calculate Relative Risk**
   - **Fraction Calculation**: 
     - Compute the fraction of patients with proteinuria and antiphospholipid antibodies who developed blood clots.
     - Compare this fraction to the baseline risk of clotting in patients with SLE alone.
   - **Risk Assessment**: Determine the relative risk by comparing these fractions.

#### **Visualization**
- **Patient Timeline**: Helps visualize the sequence of events and conditions for each patient, making it easier to track the development of comorbidities and outcomes.

By using a patient timeline, you can systematically arrange and analyze patient data to determine the relative risk of blood clots in those with SLE and additional symptoms. This method provides a clear view of how each clinical condition progresses over time and its association with outcomes.

# Revisit the data mining workflow steps

#### **Clinical Question**
- **Question**: Is the risk of clotting high enough in a teenager with SLE, proteinuria, and antiphospholipid antibodies to warrant treatment with an anticoagulant?

#### **Data Source**
- **Primary Source**: Electronic Medical Record (EMR).

#### **Data Extraction and Transformation Steps**

1. **Identify Patients**
   - **Criteria**: Find teenagers with SLE.
   - **Diagnosis Codes**: Use for identifying SLE.
   - **Subgroups**: Define based on additional conditions like proteinuria and antiphospholipid antibodies.

2. **Extraction and Transformation**
   - **Diagnosis Codes**: For SLE.
   - **Clinical Expertise**: Craft searches for conditions not easily captured by codes (e.g., antiphospholipid antibodies).
   - **Proxy Terms**: Use related terms like aspirin treatment to identify patients with antiphospholipid antibodies if direct results are not available.
   - **Confirmatory Steps**: Validate proxy results with directly searchable laboratory test results when available.

3. **Analysis**
   - **Calculate Risk**:
     - **Subgroups**: Determine the incidence of blood clots in patients with SLE and additional conditions.
     - **Baseline Risk**: Compare with the risk in patients with SLE alone.
   - **Guide Treatment**: Use the comparative risk to decide on anticoagulant treatment.

#### **Considerations for Further Analysis**

- **Complexity**: The introductory example used a single data source (EMR) and a few data types (text mentions, lab tests, medication records, demographic variables).
- **Expanding Data Types**: Future analyses might include genomic markers, signals from wearable devices, voice markers, sleep records, etc.
- **Expert Involvement**: Emphasize the importance of expert human judgment in the data mining process to contextualize findings and minimize bias, rather than relying solely on automated algorithms.

#### **Course Focus**
- **Complexity and Accuracy**: Future lessons will delve deeper into each step of the data mining workflow, including managing complexity and reducing bias.

This summary captures the workflow for the clinical example and highlights the importance of human expertise and the use of diverse data types in more complex analyses.

# Type of research questions

#### **1. Descriptive Questions**
- **Purpose**: Summarize the data.
- **Example**: What proportion of the population has familial hypercholesterolemia?
- **Focus**: Counts or proportions.

#### **2. Exploratory Questions**
- **Purpose**: Identify patterns or groups within the data.
- **Example**: What are the subtypes of autism using the data at your medical center?
- **Approach**: Use clustering or other analytical methods to discover patterns. Validation required across different institutions.

#### **3. Inferential Questions**
- **Purpose**: Discover patterns that can be generalized beyond the data set.
- **Example**: Is there a relation between gut bacteria and depression in a collection of patients?
- **Focus**: Reliable patterns without necessarily identifying mechanisms.

#### **4. Predictive Questions**
- **Purpose**: Quantify relationships between features and outcomes.
- **Example**: Which patients are expected to make extensive use of costly medical resources in the coming year?
- **Focus**: Reliable associations to predict future events or outcomes.

#### **5. Causal Questions**
- **Purpose**: Determine the effect of changes in one variable on another.
- **Example**: For a patient with hypertension and asthma whose type 2 diabetes is uncontrolled with Metformin, what is the second-best drug?
- **Focus**: Cause-and-effect relationships.

#### **6. Deterministic Questions**
- **Purpose**: Address the underlying mechanism directly.
- **Example**: How does androgen deprivation increase dementia risk?
- **Focus**: Mechanistic understanding.

### Summary
Understanding these types of questions helps in designing research and evaluating data analysis, as well as in organizing and structuring the approach to clinical data mining.

# Research questions suited for clinical data

1. **Types of Questions and Data Suitability**:
   - **Descriptive, Exploratory, Inferential, and Predictive**: Clinical data are generally well-suited for answering these types of questions.
   - **Causal and Mechanistic**: These questions are more challenging to address with clinical data alone and often require experimental design and new data collection.

2. **Primary Goals in Medicine**:
   - **Risk Stratification**: To decide whether to treat a patient based on their risk profile.
   - **Data-Driven Treatment Selection**: To determine the most appropriate treatment for a patient using data.

3. **Purpose of Research Questions**:
   - Understanding the purpose of a research question helps in evaluating its adequacy and the suitability of the analyses required to answer it. 

This approach ensures that research questions are aligned with the available data and analytical capabilities, optimizing the ability to derive meaningful insights and make informed medical decisions.

# Example: making decision to treat

Your analysis of Laura's situation demonstrates a clear understanding of how to apply descriptive analysis in a clinical context and integrate it with practical decision-making. Here’s a detailed breakdown:

### Analysis for Laura

1. **Type of Question Addressed**:
   - **Descriptive**: The analysis you performed was descriptive, focusing on counts and proportions of patients with similar conditions who developed blood clots.

2. **Risk Stratification**:
   - **Assumption**: You used historical data to estimate Laura's risk by assuming that past patterns in similar patients will apply to her situation.
   - **Outcome**: This provides a risk stratification that categorizes patients into high or low risk, guiding treatment decisions.

3. **Treatment Decision**:
   - **High-Risk Category**: If Laura is classified as high risk based on the analysis, the clear treatment recommendation would be to use anticoagulants.
   - **Considerations**: It’s essential to also evaluate the risks of potential adverse effects from the treatment itself. This involves a balance between the benefits of anticoagulation and the risks associated with it.

4. **Real-World Application**:
   - In real-world scenarios, you would need to perform additional analyses to assess the risks of adverse events from the treatment. This comprehensive approach ensures that treatment decisions are well-informed and balanced.

### Summary

The analysis provided for Laura is an example of how descriptive questions and risk stratification can inform clinical decision-making. However, it is crucial to incorporate considerations of potential treatment risks to make fully informed decisions.

# Properties that make answering a research question useful

1. **Disease Burden**
   - **Consideration**: Evaluate how many lives are affected by the condition or question being addressed.
   - **Impact**: A question that addresses a disease with a high burden can have significant public health implications.

2. **Beneficial Effects on the Target Community**
   - **Target**: Assess whether the results will benefit clinicians, patients, or both.
   - **Goal**: Aim to generate knowledge that can be applied to improve health outcomes or clinical practices.

3. **Consequences of Answering the Question**
   - **Impact on Health**: Determine if answering the question will lead to reductions in death, morbidity, other illnesses, or healthcare costs.
   - **Access to Care**: Consider if the results will improve access to healthcare services.
   - **Stakeholder Benefits**: Evaluate how the answer might benefit various groups, including patients, providers, professionals, and payers.

### Summary

When formulating and evaluating research questions, it is important to consider:
- The **scope of the disease burden** and the number of lives impacted.
- The **potential benefits** to the target community and whether the results will have practical applications.
- The **overall consequences** of the findings, including impacts on health outcomes, healthcare spending, and access to care.

These considerations help ensure that research questions are not only scientifically relevant but also practically valuable and impactful.