# Basic Info
## Title: Physical exam findings for DKA patients as a predictor for severity of GCS/cerebral edema.
## Members
- Nate Hayward - Nate.Hayward@hsc.utah.edu - u6031381
- Joshua Kawasaki - J.K.Kawasaki@utah.edu - u1424902
- Andrea Stofko - Andrea.Stofko@utah.edu - u6040357

# Background and Motivation
<span style="color:blue">Discuss your motivations and reasons for choosing this project, especially any background or research interests that may have influenced your decision.</span>
* Diabetic ketoacidosis (DKA) is a severe, potentially life-threatening complication arising from untreated or poorly managed diabetes. It occurs when insufficient insulin levels prevent glucose from entering cells, causing the liver to break down fats for energy. This process generates ketone bodies, which accumulate in the blood and lead to a patient’s blood becoming more acidic.
* In advanced cases, DKA can cause severe altered mental status and require intubation. Care teams rely on the Glasgow Coma Scale (GCS/GCS-P), which evaluate eye, motor, and verbal responses, to assess a patient’s mental status. Changes in GCS scores, coupled with other clinical indicators are critical for gauging the severity of DKA and guiding treatment decisions. Without timely intervention, DKA can cause hypokalemia, cerebral edema, cardiac arrest, organ damage, coma, and death.
* The impact of DKA is particularly pronounced in children. In the United States and Canada, approximately 30 percent of children diagnosed with type 1 diabetes mellitus (T1DM) present in DKA. As a chronic illness, there is a lifelong risk for developing DKA, with an annual incidence between 6-8 percent among children with an established T1DM diagnosis.

# Project Objectives
<span style="color:blue">Provide the primary questions you are trying to answer in your project. What would you like to learn and accomplish? List the benefits.  
This should include both questions about the data and any learning objectives you would like to fulfill. In other words, there are two kinds of benefits to address.</span>
* Our primary objective is to identify physical examination characteristics that predict severe altered mental status requiring intubation, urgent imaging, or escalation of care, in pediatric patients with DKA. We hope that finding these key predictors will help providers identify which patients may require more intensive care due to their risk of progression/decompensation.
  
* We also hope to gain experience in working with large datasets and leveraging the power of large databases to:
    * Gain experience at cleaning, transforming, and organizing a large volume of raw data to answer clinical questions. This includes: handling missing values, inconsistent data formats, and structuring data for analysis.
    * Use appropriate descriptive statistics and data visualizations to determine patterns in these data and display them in a way to demonstrate the findings so that others may benefit from these findings.
    * Use findings to help augment clinical decision making by demonstrating the importance and potential significance of physical exam findings in pediatric DKA patients. 
    * Findings, if significant, would promote high-valued patient care that can be applied in any health care setting without the use of expensive resources (i.e. the physical exam is free of cost beyond the healthcare worker and may prevent unnecessary imaging/escalation of care).

# Data Description and Acquisition
<span style="color:blue">What format is your data in? How many items are there? What attributes do those items have? Are there special structures in it (e.g., networks, geographical)? From where and how are you collecting your data? If appropriate, provide a link to your data sources.</span>

* The Pediatric Emergency Care Applied Research Network (PECARN; found at https://pecarn.org/datasets/) has data available for download from a study named “Fluid Therapy and Cerebral Injury in Pediatric Diabetic Ketoacidosis (Fluid Therapy in DKA)”. Our team has already requested access, and we currently have copies of the data. This dataset, provided in 40 CSV files, contains detailed physical examination findings, laboratory values, medical history, imaging results, and demographics, amongst other values. 

* The dataset includes quantitative readings (integers and floats) and descriptive fields (strings; formatted as two columns: a common term [Normal, Abnormal, Not Assessed] and a further description).

* Example columns and values from the Physical Exam file are shown below:

![image.png](attachment:4412fce8-423b-4aac-8704-03b6e97edcfd.png)

# Ethical Considerations
<span style="color:blue">Complete a stakeholder analysis for your project.  
Who may be affected by your project and its outcomes? How could you project be used for harm?</span>
* Stakeholder Analysis:
    * Primary Stakeholders:
        * Pediatric Patients: pediatric patients are directly affected by this study, particularly those presenting in DKA. The findings could inform better-targeted interventions for those who require intubation, or escalation of care, which could improve morbidity and mortality. 
        * Parents and Guardians: Parents are invested in the care of their children and the study findings may help improve the quality and accuracy of pediatric patient care.
    * Secondary Stakeholders:
        * Healthcare Providers: Clinicians can use the findings of this project to guide treatment decisions, such as whether or not a child in DKA should be intubated, receive brain imaging, or escalate their care. The project may affect the way they assess and intervene in cases of severe altered mental status in pediatric DKA patients.
        * Healthcare Institutions: Hospitals may use the project findings to refine clinical guidelines and protocols. Additionally, improved decision-making may cut costs by reducing unnecessary workup including labs and imaging.
    * Tertiary Stakeholders:
        * Researchers and Academics: Anyone working on related projects or within the field of pediatric critical care, emergency medicine, or endocrinology who may build on these findings or validate our predictions.
        * Public Health Organizations: Such as the American Academy of Pediatrics or other bodies interested in the improved management of DKA.
          

* Ethical considerations:
    * The potential benefit of our project is significant, as it may help identify children with DKA who are at high risk of requiring intubation, urgent imaging, and intensive care. However, any predictive model must also account for the risk of false positives (overestimating the need for intubation) and false negatives (missing those who do require intubation).
    * Consideration: Regularly evaluate the accuracy and fairness of the model to ensure that it does not inadvertently cause harm by leading to unnecessary interventions or missed diagnoses. Ensure that the model recommendations are used in conjunction with clinical judgment to prevent potential harm to patients.
    * Equitable Distribution of Burdens and Benefits: The benefits of the project could greatly impact pediatric care, particularly in identifying which children with DKA require the most intensive management. However, it is important to ensure that the benefits and burdens of the research are distributed equitably.
    * Consideration: It is important to consider the diversity of the data. It is possible that the dataset is not representative of the entire pediatric population (e.g., underrepresentation of certain ethnic or socio-economic groups), and therefore the results of the model might not generalize well across all populations and could lead to inequities in care.


# Data Cleaning and Processing
<span style="color:blue">Do you expect to do substantial data cleanup or data extraction? What quantities do you plan to derive from your data? How will data processing be implemented?</span>

Our analysis will involve: categorizing these findings by body system, and extracting specific features related to common DKA presentations to build predictive models for disease trajectory.

Currently the data is split amongst 40 CSV files. Each CSV contains patient data, including demographics, laboratory values, imaging results, and exam findings. Although the datasets are in a standard format that can be merged on the patient ID, each one will require extensive cleaning before they can be used for exploratory analysis or in the predictive model.

For instance, each aspect of the physical exam is split into two columns so that it is marked as abnormal or normal and then a corresponding description is given in the next column if the exam is abnormal. The description contains free text entered by the provider. For example, in the cardio section, the column, PECardio, will indicate if the results were abnormal or normal and the following column, PECardioDesc, will contain a detailed description. This can include free text like "Tachycardic HR 120" or "capillary refill ~ 5 seconds, weak peripheral pulses." These descriptions each have slight variations that will need to be parsed through.

We will compile this data by first creating a new binary column for each part of the physical exam. For example, if the cardio portion of the exam (PECardio) has abnormal findings, this will be coded as a 1 and, otherwise, 0. We will then use RegEx to extract the relevant information from the corresponding exam description to classify these results into clinically relevant categories. We will repeat this process for each of the 38 columns (19 parts) of the physical exam dataset.

<div style="padding-bottom: 20px;">
<img src="attachment:b55094ee-d91d-4220-8ea3-97b2a882c491.png" width="750"/>
</div>


The imaging dataset will be merged using the Patient ID column (PUDID). This dataset contains imaging results for patients who received an MRI or CT. The original format is shown below:

<div style="padding-bottom: 20px;">
<img src="attachment:e52c6248-6608-49de-b686-21448c2aa3fc.png" width="750"/>
</div>

An ImageDateTime column will be created for easier comparison between the two datasets. The ImageTime column will first be converted into a HH:MM format and then added to the ImageDay column using the timedelta function. This process will be repeated for the physical exam time and we will calculate out the time between the physical exam and imaging for each patient.

# Exploratory Analysis
<span style="color:blue">Which methods and visualizations are you planning to use to look at your dataset?</span>

* We will begin exploratory analysis using descriptive statistics of our data. Using the Pandas library, we will use methods such as describe(), info(), value_counts(), etc. to get a general idea of our data. 

* In our exploratory analysis, we will use graphing visualizations such as histograms to show the distribution of variables, to help us decide key factors in DKA.

* To understand potential relationships between our variables, we will use scatterplot and correlation matrices.

# Analysis Methodology
<span style="color:blue">How are you planning to analyze your data?  
What specific questions do you hope to calculate?  
What methods (from class or otherwise) do you think you will use?</span>

* We plan to use logistic regression models to identify predictive variables for several patient outcomes, including; severe altered mental status requiring intubation, need for urgent imaging, and escalation of care. 

* We will assess the performance of our logistic regression models by evaluating the area under the receiver operating characteristic curve (AUC-ROC), sensitivity, specificity, and associated calibration plots.


# Project Schedule
Make sure that you plan your work so that you can avoid a big rush right before the final project deadline, and delegate different modules and responsibilities among your team members. Write this in terms of weekly deadlines.

|      Week of   |                                       Task                                    |   Assigned/Notes                 |
| :--------------| :---------------------------------------------------------------------------: |:--------------------------------:|
|  March 9th     | Data cleaning and background research. Determine variables that will be used  | Import Data                      |
|  March 16th    | Data cleaning and exploratory analysis using visualizations                   | Handle missing data              |
|  March 23rd    | Finalize analysis plan                                                        | Finalize independent variables   |
|  March 30th    | Perform analysis                                                              | Develop, train, evaluate model   |
|  April 6th     | Finalize figures                                                              | Create visualizations and slides |
|  April 12th    | Create presentation                                                           | Record presentation              |        |  April 20th    | Submit final presentation                                                     | Submit                           |