# Dataset Discussion
This notebook is created towards Week 1's discussion on Visualization Project Part 1: Finding your Data. I am interested to explore clinical data and apply some of the learnt visualization techniques on this data.
## Data's Source
I sourced this data from Kaggle - [Heart Failure Clinical Records](https://www.kaggle.com/datasets/nimapourmoradi/heart-failure-clinical-records). This dataset is in open source domain on Kaggle and I am comfortable discussing as part of the course. It contains key clinical data about patients with heart failure, along with their outcomes (whether they survived or not). The dataset is designed to help analyze which factors are most closely related to patient mortality and to enable predictive modeling for heart failure outcomes.

## Key attributes/dimensions of the data
This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features, i.e. 3887 data points.

## Dataset Structure:
The dataset contains the following columns:
+ age: Age of the patient (in years)
+ anaemia: Presence of anaemia (0 = No, 1 = Yes)
+ creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
+ diabetes: Whether the patient has diabetes (0 = No, 1 = Yes)
+ ejection_fraction: Percentage of blood leaving the heart with each contraction
+ high_blood_pressure: Whether the patient has hypertension (0 = No, 1 = Yes)
+ platelets: Platelets in the blood (kiloplatelets/mL)
+ serum_creatinine: Level of serum creatinine in the blood (mg/dL)
+ serum_sodium: Level of serum sodium in the blood (mEq/L)
+ sex: Gender of the patient (1 = Male, 0 = Female)
+ smoking: Whether the patient smokes (0 = No, 1 = Yes)
+ time: Follow-up period (days)
+ DEATH_EVENT: If the patient died during the follow-up period (1 = Yes, 0 = No)

## Goals for working with that data
The main goal of analyzing this dataset is to gain insights into the factors that influence heart failure outcomes and predict patient mortality. The primary goal is to identify and characterize the key factors that influence patient survival or death in heart failure cases. We aim to derive actionable insights from the data to help predict future patient outcomes and potentially inform clinical decision-making. Below are the key questions to explore:
### a. Exploratory Questions
+ What is the **distribution** of various features in the dataset? For instance, exploring the distribution of patient age, ejection fraction, serum creatinine levels, etc.
+ Is there a significant difference between the survival and death groups based on clinical attributes? This can be explored through visualizations such as histograms, box plots, and scatter plots.
+ How do boolean attributes like anaemia, high blood pressure, and smoking relate to the mortality rate? Are these booleans associated with higher death rates?
+ How do comorbid conditions like diabetes, high blood pressure, and smoking status affect mortality?
### b. Group Comparison
+ How do men and women compare in terms of survival rates? We can explore whether gender has any correlation with heart failure outcomes.
+ What are the key differences between patients who died and those who survived?
### c. Predictive Goals
+ What factors contribute most significantly to patient mortality?
+ How do features like age, ejection fraction, serum creatinine, and others impact the likelihood of death?

## 2 Tasks identified
### Task 1: Identifying Demographic Risk Factors (Age, Gender, and Smoking) for Heart Failure Mortality
* **Why (Goal)**: This task is pursued to explore how demographic factors like age, gender, and smoking status affect heart failure outcomes. Understanding demographic risk factors helps in targeted prevention and treatment strategies.
* **How (Means)**: The task is conducted through statistical analysis and visualizations like histograms. These tools help reveal demographic trends and compare survival rates across different subgroups.
* **What (Characteristics)**: This task seeks to learn whether certain demographic groups (e.g., older patients, smokers, males) are more prone to heart failure mortality. It also explores how these factors interact with clinical variables like blood pressure or serum creatinine.
* **Where (Target Data)**: The task operates on the age, sex, and smoking columns, and examines their relationship with the target outcome, DEATH_EVENT. Other clinical variables may be included to control for potential confounders.
* **When (Workflow)**: This task is performed as part of both the exploratory analysis and during the feature engineering phase, where demographic variables are analyzed for their importance and potential inclusion in predictive models.
* **Who (Roles)**:
	* Data Scientist: Responsible for statistical analysis and visualization of demographic factors.
	* Public Health Expert: Can help in understanding population-wide implications of the results and suggest areas for targeted interventions.

### Task 2: Exploring the Relationship Between Ejection Fraction and Mortality
* **Why (Goal)**: Ejection fraction is a well-known indicator of heart function. The goal of this task is to determine whether a low ejection fraction is strongly associated with patient mortality, which could be crucial for identifying patients at high risk.
* **How (Means)**: This task is conducted through exploratory data analysis techniques like box plots, scatter plots. These visualizations help in examining the distribution of ejection fraction values and their relationship with mortality.
* **What (Characteristics)**: This task seeks to learn about the distribution of ejection fraction in the patient population and how it differs between patients who survived and those who did not (DEATH_EVENT). It also aims to identify threshold values below which the risk of death increases significantly.
* **Where (Target Data)**: The task focuses on the ejection fraction variable and its relationship with the binary outcome DEATH_EVENT. Additional variables like age, sex, and serum creatinine may also be included to understand if ejection fraction interacts with other factors.
* **When (Workflow)**: This task is usually part of the early exploratory analysis, where visualizations and statistical summaries are used to generate hypotheses about the data. Once relationships are identified, more complex modeling tasks may follow.
* **Who (Roles)**:
	* Data Analyst: Responsible for generating visualizations and summary statistics to uncover relationships.
	* Medical Researcher: Provides insights into the clinical relevance of observed relationships, helping to interpret the findings.eted interventions.ted interventions.ed interventions.


## Visualization
Here is a link to the visualization associated to this data set - [Clinical Data Visualization](https://github.com/anantpad/clinical_data_visualization_example/blob/main/ClinicalDataVisualization.ipynb)
A html version of the visualization is also attached.

## Summary of key elements of design and justification

### Data Types and Visual Representations:
1. **Scatter Plots** for showing relationships between clinical features like ejection fraction and age with mortality.
Justification: Scatter plots are effective for visualizing correlations between numerical variables and are easy to interpret by both technical and non-technical users.
2. **Box Plots and Violin Plots** to visualize the distribution of variables like age and ejection fraction across mortality outcomes.
Justification: These plots summarize key distribution statistics (e.g., medians, quartiles) and provide a visual comparison between groups (e.g., survived vs died). Violin plots extend this by showing distribution shape, giving a fuller picture. 
3. **Histograms** to show frequency distributions of categorical variables like death event.
Justification: Histograms provide clear and straightforward visualizations of frequency counts, which help users understand the spread of data across categories like death event.
### Interactive Selections:
Altair Selections for dynamically updating charts when users highlight particular ranges or data points.
Justification: Interactive selection tools enhance user engagement and make complex data more accessible by allowing users to focus on specific elements of interest.
### Color Encoding:
Color-Coding Based on Mortality Outcome for clearer differentiation between patients who survived and those who did not.
Justification: Color coding makes it easier for users to distinguish between groups (e.g., those who died and those who survived), enhancing quick pattern recognition and interpretation.
### Annotations and Tooltips:
Tooltips that provide detailed information (e.g., specific values for age, ejection fraction, and death event) when hovering over data points.
Justification: Tooltips offer additional context without cluttering the visual display, making it easier for users to understand specific data points without losing the big picture.
### Descriptive Titles and Axis Labels:
Clear Axis Labels that communicate what each axis represents (e.g., age, ejection fraction) and descriptive titles for each chart.
Justification: Well-labeled axes and descriptive titles make the charts more accessible, ensuring that users quickly understand what the visualization is showing.
### Comparative Visualizations:
Comparing Distributions of age, gender, and clinical features (e.g., serum sodium, ejection fraction) between patients who died and those who survived.
Justification: Comparative visualizations enable users to spot trends and differences between key groups, aiding in understanding of factors associated with mortality.

## Justification for the Design Choices:
1. Flexibility for Different Users: The design incorporates a variety of chart types (scatter plots, box plots, density plots), making it accessible to both medical professionals (who can interpret more complex visualizations) and laypersons (who can benefit from simple, comparative visualizations).
2. Ease of Interpretation: The use of color, annotations, and interactive filters simplifies the process of finding relevant patterns in the data, even for those without deep statistical knowledge.
3. Exploratory Depth: By incorporating interactive elements like tooltips, the design encourages users to actively explore the dataset, facilitating deeper insight into heart failure clinical outcomes.
4. Visual Clarity: Clean and simple visual encoding (with appropriate axis labels, titles, and color schemes) reduces cognitive load, making it easier for users to understand the information being presented without getting overwhelmed.

This design is aimed at achieving the core goal: to uncover clinical patterns related to heart failure mortality, in a way that is engaging, insightful, and usable for a broad range of users.

## Evaluation approach
To evaluate the design approach for visualizing the Heart Failure Clinical Records data, a structured evaluation process was employed. The evaluation focussed on ensuring that the visualization design effectively supported data interpretation and insights, helping to answer key questions related to patient mortality and clinical outcomes.

### The core questions the evaluation seeks to answer include:
* How well do the visualizations help users identify key clinical features that correlate with heart failure mortality?
* Do the visualizations provide a clear understanding of the distribution and impact of age, ejection fraction, and other clinical features in relation to mortality?
* How effective are the visualizations in helping users make predictions or detect patterns based on the data?

### I recruited:
* Computer Scientist as a data analyst: PhD Student working in Robot Human user interaction research was recruited for this purpose.
* Non-technical Users: Layperson, a middle school educator who has an interest in healthcare but lacks technical expertise. This helped to assess whether the visualizations are intuitive for non-experts.

### Measures to Answer my Data
+ Insight Depth:
•	Definition: How deeply participants can explore and understand the relationships between variables (e.g., age, ejection fraction, mortality).
•	What it Tells You: Insight depth helps measure how well the visualizations enable users to uncover non-obvious insights or trends (e.g., subtle correlations between ejection fraction and mortality).
•	Evaluation: Participants were asked to explore the data and note down patterns they observe. I measured how detailed and accurate their observations are.
+ Ease of Use (User Experience):
•	Definition: How easy it is for participants to interact with and interpret the visualizations.
•	What it Tells You: Evaluates the usability and clarity of the visualizations. Are the color schemes, axes, legends, and tooltips easy to understand and use?
•	Evaluation: I ran acusability test where participants complete tasks (a) find the age group with the highest mortality rate b) continuous clinical variables that are closely related to mortality and report the ease or difficulty of doing so.
+ Accuracy:
•	Definition: The correctness of conclusions that participants draw from the visualizations.
•	What it Tells You: Accuracy will tell whether the visualizations support the correct interpretation of the data or cause confusion.
•	Evaluation: Ask participants to answer questions about the dataset (e.g., "Does ejection fraction strongly predict mortality?") and check their answers against the data.
+ Out of scope: user satisfaction with the visualization

### The Approach I used to Answer That Question
+ Usability Testing: Observing participants as they use the visualizations and asking them to verbalize their thought process. This helps identify pain points or confusing aspects of the visualizations.

### Execution: Here’s what the participants would do:
1.	Introduction and Familiarization: Participants were introduced to the dataset, the visualization dashboard, and the core questions. They would explore the visualizations (e.g., feature importance bar chart, scatter plot, box plot) to get comfortable with the tools.
2.	Task Completion: Participants were given several tasks, such as:
    Task 1: "Identify the top 3 clinical features most predictive of mortality."
    Task 2: "Find the age group with the highest rate of heart failure mortality."
    Task 3: "Is low ejection fraction more common among patients who died?"
    During this phase, I tracked time their performance and track errors.
3.	Observation and Feedback: While participants perform the tasks, observe how they interact with the visualizations. Record areas where they seem confused or take unexpected actions. After completing the tasks, ask participants to provide feedback on how intuitive or difficult the tasks were.

### Criteria used to Indicate that Visualization Was Successful
•	Insight Generation: Participants should consistently be able to generate correct and insightful answers to questions about the data.
•	Ease of Use: If users complete tasks with minimal confusion, and if non-experts can understand and interact with the visualizations, this would indicate success.
•	Accuracy of Interpretation: A high level of correct interpretation of the visualizations shows that the design is successful in conveying the intended insights.

### Observations
#### Participant 1
| Tasks | Insight Generation | Ease of Use | Accuracy of Interpretation |
| ----- | ----- | ----- | ----- |
| Task 1 | Y | N | Y |
| Task 2 | Y | Y | Y |
| Task 3 | Y | Y | N |

#### Participant 2
| Tasks | Insight Generation | Ease of Use | Accuracy of Interpretation |
| ----- | ----- | ----- | ----- |
| Task 1 | Y | N | Y |
| Task 2 | Y | Y | Y |
| Task 3 | Y | Y | N |

### Conclusion:
By running a formal experiment combined with usability testing I was able to evaluate how well the visualizations meets goals for insight generation, ease of use, and accuracy. Success was indicated by participants' ability to draw accurate conclusions, feel comfortable with the interface. For future, I should include data collection on user satisfaction with the visualization.
ort high levels of satisfaction with the visualization.


## Synthesis of Findings
### What elements worked well
- Clear axes labels
- Color coding
- tool tips
- Title, legend
effective use of elements such as clear labels, color coding, interactive tooltips, point sizing, and legends significantly enhances the clarity and usability of scatter plot. 

### What elements need refinement in future
- Age distribution: While Histogram is effective, adding a density curve or shading based on survival status could enhance clarity.
- Use of Annotations: Adding annotations to key areas in visualizations, such as indicating significant medians, or highlighting outliers, can make the visualizations easier to interpret for non-experts.
- Faceting for Comparisons: Using facets (small multiples) is a great way to compare distributions across groups. For instance, faceting the age distribution by DEATH_EVENT (survived vs. died) could make it easier to compare the age ranges of the two groups.
- Interactive Visualizations: Introducing interactive features like filters, brushing and linking in Altair could allow users to explore the relationships between different subsets of the data dynamically. For example, selecting a range of ages could highlight corresponding ejection fractions and survival rates, enabling users to see connections more fluidly.
- Combine Multiple Visuals: Sometimes, combining multiple charts (e.g., placing a box plot beside a histogram or density plot) for a feature like age can provide both summary statistics and a detailed view of distribution simultaneously.