# Dataset Discussion
This notebook is created towards Week 1's discussion on Visualization Project Part 1: Finding your Data. I am interested to explore clinical data and apply some of the learnt visualization techniques on this data.
## Data's Source
I sourced this data from Kaggle - [Heart Failure Clinical Records](https://www.kaggle.com/datasets/nimapourmoradi/heart-failure-clinical-records). This dataset is in open source domain on Kaggle and I am comfortable discussing as part of the course. It contains key clinical data about patients with heart failure, along with their outcomes (whether they survived or not). The dataset is designed to help analyze which factors are most closely related to patient mortality and to enable predictive modeling for heart failure outcomes.

In [13]:
import pandas as pd

data = pd.read_csv("heart_failure_clinical_records_dataset.csv")
data.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


## Key attributes/dimensions of the data
This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features, i.e. 3887 data points.

In [14]:
row_count = len(data)
row_count

299

In [15]:
column_names = data.columns
print(column_names)
len(column_names)

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')


13

In [4]:
summary = data.describe()
print(summary)

              age     anaemia  creatinine_phosphokinase    diabetes  \
count  299.000000  299.000000                299.000000  299.000000   
mean    60.833893    0.431438                581.839465    0.418060   
std     11.894809    0.496107                970.287881    0.494067   
min     40.000000    0.000000                 23.000000    0.000000   
25%     51.000000    0.000000                116.500000    0.000000   
50%     60.000000    0.000000                250.000000    0.000000   
75%     70.000000    1.000000                582.000000    1.000000   
max     95.000000    1.000000               7861.000000    1.000000   

       ejection_fraction  high_blood_pressure      platelets  \
count         299.000000           299.000000     299.000000   
mean           38.083612             0.351171  263358.029264   
std            11.834841             0.478136   97804.236869   
min            14.000000             0.000000   25100.000000   
25%            30.000000             0.0

## Dataset Structure:
The dataset contains the following columns:
+ age: Age of the patient (in years)
+ anaemia: Presence of anaemia (0 = No, 1 = Yes)
+ creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
+ diabetes: Whether the patient has diabetes (0 = No, 1 = Yes)
+ ejection_fraction: Percentage of blood leaving the heart with each contraction
+ high_blood_pressure: Whether the patient has hypertension (0 = No, 1 = Yes)
+ platelets: Platelets in the blood (kiloplatelets/mL)
+ serum_creatinine: Level of serum creatinine in the blood (mg/dL)
+ serum_sodium: Level of serum sodium in the blood (mEq/L)
+ sex: Gender of the patient (1 = Male, 0 = Female)
+ smoking: Whether the patient smokes (0 = No, 1 = Yes)
+ time: Follow-up period (days)
+ DEATH_EVENT: If the patient died during the follow-up period (1 = Yes, 0 = No)

## Goals for working with that data
The main goal of analyzing this dataset is to gain insights into the factors that influence heart failure outcomes and predict patient mortality. The primary goal is to identify and characterize the key factors that influence patient survival or death in heart failure cases. We aim to derive actionable insights from the data to help predict future patient outcomes and potentially inform clinical decision-making. Below are the key questions to explore:
### a. Exploratory Questions
+ What is the **distribution** of various features in the dataset? For instance, exploring the distribution of patient age, ejection fraction, serum creatinine levels, etc.
+ Is there a significant difference between the survival and death groups based on clinical attributes? This can be explored through visualizations such as histograms, box plots, and scatter plots.
+ How do boolean attributes like anaemia, high blood pressure, and smoking relate to the mortality rate? Are these booleans associated with higher death rates?
+ How do comorbid conditions like diabetes, high blood pressure, and smoking status affect mortality?
### b. Group Comparison
+ How do men and women compare in terms of survival rates? We can explore whether gender has any correlation with heart failure outcomes.
+ What are the key differences between patients who died and those who survived?
### c. Predictive Goals
+ What factors contribute most significantly to patient mortality?
+ How do features like age, ejection fraction, serum creatinine, and others impact the likelihood of death?

## 2 Tasks identified
### Task 1: Identifying Demographic Risk Factors (Age, Gender, and Smoking) for Heart Failure Mortality
* **Why (Goal)**: This task is pursued to explore how demographic factors like age, gender, and smoking status affect heart failure outcomes. Understanding demographic risk factors helps in targeted prevention and treatment strategies.
* **How (Means)**: The task is conducted through statistical analysis and visualizations like histograms. These tools help reveal demographic trends and compare survival rates across different subgroups.
* **What (Characteristics)**: This task seeks to learn whether certain demographic groups (e.g., older patients, smokers, males) are more prone to heart failure mortality. It also explores how these factors interact with clinical variables like blood pressure or serum creatinine.
* **Where (Target Data)**: The task operates on the age, sex, and smoking columns, and examines their relationship with the target outcome, DEATH_EVENT. Other clinical variables may be included to control for potential confounders.
* **When (Workflow)**: This task is performed as part of both the exploratory analysis and during the feature engineering phase, where demographic variables are analyzed for their importance and potential inclusion in predictive models.
* **Who (Roles)**:
	* Data Scientist: Responsible for statistical analysis and visualization of demographic factors.
	* Public Health Expert: Can help in understanding population-wide implications of the results and suggest areas for targeted interventions.

### Task 2: Exploring the Relationship Between Ejection Fraction and Mortality
* **Why (Goal)**: Ejection fraction is a well-known indicator of heart function. The goal of this task is to determine whether a low ejection fraction is strongly associated with patient mortality, which could be crucial for identifying patients at high risk.
* **How (Means)**: This task is conducted through exploratory data analysis techniques like box plots, scatter plots. These visualizations help in examining the distribution of ejection fraction values and their relationship with mortality.
* **What (Characteristics)**: This task seeks to learn about the distribution of ejection fraction in the patient population and how it differs between patients who survived and those who did not (DEATH_EVENT). It also aims to identify threshold values below which the risk of death increases significantly.
* **Where (Target Data)**: The task focuses on the ejection fraction variable and its relationship with the binary outcome DEATH_EVENT. Additional variables like age, sex, and serum creatinine may also be included to understand if ejection fraction interacts with other factors.
* **When (Workflow)**: This task is usually part of the early exploratory analysis, where visualizations and statistical summaries are used to generate hypotheses about the data. Once relationships are identified, more complex modeling tasks may follow.
* **Who (Roles)**:
	* Data Analyst: Responsible for generating visualizations and summary statistics to uncover relationships.
	* Medical Researcher: Provides insights into the clinical relevance of observed relationships, helping to interpret the findings.eted interventions.ted interventions.ed interventions.


## Visualizations
### Task 1: Explore Distributions by age and gender with Mortality
**Age distribution** histogram chart shows how the ages of patients are distributed across the dataset. A simple histogram of age can reveal whether heart failure affects certain age groups more frequently.  

**Gender distribution** bar chart shows the distribution of the gender variable in the dataset. 

**Box plot** A box plot comparing the distribution of age grouped by gender provides insights into how age varies between male and female patients in the dataset. This can help identify differences in the age distribution between genders, which might have implications for understanding demographics or health outcomes related to age. (Male = 1, Female = 0). The medians are same at 60, it suggests a similar age distribution between genders. The females have a narrower box that suggests that most patients of that gender fall within a closer age range. Males have a wider variability in ages than females. If the IQR for males is wider than for females, it suggests that male patients in the dataset span a broader range of ages, while females might be concentrated in a more specific age group.
ive.

In [82]:
import altair as alt

# Create a container for our two different views
base =  alt.Chart(data).properties(width=200, height=200)

age_distr = base.mark_bar().encode(
    alt.X("age:Q", bin=True),
    y ='count()',
    color=alt.Color('DEATH_EVENT:N', title='Death Event'),
    tooltip = ['DEATH_EVENT:N']    
).properties(title='Age Mortality Distribution')

gendr_distr = base.mark_bar().encode(
    alt.X("sex:N"),
    y ='count()',
    color=alt.Color('DEATH_EVENT:N', title='Death Event'),
    tooltip = ['DEATH_EVENT:N']    
).properties(title='Gender Distribution')

age_gender = base.mark_boxplot().encode(
    x=alt.X('sex:N', title="Gender"),
    y=alt.Y('age:Q', title="Age"),
    color=alt.Color('DEATH_EVENT:N', title='Death Event'),
    tooltip = ['age','sex']
).properties(title="Age grouped by Gender")

age_distr | gendr_distr | age_gender

In [83]:
base =  alt.Chart(data).properties(width=200, height=200)

base.mark_rect().encode(
    x=alt.X('sex:N', title='Gender'),
    y=alt.Y('age:Q', bin=alt.Bin(maxbins=10), title='Age Group'),
    # color=alt.Color('count():Q', title='Number of Patients'),
    color = alt.Color('count():Q', scale = alt.Scale(scheme = 'spectral')),
    facet=alt.Facet('smoking:N', title='Smoking Status')
).properties(title='Age Smoking Gender distribution')

### Bar graph showing Smoking status by death event

In [92]:
base =  alt.Chart(data).properties(width=250, height=250)

base.mark_bar().encode(
    x='count()',
    y='smoking:N',
    color='DEATH_EVENT:N',
    tooltip=['smoking', 'count()', 'DEATH_EVENT']
).properties(title="Smoking Status by Death Event")


**Scatter Plot** to examine relationships between continuous variables
A scatter plot to explore the relationship between continuous variables, colored by death event. We do not see any clear cut pattern.

In [33]:
alt.Chart(data).mark_circle().encode(
    x = 'age:Q',
    y = 'serum_creatinine',
    color="sex:N",
    # color = alt.Color('DEATH_EVENT', scale = alt.Scale(scheme = 'spectral')),
    tooltip = ["age", "serum_creatinine"]
)

**Smoker distribution** bar chart shows the distribution of the distribution of smoking by age and gender. A stacked bar chart or grouped bar chart is ideal for comparing proportions and it works great for seeing patterns or concentrations of smoking behavior across age and gender. Smokers are distributed in highest proportion among 50-70 year males and negligible among females.

### Task 2: Examine Relationships

In [6]:
# Build a SPLOM (scatter plot of matrices)
alt.Chart(data).mark_circle().encode(
    alt.X(alt.repeat("column"), type="quantitative"),
    alt.Y(alt.repeat("row"), type="quantitative"),
    # color="DEATH_EVENT",
    color = alt.Color('DEATH_EVENT', scale = alt.Scale(scheme = 'spectral')),
    tooltip=["diabetes", "DEATH_EVENT"]
).properties(
    width=125,
    height=125
).repeat(
    column=['age', 'creatinine_phosphokinase', 
       'ejection_fraction', 'platelets',
       'serum_creatinine', 'serum_sodium', 'time'],
    row=['age', 'creatinine_phosphokinase', 
       'ejection_fraction', 'platelets',
       'serum_creatinine', 'serum_sodium', 'time']
)

**Scatter Plot**
Relationship between S Creatinine and Ejection Fraction

In [84]:
alt.Chart(data).mark_circle().encode(
    x = 'ejection_fraction',
    y = 'serum_creatinine',
    color="DEATH_EVENT",
    # color = alt.Color('DEATH_EVENT', scale = alt.Scale(scheme = 'spectral')),
    tooltip = ["ejection_fraction", "serum_creatinine","DEATH_EVENT"]
)

**Boxplot**
Boxplot to examine relationship between Ejection Fraction and Death Event

**Violin Plot**
Violin Plot to examine relationship between ejection fraction and death event

In [88]:
box = alt.Chart(data).mark_boxplot().encode(
    x=alt.X('DEATH_EVENT:N', title="Death Event"),
    y=alt.Y('ejection_fraction:Q', title="Ejection Fraction"),
    tooltip = ['ejection_fraction','DEATH_EVENT'],
    color='DEATH_EVENT:N'
).properties(title="Ejection Fraction grouped by Death Event").properties(width=250, height=250)

violin = alt.Chart(data).transform_density(
    'ejection_fraction',
    as_=['ejection_fraction', 'density'],
    groupby=['DEATH_EVENT']
).mark_area(orient='horizontal').encode(
    x=alt.X('density:Q', stack='center', title=None),
    y=alt.Y('ejection_fraction:Q', title='Ejection Fraction'),
    color=alt.Color('DEATH_EVENT:N', title='Death Event'),
    tooltip=['ejection_fraction', 'DEATH_EVENT']
).properties(
    width=250,
    height=250
)

box | violin

**Heatmap** helps visualize correlations between various clinical variables
These are great for identifying relationships between continuous variables like age, ejection fraction, and serum creatinine. They provide an at-a-glance summary of which features might be most related to outcomes.
Correlation heatmaps alone do not capture nonlinear relationships, which can be important in clinical data. A pair plot or scatter plot matrix might provide more detailed insights into potential relationships. Scatter plots for key pairs of variables identified as having high correlations would provide a more nuanced view of these relationships. Interactions between continuous and categorical variables (e.g., how serum sodium affects survival differently in males vs. females) can’t be captured in simple correlation matrices.

In [7]:
corr = data.corr()

selection = alt.selection(type='single', fields=['value'], bind='legend')

alt.Chart(corr.reset_index().melt(id_vars=['index'])).mark_rect().encode(
    x='variable:N',
    y='index:N',
    color='value:Q',
    tooltip=['variable', 'index', 'value'],
    opacity=alt.condition(selection,alt.value(1),alt.value(.2))
).properties(title="Correlation Heatmap").add_selection(selection)

   Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
        combined and should be specified using "selection_point()".
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)
  col = df[col_name].apply(to_list_if_array, convert_dtype=False)


### Design choice, Interpretation and Justification of design choice

**Age distribution** histogram chart: Histograms are useful to show frequency distributions of categorical variables like age. Histograms provide clear and straightforward visualizations of frequency counts, which help users understand the spread of data across age and gender. In Histograms, dividing the age into meaningful bins (e.g., 30-40, 40-50, etc.) is possible which provides clearer insights into age groupings than focusing on individual years. Peaks in the distribution highlights 60-70 as the common age group. This peak suggests that heart failure primarily affects older populations in this dataset.  

**Gender distribution** bar chart: It gives insight into the balance between male and female patients.

**Box plot** A box plot comparing the distribution of age grouped by gender provides insights into how age varies between male and female patients in the dataset. This can help identify differences in the age distribution between genders, which might have implications for understanding demographics or health outcomes related to age. (Male = 1, Female = 0). The medians are same at 60, it suggests a similar age distribution between genders. The females have a narrower box that suggests that most patients of that gender fall within a closer age range. Males have a wider variability in ages than females. If the IQR for males is wider than for females, it suggests that male patients in the dataset span a broader range of ages, while females might be concentrated in a more specific age group.
ive.

### Compare categorical and continuous variable
**Box Plots** (for continuous variables, grouped by a categorical variable)

A box plot showing the relationship between ejection fraction and the death event provides a clear picture of how heart functionality (represented by ejection fraction) varies between patients who survived and those who did not.
The box plot reveals a strong association between ejection fraction and death event. Lower ejection fractions are clearly associated with a higher likelihood of mortality, while higher ejection fractions are correlated with survival. However, the variability in the ejection fraction values, especially the presence of outliers, suggests that while ejection fraction is a critical factor in survival, it is not the sole determinant, and other factors (like age, comorbidities, or treatment) may also play a role.
- the median ejection fraction is higher for patients who survived (death event = 0)
- the median ejection fraction is notably lower for patients who did not survive (death event = 1).

This implies that poorer heart function is associated with higher mortality risk and patients with better heart function tend to survive.

In [8]:
base =  alt.Chart(data).properties(width=250, height=250)

base.mark_boxplot().encode(
    x=alt.X('DEATH_EVENT:N', title="Death Event"),
    y=alt.Y('ejection_fraction:Q', title="Ejection Fraction (%)"),
    color='DEATH_EVENT:N'
).properties(title="Ejection Fraction by Death Event")

In [9]:
base =  alt.Chart(data).properties(width=250, height=250)

base.mark_bar().encode(
    x='count()',
    y='smoking:N',
    color='DEATH_EVENT:N',
    tooltip=['smoking', 'count()', 'DEATH_EVENT']
).properties(title="Smoking Status by Death Event")

### Scatter Plot Comparison of continuous variables
**Scatter Plot** helps visualize the relationship between S Creatinine and S Sodium, with color-coding for survival outcomes. Scatter plots are great for identifying relationships between two continuous variables. However, when there are a large number of overlapping points, using transparency or hexbin plots can be more effective. Including marginal histograms or density curves along the axes could give additional context to the distribution of each variable separately.

In [10]:
alt.Chart(data).mark_point().encode(
    x='serum_creatinine:Q',
    y='serum_sodium:Q',
    color='DEATH_EVENT:N',
    tooltip=['serum_creatinine', 'serum_sodium', 'DEATH_EVENT']
).properties(title="Serum Creatinine vs Serum Sodium")

In [11]:
# Implementing selection
selection = alt.selection(type='multi', fields=['serum_creatinine'])

base =  alt.Chart(data).properties(width=250, height=250)

base.mark_circle(opacity = 0.5).encode(
    x='age:Q',
    y='sex:N',
    color='DEATH_EVENT:N',
    size="serum_creatinine",
    tooltip=['serum_creatinine', 'DEATH_EVENT'],
    opacity=alt.condition(selection,alt.value(1),alt.value(.2))
).properties(title="Serum Creatinine values").add_selection(selection)

   Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
        combined and should be specified using "selection_point()".


In [12]:
alt.Chart(data).mark_line().encode(
    x='time:Q',
    y='ejection_fraction:Q',
    color='DEATH_EVENT:N',
    tooltip=['time', 'ejection_fraction', 'DEATH_EVENT']
).properties(title="Time Until Death vs Ejection Fraction")

## Summary of key elements of design and justification

### Data Types and Visual Representations:
1. **Scatter Plots** for showing relationships between clinical features like ejection fraction and age with mortality.
Justification: Scatter plots are effective for visualizing correlations between numerical variables and are easy to interpret by both technical and non-technical users.
2. **Box Plots and Violin Plots** to visualize the distribution of variables like age and ejection fraction across mortality outcomes.
Justification: These plots summarize key distribution statistics (e.g., medians, quartiles) and provide a visual comparison between groups (e.g., survived vs died). Violin plots extend this by showing distribution shape, giving a fuller picture. 
3. **Histograms** to show frequency distributions of categorical variables like death event.
Justification: Histograms provide clear and straightforward visualizations of frequency counts, which help users understand the spread of data across categories like death event.
### Interactive Selections:
Altair Selections for dynamically updating charts when users highlight particular ranges or data points.
Justification: Interactive selection tools enhance user engagement and make complex data more accessible by allowing users to focus on specific elements of interest.
### Color Encoding:
Color-Coding Based on Mortality Outcome for clearer differentiation between patients who survived and those who did not.
Justification: Color coding makes it easier for users to distinguish between groups (e.g., those who died and those who survived), enhancing quick pattern recognition and interpretation.
### Annotations and Tooltips:
Tooltips that provide detailed information (e.g., specific values for age, ejection fraction, and death event) when hovering over data points.
Justification: Tooltips offer additional context without cluttering the visual display, making it easier for users to understand specific data points without losing the big picture.
### Descriptive Titles and Axis Labels:
Clear Axis Labels that communicate what each axis represents (e.g., age, ejection fraction) and descriptive titles for each chart.
Justification: Well-labeled axes and descriptive titles make the charts more accessible, ensuring that users quickly understand what the visualization is showing.
### Comparative Visualizations:
Comparing Distributions of age, gender, and clinical features (e.g., serum sodium, ejection fraction) between patients who died and those who survived.
Justification: Comparative visualizations enable users to spot trends and differences between key groups, aiding in understanding of factors associated with mortality.

## Justification for the Design Choices:
1. Flexibility for Different Users: The design incorporates a variety of chart types (scatter plots, box plots, density plots), making it accessible to both medical professionals (who can interpret more complex visualizations) and laypersons (who can benefit from simple, comparative visualizations).
2. Ease of Interpretation: The use of color, annotations, and interactive filters simplifies the process of finding relevant patterns in the data, even for those without deep statistical knowledge.
3. Exploratory Depth: By incorporating interactive elements like tooltips, the design encourages users to actively explore the dataset, facilitating deeper insight into heart failure clinical outcomes.
4. Visual Clarity: Clean and simple visual encoding (with appropriate axis labels, titles, and color schemes) reduces cognitive load, making it easier for users to understand the information being presented without getting overwhelmed.

This design is aimed at achieving the core goal: to uncover clinical patterns related to heart failure mortality, in a way that is engaging, insightful, and usable for a broad range of users.

## Evaluation
Age distribution: While this type of chart is effective, adding a density curve or shading based on survival status could enhance clarity. 

To improve the quality and interpretability of visualizations, some enhancements could be made:

Use of Annotations: Adding annotations to key areas in visualizations, such as indicating significant medians, or highlighting outliers, can make the visualizations easier to interpret for non-experts.

Faceting for Comparisons: Using facets (small multiples) is a great way to compare distributions across groups. For instance, faceting the age distribution by DEATH_EVENT (survived vs. died) could make it easier to compare the age ranges of the two groups.

Interactive Visualizations: Introducing interactive features like brushing and linking in Altair could allow users to explore the relationships between different subsets of the data dynamically. For example, selecting a range of ages could highlight corresponding ejection fractions and survival rates, enabling users to see connections more fluidly.

Combine Multiple Visuals: Sometimes, combining multiple charts (e.g., placing a box plot beside a histogram or density plot) for a feature like age can provide both summary statistics and a detailed view of distribution simultaneously.
To evaluate the design approach for visualizing the Heart Failure Clinical Records data, a structured evaluation process is necessary. The evaluation will focus on ensuring that the visualization design effectively supports data interpretation and insights, helping to answer key questions related to patient mortality and clinical outcomes.1. 
The core questions the evaluation seeks to answer include* 
•	How well do the visualizations help users identify key clinical features that correlate with heart failure mortalit* 
•	Do the visualizations provide a clear understanding of the distribution and impact of age, ejection fraction, and other clinical features in relation to mortali* ?
•	How effective are the visualizations in helping users make predictions or detect patterns based on the data?
2. The People You Would Recruit to Answer That Question
For a well-rounded evaluation, you would re* uit:
•	Healthcare Professionals: Cardiologists or medical researchers who work with heart failure patients and understand the clinical significance of th* data.
•	Data Analysts: People with experience in analyzing healthcare data but not necessarily experts in car* ology.
•	Non-technical Users: Laypersons who may have an interest in healthcare but lack technical expertise. This group helps assess whether the visualizations are intuitive for non-experts.
3. The Kinds of Measures You Would Use to Answer Your Data
a. Insight Depth:
•	Definition: How deeply participants can explore and understand the relationships between variables (e.g., age, ejection fraction, mortality).
•	What it Tells You: Insight depth helps measure how well the visualizations enable users to uncover non-obvious insights or trends (e.g., subtle correlations between ejection fraction and mortality).
•	Evaluation: Participants could be asked to explore the data and note down patterns they observe. You would measure how detailed and accurate their observations are.
b. Ease of Use (User Experience):
•	Definition: How easy it is for participants to interact with and interpret the visualizations.
•	What it Tells You: Evaluates the usability and clarity of the visualizations. Are the color schemes, axes, legends, and tooltips easy to understand and use?
•	Evaluation: You could run a usability test where participants complete tasks (e.g., finding the age group with the highest mortality rate) and report the ease or difficulty of doing so.
c. Accuracy:
•	Definition: The correctness of conclusions that participants draw from the visualizations.
•	What it Tells You: Accuracy will tell whether the visualizations support the correct interpretation of the data or cause confusion.
•	Evaluation: Ask participants to answer questions about the dataset (e.g., "Does ejection fraction strongly predict mortality?") and check their answers against the data.
d. User Satisfaction:
•	Definition: How satisfied users are with the visualization design and their overall experience.
•	What it Tells You: Measures emotional engagement and perceived usefulness of the tool.
•	Evaluation: A post-task survey could be used to capture satisfaction levels, asking participants to rate how useful and enjoyable the visualizations were.
4. The Approach You Will Use to Answer That Question
•	Mixed-Method Approach: Use a combination of qualitative and quantitative methods to evaluate the visualizations.
o	Formal Experiment: For insight depth, accuracy, and ease of use, participants will perform specific tasks with the visualizations, and you will collect task performance metrics (time to complete tasks, errors made).
o	Usability Testing: Observing participants as they use the visualizations and asking them to verbalize their thought process. This helps identify pain points or confusing aspects of the visualizations.
o	Survey/Questionnaire: After the tasks, ask participants to complete a survey to measure satisfaction and ease of use.
5. How You Would Instantiate Those Methods
Here’s what the participants would do:
1.	Introduction and Familiarization:
o	Participants would be introduced to the dataset, the visualization dashboard, and the core questions.
o	They would explore the visualizations (e.g., feature importance bar chart, scatter plot, box plot) to get comfortable with the tools.
2.	Task Completion:
o	Participants would be given several tasks, such as:
	Task 1: "Identify the top 3 clinical features most predictive of mortality."
	Task 2: "Find the age group with the highest rate of heart failure mortality."
	Task 3: "Is low ejection fraction more common among patients who died?"
o	During this phase, you will time their performance and track errors.
3.	Observation and Feedback:
o	While participants perform the tasks, observe how they interact with the visualizations. Record areas where they seem confused or take unexpected actions.
o	After completing the tasks, ask participants to provide feedback on how intuitive or difficult the tasks were.
4.	Post-Task Survey:
o	Ask participants to fill out a survey with questions about satisfaction, clarity, and perceived usefulness of the visualizations.
o	Include both Likert scale questions (e.g., "How easy was it to understand the relationship between ejection fraction and mortality?") and open-ended questions (e.g., "What changes would improve your experience?").
6. What Criteria Would You Use to Indicate that Your Visualization Was Successful
•	Insight Generation: Participants should consistently be able to generate correct and insightful answers to questions about the data.
•	Ease of Use: If users complete tasks with minimal confusion, and if non-experts can understand and interact with the visualizations, this would indicate success.
•	Accuracy of Interpretation: A high level of correct interpretation of the visualizations shows that the design is successful in conveying the intended insights.
•	User Satisfaction: Positive feedback in post-task surveys, particularly around clarity and usefulness, would suggest that the visualization meets the needs of the target audience.
Conclusion:
By running a formal experiment combined with usability testing and post-task surveys, you can comprehensively evaluate how well the visualizations meet your goals for insight generation, ease of use, and accuracy. Success would be indicated by participants' ability to draw accurate conclusions, feel comfortable with the interface, and report high levels of satisfaction with the visualization.
