# Dataset Discussion
This notebook is created towards Week 1's discussion on Visualization Project Part 1: Finding your Data. I am interested to explore clinical data and apply some of the learnt visualization techniques on this data.
## Data's Source
I sourced this data from Kaggle - [Heart Failure Clinical Records](https://www.kaggle.com/datasets/nimapourmoradi/heart-failure-clinical-records). This dataset is in open source domain on Kaggle and I am comfortable discussing as part of the course. It contains key clinical data about patients with heart failure, along with their outcomes (whether they survived or not). The dataset is designed to help analyze which factors are most closely related to patient mortality and to enable predictive modeling for heart failure outcomes.

In [1]:
import pandas as pd

data = pd.read_csv("heart_failure_clinical_records_dataset.csv")
data.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


## Key attributes/dimensions of the data
This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features, i.e. 3887 data points.

In [2]:
row_count = len(data)
row_count

299

In [3]:
column_names = data.columns
print(column_names)
len(column_names)

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')


13

In [4]:
summary = data.describe()
print(summary)

              age     anaemia  creatinine_phosphokinase    diabetes  \
count  299.000000  299.000000                299.000000  299.000000   
mean    60.833893    0.431438                581.839465    0.418060   
std     11.894809    0.496107                970.287881    0.494067   
min     40.000000    0.000000                 23.000000    0.000000   
25%     51.000000    0.000000                116.500000    0.000000   
50%     60.000000    0.000000                250.000000    0.000000   
75%     70.000000    1.000000                582.000000    1.000000   
max     95.000000    1.000000               7861.000000    1.000000   

       ejection_fraction  high_blood_pressure      platelets  \
count         299.000000           299.000000     299.000000   
mean           38.083612             0.351171  263358.029264   
std            11.834841             0.478136   97804.236869   
min            14.000000             0.000000   25100.000000   
25%            30.000000             0.0

## Dataset Structure:
The dataset contains the following columns:
+ age: Age of the patient (in years)
+ anaemia: Presence of anaemia (0 = No, 1 = Yes)
+ creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)
+ diabetes: Whether the patient has diabetes (0 = No, 1 = Yes)
+ ejection_fraction: Percentage of blood leaving the heart with each contraction
+ high_blood_pressure: Whether the patient has hypertension (0 = No, 1 = Yes)
+ platelets: Platelets in the blood (kiloplatelets/mL)
+ serum_creatinine: Level of serum creatinine in the blood (mg/dL)
+ serum_sodium: Level of serum sodium in the blood (mEq/L)
+ sex: Gender of the patient (1 = Male, 0 = Female)
+ smoking: Whether the patient smokes (0 = No, 1 = Yes)
+ time: Follow-up period (days)
+ DEATH_EVENT: If the patient died during the follow-up period (1 = Yes, 0 = No)

## Visualizations
### Distributions by age and gender with Mortality
**Age distribution** histogram chart shows how the ages of patients are distributed across the dataset. A simple histogram of age can reveal whether heart failure affects certain age groups more frequently.  

**Gender distribution** bar chart shows the distribution of the gender variable in the dataset. 

**Age distribution grouped by Gender** A box plot comparing the distribution of age grouped by gender provides insights into how age varies between male and female patients in the dataset. This can help identify differences in the age distribution between genders, which might have implications for understanding demographics or health outcomes related to age. (Male = 1, Female = 0). The medians are same at 60, it suggests a similar age distribution between genders. The females have a narrower box that suggests that most patients of that gender fall within a closer age range. Males have a wider variability in ages than females. If the IQR for males is wider than for females, it suggests that male patients in the dataset span a broader range of ages, while females might be concentrated in a more specific age group.
ive.

In [9]:
import altair as alt

# Create a container for our two different views
base =  alt.Chart(data).properties(width=200, height=200)

age_distr = base.mark_bar().encode(
    x = alt.X("age:Q", bin=True, title='Age Group'),
    y = alt.Y('count()', title='Count of Patients'),
    color=alt.Color('DEATH_EVENT:N', title='Death Event (0 = Alive, 1 = Death)'),
    tooltip = [alt.Tooltip("age:Q", bin=True, title='Age Group'),
        alt.Tooltip('sum(DEATH_EVENT):Q', title='Count of Deaths'),
        # alt.Tooltip('count() - sum(DEATH_EVENT):Q', title='Count of Alive'),
    ]    
).properties(title='Age Mortality Distribution')

gendr_distr = base.mark_bar().encode(
    alt.X("sex:N"),
    y ='count()',
    color=alt.Color('DEATH_EVENT:N', title='Death Event'),
    tooltip = ['DEATH_EVENT:N']    
).properties(title='Gender Distribution')

age_gender = base.mark_boxplot().encode(
    x=alt.X('sex:N', title="Gender"),
    y=alt.Y('age:Q', title="Age"),
    color=alt.Color('DEATH_EVENT:N', title='Death Event'),
    tooltip = ['age','sex']
).properties(title="Age grouped by Gender")

age_distr | gendr_distr | age_gender

### Distributions by age, smoking and gender
**Smoker distribution** bar chart shows the distribution of the distribution of smoking by age and gender. Goals such as (a) do certain age groups have a higher proportion of smokers? (b) Do males or females in certain age groups tend to smoke more? can be answered. This chart allows us to explore how smoking prevalence differs across genders and age groups, giving a more detailed understanding of the dataset. By combining these variables, you can uncover more complex relationships. For example, we see that smoking is more prevalent among males in a 50-70 age range. Smokers are distributed in highest proportion among 50-70 year males and negligible among females. 

In [10]:
base =  alt.Chart(data).properties(width=200, height=200)

base.mark_rect().encode(
    x=alt.X('sex:N', title='Gender'),
    y=alt.Y('age:Q', bin=alt.Bin(maxbins=10), title='Age Group'),
    # color=alt.Color('count():Q', title='Number of Patients'),
    color = alt.Color('count():Q'),
    tooltip = ['count():Q'],
    facet=alt.Facet('smoking:N', title='Smoking Status')
).properties(title='Age Smoking Gender distribution')

### Bar graph showing Smoking status by death event
The goal of this graph is to understand whether smoking is associated with the death event (mortality) in the dataset. It helps visualize if there is a higher proportion of death events among smokers or non-smokers. There is a higher proportion of deaths among smokers as against non smokers.

In [7]:
base =  alt.Chart(data).properties(width=250, height=250)

base.mark_bar().encode(
    x='count()',
    y='smoking:N',
    color='DEATH_EVENT:N',
    tooltip=[
        'smoking:N', 
        'count():Q', 
        'DEATH_EVENT:N'
    ]
).properties(title="Smoking Status by Death Event")


### Examine Relationships
**Scatter Plot** to examine relationships between continuous variables
A scatter plot to explore the relationship between continuous variables, colored by death event. 

**Heatmap** helps visualize correlations between various clinical variables
These are great for identifying relationships between continuous variables like age, ejection fraction, and serum creatinine. They provide an at-a-glance summary of which features might be most related to outcomes.
Correlation heatmaps alone do not capture nonlinear relationships, which can be important in clinical data. A pair plot or scatter plot matrix might provide more detailed insights into potential relationships. Scatter plots for key pairs of variables identified as having high correlations would provide a more nuanced view of these relationships. Interactions between continuous and categorical variables (e.g., how serum sodium affects survival differently in males vs. females) can’t be captured in simple correlation matrices.
Age, anemia, CPK, S cretinine, high blood pressure is positively correlated to death event. It is interesting that smoking and gender do not appear to be directly correlated to death event.

In [8]:
## corr = data.corr()

selection = alt.selection(type='single', fields=['variable'], on='mouseover')

alt.Chart(corr.reset_index().melt(id_vars=['index'])).mark_rect().encode(
    x='variable:N',
    y='index:N',
    color='value:Q',
    tooltip=['variable', 'index', 'value'],
    opacity=alt.condition(selection,alt.value(1),alt.value(.2))
).properties(title="Correlation Heatmap").add_selection(selection)

   Use 'selection_point()' or 'selection_interval()' instead; these functions also include more helpful docstrings.
        combined and should be specified using "selection_point()".


NameError: name 'corr' is not defined

In [None]:
**Matrix of scatter plots** for examining relationship among continuous variables. 

In [None]:
# Build a SPLOM (scatter plot of matrices)
alt.Chart(data).mark_circle().encode(
    alt.X(alt.repeat("column"), type="quantitative"),
    alt.Y(alt.repeat("row"), type="quantitative"),
    # color="DEATH_EVENT",
    color = alt.Color('DEATH_EVENT', scale = alt.Scale(scheme = 'spectral')),
    tooltip=["diabetes", "DEATH_EVENT"]
).properties(
    width=125,
    height=125
).repeat(
    column=['age', 'creatinine_phosphokinase', 
       'ejection_fraction', 'platelets',
       'serum_creatinine', 'serum_sodium', 'time'],
    row=['age', 'creatinine_phosphokinase', 
       'ejection_fraction', 'platelets',
       'serum_creatinine', 'serum_sodium', 'time']
)

**Scatter Plot**
Relationship between S Creatinine and Ejection Fraction
The scatter plot between S Creatinine and Ejection Fraction can reveal critical insights about the relationships between kidney function and heart health. By analyzing the distribution of points and their colors, you can derive hypotheses that may warrant further investigation or clinical attention. High s creatinine is associated with death event as an outlier. Most cases, lower or mid level s creaitinine is associated with lower ejection fraction.

In [None]:
alt.Chart(data).mark_circle().encode(
    x = 'ejection_fraction',
    y = 'serum_creatinine',
    color="DEATH_EVENT",
    # color = alt.Color('DEATH_EVENT', scale = alt.Scale(scheme = 'spectral')),
    tooltip = ["ejection_fraction", "serum_creatinine","DEATH_EVENT"]
)

**Boxplot**
Boxplot to examine relationship between Ejection Fraction and Death Event

**Violin Plot**
Violin Plot to examine relationship between ejection fraction and death event as an alternate for box plots. 

In [None]:
box = alt.Chart(data).mark_boxplot().encode(
    x=alt.X('DEATH_EVENT:N', title="Death Event"),
    y=alt.Y('ejection_fraction:Q', title="Ejection Fraction"),
    tooltip = ['ejection_fraction','DEATH_EVENT'],
    color='DEATH_EVENT:N'
).properties(title="Ejection Fraction grouped by Death Event").properties(width=250, height=250)

violin = alt.Chart(data).transform_density(
    'ejection_fraction',
    as_=['ejection_fraction', 'density'],
    groupby=['DEATH_EVENT']
).mark_area(orient='horizontal').encode(
    x=alt.X('density:Q', stack='center', title=None),
    y=alt.Y('ejection_fraction:Q', title='Ejection Fraction'),
    color=alt.Color('DEATH_EVENT:N', title='Death Event'),
    tooltip=['ejection_fraction', 'DEATH_EVENT']
).properties(
    width=250,
    height=250
)

box | violin

### Design choice, Interpretation and Justification of design choice

**Age distribution** histogram chart: Histograms are useful to show frequency distributions of categorical variables like age. Histograms provide clear and straightforward visualizations of frequency counts, which help users understand the spread of data across age and gender. In Histograms, dividing the age into meaningful bins (e.g., 30-40, 40-50, etc.) is possible which provides clearer insights into age groupings than focusing on individual years. Peaks in the distribution highlights 60-70 as the common age group. This peak suggests that heart failure primarily affects older populations in this dataset.  

**Gender distribution** bar chart: It gives insight into the balance between male and female patients.

**Box plot** A box plot comparing the distribution of age grouped by gender provides insights into how age varies between male and female patients in the dataset. This can help identify differences in the age distribution between genders, which might have implications for understanding demographics or health outcomes related to age. (Male = 1, Female = 0). The medians are same at 60, it suggests a similar age distribution between genders. The females have a narrower box that suggests that most patients of that gender fall within a closer age range. Males have a wider variability in ages than females. If the IQR for males is wider than for females, it suggests that male patients in the dataset span a broader range of ages, while females might be concentrated in a more specific age group.
ive.

### Compare categorical and continuous variable
**Box Plots** (for continuous variables, grouped by a categorical variable)

A box plot showing the relationship between ejection fraction and the death event provides a clear picture of how heart functionality (represented by ejection fraction) varies between patients who survived and those who did not.
The box plot reveals a strong association between ejection fraction and death event. Lower ejection fractions are clearly associated with a higher likelihood of mortality, while higher ejection fractions are correlated with survival. However, the variability in the ejection fraction values, especially the presence of outliers, suggests that while ejection fraction is a critical factor in survival, it is not the sole determinant, and other factors (like age, comorbidities, or treatment) may also play a role.
- the median ejection fraction is higher for patients who survived (death event = 0)
- the median ejection fraction is notably lower for patients who did not survive (death event = 1).

This implies that poorer heart function is associated with higher mortality risk and patients with better heart function tend to survive.

In [None]:
base =  alt.Chart(data).properties(width=250, height=250)

base.mark_boxplot().encode(
    x=alt.X('DEATH_EVENT:N', title="Death Event"),
    y=alt.Y('ejection_fraction:Q', title="Ejection Fraction (%)"),
    color='DEATH_EVENT:N'
).properties(title="Ejection Fraction by Death Event")

In [None]:
base =  alt.Chart(data).properties(width=250, height=250)

base.mark_bar().encode(
    x='count()',
    y='smoking:N',
    color='DEATH_EVENT:N',
    tooltip=['smoking', 'count()', 'DEATH_EVENT']
).properties(title="Smoking Status by Death Event")

### Scatter Plot Comparison of continuous variables
**Scatter Plot** helps visualize the relationship between S Creatinine and S Sodium, with color-coding for survival outcomes. Scatter plots are great for identifying relationships between two continuous variables. However, when there are a large number of overlapping points, using transparency or hexbin plots can be more effective. Including marginal histograms or density curves along the axes could give additional context to the distribution of each variable separately.

In [None]:
alt.Chart(data).mark_point().encode(
    x='serum_creatinine:Q',
    y='serum_sodium:Q',
    color='DEATH_EVENT:N',
    tooltip=['serum_creatinine', 'serum_sodium', 'DEATH_EVENT']
).properties(title="Serum Creatinine vs Serum Sodium")

In [None]:
# Implementing selection
selection = alt.selection(type='multi', fields=['serum_creatinine'])

base =  alt.Chart(data).properties(width=250, height=250)

base.mark_circle(opacity = 0.5).encode(
    x='age:Q',
    y='sex:N',
    color='DEATH_EVENT:N',
    size="serum_creatinine",
    tooltip=['serum_creatinine', 'DEATH_EVENT'],
    opacity=alt.condition(selection,alt.value(1),alt.value(.2))
).properties(title="Serum Creatinine values").add_selection(selection)

In [None]:
alt.Chart(data).mark_line().encode(
    x='time:Q',
    y='ejection_fraction:Q',
    color='DEATH_EVENT:N',
    tooltip=['time', 'ejection_fraction', 'DEATH_EVENT']
).properties(title="Time Until Death vs Ejection Fraction")