## CardioPredict: Assessing Heart Disease Risk

_CardioPredict harnesses the power of logistic regression to analyze key health indicators and provide a predictive model for assessing the risk of coronary heart disease in individuals._

Data source: https://paulblanche.com/files/DataFramingham.html

by Joel Wu, Sandra Gross, He Ma and Doris Wang (DSCI 522 Group 10 Milestone 1)

2023/11/15

### Description
The data come from the famous "Framingham Heart Study", a study initially planned as a 20 years cohort study of residents aged 30-59 in Framingham town, Massachusetts, in 1948. The present data contain observations on n=1,363 persons.

| Variable | Explanation |
|----------|-------------|
| sex      | sex (Female/Male) |
| AGE      | Age in years |
| FRW      | "Framingham relative weight" (pct.) at baseline (52-222) |
| SBP      | systolic blood pressure at baseline mmHg (90-300) |
| DBP      | diastolic blood pressure at baseline mmHg (50-160) |
| CHOL     | cholesterol at baseline mg/100ml (96-430) |
| CIG      | cigarettes per day at baseline (0-60) |
| disease  | 1 if coronary heart disease occurred during the follow-up, 0 otherwise |

In [1]:
import numpy as np
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline


### Exploratory Data Analysis (EDA)

In [2]:
# 1. import data and split into train and test
df = pd.read_csv("data/framingham.csv")
train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
train_df

Unnamed: 0,AGE,FRW,SBP,DBP,CHOL,CIG,sex,disease
723,50,118.0,160,100,334,20.0,Female,0
749,52,95.0,135,85,296,20.0,Male,0
66,48,90.0,120,80,200,15.0,Male,0
240,55,118.0,190,110,220,20.0,Female,0
246,47,83.0,140,78,170,20.0,Male,1
...,...,...,...,...,...,...,...,...
1147,51,118.0,124,78,242,0.0,Male,0
106,53,116.0,124,72,142,30.0,Male,0
1041,51,76.0,96,68,265,20.0,Male,0
1122,49,100.0,120,84,201,20.0,Male,0


In [3]:
train_df.describe()

Unnamed: 0,AGE,FRW,SBP,DBP,CHOL,CIG,disease
count,1090.0,1083.0,1090.0,1090.0,1090.0,1089.0,1090.0
mean,52.40367,104.779317,148.254128,90.294495,234.098165,8.019284,0.192661
std,4.81431,17.389638,28.460831,14.224762,46.241893,11.575781,0.39457
min,45.0,52.0,90.0,50.0,96.0,0.0,0.0
25%,48.0,94.0,130.0,80.0,200.0,0.0,0.0
50%,52.0,102.0,142.0,90.0,229.5,0.0,0.0
75%,56.0,113.0,160.0,99.5,263.0,20.0,0.0
max,62.0,222.0,300.0,160.0,430.0,60.0,1.0


In [4]:
# 2. check missing
print(train_df.info())
print(train_df.isna().sum())

<class 'pandas.core.frame.DataFrame'>
Index: 1090 entries, 723 to 1346
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   AGE      1090 non-null   int64  
 1   FRW      1083 non-null   float64
 2   SBP      1090 non-null   int64  
 3   DBP      1090 non-null   int64  
 4   CHOL     1090 non-null   int64  
 5   CIG      1089 non-null   float64
 6   sex      1090 non-null   object 
 7   disease  1090 non-null   int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 76.6+ KB
None
AGE        0
FRW        7
SBP        0
DBP        0
CHOL       0
CIG        1
sex        0
disease    0
dtype: int64


__'FRW' has 7 null values, 'CIG' has 1 null value, indicating missing data in these columns. The presence of null values in 'FRW' and 'CIG' suggests the need for handling missing data, either by filling in these gaps or excluding these entries.__

In [5]:
# 3. distribution of target feature
target_chart = alt.Chart(df).mark_bar().encode(
    x='disease:O',
    y=alt.Y('count():Q', axis=alt.Axis(title='Count')),
    text=alt.Text('count():Q')
).properties(
    height=200,
    width=200,
    title="Distribution of Disease Occurrence"
)

target_chart = target_chart + target_chart.mark_text(
    align='center',
    baseline='bottom',
    dy=-5
)

target_chart

__There are 1,095 individuals without heart disease and 268 individuals with heart disease, indicating a higher prevalence of non-disease cases in the sample.__

In [6]:
# 4. distribution of numerical features
numerical_features = ['AGE', 'FRW', 'SBP', 'DBP', 'CHOL',	'CIG']

numerical_chart = alt.Chart(train_df).transform_calculate(
    disease_label="datum.disease == 1 ? '1: have heart disease' : '0: do not have heart disease'"
).mark_bar(opacity=0.8).encode(
    alt.X(alt.repeat('repeat'), type='quantitative', bin=alt.Bin(maxbins=20)),
    alt.Y('count()', stack=None),
    color=alt.Color('disease_label:N', legend=alt.Legend(title="Disease Status"))
).properties(
    width=200,
    height=200
).repeat(
    repeat=numerical_features, 
    columns=3
).properties(
    title='Figure 2: Age and Health Indicators Exhibit Elevated Heart Disease'
)

numerical_chart

In [7]:
def first_mode(series):
    return series.mode().iloc[0]

mean_df = train_df.groupby('disease')[numerical_features].mean()
median_df = train_df.groupby('disease')[numerical_features].median()
mode_df = train_df.groupby('disease')[numerical_features].agg(first_mode)

combined_df = pd.concat([mean_df, median_df, mode_df], axis=1, keys=['Mean', 'Median', 'Mode'])

combined_df.columns = combined_df.columns.swaplevel(0, 1)
combined_df.sort_index(axis=1, level=0, inplace=True)

combined_df

Unnamed: 0_level_0,AGE,AGE,AGE,CHOL,CHOL,CHOL,CIG,CIG,CIG,DBP,DBP,DBP,FRW,FRW,FRW,SBP,SBP,SBP
Unnamed: 0_level_1,Mean,Median,Mode,Mean,Median,Mode,Mean,Median,Mode,Mean,Median,Mode,Mean,Median,Mode,Mean,Median,Mode
disease,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
0,52.096591,52.0,45,232.514773,229.0,200,7.657565,0.0,0.0,89.186364,88.0,80,104.291096,102.0,99.0,145.720455,140.0,140
1,53.690476,54.0,54,240.733333,231.0,200,9.533333,0.0,0.0,94.938095,94.0,90,106.845411,105.0,113.0,158.871429,155.0,150


__The distributions displayed in the charts show an imbalanced distribution between individuals with and without heart disease, skewing slightly towards younger ages and lower cardiovascular risk factors (e.g., blood pressure, cholesterol) in the disease-free group. In particular, non-smokers or low cigarette consumers are predominantly in the non-disease category, underscoring lifestyle's impact on heart health.__

__Statistical summary shows that individuals with the disease have slightly higher average/median ages and cardiovascular risk factors (e.g., cholesterol and blood pressure) compared to their disease-free counterparts, with the most common values (modes) indicating lower risk profiles overall.__

In [8]:
# 5. correlatoin matrix
correlation_matrix = train_df.select_dtypes(include=['number', 'bool']).corr('spearman')
mask = np.eye(correlation_matrix.shape[0], dtype=bool)
correlation_matrix = correlation_matrix.where(~mask, np.nan)

correlation_matrix.style.background_gradient(cmap='coolwarm')

Unnamed: 0,AGE,FRW,SBP,DBP,CHOL,CIG,disease
AGE,,0.092876,0.179298,0.036477,0.093347,-0.162217,0.130144
FRW,0.092876,,0.281357,0.322311,0.072872,-0.231591,0.055237
SBP,0.179298,0.281357,,0.771926,0.131841,-0.095482,0.183792
DBP,0.036477,0.322311,0.771926,,0.111562,-0.077424,0.157544
CHOL,0.093347,0.072872,0.131841,0.111562,,-0.059533,0.050861
CIG,-0.162217,-0.231591,-0.095482,-0.077424,-0.059533,,0.053993
disease,0.130144,0.055237,0.183792,0.157544,0.050861,0.053993,


 __Systolic and diastolic blood pressure (SBP and DBP) show a strong positive correlation. Additionally, there is a notable positive correlation between SBP and the occurrence of disease, as well as between DBP and disease, age and disease, while cigarette smoking (CIG) is negatively correlated with Framingham relative weight (FRW) and cholesterol levels (CHOL).__

In [9]:
# 6. pairwise scatter plots
base = alt.Chart(train_df).transform_calculate(
    disease_label="datum.disease == 1 ? '1: have heart disease' : '0: do not have heart disease'"
).mark_point(opacity=0.5, size=10)

# Create pairwise scatter plot with independent scales
pairwise_chart = base.encode(
    x=alt.X(alt.repeat("row"), type='quantitative', scale=alt.Scale(zero=False)),
    y=alt.Y(alt.repeat("column"), type='quantitative', scale=alt.Scale(zero=False)),
    color=alt.Color('disease_label:N', legend=alt.Legend(title="Disease Status"))
).properties(
    width=150,
    height=150
).repeat(
    row=numerical_features,
    column=numerical_features
).resolve_scale(
    x='independent', 
    y='independent'
).properties(
    title='Figure 3: Pairwise Scatterplots'
)

pairwise_chart

__Systolic and diastolic blood pressure (SBP and DBP) show a strong linear relationship. Additionally, the color differentiation indicates potential trends between these variables and the presence of heart disease, with some variables like SBP exhibiting clusters that may correlate with higher instances of the disease.__