
# 10-Year Cardiovascular Disease (CVD) Risk Prediction  
## Inference and Explainability Pipeline

This notebook implements the **inference, explainability, and risk interpretation pipeline**
for the trained cardiovascular disease risk prediction models.

The notebook is structured to:
- Load trained models and artifacts
- Perform dataset-level predictions
- Perform individual-level predictions
- Provide model explainability using SHAP
- Generate actionable risk-reduction insights using controlled what-if analysis
- we have uses git repo links to access datasets
**Note:** This system is intended for analytical and decision-support purposes only.



## 1. Required Libraries

The following libraries are used for:
- Numerical computation
- Data handling
- Model loading
- Explainability analysis


In [24]:

import numpy as np
import pandas as pd
import joblib
import shap
import os



## 2. Load Trained Models and Supporting Artifacts

This section loads:
- Two trained models (Model-A and Model-B)
- Corresponding feature lists
- Optimized decision thresholds
- SHAP explainers for tree-based models

Model-A uses a full medical feature set,  
Model-B uses a reduced feature set optimized for fast inference.


In [25]:



os.makedirs('/content/project', exist_ok=True)
%cd /content/project



urls = [
    "https://github.com/chetan-j123/Heart_decision_prediction_project/raw/master/model_A.pkl",
    "https://github.com/chetan-j123/Heart_decision_prediction_project/raw/master/model_B.pkl",
    "https://github.com/chetan-j123/Heart_decision_prediction_project/raw/master/model_A_threshold.pkl",
    "https://github.com/chetan-j123/Heart_decision_prediction_project/raw/master/model_B_threshold.pkl",
    "https://github.com/chetan-j123/Heart_decision_prediction_project/raw/master/model_a_features.pkl",
    "https://github.com/chetan-j123/Heart_decision_prediction_project/raw/master/model_b_features.pkl"
]

for url in urls:
    file_name = url.split('/')[-1]
    !wget -q {url} -O {file_name}


model_A = joblib.load("model_A.pkl")
model_B = joblib.load("model_B.pkl")

FEATURES_A = joblib.load("model_a_features.pkl")
FEATURES_B = joblib.load("model_b_features.pkl")

THRESHOLD_A = joblib.load("model_A_threshold.pkl")
THRESHOLD_B = joblib.load("model_B_threshold.pkl")

explainer_A = shap.TreeExplainer(model_A.estimator)
explainer_B = shap.TreeExplainer(model_B.estimator)

print("\n✅ All models, features, and explainers loaded successfully!")


/content/project

✅ All models, features, and explainers loaded successfully!


  setstate(state)
  setstate(state)
  setstate(state)



## 3. Hackathon Dataset Preprocessing

The hackathon dataset uses feature representations that differ from
the training dataset.

This preprocessing function:
- Aligns feature names with the training schema
- Converts categorical encodings to numerical form
- Derives clinical indicators such as BMI
- Ensures compatibility with Model-B


In [26]:

def preprocess_hackathon_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    df["male"] = (df["gender"] == 2).astype(int)
    df["BMI"] = df["weight"] / ((df["height"] / 100) ** 2)

    df["sysBP"] = df["ap_hi"]
    df["diaBP"] = df["ap_lo"]

    df["totChol"] = df["cholesterol"].map({1: 180, 2: 220, 3: 260})
    df["glucose"] = df["gluc"].map({1: 90, 2: 120, 3: 160})

    df["currentSmoker"] = df["smoke"]

    return df[FEATURES_B]



## 4. Feature Constraints for What-If Analysis

Only modifiable clinical and lifestyle-related features are included
in what-if simulations.

Immutable attributes such as age and biological sex are excluded.


In [27]:

FEATURE_BOUNDS = {
    "BMI": 18,
    "sysBP": 120,
    "diaBP": 80,
    "totChol": 180,
    "glucose": 100,
    "currentSmoker": 0
}



## 5. Risk Prediction and SHAP-Based Explainability

This function:
- Computes the predicted probability of 10-year CVD risk
- Applies the model-specific threshold
- Extracts SHAP values
- Identifies the most influential features contributing to risk


In [28]:

def predict_with_explain(model, explainer, X_user, threshold, top_k=5):

    risk = model.predict_proba(X_user)[0, 1]
    label = int(risk >= threshold)

    shap_vals = explainer.shap_values(X_user)[0]
    contributions = dict(zip(X_user.columns, shap_vals))

    contributions = dict(
        sorted(contributions.items(),
               key=lambda x: abs(x[1]),
               reverse=True)
    )

    top_contrib = dict(list(contributions.items())[:top_k])

    return risk, label, top_contrib



## 6. Controlled What-If Risk Simulation

This analysis estimates potential risk reduction by adjusting
a single modifiable feature to a clinically reasonable target value,
while keeping all other features constant.


In [29]:

def what_if_analysis(model, X_user, feature):

    if feature not in FEATURE_BOUNDS:
        return None

    if feature in ["age", "male"]:
        return None

    current = float(X_user[feature].values[0])
    target = FEATURE_BOUNDS[feature]

    X_new = X_user.copy()
    X_new[feature] = target

    old_risk = model.predict_proba(X_user)[0, 1]
    new_risk = model.predict_proba(X_new)[0, 1]

    return {
        "feature": feature,
        "current": current,
        "target": target,
        "risk_reduction_%": round((old_risk - new_risk) * 100, 2)
    }



## 7. Dataset-Level Prediction (Model-B)

This section generates predicted 10-year CVD risk scores
for the full hackathon dataset and exports the results.


In [30]:
# download link of hacakthon dataset
csv_url = "https://github.com/chetan-j123/Heart_decision_prediction_project/raw/master/cardiac_failure_processed.csv"

print("Downloading processed dataset...")
!wget -q {csv_url} -O cardiac_failure_processed.csv

hackathon_df = pd.read_csv("cardiac_failure_processed.csv")


X_hack = preprocess_hackathon_df(hackathon_df)

hackathon_risk = model_B.predict_proba(X_hack)[:, 1]
hackathon_df["CVD_10yr_risk_%"] = (hackathon_risk * 100).round(2)

hackathon_df.to_csv("predicted_cvd_for_hackathon_dataset.csv", index=False)

print("predicted_cvd_for_hackathon_dataset.csv saved successfully")


Downloading processed dataset...
predicted_cvd_for_hackathon_dataset.csv saved successfully



## 8. Individual-Level Risk Prediction

This section allows inference for a single individual by
collecting feature values, generating a risk prediction,
and providing interpretability insights.


In [None]:

def yes_no_input(prompt):
    while True:
        v = input(prompt + " (yes/no): ").strip().lower()
        if v in ["yes", "y"]:
            return 1
        if v in ["no", "n"]:
            return 0
        print("Please type yes or no.")

print("Select model:")
print("1 → Model-B (reduced feature set)")
print("2 → Model-A (full feature set)")

choice = input("Enter choice (1 or 2): ").strip()

if choice == "1":
    FEATURES = FEATURES_B
    model = model_B
    explainer = explainer_B
    threshold = THRESHOLD_B
else:
    FEATURES = FEATURES_A
    model = model_A
    explainer = explainer_A
    threshold = THRESHOLD_A

user_data = {}
for f in FEATURES:
    if f in ["male", "currentSmoker"]:
        user_data[f] = yes_no_input(f"Is {f}?")
    else:
        user_data[f] = float(input(f"Enter {f}: "))

X_user = pd.DataFrame([user_data])

risk, label, contrib = predict_with_explain(model, explainer, X_user, threshold)

print("\n10-Year CVD Risk (%):", round(risk * 100, 2))
print("Risk Category:", "High Risk" if label else "Low Risk")

print("\nPrimary contributing factors:")
for k, v in contrib.items():
    print(f"{k}: {'increases risk' if v > 0 else 'reduces risk'}")

print("\nPotential risk reduction scenarios:")
for feat, shap_val in contrib.items():
    if shap_val > 0:
        info = what_if_analysis(model, X_user, feat)
        if info:
            print(
                f"{feat}: reducing from {info['current']} "
                f"towards {info['target']} "
                f"may reduce risk by approximately {info['risk_reduction_%']}%"
            )


Select model:
1 → Model-B (reduced feature set)
2 → Model-A (full feature set)
