# Heart Disease UCI

In this notebook, we illustrate black-box model explanation with the medical [Heart Disease UCI](https://www.kaggle.com/ronitf/heart-disease-uci) dataset. There are forteen features:

* `age`
* `sex`
* `cp`: chest paintype (4 values)
* `trestbps`: resting blood pressure
* `chol`: serum cholestoral in mg/dl
* `fbs`: fasting blood sugar > 120 mg/dl
* `restecg`: resting electrocardiographic results (values 0,1,2)
* `thalach`: maximum heart rate achieved
* `exang`: exercise induced angina
* `oldpeak`: oldpeak = ST depression induced by exercise relative to rest
* `slope`: the slope of the peak exercise ST segment
* `ca`: number of major vessels (0-3) colored by flourosopy
* `thal`: 3 = normal; 6 = fixed defect; 7 = reversable defect

The output is the presence (`1`) or absence (`0`) of heart disease. 

In [14]:
import ethik

X, y = ethik.datasets.load_heart_disease()
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [15]:
y.head()

0    True
1    True
2    True
3    True
4    True
Name: has_heart_disease, dtype: bool

In [16]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, shuffle=True, random_state=42)

In this notebook, we aim to illustrate explanability and will arbitrarily train a gradient-boosting tree using [LightGBM](https://lightgbm.readthedocs.io/en/latest/).

In [17]:
import lightgbm as lgb
import pandas as pd

model = lgb.LGBMClassifier(random_state=42).fit(X_train, y_train)

y_pred = model.predict_proba(X_test)[:, 1]
# We use a named pandas series to make plot labels more explicit
y_pred = pd.Series(y_pred, name='has_heart_disease')
y_pred.head()

0    0.006718
1    0.050039
2    0.883217
3    0.072895
4    0.921450
Name: has_heart_disease, dtype: float64

In [18]:
from sklearn import metrics

# As `y_test` is binary (0 or 1), we need to make `y_pred` binary as well
# for `metrics.accuracy_score` to work.
print(f'Accuracy score: {metrics.accuracy_score(y_test, y_pred > 0.5):.4f}')

Accuracy score: 0.8684


Let's plot the four most impactful features on the predictions:

In [19]:
explainer = ethik.ClassificationExplainer()
explainer.plot_bias_ranking(
    X_test=X_test,
    y_pred=y_pred,
    n_features=10,
)

100%|██████████| 1150/1150 [00:00<00:00, 1179.08it/s]


The maximum heart rate achieved is the most impactful feature on the probability of having diabetes. Let's have a look at the details:

In [20]:
explainer.plot_bias(
    X_test=X_test["thalach"],
    y_pred=y_pred,
)

100%|██████████| 41/41 [00:00<00:00, 1173.46it/s]


In [21]:
explainer.plot_bias(
    X_test=X_test["oldpeak"],
    y_pred=y_pred,
)

100%|██████████| 41/41 [00:00<00:00, 1168.21it/s]


In [22]:
explainer.plot_bias(
    X_test=X_test["thal"],
    y_pred=y_pred,
)

100%|██████████| 144/144 [00:00<00:00, 1130.84it/s]


In [23]:
explainer.plot_bias(
    X_test=X_test["cp"],
    y_pred=y_pred,
)

100%|██████████| 164/164 [00:00<00:00, 1164.86it/s]


In [24]:
explainer.plot_bias(
    X_test=X_test[["thalach", "oldpeak", "thal"]],
    y_pred=y_pred,
)

100%|██████████| 226/226 [00:00<00:00, 1184.47it/s]
