# Predict Heart Disease 

In [32]:
# import necessary libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

%matplotlib inline 

In [33]:
# get the data 
df_heart = pd.read_csv('../Data/heart_disease.csv')
df_heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


In [34]:
# get info
df_heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 297 entries, 0 to 296
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       297 non-null    int64  
 1   sex       297 non-null    int64  
 2   cp        297 non-null    int64  
 3   trestbps  297 non-null    int64  
 4   chol      297 non-null    int64  
 5   fbs       297 non-null    int64  
 6   restecg   297 non-null    int64  
 7   thalach   297 non-null    int64  
 8   exang     297 non-null    int64  
 9   oldpeak   297 non-null    float64
 10  slope     297 non-null    int64  
 11  ca        297 non-null    float64
 12  thal      297 non-null    float64
 13  num       297 non-null    int64  
dtypes: float64(3), int64(11)
memory usage: 32.6 KB




| Column Name | Description | Type | Values (for categorical variables) |
|-------------|-------------|------|-----------------------------------|
| **age** | Age in years | Numerical | - |
| **sex** | Sex | Categorical | 1 = Male, 0 = Female |
| **cp** | Chest pain type | Categorical | 1 = Typical angina<br>2 = Atypical angina<br>3 = Non-anginal pain<br>4 = Asymptomatic |
| **trestbps** | Resting blood pressure (mm Hg) | Numerical | - |
| **chol** | Serum cholesterol in mg/dl | Numerical | - |
| **fbs** | Fasting blood sugar > 120 mg/dl | Categorical | 1 = True<br>0 = False |
| **restecg** | Resting electrocardiographic results | Categorical | 0 = Normal<br>1 = ST-T wave abnormality<br>2 = Left ventricular hypertrophy |
| **thalach** | Maximum heart rate achieved | Numerical | - |
| **exang** | Exercise-induced angina | Categorical | 1 = Yes<br>0 = No |
| **oldpeak** | ST depression induced by exercise | Numerical | - |
| **slope** | Slope of the peak exercise ST segment | Categorical | 1 = Upsloping<br>2 = Flat<br>3 = Downsloping |
| **ca** | Number of major vessels colored by fluoroscopy | Categorical (Ordinal) | 0, 1, 2, 3 |
| **thal** | Thalassemia | Categorical | 3 = Normal<br>6 = Fixed defect<br>7 = Reversible defect |
| **num** | Target variable (Presence of heart disease) | Categorical (Binary) | 0 = No disease<br>1 = Disease present |

---

In [35]:
# split the data into train and test set 
X = df_heart.iloc[:,:-1]
y = df_heart.iloc[:,-1]

from sklearn.model_selection import train_test_split

X_train , X_test, y_train , y_test = train_test_split(X, y, 
                                                      test_size=0.2,
                                                      random_state=42)

# note the shape 
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((237, 13), (60, 13), (237,), (60,))

In [36]:
# Baseline Model 
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=2)

# fir with cross_val_score 
scores = cross_val_score(estimator=clf,
                         X=X,
                         y=y,
                         scoring='accuracy',
                         cv=5)

print('Accuracy:', np.round(scores, 2))
print(f'Accuracy mean: {scores.mean():.2f}')

Accuracy: [0.75 0.85 0.71 0.68 0.73]
Accuracy mean: 0.74


### Hyperparameter Tuning using RandomizedSearchCV

Using `RandomizedSearchCV` is faster than GridSearchCV when you have a large hyperparameter space because it randomly samples combinations rather than exhaustively searching all.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'criterion': ['entropy', 'gini'],
    'splitter':['random', 'best'],
    'min_weight_fraction_leaf':[0.0, 0.0025, 0.005, 0.0075, 0.01, 0.05],'min_samples_split':[2, 3, 4, 5, 6, 8,10],
    'min_samples_leaf':[1, 0.01, 0.02, 0.03, 0.04], # used fractions
    'min_impurity_decrease':[0.0, 0.0005, 0.005,0.05, 0.10, 0.15, 0.2],'max_leaf_nodes':[10, 15, 20,25, 30, 35, 40, 45, 50, None],
    'max_features':[None, 'sqrt', 'log2'],
    'max_depth':[None,2,4,6,8]
}

clf = DecisionTreeClassifier(random_state=2)

# Randomized search
random_search = RandomizedSearchCV(
    estimator=clf,
    param_distributions=param_dist,
    n_iter=20,  # Number of parameter settings sampled
    cv=5,  # 5-fold cross-validation
    scoring='accuracy',  # Change as per your task
    random_state=2,
    n_jobs=-1  # Use all processors
)

# Fit
random_search.fit(X, y)

In [38]:
# Best parameters and best score
print("Best Parameters:", random_search.best_params_)
print(f"Train Score: {random_search.best_score_:.2f}")

Best Parameters: {'splitter': 'best', 'min_weight_fraction_leaf': 0.0075, 'min_samples_split': 2, 'min_samples_leaf': 0.02, 'min_impurity_decrease': 0.0, 'max_leaf_nodes': 10, 'max_features': 'sqrt', 'max_depth': 6, 'criterion': 'entropy'}
Train Score: 0.81


In [39]:
# let us now get the test score 
best_model = random_search.best_estimator_

# predict 
y_pred = best_model.predict(X_test)

# get the accuracy 
from sklearn.metrics import accuracy_score
print(f'Test Score: {accuracy_score(y_pred=y_pred, y_true=y_test):.2f}')

Test Score: 0.85


This is a definite improvement, and the model generalizes well on the test set.
Let's see if we can do better by narrowing the range.


### 🔍 **Strategy Breakdown**

We're doing 2 smart things here:
1. **Keeping defaults that work well**  
   - ✅ *No need to check `criterion='entropy'`*, as *gini* is typically faster and differences are tiny.
   - ✅ *Skip `min_impurity_decrease`*, it often helps only in specific cases, and defaults are fine.

2. **Focusing on most impactful hyperparameters**  
   - 🎯 `max_depth`, `max_leaf_nodes`: control complexity.
   - 🎯 `min_samples_leaf`, `min_samples_split`: prevent overfitting.
   - 🎯 `max_features`: good control over feature randomness.
   - 🎯 `min_weight_fraction_leaf`: nice for class imbalance.

3. **Increased number of runs (`runs=100`)**
   - Randomized search benefits from more iterations since it's sampling randomly.
   - 100 is a sweet spot for medium datasets — well done!

---

In [54]:
param_dist = {
    'min_weight_fraction_leaf':[0.0, 0.0025, 0.005, 0.0075, 0.01, 0.05],'min_samples_split':[2, 3, 4, 5, 6, 8,10],
    'min_samples_leaf':[1, 0.01, 0.02, 0.03, 0.04], # used fractions
    'max_leaf_nodes':[10, 15, 20,25, 30, 35, 40, 45, 50, None],
    'max_features':[None, 'sqrt', 'log2'],
    'max_depth':[None,2,4,6,8],
}

clf = DecisionTreeClassifier(random_state=42)

random_search = RandomizedSearchCV(
    estimator=clf,
    param_distributions=param_dist,
    n_iter=100,
    cv=5,
    verbose=1,
    random_state=2,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits



### 🌟 What is `n_iter` in `RandomizedSearchCV`?

When you use **RandomizedSearchCV**, instead of trying *every* combination of hyperparameters like `GridSearchCV`, it **randomly samples** a number of parameter combinations from the search space you give it.

The parameter `n_iter` controls **how many combinations** of parameters will be tried.

So:
- `n_iter=10` → Randomly pick 10 parameter sets and test them.
- `n_iter=100` → Randomly pick 100 parameter sets and test them.

The more iterations, the higher the chance of finding a really good combination — but also, the longer it will take to run.

---

### 🧩 Example:

If your search space has, say, 300 possible combinations, but you set:
```python
n_iter = 20
```
→ RandomizedSearchCV will randomly sample **only 20 out of 300**.

If you set:
```python
n_iter = 100
```
→ It will sample 100 different sets, and you're more likely to find near-optimal hyperparameters.

---

### 🎯 In short:

| `n_iter` Value | What Happens |
|---------------|--------------|
| Small (e.g., 10) | Quick, but might miss better hyperparams. |
| Medium (e.g., 50–100) | Good balance of speed and accuracy. |
| Large (e.g., 200+) | Higher chance of finding great hyperparams, but takes longer. |

---

### 💡 Tip:
If you have time and resources, increasing `n_iter` is usually worth it — especially for models like Decision Trees, which are not *super* heavy compared to deep learning models.

In [55]:
results = pd.DataFrame(random_search.cv_results_)
results = results.sort_values(by='mean_test_score', ascending=False)
results


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_weight_fraction_leaf,param_min_samples_split,param_min_samples_leaf,param_max_leaf_nodes,param_max_features,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
28,0.003771,0.000159,0.004499,0.004363,0.0500,8,0.03,45,,6,"{'min_weight_fraction_leaf': 0.05, 'min_sample...",0.687500,0.8125,0.829787,0.723404,0.787234,0.768085,0.054105,1
90,0.003769,0.000210,0.002363,0.000183,0.0500,6,1.00,50,,4,"{'min_weight_fraction_leaf': 0.05, 'min_sample...",0.687500,0.8125,0.829787,0.723404,0.787234,0.768085,0.054105,1
88,0.004108,0.000411,0.002376,0.000206,0.0500,8,0.04,45,,,"{'min_weight_fraction_leaf': 0.05, 'min_sample...",0.687500,0.8125,0.829787,0.723404,0.787234,0.768085,0.054105,1
52,0.004019,0.000430,0.002369,0.000136,0.0500,8,1.00,30,,6,"{'min_weight_fraction_leaf': 0.05, 'min_sample...",0.687500,0.8125,0.829787,0.723404,0.787234,0.768085,0.054105,1
70,0.004211,0.000476,0.002584,0.000188,0.0500,2,0.01,35,,8,"{'min_weight_fraction_leaf': 0.05, 'min_sample...",0.687500,0.8125,0.829787,0.723404,0.787234,0.768085,0.054105,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37,0.003285,0.001361,0.002000,0.000618,0.0025,3,0.02,35,sqrt,2,"{'min_weight_fraction_leaf': 0.0025, 'min_samp...",0.645833,0.7500,0.595745,0.595745,0.702128,0.657890,0.060516,90
83,0.005498,0.003724,0.002272,0.000087,0.0100,6,0.02,45,sqrt,2,"{'min_weight_fraction_leaf': 0.01, 'min_sample...",0.645833,0.7500,0.595745,0.595745,0.702128,0.657890,0.060516,90
68,0.003382,0.000115,0.002243,0.000057,0.0075,5,1.00,50,sqrt,2,"{'min_weight_fraction_leaf': 0.0075, 'min_samp...",0.645833,0.7500,0.595745,0.595745,0.702128,0.657890,0.060516,90
85,0.003539,0.000305,0.003538,0.001527,0.0050,2,0.03,20,log2,2,"{'min_weight_fraction_leaf': 0.005, 'min_sampl...",0.645833,0.7500,0.595745,0.595745,0.702128,0.657890,0.060516,90


In [56]:
# Best parameters and best score
print("Best Parameters:", random_search.best_params_)
print(f"Train Score: {random_search.best_score_:.2f}")

Best Parameters: {'min_weight_fraction_leaf': 0.05, 'min_samples_split': 8, 'min_samples_leaf': 0.03, 'max_leaf_nodes': 45, 'max_features': None, 'max_depth': 6}
Train Score: 0.77


In [57]:
# let us now get the test score 
best_model = random_search.best_estimator_

# predict 
y_pred = best_model.predict(X_test)

# get the accuracy 
from sklearn.metrics import accuracy_score
print(f'Test Score: {accuracy_score(y_pred=y_pred, y_true=y_test):.2f}')

Test Score: 0.80


This model is more accurate in the training and test score.

In [58]:
# compare the baseline model with this new best found model 

clf = random_search.best_estimator_

# fir with cross_val_score 
scores = cross_val_score(estimator=clf,
                         X=X,
                         y=y,
                         scoring='accuracy',
                         cv=5)

print('Accuracy:', np.round(scores, 2))
print(f'Accuracy mean: {scores.mean():.2f}')

Accuracy: [0.82 0.92 0.78 0.75 0.8 ]
Accuracy mean: 0.81


This is ~ nine percentage points higher than the default model. When it comes to
predicting heart disease, more accuracy can save lives.

### 🌳 What is feature_importances_?
In Decision Trees (and tree-based models like Random Forests, XGBoost, etc.), feature importance measures how valuable each feature is in improving the model’s splits.

- Features that reduce impurity (like Gini or entropy) more are given higher importance.

- It basically tells you:
>- 👉 "Hey, this feature helped the tree the most in making decisions!"



In [59]:
best_model.feature_importances_

array([0.05957229, 0.        , 0.20598782, 0.00205867, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.10061929,
       0.        , 0.4866227 , 0.14513923])

In [60]:
# let us attach to each column its importance 

feature_dict = dict(
    zip(
        X.columns,
        best_model.feature_importances_
    )
)

feature_dict

{'age': np.float64(0.0595722916820607),
 'sex': np.float64(0.0),
 'cp': np.float64(0.20598782237690966),
 'trestbps': np.float64(0.0020586672430379094),
 'chol': np.float64(0.0),
 'fbs': np.float64(0.0),
 'restecg': np.float64(0.0),
 'thalach': np.float64(0.0),
 'exang': np.float64(0.0),
 'oldpeak': np.float64(0.10061928806373142),
 'slope': np.float64(0.0),
 'ca': np.float64(0.4866226982065329),
 'thal': np.float64(0.14513923242772733)}

In [61]:
# better visual
# Create a DataFrame for better visualization
importances = best_model.feature_importances_
feature_names = X.columns

feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

feature_importance_df

Unnamed: 0,Feature,Importance
11,ca,0.486623
2,cp,0.205988
12,thal,0.145139
9,oldpeak,0.100619
0,age,0.059572
3,trestbps,0.002059
1,sex,0.0
4,chol,0.0
5,fbs,0.0
8,exang,0.0


In [65]:
# most important features 
feature_importance_df['Feature'][feature_importance_df['Importance']>=0.1]

11         ca
2          cp
12       thal
9     oldpeak
Name: Feature, dtype: object

 Here’s your top 4 features from the list:

| Rank | Feature  | Importance Score |
|------|-----------|-----------------|
| 1    | **ca** (number of major vessels colored by fluoroscopy) | 0.486623 |
| 2    | **cp** (chest pain type) | 0.205988 |
| 3    | **thal** (thalassemia) | 0.145139 |
| 4    | **oldpeak** (ST depression induced by exercise) | 0.100619 |

Now let’s explain them properly — I’ll give you both **layman-friendly** explanation and **technical** interpretation:

---

### 1. **ca** — Number of major vessels colored by fluoroscopy

- **What is it?**  
  This measures the number of blood vessels around the heart that show narrowing or blockages under imaging tests.
  
- **Why is it important?**  
  The more vessels that are narrowed, the higher the risk of heart disease.  
  It makes sense that this is your model's **#1 feature**, because clogged arteries are a direct indicator of heart problems.

- **Model perspective:**  
  High values of `ca` probably push the prediction strongly toward “presence of heart disease.”

---

### 2. **cp** — Chest pain type

- **What is it?**  
  Chest pain types:
  - **1:** Typical angina (most serious)
  - **2:** Atypical angina
  - **3:** Non-anginal pain
  - **4:** Asymptomatic (no pain)

- **Why is it important?**  
  The kind of chest pain gives an early signal of possible heart issues. Typical angina is strongly linked to heart disease, while asymptomatic might show up in some cases due to silent ischemia.

- **Model perspective:**  
  Different chest pain types help the tree split early, providing clarity in risk level.

---

### 3. **thal** — Thalassemia (blood disorder)

- **What is it?**  
  In this dataset:
  - **3:** Normal
  - **6:** Fixed defect (no blood flow in some parts)
  - **7:** Reversible defect (reduced blood flow, but can improve)

- **Why is it important?**  
  Abnormal thalassemia values indicate reduced oxygen delivery, which stresses the heart. A fixed or reversible defect shows clear cardiac dysfunction.

- **Model perspective:**  
  Helps the model confirm underlying cardiac risk factors.

---

### 4. **oldpeak** — ST depression induced by exercise

- **What is it?**  
  Difference between the resting and peak exercise ST segment.
  - Higher `oldpeak` means more stress-induced ischemia.

- **Why is it important?**  
  It reflects how the heart responds under stress. A higher value often indicates worse prognosis.

- **Model perspective:**  
  Crucial for assessing functional heart capacity under load.

---

### 📊 Summary Table:

| Feature | Meaning | Why Important |
|---------|----------|---------------|
| ca | Blocked blood vessels | Direct measure of blockage severity |
| cp | Chest pain type | Symptom severity indicator |
| thal | Blood flow defect | Detects abnormal blood supply |
| oldpeak | Stress test result | Indicates heart's response to exercise |

---

### 🔥 My take:
Our model is behaving as expected — it’s prioritizing **direct indicators of heart disease** (blocked vessels, pain, blood flow issues).  
This is a good sign! ✅