# Notebook 3: Data Cleaning & Train/Test Preparation

---

## 🎯 Objectives
In this notebook, we will take the raw dataset from **Notebook 1** and the insights from **Notebook 2** to prepare a **clean, encoded, and split dataset** ready for modelling.  
We will:
- Address missing values and invalid entries.
- Apply categorical encoding where necessary.
- Create train/test datasets for modelling.
- Save cleaned datasets for later use.

---

## 📂 Inputs & Outputs

**Inputs**
- `heart_disease_raw.csv` (saved from Notebook 1)

**Outputs**
- `TrainSetCleaned.csv` (clean training data)
- `TestSetCleaned.csv` (clean test data)

---

## Load Raw Dataset

Step Purpose

We begin by loading the unmodified dataset saved in **Notebook 1**. This ensures that cleaning steps are reproducible and independent from exploratory work.

Approach

- Use `pd.read_csv()` to load the dataset.
- Display shape and first few rows to confirm load.

Expected Outcome

Dataset loaded with no changes, matching exactly the saved raw version.

In [4]:
import pandas as pd
from pathlib import Path

input_path = Path("/workspaces/Heart_disease_risk_predictor/inputs/datasets/raw/heart_disease_uci.csv")
df = pd.read_csv(input_path)

print(f"Dataset loaded: {df.shape[0]} rows × {df.shape[1]} columns")
df.head()

Dataset loaded: 920 rows × 16 columns


Unnamed: 0,id,age,sex,dataset,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal,num
0,1,63,Male,Cleveland,typical angina,145.0,233.0,True,lv hypertrophy,150.0,False,2.3,downsloping,0.0,fixed defect,0
1,2,67,Male,Cleveland,asymptomatic,160.0,286.0,False,lv hypertrophy,108.0,True,1.5,flat,3.0,normal,2
2,3,67,Male,Cleveland,asymptomatic,120.0,229.0,False,lv hypertrophy,129.0,True,2.6,flat,2.0,reversable defect,1
3,4,37,Male,Cleveland,non-anginal,130.0,250.0,False,normal,187.0,False,3.5,downsloping,0.0,normal,0
4,5,41,Female,Cleveland,atypical angina,130.0,204.0,False,lv hypertrophy,172.0,False,1.4,upsloping,0.0,normal,0


---

## Review Missing & Invalid Values

** Step Purpose

Before cleaning, we need to understand what needs fixing.
Some variables may have invalid placeholders (e.g., 0 where it’s physiologically impossible for blood pressure or cholesterol).

** Approach

- Use df.isna().sum() to see missingness.

- Use describe() to spot impossible values (e.g., 0 in trestbps).

- Custom check: count 0 values for numerical columns where 0 is invalid.

** Expected Outcome

We identify columns requiring:

- NaN replacement for invalid zeros.

- Imputation or dropping for missingness.

In [5]:
# Count missing values
print(df.isna().sum())

# Check for invalid zeros in key columns
for col in ['trestbps', 'chol']:
    zero_count = (df[col] == 0).sum()
    print(f"{col}: {zero_count} invalid zeros")

id            0
age           0
sex           0
dataset       0
cp            0
trestbps     59
chol         30
fbs          90
restecg       2
thalch       55
exang        55
oldpeak      62
slope       309
ca          611
thal        486
num           0
dtype: int64
trestbps: 1 invalid zeros
chol: 172 invalid zeros


---

## Replace Invalid Zeros with NaN

Step Purpose

Medical context:

- trestbps = resting blood pressure (mm Hg) → cannot be 0 in a living patient.

- chol = serum cholesterol (mg/dl) → cannot be 0.

Zeros are often placeholders for "missing" in datasets. We replace them with NaN for correct handling.

*Approach

- Use df[col].replace(0, np.nan) for identified variables.

* Expected Outcome

Invalid zeros replaced with NaN so they can be correctly imputed or handled.

In [7]:
import numpy as np

for col in ['trestbps', 'chol']:
    df[col] = df[col].replace(0, np.nan)

df[['trestbps', 'chol']].head()

Unnamed: 0,trestbps,chol
0,145.0,233.0
1,160.0,286.0
2,120.0,229.0
3,130.0,250.0
4,130.0,204.0


---

# Impute or Drop Missing Data

Step Purpose

Missing values can reduce model accuracy and cause errors.
We must decide whether to:

- Impute (replace with median, mode, etc.)

- Drop variables (if too many missing values)

- Drop rows (if few and random)

Approach

- Check % missing for each column.

- For continuous variables (e.g., trestbps, chol), impute with median.

- For categorical (e.g., ca, thal), impute with mode.

Expected Outcome

A complete dataset with no missing values.

In [8]:
missing_percent = (df.isna().sum() / len(df)) * 100
print(missing_percent)

# Impute numerics with median
num_cols = df.select_dtypes(include='number').columns
for col in num_cols:
    df[col] = df[col].fillna(df[col].median())

# Impute categoricals with mode
cat_cols = df.select_dtypes(exclude='number').columns
for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

id           0.000000
age          0.000000
sex          0.000000
dataset      0.000000
cp           0.000000
trestbps     6.521739
chol        21.956522
fbs          9.782609
restecg      0.217391
thalch       5.978261
exang        5.978261
oldpeak      6.739130
slope       33.586957
ca          66.413043
thal        52.826087
num          0.000000
dtype: float64
