# Data Preprocessing

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
import yaml

In [2]:
# Load Data
df = pd.read_csv("../data/raw/heart.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## Handle Missing Value

In [3]:
df.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

In [4]:
df.dropna(inplace=True)

There is no missing value

## Handle Duplicate Value

In [5]:
print(df.duplicated().sum())

0


In [6]:
df.drop_duplicates(inplace=True)

There is no duplicate data

## Handle Invalid Value

From the EDA, we can see that in the "Cholesterol" column, there are some values 0, which is invalid because Cholesterol can't be 0.

In [7]:
# count the number of invalid value
num_invalid = (df['Cholesterol'] == 0).sum()
print(f"The total count of invalid data: {num_invalid}")
print(f"The percentage of invalid data: {num_invalid / len(df) * 100:.2f}%")

The total count of invalid data: 172
The percentage of invalid data: 18.74%


In [8]:
# change the invalid value to median

imputer = SimpleImputer(strategy='median')

#df_impute = df.copy()
df['Cholesterol'] = df['Cholesterol'].replace(0, np.nan)
df[['Cholesterol']] = imputer.fit_transform(df[['Cholesterol']])

In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,918.0,53.510893,9.432617,28.0,47.0,54.0,60.0,77.0
RestingBP,918.0,132.396514,18.514154,0.0,120.0,130.0,140.0,200.0
Cholesterol,918.0,243.204793,53.401297,85.0,214.0,237.0,267.0,603.0
FastingBS,918.0,0.233115,0.423046,0.0,0.0,0.0,0.0,1.0
MaxHR,918.0,136.809368,25.460334,60.0,120.0,138.0,156.0,202.0
Oldpeak,918.0,0.887364,1.06657,-2.6,0.0,0.6,1.5,6.2
HeartDisease,918.0,0.553377,0.497414,0.0,0.0,1.0,1.0,1.0


## Encoding Categorical Features

| Model Type              | Encoding Recommendation          | Why?                                                                          |
| ----------------------- | -------------------------------- | ----------------------------------------------------------------------------- |
| **Distance-Based Models** | One-Hot Encoding (`get_dummies`) | Treats all features as numeric inputs; doesn't understand categories or order |
| **Tree-Based Models**   | Ordinal/Label Encoding           | Trees can **handle order** and **split on encoded values**                    |


But LightGBM and CatBoost can handle categorical features directly without encoding.

In [10]:
cat_cols = df.select_dtypes("object").columns.to_list()
print(cat_cols)

['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']


In [11]:
for col in cat_cols:
  print(f"{col}: {df[col].unique()}")

Sex: ['M' 'F']
ChestPainType: ['ATA' 'NAP' 'ASY' 'TA']
RestingECG: ['Normal' 'ST' 'LVH']
ExerciseAngina: ['N' 'Y']
ST_Slope: ['Up' 'Flat' 'Down']


In [12]:
# I'm going to use tree-based model
df['Sex'] = df['Sex'].map({'M': 1, 'F': 0})
df['ExerciseAngina'] = df['ExerciseAngina'].map({'Y': 1, 'N': 0})
df['ST_Slope'] = df['ST_Slope'].map({'Down': 0, 'Flat': 1, 'Up': 2})
df['ChestPainType'] = df['ChestPainType'].map({'ASY': 0, 'NAP': 1, 'ATA': 2, 'TA':3})
df['RestingECG'] = df['RestingECG'].map({'Normal': 0, 'ST': 1, 'LVH': 2})

In [13]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,1,2,140,289.0,0,0,172,0,0.0,2,0
1,49,0,1,160,180.0,0,0,156,0,1.0,1,1
2,37,1,2,130,283.0,0,1,98,0,0.0,2,0
3,48,0,0,138,214.0,0,0,108,1,1.5,1,1
4,54,1,1,150,195.0,0,0,122,0,0.0,2,0


## Split Data

In [14]:
X = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']  

In [15]:
# stratify is important to preserve class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
) 

## Scaling Data

I ask Gemini about scaling data based on machine learning algorithm and this the answer:

When building machine learning models, whether you need to scale your data it depends on the type of algorithm you're using.

**Distance-Based and Gradient-Based Algorithms Need Scaling**

Algorithms like K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Logistic Regression are sensitive to the scale of your features. This is because they either rely on calculating distances between data points or use gradient descent for optimization.
Scaling ensures all features contribute equally, preventing features with larger numerical ranges from dominating the results. Similarly, for gradient-based methods like Logistic Regression, unscaled features can lead to an unstable and slow optimization process, as features with larger magnitudes will have larger gradients, causing uneven updates to the model's parameters.

**Tree-Based Algorithms Don't Need Scaling (and It's Often Avoided)**

In contrast, tree-based algorithms such as Decision Trees, Random Forests, and Gradient Boosting Machines (like XGBoost) are generally immune to feature scaling. These algorithms make decisions based on splitting data at specific thresholds for individual features (e.g., "Is age > 30?").
Scaling provides no performance benefit for these models and can even complicate the interpretability of the model, as the split points would then refer to scaled values rather than original, understandable units. Therefore, it's typically recommended not to scale data when using tree-based models, simplifying the preprocessing pipeline.

Because I want to use tree-based model, I won't scale the data

## Save Preprocessed Data

In [16]:
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

# Save to CSV
train_df.to_csv("../data/processed/train.csv", index=False)
test_df.to_csv("../data/processed/test.csv", index=False)