# **Heart Disease (Cleveland) — Preprocessing notebook**

## **Steps Covered:**
 1.Import the dataset and explore basic info (nu ls, data types).

 2.Handle missing values using mean/median/imputation.

 3.Convert categorical features into numerical using encoding.

 4.Normalize/standardize the numerical features.

 5.Visualize outliers using boxplots and remove them.

In [35]:
!pip install ucimlrepo



In [36]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
from sklearn.impute import SimpleImputer
import numpy as np
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt

## **Load the dataset**
Use ucimlrepo to fetch the UCI Cleveland heart disease dataset, or pd.read_csv() if you already have the file.

In [37]:

heart_disease = fetch_ucirepo(id=45)

In [38]:
X = heart_disease.data.features
y = heart_disease.data.targets

df = pd.concat([X, y], axis=1)

## **Quick exploration**

df.info() — shows dtypes and non-null counts

df.describe() — numeric summaries (mean, std, min, max)

df.isnull().sum() — missing values per column

In [41]:
print(df.head(),"\n\n")
print(df.info(),"\n\n")
print(df.describe(),"\n\n")
print(df.isnull().sum())

    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
2  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
3  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
4  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   

   slope   ca  thal  num  
0    3.0  0.0   6.0  0.0  
1    2.0  3.0   3.0  2.0  
2    2.0  2.0   7.0  1.0  
3    3.0  0.0   3.0  0.0  
4    1.0  0.0   3.0  0.0   


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    fl

## **Handle missing values (if any)**

Numeric columns: use mean or median (median is robust for skewed data).

Categorical columns: use the most frequent value (mode) or a new category like 'missing'.

In [50]:
print(df.isnull().sum())

num_cols = df.select_dtypes(include=np.number).columns
cat_cols = df.select_dtypes(include=['object']).columns


if len(num_cols) > 0:
    df[num_cols] = SimpleImputer(strategy='mean').fit_transform(df[num_cols])

if len(cat_cols) > 0:
    df[cat_cols] = SimpleImputer(strategy='most_frequent').fit_transform(df[cat_cols])
else:
    print("No categorical columns found; skipping categorical imputation.")

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
num         0
dtype: int64
No categorical columns found; skipping categorical imputation.


## **Encoding categorical features (if any)**

In [24]:
df = pd.get_dummies(df, drop_first=True)


## **Scale / normalize numeric features**

In [27]:
scaler = StandardScaler()
num_cols = df.select_dtypes(include='number').columns

df[num_cols] = scaler.fit_transform(df[num_cols])


## **Visualize outliers (boxplots) and remove them**

Use boxplots to see which numeric features have outliers. Remove rows with values beyond 1.5 * IQR as a simple method.

In [31]:
num_cols = df.select_dtypes(include=np.number).columns

Q1 = df[num_cols].quantile(0.25)
Q3 = df[num_cols].quantile(0.75)
IQR = Q3 - Q1

# Keep only valid rows (no outliers)
condition = ~((df[num_cols] < (Q1 - 1.5 * IQR)) |
              (df[num_cols] > (Q3 + 1.5 * IQR))).any(axis=1)

df_clean = df[condition]

print("Original rows:", len(df))
print("Rows after outlier removal:", len(df_clean))
print("Removed:", len(df) - len(df_clean))


Original rows: 303
Rows after outlier removal: 215
Removed: 88
