# Data Cleaning in Python — Exercise

**Course:** Intro to Data Mining  
**Dataset:** Heart Failure Prediction dataset (loaded from a public URL — no file downloads needed!)  

---


# YOUR TURN: Practice Exercise

Now apply the same data cleaning pipeline from the Titanic exercise to a **new dataset**: the **Heart Failure Prediction** dataset.

This dataset combines 5 heart disease databases (918 patients total) and contains medical information about patients and whether or not they have heart disease. It has **text columns that need encoding**, potential outliers, and some tricky data quality issues — perfect for practicing everything we just learned.

**Your tasks** (follow the same steps from above):

1. Load the data and take a first look
2. Understand the data types and structure
3. Get descriptive statistics
4. Check for duplicates
5. Handle missing values (decide: drop column, drop row, or impute?)
6. Encode any categorical (`object`) columns
7. Detect and handle outliers
8. Scale the numerical features

The starter code below loads the dataset for you. The rest is up to you!

> **Hints:** Pay close attention to `.describe()` — not all missing values show up as NaN! Also, this dataset was made by combining multiple hospital databases, so check for duplicates.

### About the Heart Failure Prediction Dataset

| Column | Description | Type |
|--------|-------------|------|
| Age | Age in years | Numerical |
| Sex | M = Male, F = Female | Categorical |
| ChestPainType | TA = Typical Angina, ATA = Atypical Angina, NAP = Non-Anginal Pain, ASY = Asymptomatic | Categorical |
| RestingBP | Resting blood pressure (mm Hg) | Numerical |
| Cholesterol | Serum cholesterol (mg/dl) | Numerical |
| FastingBS | Fasting blood sugar > 120 mg/dl (1 = true, 0 = false) | Numerical |
| RestingECG | Normal, ST = ST-T wave abnormality, LVH = left ventricular hypertrophy | Categorical |
| MaxHR | Maximum heart rate achieved | Numerical |
| ExerciseAngina | Y = Yes, N = No | Categorical |
| Oldpeak | ST depression induced by exercise | Numerical |
| ST_Slope | Up = upsloping, Flat = flat, Down = downsloping | Categorical |
| HeartDisease | 1 = heart disease, 0 = no heart disease (TARGET) | Numerical |

### Step 1: Load the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# The Heart Failure Prediction dataset — loaded from a public GitHub URL
# (Originally from Kaggle, combining 5 UCI heart disease databases)
url = 'https://raw.githubusercontent.com/xpy-10/DataSet/main/heart.csv'

heart = pd.read_csv(url)

print(f'Dataset shape: {heart.shape[0]} rows x {heart.shape[1]} columns')
heart.head()

Dataset shape: 918 rows x 12 columns


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


### Step 2: Check data types and structure

Use `.info()` to see the data types and count non-null values.

In [3]:
heart.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


### Step 3: Descriptive statistics

Use `.describe()`. Look for anything surprising in the min, max, or mean values.

In [20]:
heart.describe()


Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


### Step 4: Check for duplicates

How many duplicate rows are there? If any, remove them.

In [15]:
heart_duplicates = heart.duplicated().sum()
print(f'Number of duplicate rows: {heart_duplicates}')

Number of duplicate rows: 0


In [17]:
# DROP DUPLICATES
heart = heart.drop_duplicates()

### Step 5: Handle missing values

Check the percentage of missing values per column. Then decide what to do:
- Drop columns that are mostly missing
- Drop rows if very few are missing
- Impute (fill) with mean or median if a moderate amount is missing

In [34]:
import numpy as np

cols = ['RestingBP', 'Cholesterol']

# 1. Check zeros
print("Zero count per column:\n", heart[cols].eq(0).sum())

# 2. Convert zeros -> NaN
heart[cols] = heart[cols].replace(0, np.nan)

# 3. Check missing %
missing_pct = (heart.isna().mean() * 100).round(2)
print("\nMissing %:\n", missing_pct[missing_pct > 0])

# 4. Impute by converting zeros to median value
heart[cols] = heart[cols].fillna(heart[cols].median())

Zero count per column:
 RestingBP      0
Cholesterol    0
dtype: int64

Missing %:
 RestingBP       0.11
Cholesterol    18.74
dtype: float64


### Step 6: Encode categorical variables

Check if any columns are `object` type. If so, decide whether to use label encoding or one-hot encoding.

Hint: Look at `.info()` output — how many unique values does each `object` column have?

In [36]:
# find categorical columns
print(heart.select_dtypes(include='object').columns)

# check for number of unique values
for col in heart.select_dtypes(include='object').columns:
    print(col, "->", heart[col].nunique())


Index(['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'], dtype='object')
Sex -> 2
ChestPainType -> 4
RestingECG -> 3
ExerciseAngina -> 2
ST_Slope -> 3


In [39]:
# Sex
print("Before encoding Sex:")
print(heart['Sex'].value_counts())

heart['Sex'] = heart['Sex'].map({'M': 1, 'F': 0})

print("\nAfter encoding Sex:")
print(heart['Sex'].value_counts())

# ExerciseAngina
print("\nBefore encoding ExerciseAngina:")
print(heart['ExerciseAngina'].value_counts())

heart['ExerciseAngina'] = heart['ExerciseAngina'].map({'Y': 1, 'N': 0})

print("\nAfter encoding ExerciseAngina:")
print(heart['ExerciseAngina'].value_counts())

Before encoding Sex:
Series([], Name: count, dtype: int64)

After encoding Sex:
Series([], Name: count, dtype: int64)

Before encoding ExerciseAngina:
ExerciseAngina
0    547
1    371
Name: count, dtype: int64

After encoding ExerciseAngina:
Series([], Name: count, dtype: int64)


### Step 7: Detect and handle outliers

Pick 2-3 numerical columns (e.g., `trestbps`, `chol`, `thalach`). Create box plots and use the IQR method.

Remember: think about whether outliers are real data or errors before removing them!

In [None]:
# YOUR CODE HERE
# Hint: You can reuse the detect_outliers_iqr function from above
# Then create box plots with plt.boxplot()


### Step 8: Scale the numerical features

Apply Min-Max scaling to the numerical columns (not the target column).

In [None]:
# YOUR CODE HERE
# Hint: Follow the same pattern from Section 8 of the tutorial


### Reflection Questions

After completing the exercise, answer these questions (add a text cell or answer here):

1. How did the Heart Disease dataset compare to the Titanic dataset in terms of data quality?
2. Which missing value strategy did you use, and why?
3. Did you find any outliers? Did you remove them? Why or why not?
4. If you were building a model to predict heart disease, which columns do you think would be most important?

---
*Great work! You now have hands-on experience with a complete data cleaning pipeline.* Dr. Thompson