# Heartbeat Classification - Data Exploration

## Project Overview
Exploratory data analysis on ECG heartbeat classification datasets.

**Datasets:**
- MIT-BIH Arrhythmia Dataset (5 classes: Normal, Supraventricular, Ventricular, Fusion, Unknown)
- PTB Diagnostic ECG Database (2 classes: Normal, Abnormal/MI)

**Objective:** Analyze data structure, class distributions, and identify preprocessing requirements.



In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy.stats import kruskal, chisquare, kstest

# Data handling
import warnings
warnings.filterwarnings('ignore')

# System libraries
import sys
# Add the project root directory to Python path
sys.path.append('..')

# Set random seed for reproducibility
np.random.seed(42)

# Custom utils
from src.utils import generate_data_audit_report, generate_summary_report
from src.visualization.visualization import plot_heartbeat, plot_multiple_heartbeats

## 1. Dataset Descriptions

### PTBDB_*.csv

- Derived from PTB Diagnostic ECG Database
- Binary classification: normal vs abnormal (myocardial infarction) heartbeats

### MITBIH_*.csv

- Derived from PhysioNet's MIT-BIH Arrhythmia Dataset
- Multiclass classification: 5 categories (N: Normal, S: Supraventricular, V: Ventricular, F: Fusion, Q: Unknown)
- Each column represents a time point in a 10-second ECG signal, sampled at 125Hz
- Values normalized between 0 and 1

### Common Characteristics

- Each column represents a time point in a 10-second ECG signal, sampled at 125Hz
- Values normalized between 0 and 1
- Zero-padded to fixed dimension of 188 columns
- Column 187 = class label (target)
- Datasets are pre-split into train/test partitions

## 2. Data Loading & Structure

In [None]:
ptbdb_normal = pd.read_csv("data/original/ptbdb_normal.csv", header=None)
display(ptbdb_normal.head())

print("Dataset shapes:", ptbdb_normal.shape, ptbdb_normal.shape)
print("Data types:", ptbdb_normal.dtypes.value_counts())
print("Memory usage:", ptbdb_normal.memory_usage(deep=True).sum() / 1024**2, "MB")
print("Duplicates - deleted! ", ptbdb_normal.duplicated().sum())
print(ptbdb_normal[187].value_counts())

ptbdb_normal.drop_duplicates(inplace=True)

In [None]:
ptbdb_abnormal = pd.read_csv("data/original/ptbdb_abnormal.csv", header=None)
display(ptbdb_abnormal.head())

print("Dataset shapes:", ptbdb_abnormal.shape, ptbdb_abnormal.shape)
print("Data types:", ptbdb_abnormal.dtypes.value_counts())
print("Memory usage:", ptbdb_abnormal.memory_usage(deep=True).sum() / 1024**2, "MB")
print("Duplicates - deleted! ", ptbdb_abnormal.duplicated().sum())

print(ptbdb_abnormal[187].value_counts())

ptbdb_abnormal.drop_duplicates(inplace=True)

In [None]:
mitbih_test = pd.read_csv("data/original/mitbih_test.csv", header=None)
display(mitbih_test.head())

print("Dataset shapes:", mitbih_test.shape, mitbih_test.shape)
print("Data types:", mitbih_test.dtypes.value_counts())
print("Memory usage:", mitbih_test.memory_usage(deep=True).sum() / 1024**2, "MB")
print("Duplicates - 0! ", mitbih_test.duplicated().sum())

print(mitbih_test[187].value_counts())



In [None]:
mitbih_train = pd.read_csv("data/original/mitbih_train.csv", header=None)
display(mitbih_train.head())

print("Dataset shapes:", mitbih_train.shape, mitbih_train.shape)
print("Data types:", mitbih_train.dtypes.value_counts())
print("Memory usage:", mitbih_train.memory_usage(deep=True).sum() / 1024**2, "MB")
print("Duplicates - 0! ", mitbih_train.duplicated().sum())

print(mitbih_train[187].value_counts())

In [None]:
# mitbih labels mapping
mitbih_labels_map = {0: 'N', 1: 'S', 2: 'V', 3: 'F', 4: 'Q'}
mitbih_labels_to_desc = {"N": "Normal", "S": "Supraventricular premature beat", "V": "Premature ventricular contraction", "F": "Fusion of V+N", "Q": "Unclassified"}

## 3. Dataset consistency

In [None]:
# Missing values analysis
datasets = [mitbih_train, mitbih_test, ptbdb_normal, ptbdb_abnormal]
for i, dataset in enumerate(datasets):
    total_missing = dataset.isnull().sum().sum()
    missing_per_column = dataset.isnull().sum()
    missing_percentage = (dataset.isnull().sum() / len(dataset)) * 100
    print(f"Dataset {i+1} - Total missing values: {total_missing}")
    # print(f"Dataset {i+1} - Missing values per column:\n{missing_per_column}")
    # print(f"Dataset {i+1} - Missing values percentage per column:\n{missing_percentage}\n")

    # Data range analysis
    # print("Signal amplitude range:", dataset.describe())

*NO MISSING VALUES!*

In [None]:
ptbdb = pd.concat([ptbdb_abnormal, ptbdb_normal], axis=0).reset_index(drop=True)
mitbih = pd.concat([mitbih_train, mitbih_test], axis=0).reset_index(drop=True)
for i, dataset in enumerate([ptbdb, mitbih ]):
    if i == 0:
        name = "ptbdb"
    else:
        name = "mitbih"
    print(f"Dataset {name} - Shape: {dataset.shape}")
    zero_mean_columns = dataset.columns[dataset.describe().loc['mean'] == 0].tolist()
    print(f"Dataset {name} - Columns (columnname(s)) with zero mean: {zero_mean_columns}")
    display(dataset.describe())



## 4. Class Distribution & Imbalance Tests

### MITBIH

In [None]:
class_counts = mitbih.iloc[:, -1].value_counts().sort_index()
imbalance_ratio = class_counts.min() / class_counts.max()

# Statistical tests for imbalance
from scipy.stats import chi2_contingency
chi2, p_value, dof, expected = chi2_contingency(class_counts.values.reshape(1, -1))

print(f"Chi-squared test statistic: {chi2}")
print(f"P-value: {p_value}")
print(f"Degrees of freedom: {dof}") 

print(f"Class counts:\t{class_counts}") # pie chart

In [None]:

# Map class IDs -> short + description labels
labels = [
    f"{mitbih_labels_map[i]} - {mitbih_labels_to_desc[mitbih_labels_map[i]]}"
    for i in class_counts.index
]

# --- Colors ---
colors = sns.color_palette("pastel", len(class_counts))

# --- Plot setup ---
fig, ax = plt.subplots(figsize=(8, 8))

plt.pie(class_counts, labels=labels, colors=colors, autopct='%1.1f%%')

ax.set_title("Class Distribution in MIT-BIH Arrhythmia Dataset", fontsize=16, weight='bold', pad=20)

plt.tight_layout()

plt.show()

#### PTBDB

In [None]:
class_counts = ptbdb.iloc[:, -1].value_counts().sort_index()
expected = [class_counts.mean()] * len(class_counts)
imbalance_ratio = class_counts.min() / class_counts.max()

chi2, p_value = chisquare(class_counts, expected)

print(f"Chi-squared test statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Imbalance ratio: {imbalance_ratio:.4f}")

print(f"Class counts:\t{class_counts}") # pie chart

In [None]:
colors = sns.color_palette("pastel", len(class_counts))
fig, ax = plt.subplots(figsize=(8, 8))

class_counts = ptbdb[187].value_counts().sort_index()
class_counts.plot(kind='bar', color='skyblue')
plt.xlabel('Class')
plt.ylabel('Count')
plt.title('PTB: Class Distribution')
plt.show()

## 5. Plots

### PTBDB: Normal vs Abnormal


In [None]:
normal_heartbeat = ptbdb[ptbdb[187] == 0].sample(3)
abnormal_heartbeat = ptbdb[ptbdb[187] == 1].sample(3)
fig = plot_multiple_heartbeats(normal_heartbeat, title="PTBDB Normal")
fig = plot_multiple_heartbeats(abnormal_heartbeat, title="PTBDB Abnormal")

### MITBIH: Plots for all classes

In [None]:
classes = mitbih[187].unique()
colors = ['blue', 'green', 'red', 'orange', 'purple', 'brown', 'pink', 'cyan', 'magenta', 'lime']

for i, _class in enumerate(classes):
    heartbeat = mitbih[mitbih[187] == _class].sample(3)
    title = f"MITBIH Class {_class} - {mitbih_labels_map[_class]}: {mitbih_labels_to_desc[mitbih_labels_map[_class]]}"
    fig = plot_multiple_heartbeats(heartbeat, title=title, color=colors[i])

## 6. R-R Distance Analysis

Analysis of R-R distances (time between R peaks) to identify potential classification errors or unusual patterns.

Each row contains 1.2R (heartbeat duration) and is zero-padded to fixed length for deep learning compatibility.
R distance is calculated by:
- Finding the index where zero-padding starts
- Dividing by 1.2

This analysis enables:
- Comparison of R-R distances across classes
- Identification of outliers and extreme values

In [None]:
def find_first_nonzero_index(arr):
    for i in range(len(arr) - 1, -1, -1):
        if arr[i] != 0:
            first_zero_index = ( i + 1 ) / 1.2
            break
        else:
            first_zero_index = 0  # all zeros
    return first_zero_index

### PTB

In [None]:
ptbdb_r = ptbdb.iloc[:, :-1].apply(lambda row: find_first_nonzero_index(row.values), axis=1)
result_ptbdb_r = pd.concat([ptbdb_r.rename('zero_pad_start'), ptbdb.iloc[:, -1].rename('target')], axis=1)

result_ptbdb_r.describe()

In [None]:
result_ptbdb_r[result_ptbdb_r["target"] == 1].describe()

In [None]:
result_ptbdb_r[result_ptbdb_r["target"] == 0].describe()

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x='target', y='zero_pad_start', data=result_ptbdb_r)
plt.title('PTB Zero-padding Start vs Class', fontsize=16)
plt.xlabel('PTB Class', fontsize=12)
plt.ylabel('Zero-padding Start Index', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

#### Extremes

In [None]:
# Function to get outlier indices for each class
def get_outliers_idx(df, value_col, class_col):
    outlier_indices = []
    
    for cls, group in df.groupby(class_col):
        q1 = group[value_col].quantile(0.25)
        q3 = group[value_col].quantile(0.75)
        iqr = q3 - q1
        lower_whisker = q1 - 1.5 * iqr
        upper_whisker = q3 + 1.5 * iqr
        
        # Identify outliers
        outliers = group[(group[value_col] < lower_whisker) | (group[value_col] > upper_whisker)]
        outlier_indices.extend(outliers.index.tolist())
    
    return outlier_indices

In [None]:
for c in [0,1]:
    df = result_ptbdb_r[result_ptbdb_r["target"] == c]
    idx = get_outliers_idx(df, 'zero_pad_start', 'target')
    
    proportion = len(idx) / len(df) * 100

    if c == 0:
        n = "Normal"
    if c == 1:
        n = "Abnormal"

    plot_multiple_heartbeats(ptbdb.iloc[idx].sample(3), title=f"PTBDB Extremes in {n} Class. Proportion:{proportion:.2f}%")

### MITBIH

In [None]:
mitbih_r = mitbih.iloc[:, :-1].apply(lambda row: find_first_nonzero_index(row.values), axis=1)
result_mitbih_r = pd.concat([mitbih_r.rename('zero_pad_start'), mitbih.iloc[:, -1].rename('target')], axis=1)

result_mitbih_r.describe()

result_mitbih_r.head()

In [None]:
for c in mitbih[187].unique():
    display(result_mitbih_r[result_mitbih_r["target"] == c].describe())

In [None]:
plt.figure(figsize=(8, 6))
sns.boxplot(x='target', y='zero_pad_start', data=result_mitbih_r)
plt.title('PTB Zero-padding Start vs Target', fontsize=16)
plt.xlabel('PTB Class', fontsize=12)
plt.ylabel('Zero-padding Start Index', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
for c in mitbih[187].unique():
    df = result_mitbih_r[result_mitbih_r["target"] == c]
    idx = get_outliers_idx(df, 'zero_pad_start', 'target')
    
    proportion = len(idx) / len(df) * 100

    sample = mitbih.iloc[idx]
    
    if len(sample) > 0:
        min_len = min([len(sample), 3])
        plot_multiple_heartbeats(sample.sample(min_len), title=f"MIT-BIH Extremes in {str(c)} Class. Proportion:{proportion:.2f}%")

### Statistical Tests for Class Differences

#### Kruskal-Wallis Test on Principal Components

Kruskal-Wallis is univariate, so a multivariate approach is needed:
1. Apply PCA to reduce dimensionality
2. Run Kruskal-Wallis test on each principal component

In [None]:
from sklearn.decomposition import PCA
from scipy.stats import kruskal
import pandas as pd

def kruskal_multivariate(df, target_col=None, n_components=5):
    if target_col is None:
        target_col = df.columns[-1]

    X = df.drop(columns=target_col)
    y = df[target_col]

    p_values = []
    # Reduce to n_components using PCA
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X)

    for i in range(n_components):
        groups = [X_pca[y == cls, i] for cls in pd.unique(y)]
        stat, p = kruskal(*groups)
        p_values.append(p)

    return pd.DataFrame({
        'PC': [f'PC{i+1}' for i in range(n_components)],
        'p_value': p_values
    })

In [None]:
kruskal_results_mit = kruskal_multivariate(mitbih, target_col=187, n_components=5)
print(kruskal_results_mit)

In [None]:
kruskal_results_ptb = kruskal_multivariate(ptbdb, target_col=187, n_components=2)
print(kruskal_results_ptb)

Low p-values indicate significant differences along principal components.

#### Pairwise comparison of R

In [None]:
from itertools import combinations 

def kruskal_class_pairs(df, target_col=None, agg_func='mean'):
    if target_col is None:
        target_col = df.columns[-1]

    X = df.drop(columns=target_col)
    y = df[target_col]

    # Aggregate features into a single value per sample
    if agg_func == 'mean':
        X_agg = X.mean(axis=1)
    elif agg_func == 'sum':
        X_agg = X.sum(axis=1)
    else:
        raise ValueError("agg_func must be 'mean' or 'sum'")

    results = []

    classes = pd.unique(y)
    for class1, class2 in combinations(classes, 2):
        group1 = X_agg[y == class1]
        group2 = X_agg[y == class2]

        stat, p = kruskal(group1, group2)
        results.append({
            'class1': class1,
            'class2': class2,
            'p_value': p
        })

    return pd.DataFrame(results).sort_values('p_value')


In [None]:
mitbih_r_copy = result_mitbih_r.copy()
result_ptbdb_r_copy = result_ptbdb_r.copy()

mitbih_r_copy['target'] = 'MIT_' + mitbih_r_copy['target'].astype(str)
result_ptbdb_r_copy['target'] = 'PTB_' + result_ptbdb_r_copy['target'].astype(str)

combined_df = pd.concat([mitbih_r_copy, result_ptbdb_r_copy], ignore_index=True)

combined_df.head()


In [None]:
kruskal_pairwise_df = kruskal_class_pairs(combined_df, target_col='target', agg_func='mean')

In [None]:
kruskal_pairwise_df[kruskal_pairwise_df["p_value"] > 0.01]

In [None]:
kruskal_pairwise_df.sort_values(by='class1', ascending=True)

## Results and Findings
 
### PTB Dataset

1. Last column in PTBDB is empty (preserved for compatibility)
2. Class imbalance validated through chi-squared test → resampling required
3. Extreme values in R-R distances identified; proportion is low (0.2%)

### MIT-BIH Dataset

1. Severe class imbalance validated through chi-squared test → resampling required
2. Extreme values in R-R distances identified; counts for classes 3 and 4 are relatively high
3. MIT class 1 vs PTB class 1 may represent similar abnormal heartbeat patterns

## Generate Audit Report

In [None]:
# Generate audit reports for all CSV files (specify the correct path)
generate_data_audit_report(data_dir="../data/original/", output_dir="../reports/data_audit/")

# Generate summary report (specify the correct path)
generate_summary_report(data_dir="../data/original/", output_file="../reports/data_audit/data_summary.txt")