# Intro to ECG Heartbeat Classification

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 📖 TABLE OF CONTENTS

- [1. Intro]()
- [2. Understanding Time Series Problems]()
  - [📌 What is Time Series Data?]()
  - [🔍 Characteristics of Time Series Data]()
  - [⚠️ Challenges in Time Series Analysis]()
- [3. Introduction to ECG Signals]()
  - [📊 What are ECG Signals?]()
  - [🏥 Components of an ECG Signal]()
  - [📈 How to read ECG paper?]()
  - [🔍 Why Classify ECG Signals?]()
- [4. Methods for ECG Heartbeat Classification]()
  - [1️⃣ Traditional Techniques]()
  - [2️⃣ Machine Learning (ML) Techniques]()
  - [3️⃣ Deep Learning (DL) Techniques]()
  - [4️⃣ Multimodal AI Techniques]()
  - [⚖️ Comparison of Different Methods]()
- [5. The Kaggle ECG Heartbeat Categorization Dataset]()

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 1. Intro

Welcome to the **HeartBeatInsight** project! This first Jupyter Notebook is your entry point into understanding **ECG Heartbeat Classification** from the ground up. We'll explore time series data, ECG signals, and various methods to classify heartbeats, ultimately helping you determine the best approach for the [**Kaggle ECG Heartbeat Categorization Dataset**](https://www.kaggle.com/datasets/shayanfazeli/heartbeat).

Let's get started! 🩺❤️

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 2. Understanding Time Series Problems

## 📌 What is Time Series Data?

**Time Series Data** is a sequence of data points collected or recorded at successive time intervals. Each data point is associated with a timestamp, and the order of the data matters because each observation depends on previous ones.

**Examples of Time Series Data:**

- **Weather Data:** Temperature recorded daily.

- **Stock Prices:** Prices recorded minute-by-minute.
    
- **Heart Rate:** Beats per minute recorded continuously.

## 🔍 Characteristics of Time Series Data

1. **Temporal Dependency:** Data points are related to each other over time.

2. **Trend:** A long-term upward or downward movement in the data.

3. **Seasonality:** Regular patterns that repeat over specific intervals.

4. **Noise:** Random variations or anomalies in the data.

## ⚠️ Challenges in Time Series Analysis

1. **Handling Noise:** Random disturbances can obscure patterns.

2. **Capturing Trends:** Identifying underlying trends accurately.

3. **Dealing with Missing Data:** Gaps in data can disrupt analysis.

4. **Complex Patterns:** Time series data often has intricate relationships that are difficult to model.

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 3. Introduction to ECG Signals

## 📊 What are ECG Signals?

An **Electrocardiogram (ECG)** measures the electrical activity of the heart over time. It's a non-invasive test used to detect heart problems like **arrhythmias (irregular heartbeats)**. Each ECG signal records a sequence of heartbeats, and each heartbeat can be categorized as **normal or abnormal**.

## 🏥 Components of an ECG Signal

In [1]:
# Components of an ECG Signal

from IPython import display
display.Image("data/images/Intro_ECG_Heartbeat_Classification-01.jpg")

<IPython.core.display.Image object>

- **P Waves**
    - P waves represent atrial depolarisation $\implies$ contraction of the atria (upper chambers).
    - In healthy individuals, there should be a P wave preceding each QRS complex.

- **PR Interval**
    - The PR interval begins at the start of the P wave and ends at the beginning of the Q wave.
    - It represents the time for electrical activity to move between the atria and the ventricles.

- **QRS Complex**
    - The QRS complex represents the ventricular depolarisation $\implies$ contraction of the ventricles (lower chambers).
    - It appears as three closely related waves on the ECG (the Q, R and S wave).

- **ST Segment**
    - The ST segment starts at the end of the S wave and ends at the beginning of the T wave.
    - The ST segment is an isoelectric line representing the time between depolarisation and repolarisation of the ventricles (i.e. ventricular contraction).

- **T Wave**
    - The T wave represents ventricular repolarisation $\implies$ relaxation of the ventricles.
    - It appears as a small wave after the QRS complex.

- **RR Interval**
    - The RR interval begins at the peak of one R wave and ends at the peak of the next R wave.
    - It represents the time between two QRS complexes.

- **QT Interval**
    - The QT interval begins at the start of the QRS complex and finishes at the end of the T wave.
    - It represents the time taken for the ventricles to depolarise and then repolarise.

## 📈 How to read ECG paper?

In [2]:
# How to read ECG paper

from IPython import display
display.Image("data/images/Intro_ECG_Heartbeat_Classification-02.jpg")

<IPython.core.display.Image object>

The paper used to record ECGs is **standardised** across most hospitals and has the following characteristics:

- Each **small square** represents **0.04 seconds**
    
- Each **large square** represents **0.2 seconds**

- **5 large squares = 1 second**

- **300 large squares = 1 minute**

## 🔍 Why Classify ECG Signals?

Accurately classifying ECG signals helps in diagnosing heart conditions. For example:

- **Normal Beats (N)**
    
- **Arrhythmias**: Supraventricular Ectopic Beats (S), Ventricular Ectopic Beats (V), Fusion Beats (F), and Unknown Beats (Q).

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 4. Methods for ECG Heartbeat Classification

ECG Heartbeat Classification can be done using different approaches. Each method has its advantages, limitations, and best-use scenarios. Let's explore the major techniques:

## 1️⃣ Traditional Techniques

📌 **Description:**

- These methods rely on **manual feature extraction** (e.g., measuring the distance between peaks, amplitude, or duration of specific ECG segments).

- Commonly used features include the **QRS complex duration**, **P-wave interval**, and **heart rate variability**.

🛠️ **Examples:**

- **Rule-Based Methods:** Apply predefined rules to classify heartbeats.
    
- **Statistical Methods:** Use techniques like **Fourier Transform** or **Wavelet Transform** to analyze frequency characteristics.

✅ **Advantages:**

- Simple to understand and implement.

- Effective for small datasets with clear patterns.

❌ **Limitations:**

- Requires **domain expertise** for feature extraction.
    
- Struggles with complex or noisy data.

🕒 **When to Use:**

- When you have **small datasets**.
    
- When the patterns are straightforward and domain expertise is available.

## 2️⃣ Machine Learning (ML) Techniques


📌 **Description:**

- In ML, we train models to automatically learn features from data.
    
- These models require **preprocessing and feature extraction** before classification.

🛠️ **Examples:**

1. **Support Vector Machines (SVMs)**
    
2. **Random Forests**
    
3. **K-Nearest Neighbors (KNN)**
    
4. **Gradient Boosting (XGBoost)**

✅ **Advantages:**

- Good for **moderately-sized datasets**.
    
- Offers a balance between simplicity and performance.

❌ **Limitations:**

- Requires **manual feature engineering**.
    
- Performance may degrade with large, complex datasets.

🕒 **When to Use:**

- When you have **moderate-sized datasets**.
    
- When you can extract meaningful features.

## 3️⃣ Deep Learning (DL) Techniques

📌 **Description:**

- DL models like **Convolutional Neural Networks (CNNs)** and **Recurrent Neural Networks (RNNs)** automatically learn features directly from raw ECG signals.
    
- **Hybrid Models** (CNN + LSTM) combine the strengths of both architectures.

🛠️ **Examples:**

1. **1D CNN:** Recognizes patterns in raw signals.
    
2. **LSTM:** Captures temporal dependencies.
    
3. **CNN + LSTM:** Combines pattern recognition and sequence modeling.

✅ **Advantages:**

- **Automatic feature extraction**.
    
- Handles large, complex datasets well.
    
- Suitable for noisy data.

❌ **Limitations:**

- Requires **large datasets** for training.
    
- Computationally intensive (needs GPUs).

🕒 **When to Use:**

- When you have **large datasets** with complex patterns.
    
- When you want to avoid manual feature extraction.

## 4️⃣ Multimodal AI Techniques

📌 **Description:**

- Combines multiple data sources (e.g., ECG signals + patient history) to improve classification accuracy.

✅ **Advantages:**

- More comprehensive analysis by considering additional context.
    
- Improved accuracy and robustness.

❌ **Limitations:**

- Requires integrating multiple types of data.
    
- Computationally complex.

🕒 **When to Use:**

- When you have **additional data sources** beyond ECG signals.

## ⚖️ Comparison of Different Methods

| Method | Dataset Size | Computational Complexity | Memory Requirement | Latency | Best Use Case |
| :----- | :----------- | :----------------------- | :----------------- | :------ | :------------ |
| **Traditional** | Small | Low | Low | Low | Simple datasets, clear patterns |
| **Machine Learning** | Moderate | Moderate | Moderate | Moderate | Moderate datasets, extracted features |
| **Deep Learning** | Large | High | High | High | Large datasets, automatic feature learning |
| **Multimodal AI** | Large + Additional Data | Very High | Very High | High | When combining ECG with other data sources |

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 5. The Kaggle ECG Heartbeat Categorization Dataset

## 🗂️ Dataset Overview

This dataset, available on [Kaggle](https://www.kaggle.com/datasets/shayanfazeli/heartbeat), is derived from the **MIT-BIH Arrhythmia Database** and has been preprocessed for ease of use in heartbeat classification tasks. Below is a detailed overview of the dataset:

1. **Data Description**

    - The dataset is divided into two CSV files:
        
        - `mitbih_train.csv` (Training Data): Contains **87,554** heartbeat samples.
        
        - `mitbih_test.csv` (Testing Data): Contains **21,892** heartbeat samples.

    - Each row represents **a single heartbeat segment** described by **187 time steps (data points)**.

2. **Features**

    - **ECG Signal Data:**
        
        - Each heartbeat is represented by **187 numerical values**.
        
        - These values capture the time series of the ECG signal segment.

3. **Labels**

    - The dataset includes 5 distinct classes of heartbeats:

        | Class Label | Class Description | Code |
        | :---------- | :---------------- | :--- |
        | 0 | Normal Beat | N |
        | 1 | Supraventricular Ectopic Beat | S |
        | 2 | Ventricular Ectopic Beat | V |
        | 3 | Fusion Beat | F |
        | 4 | Unknown Beat | Q |

    - These classes are based on the **AAMI (Association for the Advancement of Medical Instrumentation)** heartbeat classification standard.

4. **Dataset Summary**

    - **Total Samples:**
        
        - **Training Set:** 87,554 heartbeats
        - **Test Set:** 21,892 heartbeats
        - **Combined Total:** 109,446 heartbeats

    - **Input Shape:**
        - Each sample has a shape of **(187,)** representing the 187 time steps in the ECG signal.

    - **Output:**
        - Each sample is labeled with one of the 5 heartbeat classes.

5. **Distribution of Classes**

    The dataset is **imbalanced**, meaning some classes have more samples than others. Here's the approximate distribution of heartbeats:

    - **Class 0 (Normal):** Most frequent.
    
    - **Class 1 (Supraventricular Ectopic):** Less frequent.
    
    - **Class 2 (Ventricular Ectopic):** Less frequent.
    
    - **Class 3 (Fusion):** Rare.
    
    - **Class 4 (Unknown):** Rare.

6. **Preprocessing Steps (Already Applied)**

    The dataset has been **segmented and normalized**, making it ready for model training.

## 🧐 Which Method to Use for This Dataset?

Given the characteristics of this dataset:

1. **Dataset Size:** Large (87,554 samples in training).
    
2. **Complexity:** ECG signals have intricate patterns.
    
3. **Noise:** Real-world ECG data may contain noise.

## ✅ Conclusion: Use Deep Learning

**Justification:**

- **Large Dataset:** Deep Learning models excel with large datasets.
    
- **Automatic Feature Extraction:** CNNs and LSTMs can learn features directly from raw signals.
    
- **Complex Patterns:** Deep Learning can capture subtle differences between heartbeat types.
    
- **Robustness to Noise:** DL models handle noisy data better than traditional methods.

Given the characteristics of this dataset — large sample size, complexity, and variability in heartbeats — **Deep Learning** is the most suitable approach for heartbeat classification.

## 📌 Suggested Approach

1. **Start with a 1D CNN** for feature extraction to identify local patterns in the ECG signals.

2. **Experiment with LSTM or Hybrid CNN + LSTM models** to capture temporal dependencies.

3. **Evaluate Performance Using the Following Metrics:**

    To fully assess the model's effectiveness, use these relevant metrics:

    - **Accuracy:**
        
        - Measures the overall correctness of the model.
        
        - $\text{Accuracy} = \frac {\text{Correct Predictions}}{\text{Total Predictions}}$
    
    - **Precision (Positive Predictive Value):**
        
        - Measures the proportion of correctly predicted positive instances out of all predicted positives.
        
        - Useful when false positives are costly.

        - $\text{Precision} = \frac {\text{True Positives}}{\text{True Positives} + \text{False Positives}}$

    - **Recall (Sensitivity or True Positive Rate):**
    
        - Measures the proportion of correctly predicted positive instances out of all actual positives.
        
        - Important when missing positive cases is critical.

        - $\text{Recall} = \frac {\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$

    - **F1-Score:**
    
        - The harmonic mean of Precision and Recall.
        
        - Useful when the dataset is imbalanced.

        - $\text{F1-Score} = 2 \times \frac {\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

    - **Confusion Matrix:**
    
        - Provides a detailed breakdown of correct and incorrect classifications for each class.
        
        - Helps visualize performance across different classes.

    - **ROC-AUC Score (Receiver Operating Characteristic - Area Under Curve):**
    
        - Measures the model's ability to distinguish between classes.

        - Higher scores indicate better performance.

    - **Specificity (True Negative Rate):**
    
        - Measures the proportion of correctly predicted negative instances.

        - $\text{Specificity} = \frac {\text{True Negatives}}{\text{True Negatives} + \text{False Positives}}$

    - **Balanced Accuracy:**
    
        - The average of Sensitivity and Specificity.
        
        - Useful for imbalanced datasets.

        - $\text{Balanced Accuracy} = \frac {\text{Sensitivity} + \text{Specificity​}}{2}$

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)

# 6. Next Steps 🚀

In the next Jupyter Notebooks, we will:

- **Preprocess the ECG dataset** (scaling, reshaping).
    
- **Build and train Deep Learning models** (CNN, LSTM, and hybrid models).
    
- **Analyze the results** using the metrics listed above.

- **Evaluate and compare model performance.**

- **Visualize ECG signals and model predictions** for better understanding.
    
- **Deploy the best model** for real-world use cases.

![rainbow](https://github.com/ancilcleetus/My-Learning-Journey/assets/25684256/839c3524-2a1d-4779-85a0-83c562e1e5e5)