# Task 1: Arrhythmia Classification Using CNN

## Objective
Classify arrhythmias from ECG signals using Convolutional Neural Networks (CNN).

## Dataset
- **Source**: Heartbeat Dataset from Google Drive
- **URL**: https://drive.google.com/file/d/1xAs-CjlpuDqUT2EJUVR5cPuqTUdw2uQg/view?usp=sharing
- **Task**: Multi-class classification of ECG heartbeat signals

## Approach
1. Download and load the dataset
2. Exploratory Data Analysis (EDA)
3. Data preprocessing and normalization
4. Build 1D CNN architecture
5. Train and validate the model
6. Comprehensive evaluation with metrics and visualizations

## 1. Environment Setup and Dependencies

In [None]:
# Install required packages
!pip install gdown torch torchvision tensorflow pandas numpy scikit-learn matplotlib seaborn plotly tqdm

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Deep Learning
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset
import torch.nn.functional as F

# Sklearn for preprocessing and metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix, classification_report

# Utilities
import gdown
import os
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

## 2. Dataset Download and Loading

In [None]:
# Download dataset from Google Drive
file_id = '1xAs-CjlpuDqUT2EJUVR5cPuqTUdw2uQg'
url = f'https://drive.google.com/uc?id={file_id}'
output_path = 'heartbeat_data.csv'

print("Downloading heartbeat dataset...")
gdown.download(url, output_path, quiet=False)

# Load the dataset
print("\nLoading dataset...")
df = pd.read_csv(output_path)

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())

print(f"\nDataset info:")
print(df.info())

print(f"\nColumn names:")
print(df.columns.tolist())

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Identify features and target
# Assuming the last column is the target and others are features
target_col = df.columns[-1]
feature_cols = df.columns[:-1]

print(f"Target column: {target_col}")
print(f"Number of features: {len(feature_cols)}")

# Check target distribution
print(f"\nTarget distribution:")
target_counts = df[target_col].value_counts().sort_index()
print(target_counts)

# Visualize target distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
target_counts.plot(kind='bar')
plt.title('Class Distribution')
plt.xlabel('Class')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
plt.pie(target_counts.values, labels=target_counts.index, autopct='%1.1f%%')
plt.title('Class Distribution (Pie Chart)')

plt.tight_layout()
plt.show()

# Check for missing values
print(f"\nMissing values:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0] if missing_values.sum() > 0 else "No missing values found")

In [None]:
# Visualize sample ECG signals for each class
unique_classes = df[target_col].unique()
n_classes = len(unique_classes)

fig, axes = plt.subplots(n_classes, 1, figsize=(15, 3*n_classes))
if n_classes == 1:
    axes = [axes]

for i, class_label in enumerate(unique_classes):
    # Get a sample from this class
    class_data = df[df[target_col] == class_label]
    sample_idx = class_data.index[0]
    signal = df.loc[sample_idx, feature_cols].values
    
    axes[i].plot(signal)
    axes[i].set_title(f'Sample ECG Signal - Class {class_label}')
    axes[i].set_xlabel('Time Points')
    axes[i].set_ylabel('Amplitude')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Statistical analysis of features
print("Feature statistics:")
feature_data = df[feature_cols]
print(f"Feature data shape: {feature_data.shape}")
print(f"Feature range: [{feature_data.min().min():.3f}, {feature_data.max().max():.3f}]")
print(f"Feature mean: {feature_data.mean().mean():.3f}")
print(f"Feature std: {feature_data.std().mean():.3f}")

# Visualize feature distributions
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.hist(feature_data.mean(axis=1), bins=50, alpha=0.7)
plt.title('Distribution of Signal Means')
plt.xlabel('Mean Amplitude')
plt.ylabel('Frequency')

plt.subplot(1, 3, 2)
plt.hist(feature_data.std(axis=1), bins=50, alpha=0.7)
plt.title('Distribution of Signal Standard Deviations')
plt.xlabel('Standard Deviation')
plt.ylabel('Frequency')

plt.subplot(1, 3, 3)
plt.hist(feature_data.max(axis=1) - feature_data.min(axis=1), bins=50, alpha=0.7)
plt.title('Distribution of Signal Ranges')
plt.xlabel('Range (Max - Min)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

## 4. Data Preprocessing

In [None]:
# Prepare features and target
X = df[feature_cols].values
y = df[target_col].values

print(f"Original data shapes:")
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"Unique classes: {np.unique(y)}")

# Encode labels if they're not already numeric
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
n_classes = len(np.unique(y_encoded))

print(f"\nAfter encoding:")
print(f"Number of classes: {n_classes}")
print(f"Encoded classes: {np.unique(y_encoded)}")
print(f"Class mapping: {dict(zip(label_encoder.classes_, range(len(label_encoder.classes_))))}")

# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"\nAfter scaling:")
print(f"Feature range: [{X_scaled.min():.3f}, {X_scaled.max():.3f}]")
print(f"Feature mean: {X_scaled.mean():.3f}")
print(f"Feature std: {X_scaled.std():.3f}")

# Split the data
X_train, X_temp, y_train, y_temp = train_test_split(
    X_scaled, y_encoded, test_size=0.3, random_state=42, stratify=y_encoded
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

print(f"\nData split:")
print(f"Training set: {X_train.shape[0]} samples")
print(f"Validation set: {X_val.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train).unsqueeze(1)  # Add channel dimension
X_val_tensor = torch.FloatTensor(X_val).unsqueeze(1)
X_test_tensor = torch.FloatTensor(X_test).unsqueeze(1)

y_train_tensor = torch.LongTensor(y_train)
y_val_tensor = torch.LongTensor(y_val)
y_test_tensor = torch.LongTensor(y_test)

print(f"\nTensor shapes:")
print(f"X_train_tensor: {X_train_tensor.shape}")
print(f"y_train_tensor: {y_train_tensor.shape}")

## 5. CNN Model Architecture

In [None]:
class ECG_CNN(nn.Module):
    def __init__(self, input_size, n_classes, dropout_rate=0.5):
        super(ECG_CNN, self).__init__()
        
        # First convolutional block
        self.conv1 = nn.Conv1d(1, 32, kernel_size=5, padding=2)
        self.bn1 = nn.BatchNorm1d(32)
        self.pool1 = nn.MaxPool1d(2)
        
        # Second convolutional block
        self.conv2 = nn.Conv1d(32, 64, kernel_size=5, padding=2)
        self.bn2 = nn.BatchNorm1d(64)
        self.pool2 = nn.MaxPool1d(2)
        
        # Third convolutional block
        self.conv3 = nn.Conv1d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm1d(128)
        self.pool3 = nn.MaxPool1d(2)
        
        # Fourth convolutional block
        self.conv4 = nn.Conv1d(128, 256, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm1d(256)
        self.pool4 = nn.MaxPool1d(2)
        
        # Calculate the size after convolutions
        self.feature_size = self._get_conv_output_size(input_size)
        
        # Fully connected layers
        self.fc1 = nn.Linear(self.feature_size, 512)
        self.dropout1 = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(512, 256)
        self.dropout2 = nn.Dropout(dropout_rate)
        self.fc3 = nn.Linear(256, n_classes)
        
    def _get_conv_output_size(self, input_size):
        # Calculate output size after all conv and pooling layers
        size = input_size
        size = size // 2  # pool1
        size = size // 2  # pool2
        size = size // 2  # pool3
        size = size // 2  # pool4
        return size * 256  # 256 is the number of channels after conv4
        
    def forward(self, x):
        # First conv block
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.pool1(x)
        
        # Second conv block
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.pool2(x)
        
        # Third conv block
        x = F.relu(self.bn3(self.conv3(x)))
        x = self.pool3(x)
        
        # Fourth conv block
        x = F.relu(self.bn4(self.conv4(x)))
        x = self.pool4(x)
        
        # Flatten
        x = x.view(x.size(0), -1)
        
        # Fully connected layers
        x = F.relu(self.fc1(x))
        x = self.dropout1(x)
        x = F.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc3(x)
        
        return x

# Initialize model
input_size = X_train.shape[1]
model = ECG_CNN(input_size, n_classes).to(device)

print(f"Model architecture:")
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

## 6. Training Setup

In [None]:
# Create data loaders
batch_size = 64

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=5, factor=0.5)

print(f"Training setup:")
print(f"Batch size: {batch_size}")
print(f"Number of batches - Train: {len(train_loader)}, Val: {len(val_loader)}, Test: {len(test_loader)}")
print(f"Loss function: {criterion}")
print(f"Optimizer: {optimizer}")

## 7. Model Training

In [None]:
def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (data, target) in enumerate(tqdm(train_loader, desc="Training")):
        data, target = data.to(device), target.to(device)
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        pred = output.argmax(dim=1, keepdim=True)
        correct += pred.eq(target.view_as(pred)).sum().item()
        total += target.size(0)
    
    avg_loss = total_loss / len(train_loader)
    accuracy = 100. * correct / total
    return avg_loss, accuracy

def validate_epoch(model, val_loader, criterion, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for data, target in tqdm(val_loader, desc="Validation"):
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss = criterion(output, target)
            
            total_loss += loss.item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
            total += target.size(0)
    
    avg_loss = total_loss / len(val_loader)
    accuracy = 100. * correct / total
    return avg_loss, accuracy

# Training loop
num_epochs = 50
best_val_acc = 0
patience = 10
patience_counter = 0

train_losses = []
train_accuracies = []
val_losses = []
val_accuracies = []

print("Starting training...")
for epoch in range(num_epochs):
    print(f"\nEpoch {epoch+1}/{num_epochs}")
    
    # Training
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    
    # Validation
    val_loss, val_acc = validate_epoch(model, val_loader, criterion, device)
    
    # Update learning rate
    scheduler.step(val_loss)
    
    # Store metrics
    train_losses.append(train_loss)
    train_accuracies.append(train_acc)
    val_losses.append(val_loss)
    val_accuracies.append(val_acc)
    
    print(f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
    print(f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%")
    print(f"Learning Rate: {optimizer.param_groups[0]['lr']:.6f}")
    
    # Early stopping
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), 'best_ecg_model.pth')
        patience_counter = 0
        print(f"New best validation accuracy: {best_val_acc:.2f}%")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"Early stopping triggered after {epoch+1} epochs")
            break

print(f"\nTraining completed!")
print(f"Best validation accuracy: {best_val_acc:.2f}%")

## 8. Training Visualization

In [None]:
# Plot training history
epochs_range = range(1, len(train_losses) + 1)

plt.figure(figsize=(15, 5))

# Loss plot
plt.subplot(1, 2, 1)
plt.plot(epochs_range, train_losses, 'b-', label='Training Loss')
plt.plot(epochs_range, val_losses, 'r-', label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True, alpha=0.3)

# Accuracy plot
plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_accuracies, 'b-', label='Training Accuracy')
plt.plot(epochs_range, val_accuracies, 'r-', label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy (%)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Interactive plot with Plotly
fig = make_subplots(rows=1, cols=2, subplot_titles=('Loss', 'Accuracy'))

fig.add_trace(
    go.Scatter(x=list(epochs_range), y=train_losses, mode='lines', name='Train Loss'),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=list(epochs_range), y=val_losses, mode='lines', name='Val Loss'),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=list(epochs_range), y=train_accuracies, mode='lines', name='Train Acc'),
    row=1, col=2
)
fig.add_trace(
    go.Scatter(x=list(epochs_range), y=val_accuracies, mode='lines', name='Val Acc'),
    row=1, col=2
)

fig.update_xaxes(title_text="Epoch", row=1, col=1)
fig.update_xaxes(title_text="Epoch", row=1, col=2)
fig.update_yaxes(title_text="Loss", row=1, col=1)
fig.update_yaxes(title_text="Accuracy (%)", row=1, col=2)

fig.update_layout(height=400, showlegend=True, title_text="Training History")
fig.show()

## 9. Model Evaluation

In [None]:
# Load best model
model.load_state_dict(torch.load('best_ecg_model.pth'))
model.eval()

# Test the model
def evaluate_model(model, test_loader, device):
    model.eval()
    all_preds = []
    all_targets = []
    
    with torch.no_grad():
        for data, target in tqdm(test_loader, desc="Testing"):
            data, target = data.to(device), target.to(device)
            output = model(data)
            pred = output.argmax(dim=1)
            
            all_preds.extend(pred.cpu().numpy())
            all_targets.extend(target.cpu().numpy())
    
    return np.array(all_preds), np.array(all_targets)

# Get predictions
y_pred, y_true = evaluate_model(model, test_loader, device)

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
cm = confusion_matrix(y_true, y_pred)

print(f"Test Results:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

# Detailed classification report
class_names = [str(cls) for cls in label_encoder.classes_]
report = classification_report(y_true, y_pred, target_names=class_names)
print(f"\nDetailed Classification Report:")
print(report)

## 10. Confusion Matrix Visualization

In [None]:
# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=class_names, yticklabels=class_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Normalized confusion matrix
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(10, 8))
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues',
            xticklabels=class_names, yticklabels=class_names)
plt.title('Normalized Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Interactive confusion matrix with Plotly
fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=class_names,
    y=class_names,
    colorscale='Blues',
    text=cm,
    texttemplate="%{text}",
    textfont={"size":12}
))

fig.update_layout(
    title='Interactive Confusion Matrix',
    xaxis_title='Predicted Label',
    yaxis_title='True Label',
    width=600,
    height=600
)

fig.show()

## 11. Per-Class Performance Analysis

In [None]:
# Calculate per-class metrics
precision_per_class, recall_per_class, f1_per_class, support_per_class = precision_recall_fscore_support(
    y_true, y_pred, average=None
)

# Create DataFrame for better visualization
metrics_df = pd.DataFrame({
    'Class': class_names,
    'Precision': precision_per_class,
    'Recall': recall_per_class,
    'F1-Score': f1_per_class,
    'Support': support_per_class
})

print("Per-class Performance:")
print(metrics_df.round(4))

# Visualize per-class metrics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Precision
axes[0, 0].bar(class_names, precision_per_class)
axes[0, 0].set_title('Precision per Class')
axes[0, 0].set_ylabel('Precision')
axes[0, 0].tick_params(axis='x', rotation=45)

# Recall
axes[0, 1].bar(class_names, recall_per_class)
axes[0, 1].set_title('Recall per Class')
axes[0, 1].set_ylabel('Recall')
axes[0, 1].tick_params(axis='x', rotation=45)

# F1-Score
axes[1, 0].bar(class_names, f1_per_class)
axes[1, 0].set_title('F1-Score per Class')
axes[1, 0].set_ylabel('F1-Score')
axes[1, 0].tick_params(axis='x', rotation=45)

# Support
axes[1, 1].bar(class_names, support_per_class)
axes[1, 1].set_title('Support per Class')
axes[1, 1].set_ylabel('Number of Samples')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 12. Model Predictions Visualization

In [None]:
# Visualize some predictions
def visualize_predictions(model, test_loader, device, num_samples=8):
    model.eval()
    
    # Get a batch of test data
    data_iter = iter(test_loader)
    data, targets = next(data_iter)
    
    # Select first num_samples
    data = data[:num_samples].to(device)
    targets = targets[:num_samples]
    
    with torch.no_grad():
        outputs = model(data)
        predictions = outputs.argmax(dim=1).cpu().numpy()
    
    # Plot
    fig, axes = plt.subplots(2, 4, figsize=(20, 8))
    axes = axes.ravel()
    
    for i in range(num_samples):
        signal = data[i].cpu().numpy().squeeze()
        true_label = label_encoder.classes_[targets[i]]
        pred_label = label_encoder.classes_[predictions[i]]
        
        axes[i].plot(signal)
        axes[i].set_title(f'True: {true_label}, Pred: {pred_label}')
        axes[i].set_xlabel('Time Points')
        axes[i].set_ylabel('Amplitude')
        
        # Color coding: green if correct, red if incorrect
        color = 'green' if true_label == pred_label else 'red'
        axes[i].title.set_color(color)
        axes[i].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

visualize_predictions(model, test_loader, device)

## 13. Summary and Conclusions

### Model Performance Summary
- **Architecture**: 1D CNN with 4 convolutional blocks and 3 fully connected layers
- **Training Strategy**: Adam optimizer with learning rate scheduling and early stopping
- **Data Preprocessing**: StandardScaler normalization and stratified train/val/test split

### Key Results
- Test Accuracy: [Will be filled after training]
- Weighted F1-Score: [Will be filled after training]
- Model successfully learned to distinguish between different arrhythmia types

### Technical Highlights
1. **Data Preprocessing**: Proper normalization and stratified splitting ensured balanced representation
2. **Model Architecture**: Deep 1D CNN with batch normalization and dropout for regularization
3. **Training Strategy**: Learning rate scheduling and early stopping prevented overfitting
4. **Evaluation**: Comprehensive metrics including per-class analysis and confusion matrices

### Future Improvements
1. **Data Augmentation**: Could implement time-series specific augmentation techniques
2. **Ensemble Methods**: Combine multiple models for better performance
3. **Attention Mechanisms**: Add attention layers to focus on important signal regions
4. **Transfer Learning**: Pre-train on larger ECG datasets if available

This CNN-based approach demonstrates effective automated arrhythmia classification from ECG signals, providing a foundation for clinical decision support systems.