# Deep Learning Approach for PPG Arrhythmia Detection

This notebook implements a 1D CNN with minimal preprocessing for end-to-end learning.

**Signal Characteristics:**
- Sampling rate: 100 Hz
- Segment duration: 10 seconds
- Samples per segment: 1,000 values
- Task: Binary classification (Healthy vs Arrhythmic)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from deep_learning_pipeline import (
    PPGPreprocessor,
    DataAugmentation,
    build_1d_cnn_model,
    create_callbacks,
    plot_training_history,
    plot_confusion_matrix,
    plot_roc_curve
)

# Set random seeds for reproducibility
np.random.seed(42)
import tensorflow as tf
tf.random.set_seed(42)

print('Libraries imported successfully!')

## 1. Load Data

In [None]:
# Set the base data directory
data_dir = 'data'

# Load training data
X_train = np.load(os.path.join(data_dir, 'train', 'train_segments.npy'))
y_train = np.load(os.path.join(data_dir, 'train', 'train_labels.npy'))
train_metadata = pd.read_csv(os.path.join(data_dir, 'train', 'train_metadata.csv'))

# Load test data
X_test = np.load(os.path.join(data_dir, 'test', 'test_segments.npy'))
y_test = np.load(os.path.join(data_dir, 'test', 'test_labels.npy'))
test_metadata = pd.read_csv(os.path.join(data_dir, 'test', 'test_metadata.csv'))

# Print dataset information
print('Dataset Information:')
print('=' * 50)
print(f'Training data shape: {X_train.shape}')
print(f'Training labels shape: {y_train.shape}')
print(f'Test data shape: {X_test.shape}')
print(f'Test labels shape: {y_test.shape}')
print(f'\nClass distribution in training set:')
print(pd.Series(y_train).value_counts(normalize=True))
print(f'\nClass distribution in test set:')
print(pd.Series(y_test).value_counts(normalize=True))

## 2. Visualize Raw Signals

In [None]:
# Visualize sample signals from both classes
fig, axes = plt.subplots(2, 2, figsize=(15, 8))

# Healthy samples
healthy_idx = np.where(y_train == 0)[0]
for i in range(2):
    sample_idx = healthy_idx[i]
    axes[0, i].plot(X_train[sample_idx], linewidth=0.8)
    axes[0, i].set_title(f'Healthy Sample {i+1}', fontsize=12, fontweight='bold')
    axes[0, i].set_xlabel('Time Steps')
    axes[0, i].set_ylabel('Amplitude')
    axes[0, i].grid(True, alpha=0.3)

# Arrhythmic samples
arrhythmic_idx = np.where(y_train == 1)[0]
for i in range(2):
    sample_idx = arrhythmic_idx[i]
    axes[1, i].plot(X_train[sample_idx], linewidth=0.8, color='red')
    axes[1, i].set_title(f'Arrhythmic Sample {i+1}', fontsize=12, fontweight='bold')
    axes[1, i].set_xlabel('Time Steps')
    axes[1, i].set_ylabel('Amplitude')
    axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Preprocess Data

Minimal preprocessing:
1. Bandpass filter (0.5-8 Hz) - removes noise outside physiological range
2. Z-score normalization - ensures consistent amplitude

In [None]:
# Initialize preprocessor
preprocessor = PPGPreprocessor(sampling_rate=100, lowcut=0.5, highcut=8.0)

# Preprocess all signals
print('Preprocessing training data...')
X_train_processed = preprocessor.preprocess_batch(X_train)

print('Preprocessing test data...')
X_test_processed = preprocessor.preprocess_batch(X_test)

print('Preprocessing complete!')
print(f'Preprocessed training data shape: {X_train_processed.shape}')
print(f'Preprocessed test data shape: {X_test_processed.shape}')

## 4. Compare Raw vs Preprocessed Signals

In [None]:
# Compare raw and preprocessed signals
sample_idx = 0

fig, axes = plt.subplots(2, 1, figsize=(15, 8))

# Raw signal
axes[0].plot(X_train[sample_idx], linewidth=0.8)
axes[0].set_title('Raw PPG Signal', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Time Steps')
axes[0].set_ylabel('Amplitude')
axes[0].grid(True, alpha=0.3)

# Preprocessed signal
axes[1].plot(X_train_processed[sample_idx], linewidth=0.8, color='green')
axes[1].set_title('Preprocessed PPG Signal (Bandpass + Z-score Normalized)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Time Steps')
axes[1].set_ylabel('Normalized Amplitude')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Prepare Data for CNN

Reshape data to CNN input format: (samples, timesteps, channels)

In [None]:
# Reshape for CNN input: (samples, timesteps, channels)
X_train_cnn = X_train_processed.reshape(-1, 1000, 1)
X_test_cnn = X_test_processed.reshape(-1, 1000, 1)

# Create validation split
from sklearn.model_selection import train_test_split
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train_cnn, y_train, test_size=0.15, random_state=42, stratify=y_train
)

print(f'Training set: {X_train_split.shape}')
print(f'Validation set: {X_val_split.shape}')
print(f'Test set: {X_test_cnn.shape}')
print(f'\nTraining labels distribution:')
print(pd.Series(y_train_split).value_counts(normalize=True))
print(f'\nValidation labels distribution:')
print(pd.Series(y_val_split).value_counts(normalize=True))

## 6. Build 1D CNN Model

Architecture:
- 4 Convolutional blocks with increasing filters (64 → 128 → 256 → 512)
- Batch Normalization for stable training
- Dropout for regularization
- Global Average Pooling to reduce parameters
- Dense layers for classification

In [None]:
# Build 1D CNN model
model = build_1d_cnn_model(input_shape=(1000, 1), num_classes=1)

# Display model architecture
model.summary()

## 7. Train Model

Training strategy:
- Optimizer: Adam (lr=0.001)
- Loss: Binary Crossentropy
- Callbacks: ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
- Batch size: 64
- Max epochs: 100 (with early stopping)

In [None]:
# Create callbacks
callback_list = create_callbacks(model_path='best_arrhythmia_model.keras')

# Train model
history = model.fit(
    X_train_split, y_train_split,
    validation_data=(X_val_split, y_val_split),
    epochs=100,
    batch_size=64,
    callbacks=callback_list,
    verbose=1
)

## 8. Plot Training History

In [None]:
plot_training_history(history)

## 9. Evaluate on Test Set

In [None]:
# Load best model
from tensorflow import keras
best_model = keras.models.load_model('best_arrhythmia_model.keras')

# Evaluate
test_loss, test_acc, test_auc, test_precision, test_recall = best_model.evaluate(
    X_test_cnn, y_test, verbose=1
)

# Calculate F1-score
test_f1 = 2 * (test_precision * test_recall) / (test_precision + test_recall)

print(f'\nTest Results:')
print('=' * 50)
print(f'  Loss: {test_loss:.4f}')
print(f'  Accuracy: {test_acc:.4f}')
print(f'  AUC: {test_auc:.4f}')
print(f'  Precision: {test_precision:.4f}')
print(f'  Recall: {test_recall:.4f}')
print(f'  F1-Score: {test_f1:.4f}')
print('=' * 50)

## 10. Predictions and Confusion Matrix

In [None]:
# Make predictions
y_pred_proba = best_model.predict(X_test_cnn).flatten()
y_pred = (y_pred_proba > 0.5).astype(int)

# Plot confusion matrix
plot_confusion_matrix(y_test, y_pred)

## 11. ROC Curve

In [None]:
plot_roc_curve(y_test, y_pred_proba)

## 12. Per-Class Performance Analysis

In [None]:
from sklearn.metrics import classification_report, f1_score

# Detailed classification report
print('\nDetailed Classification Report:')
print('=' * 60)
print(classification_report(y_test, y_pred, target_names=['Healthy', 'Arrhythmic']))

# Per-class F1 scores
f1_healthy = f1_score(y_test, y_pred, pos_label=0)
f1_arrhythmic = f1_score(y_test, y_pred, pos_label=1)

print(f'\nPer-Class F1 Scores:')
print(f'  Healthy: {f1_healthy:.4f}')
print(f'  Arrhythmic: {f1_arrhythmic:.4f}')

## 13. Save Complete Pipeline for Deployment

In [None]:
import joblib

# Save preprocessor and model together
pipeline = {
    'preprocessor': preprocessor,
    'model_path': 'best_arrhythmia_model.keras',
    'sampling_rate': 100,
    'signal_length': 1000,
    'class_names': ['Healthy', 'Arrhythmic'],
    'performance': {
        'accuracy': test_acc,
        'auc': test_auc,
        'precision': test_precision,
        'recall': test_recall,
        'f1_score': test_f1
    }
}

joblib.dump(pipeline, 'arrhythmia_detection_pipeline.pkl')
print('Pipeline saved successfully!')
print(f'\nSaved files:')
print(f'  - best_arrhythmia_model.keras (trained model)')
print(f'  - arrhythmia_detection_pipeline.pkl (complete pipeline)')

## 14. Test Pipeline on Random Samples

In [None]:
# Load pipeline
pipeline = joblib.load('arrhythmia_detection_pipeline.pkl')
model = keras.models.load_model(pipeline['model_path'])

# Test on 5 random samples
num_samples = 5
random_indices = np.random.choice(len(X_test), num_samples, replace=False)

fig, axes = plt.subplots(num_samples, 1, figsize=(15, 3*num_samples))

for i, sample_idx in enumerate(random_indices):
    # Get sample
    sample_signal = X_test[sample_idx]
    
    # Preprocess
    preprocessed = pipeline['preprocessor'].preprocess(sample_signal)
    preprocessed = preprocessed.reshape(1, 1000, 1)
    
    # Predict
    prediction_proba = model.predict(preprocessed, verbose=0)[0][0]
    prediction = int(prediction_proba > 0.5)
    
    # Get true label
    true_label = y_test[sample_idx]
    
    # Plot
    ax = axes[i] if num_samples > 1 else axes
    ax.plot(sample_signal, linewidth=0.8)
    
    # Color based on correctness
    color = 'green' if prediction == true_label else 'red'
    correctness = '✓' if prediction == true_label else '✗'
    
    ax.set_title(
        f'{correctness} Sample {sample_idx} | '
        f'True: {pipeline["class_names"][true_label]} | '
        f'Predicted: {pipeline["class_names"][prediction]} ({prediction_proba:.2%})',
        fontsize=12, fontweight='bold', color=color
    )
    ax.set_xlabel('Time Steps')
    ax.set_ylabel('Amplitude')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

This notebook demonstrated:
1. ✅ Minimal preprocessing (bandpass filter + normalization)
2. ✅ 1D CNN architecture for automatic feature learning
3. ✅ End-to-end training with callbacks and monitoring
4. ✅ Comprehensive evaluation (accuracy, AUC, F1, confusion matrix, ROC)
5. ✅ Deployment-ready pipeline

**Advantages over feature engineering approach:**
- Simpler pipeline (2 preprocessing steps vs 43 features)
- Automatic feature learning
- Better generalization potential
- End-to-end optimization

**Next steps:**
- Experiment with data augmentation
- Try different CNN architectures (ResNet, LSTM)
- Ensemble multiple models
- Deploy to production