# Parkinson's Disease Detection Using Machine Learning on Gait Sensor Data

## Project Overview

This project develops a machine learning pipeline to detect Parkinson's Disease from gait (walking) patterns captured by wearable force sensors. Early detection of Parkinson's is critical for effective treatment, and gait analysis offers a non-invasive screening method.

---

## Key Highlights

- **Dataset**: PhysioNet Gait in Parkinson's Disease (165 subjects, 306 walking trials)
- **Features Engineered**: 38 biomechanical features from temporal, force, asymmetry, and frequency domains
- **Best Model**: Logistic Regression with **87% accuracy**
- **Clinical Optimization**: Random Forest with **100% recall** for screening applications
- **Tech Stack**: Python, scikit-learn, pandas, scipy, matplotlib

---

## Motivation

Parkinson's Disease (PD) affects over 10 million people worldwide. Gait abnormalities—such as shuffling steps, reduced stride length, and irregular timing—are early indicators of PD. This project leverages machine learning to automatically detect these patterns from wearable sensor data, enabling:

- **Early screening** in clinical settings
- **Non-invasive assessment** (no complex lab equipment)
- **Scalable monitoring** using low-cost wearable devices

---

## Table of Contents

1. [Dataset Description](#dataset)
2. [Feature Engineering](#features)
3. [Exploratory Data Analysis](#eda)
4. [Model Development](#models)
5. [Results & Evaluation](#results)
6. [Key Findings](#findings)
7. [Future Work](#future)
8. [How to Use This Project](#usage)

## 1. Dataset Description {#dataset}

### Source
**PhysioNet Gait in Parkinson's Disease Database**
- 93 Parkinson's patients
- 72 healthy control subjects
- Multiple walking trials per subject
- Total: 306 walking trials analyzed

### Data Collection
- **Sensors**: 8 force-sensitive resistors under each foot (16 total)
- **Sampling Rate**: 100 Hz
- **Measurement**: Vertical ground reaction force (in Newtons)
- **Walking Tasks**: Normal walking, walking with turns, activities of daily living

### Data Structure
Each trial contains:
- Time-series force measurements from 16 sensors
- Subject metadata (PD patient vs. control)
- Disease severity scores (for PD patients)

### Class Distribution
- **Parkinson's Disease**: 214 trials (69.9%)
- **Healthy Controls**: 92 trials (30.1%)

> **Note**: Class imbalance was addressed using class weighting in model training.

## 2. Feature Engineering {#features}

Feature engineering is the **core** of this project. We extracted 38 features from raw time-series sensor data, organized into 4 categories:

---

### Batch 1: Temporal Features (10 features)
**Captures timing and rhythm of walking**

| Feature | Description | Clinical Relevance |
|---------|-------------|-------------------|
| Stride Time | Time between consecutive heel strikes | PD patients have irregular timing |
| Step Time | Time between alternating feet | Measures gait rhythm |
| Cadence | Steps per minute | PD often shows higher cadence (shuffling) |
| **Stride Time CV** ⭐ | Coefficient of variation | **#1 PD biomarker** - high variability |
| Double Support Time | Both feet on ground | Longer in PD (balance issues) |

**Why important**: Stride time variability is the single most predictive feature for Parkinson's detection.

---

### Batch 2: Force/Spatial Features (10 features)
**Captures force magnitude and distribution**

| Feature | Description | Clinical Relevance |
|---------|-------------|-------------------|
| Peak Force | Maximum force during stance | Lower in PD (reduced step length) |
| Mean Force | Average force during stance | Overall force application pattern |
| Heel/Toe/Midfoot Ratio | Force distribution across foot | PD shows altered foot strike patterns |
| Force Range | Dynamic range of force | Reduced in PD (less dynamic walking) |

**Why important**: PD patients show characteristic force distribution changes due to shuffling gait.

---

### Batch 3: Asymmetry & Statistical Features (10 features)
**Captures left-right differences and signal characteristics**

| Feature | Description | Clinical Relevance |
|---------|-------------|-------------------|
| Force Asymmetry Index | Left vs. right force imbalance | PD often unilateral (affects one side) |
| Temporal Asymmetry | Stride time differences | Hemiparkinsonian patterns |
| Signal RMS | Root mean square energy | Signal power characteristics |
| Skewness/Kurtosis | Distribution shape | Abnormal force patterns in PD |

**Why important**: Parkinson's often presents asymmetrically, affecting one side of the body more than the other.

---

### Batch 4: Frequency-Domain Features (8 features)
**Captures rhythm regularity using Fourier analysis**

| Feature | Description | Clinical Relevance |
|---------|-------------|-------------------|
| Dominant Frequency | Main walking frequency (Hz) | PD shows shifted frequency |
| **Spectral Entropy** ⭐ | Regularity of rhythm | High = chaotic (PD), Low = regular (healthy) |
| Harmonic Ratio | Stride/step frequency ratio | Gait smoothness measure |
| Spectral Centroid | Center of frequency spectrum | Pattern shifts in PD |

**Why important**: Healthy gait has rhythmic, predictable frequency patterns. PD gait shows irregular, high-entropy spectra.

---

### Feature Extraction Pipeline
```python
Raw Sensor Data (100 Hz time-series)
    ↓
1. Detect Gait Events (heel strikes via peak detection)
    ↓
2. Segment into Individual Gait Cycles
    ↓
3. Extract Features for Each Cycle:
   - Temporal: timing calculations
   - Force: magnitude & distribution
   - Asymmetry: left-right comparisons
   - Frequency: FFT → spectral features
    ↓
Feature Matrix (306 walks × 38 features)
    ↓
Machine Learning Models
```

> **Key Insight**: Domain-driven feature engineering (based on clinical gait analysis literature) outperformed raw signal inputs.

## 3. Exploratory Data Analysis {#eda}

### Gait Pattern Visualization: Control vs. Parkinson's

![Gait Comparison](../results/figures/gait_comparison.png)

**Key Observations:**
- **Control subjects** (top): Regular, consistent peaks with smooth transitions
- **Parkinson's patients** (bottom): Irregular peaks, variable amplitudes, more noise

---

### What the Peaks Mean

Each peak represents a **heel strike** (foot hitting the ground). 

**Healthy Gait:**
- ✅ Regular spacing between peaks (consistent stride time)
- ✅ Similar peak heights (consistent force application)
- ✅ Clear left-right alternation

**Parkinson's Gait:**
- ⚠️ Irregular spacing (stride time variability)
- ⚠️ Variable peak heights (inconsistent force)
- ⚠️ Possible overlapping (both feet on ground longer = double support)

---

### Top 15 Most Important Features

![Feature Importance](../results/figures/feature_importance.png)

**Key Predictors:**
1. **Stride time variability** - Irregularity in walking rhythm
2. **Double support time** - Time with both feet on ground
3. **Spectral entropy** - Frequency domain regularity
4. **Force asymmetry** - Left-right imbalance
5. **Cadence** - Steps per minute

> These features align with clinical PD research, validating our feature engineering approach.

## 4. Model Development {#models}

### Models Evaluated

We systematically compared 4 machine learning algorithms:

| Model | Purpose | Hyperparameters |
|-------|---------|-----------------|
| **Logistic Regression** | Baseline, interpretable | C=1.0, class_weight='balanced' |
| **Random Forest** | Non-linear patterns | n_estimators=300, max_depth=8 |
| **SVM (RBF)** | Complex decision boundaries | C=1.0, gamma='scale' |
| **Gradient Boosting** | Ensemble method | n_estimators=200, learning_rate=0.05 |

---

### Training Methodology
```python
1. Data Split:
   - Training: 70% (214 samples)
   - Validation: 15% (46 samples)
   - Test: 15% (46 samples)
   - Stratified split to maintain class balance

2. Feature Scaling:
   - StandardScaler (mean=0, std=1)
   - Fitted on training data only

3. Hyperparameter Tuning:
   - GridSearchCV with 5-fold cross-validation
   - Scoring metric: Accuracy
   - Tested 54-162 parameter combinations per model

4. Class Imbalance Handling:
   - class_weight='balanced' for all models
   - Adjusts for 70% PD / 30% Control distribution
```

---

### Why Multiple Models?

Different algorithms have different strengths:

- **Logistic Regression**: Best for linearly separable data, highly interpretable
- **Random Forest**: Captures non-linear relationships, handles feature interactions
- **SVM**: Effective in high-dimensional spaces
- **Gradient Boosting**: Sequential learning, often best performance

**Our finding**: Simpler models (Logistic Regression) performed best due to:
- Small dataset size (306 samples)
- Strong linear relationships in features
- Risk of overfitting with complex models

## 5. Results & Evaluation {#results}

### Model Performance Comparison

![Model Comparison](../results/figures/model_comparison.png)

| Model | Accuracy | Precision | Recall | F1-Score | AUC-ROC |
|-------|----------|-----------|--------|----------|---------|
| **Logistic Regression** ⭐ | **86.96%** | **93.33%** | 87.50% | 90.32% | 0.89 |
| Random Forest | 84.78% | 82.05% | **100.00%** | 90.14% | **0.92** |
| SVM (RBF) | 84.78% | 85.71% | 93.75% | 89.55% | 0.93 |
| Gradient Boosting | 84.78% | 82.05% | **100.00%** | 90.14% | **0.94** |

---

### Confusion Matrices

![Confusion Matrices](../results/figures/confusion_matrices.png)

**Logistic Regression:**
- True Negatives: 12 | False Positives: 2
- False Negatives: 4 | True Positives: 28

**Random Forest:**
- True Negatives: 9 | False Positives: 5
- False Negatives: 0 | True Positives: 32

---

### ROC Curves

![ROC Curves](../results/figures/roc_curves.png)

**AUC-ROC Interpretation:**
- 0.89-0.94 indicates **excellent discrimination** between classes
- All models significantly outperform random classifier (0.50)

---

### Model Selection: Precision vs. Recall Trade-off

**For Clinical Screening Applications:**

**Option 1: Logistic Regression (High Precision)**
- **Use when**: False positives are costly (unnecessary follow-up tests)
- **87% accuracy, 93% precision**
- Fewer false alarms, but misses some PD cases

**Option 2: Random Forest (High Recall)**
- **Use when**: False negatives are critical (missing PD diagnosis)
- **100% recall** - catches ALL Parkinson's cases
- More false positives, but no missed diagnoses

> **Our recommendation**: Random Forest for initial screening (catch all cases), followed by clinical confirmation to filter false positives.

## 6. Key Findings {#findings}

### 1. Feature Engineering is Critical ⭐⭐⭐

**Domain-driven feature engineering outperformed raw signals:**
- 38 carefully designed features based on clinical gait research
- Top features aligned with known PD biomarkers
- Feature importance analysis validated medical literature

**Key Takeaway**: Understanding the problem domain (biomechanics, clinical PD symptoms) was essential for effective feature design.

---

### 2. Simpler Models Can Outperform Complex Ones

**Logistic Regression achieved highest accuracy (86.96%) despite being the simplest model.**

**Why?**
- Small dataset (306 samples)
- High-dimensional feature space (38 features)
- Strong linear relationships in engineered features
- Complex models (RF, GB) showed overfitting

**Lesson**: Always start with simple baselines. Complexity doesn't guarantee better performance.

---

### 3. Medical AI Requires Different Metrics

**Accuracy alone is insufficient for healthcare applications.**

In medical screening:
- **High Recall > High Precision** (don't miss diseases)
- **False Negatives are costlier** than False Positives
- Need to optimize for clinical workflow

**Example**: Random Forest's 100% recall is more valuable for screening than Logistic Regression's 87% recall, despite lower overall accuracy.

---

### 4. Top Predictive Features Match Clinical Research

**Most Important Features:**
1. **Stride Time Variability** - Known PD biomarker in literature
2. **Double Support Time** - Reflects balance/confidence issues
3. **Spectral Entropy** - Captures gait rhythm irregularity
4. **Force Asymmetry** - Hemiparkinsonian presentations

**Validation**: Our data-driven feature ranking aligns with decades of clinical gait research.

---

### 5. Class Imbalance Must Be Addressed

**Dataset had 70% PD, 30% Control imbalance.**

**Solutions applied:**
- `class_weight='balanced'` in all models
- Stratified splits (maintain balance in train/val/test)
- Evaluation focused on balanced metrics (precision, recall, F1)

**Impact**: Without balancing, models would achieve high accuracy by simply predicting "PD" for everyone.

---

### Limitations & Considerations

1. **Small Test Set**: 46 samples - results sensitive to individual cases
2. **Dataset Scope**: Lab-based walking, not real-world conditions
3. **Sensor Placement**: Requires proper foot sensor positioning
4. **External Validation**: Needs testing on independent datasets
5. **Generalization**: Performance on different PD subtypes unknown

---

### Clinical Implications

**Potential Applications:**
- ✅ Pre-clinical screening tool in primary care
- ✅ Monitoring disease progression over time
- ✅ Assessing treatment effectiveness
- ✅ Home-based monitoring with wearable devices

**NOT suitable for:**
- ❌ Definitive diagnosis (requires neurologist confirmation)
- ❌ Replacing comprehensive clinical assessment
- ❌ Standalone decision-making

**Dataset Source**: 
PhysioNet - Gait in Parkinson's Disease
https://physionet.org/content/gaitpdb/1.0.0/

**References**:
- Hausdorff, J.M. et al. (2007). Rhythmic auditory stimulation modulates gait variability in Parkinson's disease. Eur. J. Neurosci.
- Yogev, G. et al. (2005). Dual tasking, gait rhythmicity, and Parkinson's disease. Eur. J. Neurosci.
