# Automatic Piano Fingering Detection from Video

**Computer Vision Final Project ‚Äî Sapienza University of Rome**

---

## Project Goal

Given a video of a piano performance with synchronized MIDI data, automatically determine the finger assignment (1‚Äì5, thumb to pinky) for each played note using **only computer vision techniques** ‚Äî no manual annotations are used in the detection pipeline itself.

**Input**: Video + MIDI ‚Üí **Output**: Per-note finger labels (L1‚ÄìL5 for left hand, R1‚ÄìR5 for right hand)

## Primary Reference

> **Moryossef et al. (2023)** ‚Äî *"At Your Fingertips: Extracting Piano Fingering Instructions from Videos"* ‚Äî [arXiv:2303.03745](https://arxiv.org/abs/2303.03745)

Our pipeline follows the methodology proposed in this paper:

| Paper Methodology | Our Implementation |
|---|---|
| Video-based pipeline (Video ‚Üí Keyboard ‚Üí Hands ‚Üí Assignment) | Same 4-stage architecture |
| Keyboard detection from video via edge/line analysis | Canny + Hough + line clustering + black-key analysis (`AutoKeyboardDetector`) |
| MediaPipe 21-keypoint hand pose estimation | Live detection on raw video (model_complexity=1, conf=0.3, video mode) |
| **Gaussian probability assignment using x-distance only** | `P(finger‚Üíkey) = exp(‚àídx¬≤/2œÉ¬≤)` with auto-scaled œÉ |
| Max-distance gate (reject when hand is far) | 4œÉ rejection threshold |
| Both-hands evaluation per note | Try L & R, pick higher confidence |
| Temporal smoothing of landmarks | Hampel + interpolation + Savitzky-Golay |

## Pipeline Architecture

```
Video ‚îÄ‚îÄ‚ñ∫ Keyboard Detection ‚îÄ‚îÄ‚ñ∫ Hand Processing ‚îÄ‚îÄ‚ñ∫ Finger-Key Assignment ‚îÄ‚îÄ‚ñ∫ Neural Refinement ‚îÄ‚îÄ‚ñ∫ Fingering Labels
           (Canny/Hough/           (MediaPipe +        (Gaussian x-only          (BiLSTM +
            Clustering)             Temporal Filter)     Probability)              Viterbi)
```

| Stage | Method | Input | Output |
|-------|--------|-------|--------|
| 1. Keyboard Detection | Canny + Hough + Clustering + Black-Key Analysis | Video frames | 88 key bounding boxes (pixel space) |
| 2. Hand Processing | MediaPipe (live) + Hampel + SavGol | Raw video frames | Filtered landmarks (T √ó 21 √ó 3) |
| 3. Finger Assignment | Gaussian probability (x-only) | MIDI + fingertips + keys | FingerAssignment per note |
| 4. Neural Refinement | BiLSTM + Attention + Viterbi | Initial assignments | Refined predictions |

> **Full-CV approach**: The keyboard is detected automatically from raw video ‚Äî no dataset annotations are used in the pipeline. Corner annotations from PianoVAM are used **only for evaluation** (IoU metric).

## Dataset

**PianoVAM** (KAIST) ‚Äî 107 piano performances with synchronized video, audio, MIDI, and pre-extracted hand skeletons.

## Table of Contents

0. [Environment Setup](#0)
1. [Data Exploration](#1)
2. [Stage 1: Keyboard Detection ‚Äî Automatic CV Pipeline](#2)
3. [Stage 2: Hand Pose Estimation (MediaPipe)](#3)
4. [Stage 3: Temporal Filtering](#4)
5. [Stage 4: Finger-Key Assignment (Gaussian)](#5)
6. [Baseline Pipeline on Multiple Samples](#6)
7. [Stage 5: Neural Refinement (BiLSTM)](#7)
8. [Evaluation & Results](#8)
9. [Extended Evaluation & Validation](#9)
    - 9.1 Baseline Comparisons
    - 9.2 Hand Detection Validation (Live MP vs. Dataset Skeletons)
    - 9.3 Ablation Study
    - 9.4 Qualitative Analysis

---
<a id='0'></a>
## 0. Environment Setup

In [None]:
import os, sys, subprocess

IN_COLAB = 'google.colab' in str(get_ipython()) if 'get_ipython' in dir() else False

if IN_COLAB:
    REPO_URL = 'https://github.com/esnylmz/computer-vision.git'
    BRANCH = 'v4'
    if not os.path.exists('computer-vision'):
        subprocess.run(['git', 'clone', '--branch', BRANCH, '--single-branch', REPO_URL], check=True)
    os.chdir('computer-vision')
    subprocess.run(['git', 'fetch', 'origin', BRANCH], check=True)
    subprocess.run(['git', 'checkout', BRANCH], check=True)
    subprocess.run(['git', 'pull', '--ff-only', 'origin', BRANCH], check=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', '-e', '.'], check=True)
    # mediapipe-numpy2 keeps mp.solutions API and works with numpy 2.x on Colab
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'mediapipe-numpy2'], check=True)
    print('\nColab environment ready')
else:
    PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
    if PROJECT_ROOT not in sys.path:
        sys.path.insert(0, PROJECT_ROOT)

    # make sure we have a compatible mediapipe (solutions API removed in 0.10.31+)
    try:
        import mediapipe as _mp
        if not hasattr(_mp, 'solutions'):
            print('WARNING: mediapipe version too new, reinstalling compatible version...')
            subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'mediapipe-numpy2'], check=True)
    except ImportError:
        subprocess.run([sys.executable, '-m', 'pip', 'install', '-q', 'mediapipe-numpy2'], check=True)

    print('Local environment ready')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import json, time, warnings
from pathlib import Path
from tqdm.notebook import tqdm

warnings.filterwarnings('ignore', category=UserWarning)
sns.set_style('whitegrid')

print(f'NumPy  : {np.__version__}')
print(f'Pandas : {pd.__version__}')
print(f'OpenCV : {cv2.__version__}')

import torch
print(f'PyTorch: {torch.__version__}')
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device : {DEVICE}')

In [None]:
from src.data.dataset import PianoVAMDataset, PianoVAMSample
from src.data.midi_utils import MidiProcessor, MidiEvent
from src.data.video_utils import VideoProcessor
from src.utils.config import load_config, Config

from src.keyboard.detector import KeyboardDetector, KeyboardRegion
from src.keyboard.auto_detector import AutoKeyboardDetector, AutoDetectionResult
from src.keyboard.homography import HomographyComputer
from src.keyboard.key_localization import KeyLocalizer

from src.hand.skeleton_loader import SkeletonLoader, HandLandmarks
from src.hand.temporal_filter import TemporalFilter
from src.hand.fingertip_extractor import FingertipExtractor, FingertipData
from src.hand.live_detector import LiveHandDetector, LiveDetectionConfig

from src.assignment.gaussian_assignment import GaussianFingerAssigner, FingerAssignment
from src.assignment.midi_sync import MidiVideoSync
from src.assignment.hand_separation import HandSeparator

from src.refinement.model import FingeringRefiner, FeatureExtractor, SequenceDataset
from src.refinement.constraints import BiomechanicalConstraints
from src.refinement.decoding import constrained_viterbi_decode
from src.refinement.train import train_refiner, collate_fn

from src.evaluation.metrics import FingeringMetrics, EvaluationResult, aggregate_results
from src.evaluation.visualization import ResultVisualizer

from src.pipeline import FingeringPipeline

config_path = 'configs/colab.yaml' if IN_COLAB else 'configs/default.yaml'
config = load_config(config_path)
print(f'All modules imported | Config: {config_path}')
print(f'Project: {config.project_name} v{config.version}')

---
<a id='1'></a>
## 1. Data Exploration

In [None]:
MAX_EXPLORE = 20

print('Loading PianoVAM dataset splits ...\n')
train_dataset = PianoVAMDataset(split='train', streaming=True, max_samples=MAX_EXPLORE)
val_dataset = PianoVAMDataset(split='validation', streaming=True, max_samples=MAX_EXPLORE)
test_dataset = PianoVAMDataset(split='test', streaming=True, max_samples=MAX_EXPLORE)

sample = next(iter(train_dataset))
print(f'\nSample ID      : {sample.id}')
print(f'Composer       : {sample.metadata["composer"]}')
print(f'Piece          : {sample.metadata["piece"]}')
print(f'Skill Level    : {sample.metadata["skill_level"]}')
print(f'Keyboard Corners: {sample.metadata["keyboard_corners"]}')

In [None]:
print('Collecting dataset statistics ...')
stats_ds = PianoVAMDataset(split='train', max_samples=None)

composers, skill_levels = [], []
for s in stats_ds:
    composers.append(s.metadata['composer'])
    skill_levels.append(s.metadata['skill_level'])

fig, axes = plt.subplots(1, 2, figsize=(14, 4))
pd.Series(skill_levels).value_counts().plot.bar(ax=axes[0], color='steelblue')
axes[0].set_title(f'Skill Level Distribution (n={len(skill_levels)})')
pd.Series(composers).value_counts().head(10).plot.barh(ax=axes[1], color='darkorange')
axes[1].set_title('Top 10 Composers')
plt.tight_layout()
plt.show()

print(f'Train samples    : {len(skill_levels)}')
print(f'Unique composers : {len(set(composers))}')
print(f'Skill levels     : {dict(pd.Series(skill_levels).value_counts())}')

---
<a id='2'></a>
## 2. Stage 1 ‚Äî Keyboard Detection (Automatic CV Pipeline)

We detect the piano keyboard from raw video using **only classical computer vision** ‚Äî no dataset annotations are used in the detection itself. This follows the video-based detection approach from Moryossef et al. (2023).

### Detection Pipeline
1. **Preprocessing** ‚Äî Grayscale ‚Üí CLAHE contrast enhancement ‚Üí Gaussian blur
2. **Canny edge detection** ‚Äî Otsu-adaptive thresholds merged with fixed thresholds
3. **Morphological closing** ‚Äî horizontal kernel to connect fragmented edges
4. **Hough line transform** ‚Äî extract horizontal and vertical line segments
5. **Line clustering** ‚Äî group horizontal lines by y-coordinate, select top/bottom keyboard edges
6. **Black-key refinement** ‚Äî contour analysis to tighten x-boundaries
7. **Multi-frame consensus** ‚Äî sample N frames, take median bbox for robustness
8. **88-key layout** ‚Äî divide detected region into 52 white + 36 black keys in pixel space

> **Evaluation only**: PianoVAM corner annotations are used solely to compute IoU (Intersection-over-Union) as a detection quality metric.

In [None]:
# Download a video frame from PianoVAM
print(f'Downloading video for sample {sample.id} ...')
video_path = train_dataset.download_file(sample.video_path)
print(f'Video saved to: {video_path}')

vp = VideoProcessor()
vp.open(video_path)
print(f'Resolution: {vp.info.width}x{vp.info.height}')
print(f'FPS: {vp.info.fps}')
print(f'Total frames: {vp.info.frame_count}')
print(f'Duration: {vp.info.duration:.1f}s')

# grab a frame from the middle of the video
mid_frame_idx = vp.info.frame_count // 2
frame_bgr = vp.get_frame(mid_frame_idx)
frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)

plt.figure(figsize=(14, 6))
plt.imshow(frame_rgb)
plt.title(f'Raw Video Frame (frame {mid_frame_idx})')
plt.axis('off')
plt.show()

vp.close()

In [None]:
# ‚îÄ‚îÄ Image Preprocessing Pipeline ‚îÄ‚îÄ
gray = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)

# CLAHE (Contrast-Limited Adaptive Histogram Equalisation) for lighting normalisation
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
enhanced = clahe.apply(gray)
enhanced_blur = cv2.GaussianBlur(enhanced, (5, 5), 0)

# Canny edge detection ‚Äî fixed thresholds for comparison
edges_low = cv2.Canny(blurred, 30, 100)
edges_mid = cv2.Canny(blurred, 50, 150)
edges_high = cv2.Canny(blurred, 100, 200)

# Otsu-based automatic threshold on CLAHE-enhanced image
otsu_thresh, _ = cv2.threshold(enhanced_blur, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
otsu_low = max(10, int(otsu_thresh * 0.5))
otsu_high = min(255, int(otsu_thresh * 1.0))
edges_otsu = cv2.Canny(enhanced_blur, otsu_low, otsu_high)

# Merge fixed + Otsu edges for robustness
edges_merged = cv2.bitwise_or(edges_mid, edges_otsu)

# Morphological closing with horizontal kernel to connect fragmented edges
kernel_h = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 1))
edges_closed = cv2.morphologyEx(edges_merged, cv2.MORPH_CLOSE, kernel_h)

fig, axes = plt.subplots(3, 3, figsize=(18, 14))

axes[0, 0].imshow(frame_rgb);               axes[0, 0].set_title('Original (RGB)')
axes[0, 1].imshow(gray, cmap='gray');        axes[0, 1].set_title('Grayscale')
axes[0, 2].imshow(enhanced, cmap='gray');    axes[0, 2].set_title('CLAHE Enhanced')

axes[1, 0].imshow(edges_low, cmap='gray');   axes[1, 0].set_title('Canny (30, 100)')
axes[1, 1].imshow(edges_mid, cmap='gray');   axes[1, 1].set_title('Canny (50, 150)')
axes[1, 2].imshow(edges_high, cmap='gray');  axes[1, 2].set_title('Canny (100, 200)')

axes[2, 0].imshow(edges_otsu, cmap='gray');  axes[2, 0].set_title(f'Otsu-adaptive ({otsu_low}, {otsu_high})')
axes[2, 1].imshow(edges_merged, cmap='gray');axes[2, 1].set_title('Merged (fixed + Otsu)')
axes[2, 2].imshow(edges_closed, cmap='gray');axes[2, 2].set_title('After Morphological Close')

for ax in axes.flat:
    ax.axis('off')

plt.suptitle('Image Preprocessing & Edge Detection Pipeline', fontsize=14)
plt.tight_layout()
plt.show()

print(f'Otsu threshold: {otsu_thresh:.0f}  ‚Üí  Canny range: ({otsu_low}, {otsu_high})')

In [None]:
# Hough Line Transform on the edge map
edges = edges_mid

lines = cv2.HoughLinesP(
    edges, rho=1, theta=np.pi/180, threshold=100,
    minLineLength=100, maxLineGap=10
)

print(f'Total lines detected: {len(lines) if lines is not None else 0}')

# separate horizontal and vertical lines
line_vis = frame_rgb.copy()
horizontal_lines = []
vertical_lines = []

if lines is not None:
    for line in lines:
        x1, y1, x2, y2 = line[0]
        angle = np.abs(np.arctan2(y2 - y1, x2 - x1))
        if angle < np.pi / 18:  # within 10 degrees of horizontal
            horizontal_lines.append((x1, y1, x2, y2))
            cv2.line(line_vis, (x1, y1), (x2, y2), (0, 255, 0), 2)
        elif angle > np.pi / 2 - np.pi / 18:  # within 10 degrees of vertical
            vertical_lines.append((x1, y1, x2, y2))
            cv2.line(line_vis, (x1, y1), (x2, y2), (255, 0, 0), 1)

print(f'Horizontal lines: {len(horizontal_lines)}')
print(f'Vertical lines  : {len(vertical_lines)}')

fig, axes = plt.subplots(1, 2, figsize=(18, 6))
axes[0].imshow(edges, cmap='gray')
axes[0].set_title('Canny Edge Map')
axes[1].imshow(line_vis)
axes[1].set_title('Hough Lines (green=horizontal, red=vertical)')
for ax in axes:
    ax.axis('off')
plt.tight_layout()
plt.show()

In [None]:
# ‚îÄ‚îÄ Automatic Keyboard Detection (PRIMARY ‚Äî no annotations used) ‚îÄ‚îÄ
auto_detector = AutoKeyboardDetector({
    'canny_low': config.keyboard.canny_low,
    'canny_high': config.keyboard.canny_high,
    'hough_threshold': config.keyboard.hough_threshold,
})

# Run auto-detection on a single frame
auto_result = auto_detector.detect_single_frame(frame_bgr, return_intermediates=True)

if auto_result.success:
    keyboard_region = auto_result.keyboard_region
    print(f'‚úÖ Auto-detection succeeded')
    print(f'  Bounding box    : {auto_result.consensus_bbox}')
    print(f'  Keys detected   : {len(keyboard_region.key_boundaries)}')
    print(f'  White key width : {keyboard_region.white_key_width:.1f} px')
    print(f'  Horiz. lines    : {len(auto_result.horizontal_lines or [])}')
    print(f'  Line clusters   : {len(auto_result.line_clusters or [])}')
    print(f'  Black key cands : {len(auto_result.black_key_contours or [])}')
else:
    print('‚ö†Ô∏è  Single-frame detection failed ‚Äî falling back to multi-frame consensus')
    auto_result = auto_detector.detect_from_video(video_path, return_intermediates=True)
    if auto_result.success:
        keyboard_region = auto_result.keyboard_region
        print(f'‚úÖ Multi-frame consensus succeeded: bbox={auto_result.consensus_bbox}')
    else:
        raise RuntimeError('Keyboard auto-detection failed on this video')

# ‚îÄ‚îÄ Evaluation: IoU against corner annotations (annotations used ONLY here) ‚îÄ‚îÄ
corners = sample.metadata['keyboard_corners']
iou = auto_detector.evaluate_against_corners(auto_result, corners)
print(f'\nüìê IoU vs corner annotations (evaluation only): {iou:.3f}')

In [None]:
# ‚îÄ‚îÄ Visualize Auto-Detection Intermediates ‚îÄ‚îÄ
fig, axes = plt.subplots(2, 2, figsize=(18, 12))

# 1) Hough lines on original frame
line_vis = auto_detector.visualize_lines(frame_bgr, auto_result)
axes[0, 0].imshow(cv2.cvtColor(line_vis, cv2.COLOR_BGR2RGB))
axes[0, 0].set_title(f'Hough Lines (green=horiz, red=vert)')

# 2) Line clusters with selected top/bottom
cluster_vis = auto_detector.visualize_clusters(frame_bgr, auto_result)
axes[0, 1].imshow(cv2.cvtColor(cluster_vis, cv2.COLOR_BGR2RGB))
axes[0, 1].set_title('Line Clusters & Selected Edges (cyan)')

# 3) Black key contours
bk_vis = auto_detector.visualize_black_keys(frame_bgr, auto_result)
axes[1, 0].imshow(cv2.cvtColor(bk_vis, cv2.COLOR_BGR2RGB))
axes[1, 0].set_title('Black-Key Contours (boundary refinement)')

# 4) Final detection vs corner ground truth
corner_det = KeyboardDetector()
corner_region = corner_det.detect_from_corners(corners)
det_vis = auto_detector.visualize_detection(frame_bgr, auto_result, corner_bbox=corner_region.bbox)
axes[1, 1].imshow(cv2.cvtColor(det_vis, cv2.COLOR_BGR2RGB))
axes[1, 1].set_title(f'Auto (green) vs Corner GT (red) ‚Äî IoU={iou:.3f}')

for ax in axes.flat:
    ax.axis('off')
plt.suptitle('Automatic Keyboard Detection Pipeline ‚Äî Intermediates', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# ‚îÄ‚îÄ Multi-Frame Consensus Detection ‚îÄ‚îÄ
# Sample multiple frames across the video and take the median bbox.
# This compensates for temporary occlusions (hands, page turns, etc.)

multi_result = auto_detector.detect_from_video(video_path, return_intermediates=False)

if multi_result.success:
    print(f'‚úÖ Multi-frame consensus bbox: {multi_result.consensus_bbox}')
    valid_bboxes = [b for b in (multi_result.per_frame_bboxes or []) if b is not None]
    print(f'   Frames sampled: {len(multi_result.per_frame_bboxes or [])}')
    print(f'   Successful detections: {len(valid_bboxes)}')

    # Update keyboard_region to use multi-frame consensus (more robust)
    keyboard_region = multi_result.keyboard_region

    multi_iou = auto_detector.evaluate_against_corners(multi_result, corners)
    print(f'   IoU vs corner GT: {multi_iou:.3f}')

    # Show per-frame bboxes
    if multi_result.per_frame_bboxes:
        print('\n   Per-frame detections:')
        for i, bb in enumerate(multi_result.per_frame_bboxes):
            status = f'bbox={bb}' if bb is not None else 'FAILED'
            print(f'     Frame {i}: {status}')
else:
    print('Multi-frame consensus failed ‚Äî keeping single-frame result')

In [None]:
# ‚îÄ‚îÄ Homography & Perspective Correction ‚îÄ‚îÄ
# The auto-detected bbox defines the keyboard region in pixel space.
# We compute a homography to warp it into a normalised rectangle for visualisation.

H = keyboard_region.homography
x1, y1, x2, y2 = keyboard_region.bbox
kb_width = x2 - x1
kb_height = y2 - y1

warped = cv2.warpPerspective(frame_bgr, H, (kb_width, kb_height))
warped_rgb = cv2.cvtColor(warped, cv2.COLOR_BGR2RGB)

# Draw auto-detected keyboard boundary on the original frame
bbox_vis = frame_rgb.copy()
if keyboard_region.corners:
    pts = [keyboard_region.corners[k] for k in ['LT', 'RT', 'RB', 'LB']]
    for i in range(4):
        p1 = pts[i]
        p2 = pts[(i + 1) % 4]
        cv2.line(bbox_vis, p1, p2, (0, 255, 0), 3)
        cv2.circle(bbox_vis, p1, 8, (255, 0, 0), -1)

fig, axes = plt.subplots(2, 1, figsize=(16, 8))
axes[0].imshow(bbox_vis)
axes[0].set_title('Auto-Detected Keyboard Region')
axes[1].imshow(warped_rgb)
axes[1].set_title(f'Perspective-Corrected Keyboard ({kb_width}√ó{kb_height} px)')
for ax in axes:
    ax.axis('off')
plt.tight_layout()
plt.show()

print(f'Keyboard bbox: ({x1}, {y1}) ‚Üí ({x2}, {y2})  |  {kb_width}√ó{kb_height} px')

In [None]:
# ‚îÄ‚îÄ Visualize 88-key layout in pixel space (from auto-detection) ‚îÄ‚îÄ
localizer = KeyLocalizer(keyboard_region.key_boundaries)
white_keys = localizer.get_white_keys()
black_keys = localizer.get_black_keys()

print(f'White keys: {len(white_keys)}  |  Black keys: {len(black_keys)}')

fig, ax = plt.subplots(figsize=(18, 3))
for ki in white_keys:
    kx1, ky1, kx2, ky2 = ki.bbox
    ax.add_patch(plt.Rectangle((kx1, ky1), kx2 - kx1, ky2 - ky1, linewidth=0.8,
                               edgecolor='black', facecolor='white'))
for ki in black_keys:
    kx1, ky1, kx2, ky2 = ki.bbox
    ax.add_patch(plt.Rectangle((kx1, ky1), kx2 - kx1, ky2 - ky1, linewidth=0.5,
                               edgecolor='black', facecolor='#333'))

for note_name in ['A0', 'C4', 'C8']:
    ki = localizer.get_key_by_name(note_name)
    if ki:
        ax.annotate(ki.note_name, xy=ki.center, fontsize=7, color='red', ha='center', va='bottom')

bx1, by1, bx2, by2 = keyboard_region.bbox
ax.set_xlim(bx1 - 10, bx2 + 10)
ax.set_ylim(by2 + 10, by1 - 10)
ax.set_aspect('equal')
ax.set_title('88-Key Layout in Pixel Space (Auto-Detected)')
plt.tight_layout()
plt.show()

---
<a id='3'></a>
## 3. Stage 2 ‚Äî Hand Pose Estimation (Live MediaPipe)

We run **MediaPipe Hands directly on the raw video** to detect 21 hand landmarks per hand ‚Äî no pre-extracted skeleton data from the dataset is used.

Key parameters for robust detection:
- **`model_complexity=1`** ‚Äî full model for higher accuracy
- **`min_detection_confidence=0.3`** ‚Äî lower threshold catches partially-occluded and fast-moving hands
- **`static_image_mode=False`** (video mode) ‚Äî enables temporal tracking across consecutive frames, dramatically improving detection when hands are in motion

In [None]:
import mediapipe as mp

mp_hands = mp.solutions.hands
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles

print(f'MediaPipe version: {mp.__version__}')

In [None]:
# ‚îÄ‚îÄ Run MediaPipe hand detection on video frames ‚îÄ‚îÄ
# Key improvements over default MediaPipe:
#   - model_complexity=1 ‚Üí full model (more accurate keypoints)
#   - min_detection_confidence=0.3 ‚Üí detect partially-occluded / fast-moving hands
#   - min_tracking_confidence=0.3 ‚Üí maintain tracking through motion blur

vp = VideoProcessor()
vp.open(video_path)
total_frames = int(vp.info.frame_count)

# Pick frames spread evenly (skip first/last 5%)
n_demo = 5
margin = max(1, total_frames // 20)
sample_frame_indices = np.linspace(margin, total_frames - margin, n_demo, dtype=int).tolist()
sample_frames = []
for idx in sample_frame_indices:
    f = vp.get_frame(idx)
    if f is not None:
        sample_frames.append((idx, f))

vp.close()

hands_detector = mp_hands.Hands(
    static_image_mode=True,       # True for individual frames
    max_num_hands=2,
    model_complexity=1,           # full model for higher accuracy
    min_detection_confidence=0.3, # lower ‚Üí catches more hands
    min_tracking_confidence=0.3
)

fig, axes = plt.subplots(1, len(sample_frames), figsize=(5 * len(sample_frames), 6))
if len(sample_frames) == 1:
    axes = [axes]

for i, (fidx, frame) in enumerate(sample_frames):
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = hands_detector.process(frame_rgb)

    annotated = frame_rgb.copy()
    n_hands = 0
    if results.multi_hand_landmarks:
        n_hands = len(results.multi_hand_landmarks)
        for hand_lm in results.multi_hand_landmarks:
            mp_drawing.draw_landmarks(
                annotated, hand_lm, mp_hands.HAND_CONNECTIONS,
                mp_drawing_styles.get_default_hand_landmarks_style(),
                mp_drawing_styles.get_default_hand_connections_style()
            )

    axes[i].imshow(annotated)
    axes[i].set_title(f'Frame {fidx} ({n_hands} hands)')
    axes[i].axis('off')

plt.suptitle('MediaPipe Hand Detection (model_complexity=1, conf=0.3)', fontsize=14)
plt.tight_layout()
plt.show()

hands_detector.close()

In [None]:
# Extract landmarks from MediaPipe and visualize
# Focus on one frame to show the 21-keypoint structure

demo_frame_bgr = sample_frames[2][1] if len(sample_frames) > 2 else sample_frames[0][1]
demo_frame_rgb = cv2.cvtColor(demo_frame_bgr, cv2.COLOR_BGR2RGB)

hands_detector = mp_hands.Hands(static_image_mode=True, max_num_hands=2, model_complexity=1,
                                min_detection_confidence=0.3, min_tracking_confidence=0.3)
results = hands_detector.process(demo_frame_rgb)
hands_detector.close()

h, w = demo_frame_rgb.shape[:2]
annotated = demo_frame_rgb.copy()

fingertip_indices = [4, 8, 12, 16, 20]
finger_names = {4: 'Thumb', 8: 'Index', 12: 'Middle', 16: 'Ring', 20: 'Pinky'}

if results.multi_hand_landmarks:
    for hand_lm, hand_info in zip(results.multi_hand_landmarks, results.multi_handedness):
        label = hand_info.classification[0].label
        mp_drawing.draw_landmarks(annotated, hand_lm, mp_hands.HAND_CONNECTIONS)

        # mark fingertips
        for tip_idx in fingertip_indices:
            lm = hand_lm.landmark[tip_idx]
            px, py = int(lm.x * w), int(lm.y * h)
            cv2.circle(annotated, (px, py), 8, (255, 0, 0), -1)
            cv2.putText(annotated, finger_names[tip_idx], (px + 10, py - 5),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 0), 1)

        print(f'{label} hand detected:')
        for tip_idx in fingertip_indices:
            lm = hand_lm.landmark[tip_idx]
            print(f'  {finger_names[tip_idx]:6s} tip: ({lm.x:.4f}, {lm.y:.4f}, {lm.z:.4f})')

plt.figure(figsize=(14, 8))
plt.imshow(annotated)
plt.title('MediaPipe 21-Keypoint Hand Skeleton with Fingertip Labels')
plt.axis('off')
plt.show()

In [None]:
# ‚îÄ‚îÄ Run LIVE MediaPipe detection on the full video ‚îÄ‚îÄ
# This is our primary hand-detection method ‚Äî no pre-extracted skeletons.
# We use video mode (static_image_mode=False) for temporal tracking,
# which dramatically improves detection across consecutive frames.

print(f'Running live MediaPipe hand detection on {video_path} ...')
print(f'  model_complexity=1, min_detection_conf=0.3, stride=2')

live_cfg = LiveDetectionConfig(
    model_complexity=1,
    min_detection_confidence=0.3,
    min_tracking_confidence=0.3,
    frame_stride=2,            # process every 2nd frame for speed
    static_image_mode=False,   # video mode ‚Üí temporal tracking
)
live_det = LiveHandDetector(config=live_cfg)
left_raw, right_raw = live_det.detect_from_video(
    video_path,
    progress_callback=lambda cur, tot: print(f'  frame {cur}/{tot}', end='\r')
)

left_rate = LiveHandDetector.detection_rate(left_raw)
right_rate = LiveHandDetector.detection_rate(right_raw)

print(f'\nLive MediaPipe results:')
print(f'  Left  hand: {left_raw.shape}, detection rate = {left_rate:.1%}')
print(f'  Right hand: {right_raw.shape}, detection rate = {right_rate:.1%}')
print(f'  Coordinates are normalised [0, 1] ‚Äî same format as SkeletonLoader')

In [None]:
# ‚îÄ‚îÄ Overlay live-detected landmarks on a demo frame ‚îÄ‚îÄ
demo_fidx = sample_frames[2][0] if len(sample_frames) > 2 else sample_frames[0][0]
demo_frame_bgr2 = sample_frames[2][1] if len(sample_frames) > 2 else sample_frames[0][1]
comparison = cv2.cvtColor(demo_frame_bgr2, cv2.COLOR_BGR2RGB).copy()
h, w = comparison.shape[:2]

for hand_key, color in [('right', (0, 255, 0)), ('left', (0, 200, 255))]:
    arr = right_raw if hand_key == 'right' else left_raw
    if demo_fidx < len(arr) and not np.any(np.isnan(arr[demo_fidx])):
        lm = arr[demo_fidx]
        for j in range(21):
            px = int(lm[j, 0] * w)
            py = int(lm[j, 1] * h)
            cv2.circle(comparison, (px, py), 4, color, -1)
        connections = [(0,1),(1,2),(2,3),(3,4),(0,5),(5,6),(6,7),(7,8),
                       (5,9),(9,10),(10,11),(11,12),(9,13),(13,14),(14,15),(15,16),
                       (13,17),(17,18),(18,19),(19,20),(0,17)]
        for c1, c2 in connections:
            p1 = (int(lm[c1, 0] * w), int(lm[c1, 1] * h))
            p2 = (int(lm[c2, 0] * w), int(lm[c2, 1] * h))
            cv2.line(comparison, p1, p2, color, 2)

plt.figure(figsize=(14, 8))
plt.imshow(comparison)
plt.title(f'Live MediaPipe Detection Overlay (green=right, cyan=left) ‚Äî Frame {demo_fidx}')
plt.axis('off')
plt.show()

---
<a id='4'></a>
## 4. Stage 3 - Temporal Filtering

MediaPipe landmarks are noisy. We apply a 3-stage filtering pipeline:
1. Hampel filter (outlier detection via Median Absolute Deviation)
2. Linear interpolation (fill gaps < 30 frames)
3. Savitzky-Golay filter (smoothing)

In [None]:
tf = TemporalFilter(
    hampel_window=config.hand.hampel_window,
    hampel_threshold=config.hand.hampel_threshold,
    max_interpolation_gap=config.hand.interpolation_max_gap,
    savgol_window=config.hand.savgol_window,
    savgol_order=config.hand.savgol_order
)

left_filtered = tf.process(left_raw) if left_raw.size > 0 else left_raw
right_filtered = tf.process(right_raw) if right_raw.size > 0 else right_raw

print('Filtering complete')
print(f'Left  filtered shape: {left_filtered.shape}')
print(f'Right filtered shape: {right_filtered.shape}')

In [None]:
# Visualize filtering effect: index fingertip x-coordinate
hand_arr_raw = right_raw if right_raw.size > 0 else left_raw
hand_arr_filt = right_filtered if right_filtered.size > 0 else left_filtered
hand_label = 'Right' if right_raw.size > 0 else 'Left'

lm_idx = 8  # index fingertip
T = min(3000, len(hand_arr_raw))

raw_signal = hand_arr_raw[:T, lm_idx, 0]
filt_signal = hand_arr_filt[:T, lm_idx, 0]

fig, ax = plt.subplots(figsize=(16, 4))
ax.plot(raw_signal, alpha=0.5, label='Raw', linewidth=0.5)
ax.plot(filt_signal, label='Filtered', linewidth=1)
ax.set_title(f'{hand_label} Hand - Index Fingertip X-Coordinate')
ax.set_xlabel('Frame')
ax.set_ylabel('X (normalized)')
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
# Optical flow: track fingertip motion between frames
vp = VideoProcessor()
vp.open(video_path)

flow_start = 1000
frame1_bgr = vp.get_frame(flow_start)
frame2_bgr = vp.get_frame(flow_start + 5)
vp.close()

if frame1_bgr is not None and frame2_bgr is not None:
    gray1 = cv2.cvtColor(frame1_bgr, cv2.COLOR_BGR2GRAY)
    gray2 = cv2.cvtColor(frame2_bgr, cv2.COLOR_BGR2GRAY)

    # crop to keyboard region for clearer visualization
    y1, y2 = keyboard_region.bbox[1], keyboard_region.bbox[3]
    x1, x2 = keyboard_region.bbox[0], keyboard_region.bbox[2]
    gray1_crop = gray1[y1:y2, x1:x2]
    gray2_crop = gray2[y1:y2, x1:x2]

    flow = cv2.calcOpticalFlowFarneback(
        gray1_crop, gray2_crop, None,
        pyr_scale=0.5, levels=3, winsize=15, iterations=3, poly_n=5, poly_sigma=1.2, flags=0
    )

    magnitude, angle = cv2.cartToPolar(flow[..., 0], flow[..., 1])

    # HSV visualization
    hsv = np.zeros((*gray1_crop.shape, 3), dtype=np.uint8)
    hsv[..., 0] = angle * 180 / np.pi / 2
    hsv[..., 1] = 255
    hsv[..., 2] = cv2.normalize(magnitude, None, 0, 255, cv2.NORM_MINMAX)
    flow_vis = cv2.cvtColor(hsv, cv2.COLOR_HSV2RGB)

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    axes[0].imshow(cv2.cvtColor(frame1_bgr[y1:y2, x1:x2], cv2.COLOR_BGR2RGB))
    axes[0].set_title(f'Frame {flow_start}')
    axes[1].imshow(cv2.cvtColor(frame2_bgr[y1:y2, x1:x2], cv2.COLOR_BGR2RGB))
    axes[1].set_title(f'Frame {flow_start + 5}')
    axes[2].imshow(flow_vis)
    axes[2].set_title('Dense Optical Flow (Farneback)')
    for ax in axes:
        ax.axis('off')
    plt.suptitle('Optical Flow: Hand Motion Between Frames', fontsize=14)
    plt.tight_layout()
    plt.show()

    print(f'Mean flow magnitude: {np.mean(magnitude):.2f} px/frame')
    print(f'Max flow magnitude : {np.max(magnitude):.2f} px/frame')
else:
    print('Could not read frames for optical flow')

In [None]:
# Fingertip extraction
extractor = FingertipExtractor()

sample_fidx = 500
if sample_fidx < len(right_filtered) and not np.any(np.isnan(right_filtered[sample_fidx])):
    ftips = extractor.extract(right_filtered[sample_fidx], frame_idx=sample_fidx, hand_type='right')
    print(f'Frame {sample_fidx} - Right hand fingertips:')
    for f_num in range(1, 6):
        pos = ftips.get_position_2d(f_num)
        if pos:
            print(f'  {extractor.FINGER_NAMES[f_num]:6s}: ({pos[0]:.4f}, {pos[1]:.4f})')

    span = extractor.compute_hand_span(ftips)
    print(f'  Hand span: {span:.4f}')

---
<a id='5'></a>
## 5. Stage 4 - Finger-Key Assignment

Gaussian probability model in image-pixel space (not homography-warped space).
Uses x-distance only to avoid y-bias from different finger lengths.
Tries both hands for each key, picks the higher-confidence assignment.
Max-distance gate rejects assignments when the hand is too far from the key.

In [None]:
# Load MIDI/TSV annotations
print(f'Downloading TSV annotations for sample {sample.id} ...')
tsv_df = train_dataset.load_tsv_annotations(sample)

midi_events = []
for _, row in tsv_df.iterrows():
    midi_events.append({
        'onset': float(row['onset']),
        'offset': float(row['onset']) + 0.3,
        'pitch': int(row['note']),
        'velocity': int(row['velocity']) if 'velocity' in row and pd.notna(row['velocity']) else 64
    })

print(f'Total MIDI events: {len(midi_events)}')
print(f'Pitch range: {min(e["pitch"] for e in midi_events)} - {max(e["pitch"] for e in midi_events)}')

In [None]:
# Synchronize MIDI events with video frames
midi_sync = MidiVideoSync(fps=config.video_fps)
synced_events = midi_sync.sync_events(midi_events)
print(f'Synced events: {len(synced_events)}')

In [None]:
FRAME_W, FRAME_H = 1920, 1080

# Auto-detected key boundaries are already in pixel space (no projection needed).
# This is a key advantage of our full-CV approach:  the auto-detector computes
# key positions directly in frame coordinates, matching the hand landmark space.
key_boundaries_px = keyboard_region.key_boundaries

# Scale hand landmarks from normalised [0,1] to pixel coordinates
left_px = left_filtered.copy()
left_px[:, :, 0] *= FRAME_W
left_px[:, :, 1] *= FRAME_H

right_px = right_filtered.copy()
right_px[:, :, 0] *= FRAME_W
right_px[:, :, 1] *= FRAME_H

# Gaussian finger assigner (Moryossef et al. 2023 ‚Äî x-distance only)
assigner = GaussianFingerAssigner(
    key_boundaries=key_boundaries_px,
    sigma=config.assignment.sigma,
    candidate_range=config.assignment.candidate_keys
)

print(f'Key boundaries: {len(key_boundaries_px)} keys in pixel space')
print(f'Sigma (auto): {assigner.sigma:.1f} px  (‚âà 1 white-key width)')
print(f'Max distance: {assigner.max_distance_px:.0f} px ({assigner.max_distance_sigma}œÉ gate)')

In [None]:
# Run assignment: try BOTH hands for every key, pick higher confidence
assignments = []
skipped = 0

for event in synced_events:
    frame_idx = event.frame_idx
    key_idx = event.key_idx

    if key_idx not in assigner.key_centers:
        skipped += 1
        continue

    asgn_right = None
    if frame_idx < len(right_px):
        lm = right_px[frame_idx]
        if not np.any(np.isnan(lm)):
            asgn_right = assigner.assign_from_landmarks(lm, key_idx, 'right', frame_idx, event.onset_time)

    asgn_left = None
    if frame_idx < len(left_px):
        lm = left_px[frame_idx]
        if not np.any(np.isnan(lm)):
            asgn_left = assigner.assign_from_landmarks(lm, key_idx, 'left', frame_idx, event.onset_time)

    candidates = [a for a in (asgn_right, asgn_left) if a is not None]
    if candidates:
        assignments.append(max(candidates, key=lambda a: a.confidence))
    else:
        skipped += 1

print(f'Total events  : {len(synced_events)}')
print(f'Assigned      : {len(assignments)}')
print(f'Skipped       : {skipped}')
print(f'Coverage      : {len(assignments)/max(1,len(synced_events))*100:.1f}%')

In [None]:
# Assignment statistics
if assignments:
    fingers = [a.assigned_finger for a in assignments]
    hands_list = [a.hand for a in assignments]
    confs = [a.confidence for a in assignments]

    fig, axes = plt.subplots(1, 3, figsize=(16, 4))

    finger_names_map = {1: 'Thumb', 2: 'Index', 3: 'Middle', 4: 'Ring', 5: 'Pinky'}
    colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00']
    fc = pd.Series(fingers).value_counts().sort_index()
    fc.plot.bar(ax=axes[0], color=[colors[i-1] for i in fc.index])
    axes[0].set_xticklabels([finger_names_map[i] for i in fc.index], rotation=45)
    axes[0].set_title('Finger Distribution')

    pd.Series(hands_list).value_counts().plot.bar(ax=axes[1], color=['coral', 'skyblue'])
    axes[1].set_title('Hand Distribution')

    axes[2].hist(confs, bins=30, color='mediumseagreen', edgecolor='white')
    axes[2].set_title('Assignment Confidence')

    plt.tight_layout()
    plt.show()

    print('\nSample assignments:')
    for a in assignments[:10]:
        print(f'  Frame {a.frame_idx:>5d} | {a.label} | MIDI {a.midi_pitch} ({a.finger_name:6s}) | conf={a.confidence:.3f}')

---
<a id='6'></a>
## 6. Baseline Pipeline on Multiple Samples

In [None]:
_SAMPLE_CACHE = {'video': {}, 'keyboard': {}, 'tsv': {},
                 'filtered_landmarks': {}, 'keys_px': {}}

def process_sample_baseline(sample, dataset, config, max_duration_sec=60, cache=_SAMPLE_CACHE):
    """Full-CV baseline: auto-detect keyboard + live MediaPipe hands + Gaussian assignment.

    NO pre-extracted skeletons from the dataset are used ‚Äî only raw video.
    Corner annotations are used ONLY for IoU evaluation of keyboard detection.
    """
    result = {'sample_id': sample.id, 'assignments': [], 'error': None, 'iou': None}
    try:
        # ‚îÄ‚îÄ Download video (shared by all stages) ‚îÄ‚îÄ
        if sample.id not in cache['video']:
            cache['video'][sample.id] = dataset.download_file(sample.video_path)
        vid_path = cache['video'][sample.id]

        # ‚îÄ‚îÄ Stage 1: Automatic keyboard detection (NO annotations) ‚îÄ‚îÄ
        if sample.id not in cache['keyboard']:
            det = AutoKeyboardDetector({
                'canny_low': config.keyboard.canny_low,
                'canny_high': config.keyboard.canny_high,
                'hough_threshold': config.keyboard.hough_threshold,
            })
            auto_res = det.detect_from_video(vid_path)
            if not auto_res.success:
                result['error'] = 'Auto-detection failed'
                return result
            cache['keyboard'][sample.id] = (auto_res, det)

        auto_res, det = cache['keyboard'][sample.id]
        kb = auto_res.keyboard_region

        if sample.id not in cache['keys_px']:
            cache['keys_px'][sample.id] = kb.key_boundaries
        kb_px = cache['keys_px'][sample.id]

        # IoU evaluation against corner annotations (eval only)
        corners = sample.metadata.get('keyboard_corners')
        if corners:
            result['iou'] = det.evaluate_against_corners(auto_res, corners)

        # ‚îÄ‚îÄ Stage 2: Live hand detection (MediaPipe on raw video) ‚îÄ‚îÄ
        if sample.id not in cache['filtered_landmarks']:
            live_cfg = LiveDetectionConfig(
                model_complexity=1,
                min_detection_confidence=0.3,
                min_tracking_confidence=0.3,
                frame_stride=2,
                static_image_mode=False,
            )
            live_det = LiveHandDetector(config=live_cfg)
            max_vid_frames = int(max_duration_sec * config.video_fps) if max_duration_sec else None
            la, ra = live_det.detect_from_video(vid_path, max_frames=max_vid_frames)

            # Temporal filtering (Hampel ‚Üí interpolation ‚Üí Savitzky-Golay)
            t = TemporalFilter(
                hampel_window=config.hand.hampel_window,
                hampel_threshold=config.hand.hampel_threshold,
                max_interpolation_gap=config.hand.interpolation_max_gap,
                savgol_window=config.hand.savgol_window,
                savgol_order=config.hand.savgol_order
            )
            if la.size > 0: la = t.process(la)
            if ra.size > 0: ra = t.process(ra)

            # Scale from [0,1] to pixel space
            if la.size > 0: la = la.copy(); la[:,:,0] *= FRAME_W; la[:,:,1] *= FRAME_H
            if ra.size > 0: ra = ra.copy(); ra[:,:,0] *= FRAME_W; ra[:,:,1] *= FRAME_H
            cache['filtered_landmarks'][sample.id] = (la, ra)

        la, ra = cache['filtered_landmarks'][sample.id]
        if max_duration_sec:
            mf = int(max_duration_sec * config.video_fps)
            if la.size > 0: la = la[:mf]
            if ra.size > 0: ra = ra[:mf]

        # ‚îÄ‚îÄ Stage 3: MIDI sync + Gaussian assignment ‚îÄ‚îÄ
        if sample.id not in cache['tsv']:
            cache['tsv'][sample.id] = dataset.load_tsv_annotations(sample)
        tsv = cache['tsv'][sample.id]
        if max_duration_sec:
            tsv = tsv[tsv['onset'] <= float(max_duration_sec)].copy()

        midi_evts = [{'onset': float(r['onset']), 'offset': float(r['onset'])+0.3,
                      'pitch': int(r['note']),
                      'velocity': int(r['velocity']) if 'velocity' in r and pd.notna(r['velocity']) else 64}
                     for _, r in tsv.iterrows()]

        sync = MidiVideoSync(fps=config.video_fps)
        synced = sync.sync_events(midi_evts)

        asgn = GaussianFingerAssigner(key_boundaries=kb_px, sigma=config.assignment.sigma,
                                      candidate_range=config.assignment.candidate_keys)

        for ev in synced:
            fidx, kidx = ev.frame_idx, ev.key_idx
            if kidx not in asgn.key_centers: continue
            ar = None
            if fidx < len(ra):
                lm = ra[fidx]
                if not np.any(np.isnan(lm)):
                    ar = asgn.assign_from_landmarks(lm, kidx, 'right', fidx, ev.onset_time)
            al = None
            if fidx < len(la):
                lm = la[fidx]
                if not np.any(np.isnan(lm)):
                    al = asgn.assign_from_landmarks(lm, kidx, 'left', fidx, ev.onset_time)
            cands = [a for a in (ar, al) if a is not None]
            if cands:
                result['assignments'].append(max(cands, key=lambda a: a.confidence))
    except Exception as e:
        result['error'] = str(e)
    return result

In [None]:
NUM_SAMPLES = 5          # Reduced for class project (faster training)
MAX_DURATION_SEC = 60    # Reduced from 120s to 60s per sample

all_results = []
for i, samp in enumerate(train_dataset):
    if i >= NUM_SAMPLES: break
    print(f'Processing {i+1}/{NUM_SAMPLES}: {samp.id} - {samp.metadata["piece"][:40]}')
    res = process_sample_baseline(samp, train_dataset, config, max_duration_sec=MAX_DURATION_SEC)
    if res['error']:
        print(f'  Error: {res["error"][:100]}')
    else:
        iou_str = f'IoU={res["iou"]:.3f}' if res['iou'] is not None else 'IoU=N/A'
        print(f'  Assigned {len(res["assignments"])} notes  |  {iou_str}')
    all_results.append(res)

total_assigned = sum(len(r['assignments']) for r in all_results)
ious = [r['iou'] for r in all_results if r['iou'] is not None]
print(f'\nTotal assignments: {total_assigned}')
if ious:
    print(f'Keyboard detection IoU: mean={np.mean(ious):.3f}, min={np.min(ious):.3f}, max={np.max(ious):.3f}')

In [None]:
all_fingers = [a.assigned_finger for r in all_results for a in r['assignments']]
all_hands = [a.hand for r in all_results for a in r['assignments']]
all_confs = [a.confidence for r in all_results for a in r['assignments']]

if all_fingers:
    fig, axes = plt.subplots(1, 3, figsize=(16, 4))
    colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00']
    fc = pd.Series(all_fingers).value_counts().sort_index()
    fc.plot.bar(ax=axes[0], color=[colors[i-1] for i in fc.index])
    axes[0].set_title(f'Finger Distribution (n={len(all_fingers)})')
    pd.Series(all_hands).value_counts().plot.bar(ax=axes[1], color=['coral', 'skyblue'])
    axes[1].set_title('Hand Distribution')
    axes[2].hist(all_confs, bins=30, color='mediumseagreen', edgecolor='white')
    axes[2].axvline(np.mean(all_confs), color='red', ls='--', label=f'mean={np.mean(all_confs):.3f}')
    axes[2].set_title('Confidence Distribution')
    axes[2].legend()
    plt.tight_layout()
    plt.show()

In [None]:
# ‚îÄ‚îÄ Keyboard Detection IoU Across Samples ‚îÄ‚îÄ
ious = [(r['sample_id'], r['iou']) for r in all_results if r['iou'] is not None]

if ious:
    labels, values = zip(*ious)
    short_labels = [l[:12] for l in labels]

    fig, ax = plt.subplots(figsize=(14, 4))
    bars = ax.bar(range(len(values)), values, color='steelblue', edgecolor='white')
    ax.axhline(np.mean(values), color='red', ls='--', label=f'Mean IoU = {np.mean(values):.3f}')
    ax.set_xticks(range(len(values)))
    ax.set_xticklabels(short_labels, rotation=45, ha='right', fontsize=8)
    ax.set_ylabel('IoU')
    ax.set_title('Keyboard Auto-Detection IoU vs Corner Annotations')
    ax.set_ylim(0, 1.05)
    ax.legend()
    plt.tight_layout()
    plt.show()
else:
    print('No IoU data available')

---
<a id='7'></a>
## 7. Stage 5 - Neural Refinement (BiLSTM)

Architecture: Input(20) -> Linear(128) -> BiLSTM(128 x 2 layers) -> Self-Attention -> Linear(128) -> Linear(5)

In [None]:
print('Preparing training sequences from baseline assignments ...')

MAX_TRAIN_SAMPLES = 20
train_sequences = []
train_ds_full = PianoVAMDataset(split='train', streaming=True, max_samples=MAX_TRAIN_SAMPLES)

for i, samp in enumerate(train_ds_full):
    if i >= MAX_TRAIN_SAMPLES: break
    res = process_sample_baseline(samp, train_ds_full, config, max_duration_sec=60)
    asgns = res['assignments']
    if len(asgns) < 10: continue
    seq = {
        'pitches': [a.midi_pitch for a in asgns],
        'fingers': [a.assigned_finger for a in asgns],
        'onsets': [a.note_onset for a in asgns],
        'hands': [a.hand for a in asgns],
        'labels': [a.assigned_finger for a in asgns],
    }
    train_sequences.append(seq)

print(f'Training sequences: {len(train_sequences)}')
print(f'Total notes: {sum(len(s["pitches"]) for s in train_sequences)}')

In [None]:
feature_extractor = FeatureExtractor(normalize_pitch=True)
input_size = feature_extractor.get_input_size()

trained_model = None
if len(train_sequences) > 2:
    split_idx = max(1, int(0.8 * len(train_sequences)))
    train_seqs = train_sequences[:split_idx]
    val_seqs = train_sequences[split_idx:]

    train_torch_ds = SequenceDataset(train_seqs, feature_extractor, max_len=256)
    val_torch_ds = SequenceDataset(val_seqs, feature_extractor, max_len=256)

    model = FingeringRefiner(
        input_size=input_size,
        hidden_size=config.refinement.hidden_size,
        num_layers=config.refinement.num_layers,
        dropout=config.refinement.dropout,
        bidirectional=config.refinement.bidirectional
    ).to(DEVICE)

    print(f'Model parameters: {sum(p.numel() for p in model.parameters()):,}')
    print(model)

    training_config = {
        'hidden_size': config.refinement.hidden_size,
        'num_layers': config.refinement.num_layers,
        'dropout': config.refinement.dropout,
        'batch_size': min(config.refinement.batch_size, len(train_torch_ds)),
        'learning_rate': config.refinement.learning_rate,
        'epochs': config.refinement.epochs,
        'early_stopping_patience': config.refinement.early_stopping_patience,
        'device': DEVICE,
        'checkpoint_dir': '/content/checkpoints' if IN_COLAB else './outputs/checkpoints'
    }

    print('\nTraining BiLSTM refinement model ...')
    trained_model = train_refiner(
        train_dataset=train_torch_ds,
        val_dataset=val_torch_ds if len(val_torch_ds) > 0 else None,
        config=training_config
    )
    print('Training complete')
else:
    print('Not enough data for training')

In [None]:
def refine_assignments(model, assignments, feature_extractor, device='cpu', use_constraints=True):
    if not assignments or model is None:
        return assignments

    pitches = [a.midi_pitch for a in assignments]
    fingers = [a.assigned_finger for a in assignments]
    onsets = [a.note_onset for a in assignments]
    hands = [a.hand for a in assignments]

    x = feature_extractor.extract(pitches, fingers, onsets, hands)
    x = x.unsqueeze(0).to(device)

    model.eval()
    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=-1).squeeze(0).cpu().numpy()

    if use_constraints:
        decoded = constrained_viterbi_decode(
            probs=probs, pitches=pitches, hands=hands,
            constraints=BiomechanicalConstraints(strict=False)
        )
        pred_fingers = decoded.fingers
    else:
        pred_fingers = (np.argmax(probs, axis=-1) + 1).tolist()

    confs = [float(probs[i, f - 1]) for i, f in enumerate(pred_fingers)]

    return [FingerAssignment(
        note_onset=a.note_onset, frame_idx=a.frame_idx, midi_pitch=a.midi_pitch,
        key_idx=a.key_idx, assigned_finger=int(pred_fingers[i]), hand=a.hand,
        confidence=float(confs[i]), fingertip_position=a.fingertip_position
    ) for i, a in enumerate(assignments)]


if trained_model is not None and all_results:
    print('Refining baseline predictions ...')
    for res in all_results:
        if res['assignments']:
            original = res['assignments']
            refined = refine_assignments(trained_model, original, feature_extractor, DEVICE)
            res['refined_assignments'] = refined
            changed = sum(1 for o, r in zip(original, refined) if o.assigned_finger != r.assigned_finger)
            print(f'  {res["sample_id"]}: {changed}/{len(original)} changed')
    print('Refinement done')

---
<a id='8'></a>
## 8. Evaluation & Results

In [None]:
metrics = FingeringMetrics()
constraints = BiomechanicalConstraints()

print('=' * 70)
print('EVALUATION RESULTS')
print('=' * 70)

baseline_ifrs = []
refined_ifrs = []

for res in all_results:
    if not res['assignments']: continue
    asgns = res['assignments']
    pitches = [a.midi_pitch for a in asgns]
    fingers = [a.assigned_finger for a in asgns]
    hl = [a.hand for a in asgns]

    violations = constraints.validate_sequence(fingers, pitches, hl)
    ifr = len(violations) / max(1, len(asgns) - 1)
    baseline_ifrs.append(ifr)
    mc = np.mean([a.confidence for a in asgns])

    msg = f'  {res["sample_id"]} - {len(asgns)} notes | Baseline IFR={ifr:.3f} | conf={mc:.3f}'

    if 'refined_assignments' in res:
        ref = res['refined_assignments']
        rf = [a.assigned_finger for a in ref]
        rv = constraints.validate_sequence(rf, pitches, hl)
        ri = len(rv) / max(1, len(ref) - 1)
        refined_ifrs.append(ri)
        msg += f' | Refined IFR={ri:.3f}'

    print(msg)

print('\n' + '=' * 70)
if baseline_ifrs:
    print(f'BASELINE Mean IFR: {np.mean(baseline_ifrs):.3f} +/- {np.std(baseline_ifrs):.3f}')
if refined_ifrs:
    print(f'REFINED  Mean IFR: {np.mean(refined_ifrs):.3f} +/- {np.std(refined_ifrs):.3f}')
    imp = np.mean(baseline_ifrs) - np.mean(refined_ifrs)
    print(f'Improvement: {imp:+.3f}')
print('=' * 70)

In [None]:
# Test set evaluation (full-CV: auto-detection for each test sample)
print('Processing test split ...\n')
test_ds_eval = PianoVAMDataset(split='test', streaming=True, max_samples=5)
test_results = []

for i, samp in enumerate(test_ds_eval):
    print(f'  Test {i+1}: {samp.id}')
    res = process_sample_baseline(samp, test_ds_eval, config)
    if res['error']:
        print(f'    Error: {res["error"][:80]}')
    else:
        n = len(res['assignments'])
        iou_str = f'IoU={res["iou"]:.3f}' if res['iou'] is not None else ''
        if trained_model is not None and n > 0:
            res['refined_assignments'] = refine_assignments(
                trained_model, res['assignments'], feature_extractor, DEVICE)
        print(f'    {n} notes assigned  {iou_str}')
    test_results.append(res)

print('\n' + '=' * 70)
print('TEST SET RESULTS')
print('=' * 70)

test_baseline_ifrs = []
test_refined_ifrs = []

for res in test_results:
    if not res['assignments']: continue
    asgns = res['assignments']
    pitches = [a.midi_pitch for a in asgns]
    fingers = [a.assigned_finger for a in asgns]
    hl = [a.hand for a in asgns]

    viols = constraints.validate_sequence(fingers, pitches, hl)
    ifr = len(viols) / max(1, len(asgns) - 1)
    test_baseline_ifrs.append(ifr)

    msg = f'  {res["sample_id"]} - {len(asgns)} notes | Baseline IFR={ifr:.3f}'

    if 'refined_assignments' in res:
        ref = res['refined_assignments']
        rf = [a.assigned_finger for a in ref]
        rv = constraints.validate_sequence(rf, pitches, hl)
        ri = len(rv) / max(1, len(ref) - 1)
        test_refined_ifrs.append(ri)
        msg += f' | Refined IFR={ri:.3f}'
    print(msg)

if test_baseline_ifrs:
    print(f'\nTEST Baseline Mean IFR: {np.mean(test_baseline_ifrs):.3f}')
if test_refined_ifrs:
    print(f'TEST Refined  Mean IFR: {np.mean(test_refined_ifrs):.3f}')

In [None]:
# Final summary figure
if baseline_ifrs:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    x = np.arange(len(baseline_ifrs))
    w = 0.35
    axes[0].bar(x - w/2, baseline_ifrs, w, label='Baseline', color='steelblue')
    if refined_ifrs:
        axes[0].bar(x + w/2, refined_ifrs, w, label='Refined', color='coral')
    axes[0].set_xlabel('Sample')
    axes[0].set_ylabel('IFR (lower = better)')
    axes[0].set_title('IFR Comparison (Train Samples)')
    axes[0].legend()

    if all_confs:
        axes[1].hist(all_confs, bins=30, color='mediumseagreen', edgecolor='white')
        axes[1].axvline(np.mean(all_confs), color='red', ls='--', label=f'mean={np.mean(all_confs):.3f}')
        axes[1].set_title('Confidence Distribution')
        axes[1].legend()

    plt.suptitle('Piano Fingering Detection - Results Summary', fontsize=14)
    plt.tight_layout()
    plt.show()

In [None]:
# Save results
output_dir = Path('/content/outputs' if IN_COLAB else './outputs')
output_dir.mkdir(parents=True, exist_ok=True)

results_summary = {
    'pipeline': 'piano-fingering-detection',
    'baseline_method': 'Gaussian Assignment (x-only, both hands, max-distance gate)',
    'refinement_method': 'BiLSTM + Attention + Constrained Viterbi',
    'test_results': []
}

for i, res in enumerate(test_results):
    entry = {'sample_id': res['sample_id'], 'num_assignments': len(res.get('assignments', []))}
    if i < len(test_baseline_ifrs):
        entry['baseline_ifr'] = float(test_baseline_ifrs[i])
    if i < len(test_refined_ifrs):
        entry['refined_ifr'] = float(test_refined_ifrs[i])
    results_summary['test_results'].append(entry)

if test_baseline_ifrs:
    results_summary['mean_baseline_ifr'] = float(np.mean(test_baseline_ifrs))
if test_refined_ifrs:
    results_summary['mean_refined_ifr'] = float(np.mean(test_refined_ifrs))

with open(output_dir / 'evaluation_results.json', 'w') as f:
    json.dump(results_summary, f, indent=2)

if trained_model is not None:
    torch.save(trained_model.state_dict(), output_dir / 'refinement_model.pt')

print(f'Results saved to {output_dir}')

---
<a id='9'></a>
## 9. Extended Evaluation & Validation

### 9.1 Baseline Comparisons

To validate that our pipeline produces **meaningful** predictions, we compare against trivial baselines.
If our IFR metric is useful, there should be clear separation between intelligent methods and naive strategies.

| Baseline | Strategy |
|---|---|
| **Random** | Assign a random finger (1‚Äì5) to every note |
| **Always Finger 3** | Assign middle finger to every note |
| **Pitch-Proportional** | Map the pitch range linearly to fingers 1‚Äì5 |

In [None]:
# ‚îÄ‚îÄ 9.1  Baseline Comparisons ‚îÄ‚îÄ
import random as _random

print('=' * 70)
print('BASELINE COMPARISONS ‚Äî IFR  (lower = fewer impossible transitions = better)')
print('=' * 70)

method_names = ['Random', 'Always Finger 3', 'Pitch-Proportional',
                'Gaussian Baseline', 'Refined (BiLSTM+Viterbi)']
method_ifrs = {m: [] for m in method_names}

eval_pool = [r for r in (all_results + test_results)
             if r.get('assignments') and len(r['assignments']) >= 2]

for res in eval_pool:
    asgns = res['assignments']
    pitches = [a.midi_pitch for a in asgns]
    hl      = [a.hand       for a in asgns]
    n = len(asgns)

    # 1 ‚Äî Random
    _random.seed(42)
    rf = [_random.randint(1, 5) for _ in range(n)]
    v = constraints.validate_sequence(rf, pitches, hl)
    method_ifrs['Random'].append(len(v) / (n - 1))

    # 2 ‚Äî Always finger 3 (middle)
    v = constraints.validate_sequence([3] * n, pitches, hl)
    method_ifrs['Always Finger 3'].append(len(v) / (n - 1))

    # 3 ‚Äî Pitch-proportional  (map pitch range ‚Üí fingers 1-5)
    pmin, pmax = min(pitches), max(pitches)
    prange = max(1, pmax - pmin)
    pf = [max(1, min(5, round(1 + 4 * (p - pmin) / prange))) for p in pitches]
    v = constraints.validate_sequence(pf, pitches, hl)
    method_ifrs['Pitch-Proportional'].append(len(v) / (n - 1))

    # 4 ‚Äî Our Gaussian baseline
    bf = [a.assigned_finger for a in asgns]
    v = constraints.validate_sequence(bf, pitches, hl)
    method_ifrs['Gaussian Baseline'].append(len(v) / (n - 1))

    # 5 ‚Äî Refined (if available)
    if 'refined_assignments' in res and res['refined_assignments']:
        ref_f = [a.assigned_finger for a in res['refined_assignments']]
        v = constraints.validate_sequence(ref_f, pitches, hl)
        method_ifrs['Refined (BiLSTM+Viterbi)'].append(len(v) / (n - 1))

# ‚îÄ‚îÄ Print table ‚îÄ‚îÄ
print(f'\n  {"Method":32s}  {"Mean IFR":>10s}  {"Std":>8s}  {"n":>4s}')
print(f'  {"‚îÄ"*32}  {"‚îÄ"*10}  {"‚îÄ"*8}  {"‚îÄ"*4}')
for m in method_names:
    vals = method_ifrs[m]
    if vals:
        print(f'  {m:32s}  {np.mean(vals):>10.3f}  {np.std(vals):>8.3f}  {len(vals):>4d}')
    else:
        print(f'  {m:32s}  {"N/A":>10s}')

# ‚îÄ‚îÄ Bar chart ‚îÄ‚îÄ
fig, ax = plt.subplots(figsize=(10, 5))
active = [m for m in method_names if method_ifrs[m]]
means  = [np.mean(method_ifrs[m]) for m in active]
stds   = [np.std(method_ifrs[m])  for m in active]
palette = ['#d62728', '#ff7f0e', '#9467bd', '#2ca02c', '#1f77b4']

bars = ax.bar(range(len(active)), means, yerr=stds, capsize=5,
              color=palette[:len(active)], edgecolor='white', linewidth=1.5)
ax.set_xticks(range(len(active)))
ax.set_xticklabels(active, rotation=20, ha='right', fontsize=10)
ax.set_ylabel('IFR  (lower = better)')
ax.set_title('Irrational Fingering Rate ‚Äî Our Pipeline vs. Trivial Baselines')
ax.set_ylim(0, None)

for bar, m_val in zip(bars, means):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
            f'{m_val:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

# ‚îÄ‚îÄ Relative improvements ‚îÄ‚îÄ
if method_ifrs['Gaussian Baseline'] and method_ifrs['Random']:
    base_m = np.mean(method_ifrs['Gaussian Baseline'])
    rand_m = np.mean(method_ifrs['Random'])
    print(f'\nGaussian baseline reduces IFR by {(rand_m - base_m) / rand_m * 100:.1f}% vs random')
if method_ifrs['Refined (BiLSTM+Viterbi)'] and method_ifrs['Gaussian Baseline']:
    ref_m  = np.mean(method_ifrs['Refined (BiLSTM+Viterbi)'])
    base_m = np.mean(method_ifrs['Gaussian Baseline'])
    print(f'BiLSTM+Viterbi further reduces IFR by {(base_m - ref_m) / max(base_m, 1e-6) * 100:.1f}% vs Gaussian baseline')

### 9.2 Hand Detection Validation ‚Äî Live MediaPipe vs. Dataset Skeletons

PianoVAM provides **pre-extracted hand skeletons** for every video. Our pipeline deliberately uses live MediaPipe detection on the raw video. Here we validate our detections against the dataset's reference skeletons using standard pose-estimation metrics.

| Metric | What It Measures |
|--------|-----------------|
| **Detection Rate** | % of frames with a valid hand detection |
| **PCK** (Percentage of Correct Keypoints) | % of keypoints within a normalised threshold of the reference (standard pose metric) |
| **Trajectory Correlation** | Pearson *r* of fingertip x-coordinates over time ‚Äî captures motion-pattern similarity |
| **Key Agreement** | % of frames where both methods place a fingertip on the **same piano key** |

> **PCK** is the standard evaluation metric in pose estimation (Andriluka et al., 2014). Small spatial offsets still count as *correct* if they fall within the threshold.

> **Note**: Differences are expected ‚Äî the dataset skeletons were extracted with a different MediaPipe version, different detection parameters, and potentially different hand labelling conventions. What matters is that both methods capture the same underlying hand motion.

In [None]:
# ‚îÄ‚îÄ 9.2  Hand Detection Validation ‚îÄ‚îÄ
print(f'Downloading pre-extracted skeleton for sample {sample.id} ...\n')

try:
    skeleton_data = train_dataset.load_skeleton(sample)
    loader = SkeletonLoader(normalize=False)
    parsed = loader._parse_json(skeleton_data)

    T_live = max(len(left_raw), len(right_raw))
    ds_left_arr  = loader.to_array(parsed.get('left',  []), fill_missing=True, total_frames=T_live)
    ds_right_arr = loader.to_array(parsed.get('right', []), fill_missing=True, total_frames=T_live)

    # ‚îÄ‚îÄ Normalise coordinates to [0, 1] if stored in pixel space ‚îÄ‚îÄ
    for tag, arr in [('ds_left', ds_left_arr), ('ds_right', ds_right_arr)]:
        valid = ~np.isnan(arr[:, 0, 0])
        if valid.any():
            xmax = np.nanmax(arr[valid, :, 0])
            ymax = np.nanmax(arr[valid, :, 1])
            if xmax > 2.0 or ymax > 2.0:          # pixel coordinates
                print(f'  {tag}: pixel coords detected (max x={xmax:.0f}), normalising ...')
                arr[:, :, 0] /= FRAME_W
                arr[:, :, 1] /= FRAME_H

    print(f'Shapes  ‚Äî Dataset  L:{ds_left_arr.shape}  R:{ds_right_arr.shape}')
    print(f'          Live MP  L:{left_raw.shape}  R:{right_raw.shape}')

    # ‚îÄ‚îÄ Detect hand-label swap ‚îÄ‚îÄ
    # MediaPipe sometimes mirrors L/R labels. Try both matchings, keep the better one.
    def _corr_tip(a, b, tip=8, min_n=30):
        mn = min(len(a), len(b))
        v = ~np.isnan(a[:mn, tip, 0]) & ~np.isnan(b[:mn, tip, 0])
        if v.sum() < min_n:
            return -1.0
        x1, x2 = a[:mn, tip, 0][v], b[:mn, tip, 0][v]
        if np.std(x1) < 1e-6 or np.std(x2) < 1e-6:
            return -1.0
        return float(np.corrcoef(x1, x2)[0, 1])

    r_normal  = _corr_tip(right_raw, ds_right_arr) + _corr_tip(left_raw, ds_left_arr)
    r_swapped = _corr_tip(right_raw, ds_left_arr)  + _corr_tip(left_raw, ds_right_arr)

    if r_swapped > r_normal + 0.05:
        print(f'\n  Hand labels are SWAPPED between methods ‚Äî correcting '
              f'(normal r={r_normal:.2f}, swapped r={r_swapped:.2f})')
        ds_left_arr, ds_right_arr = ds_right_arr, ds_left_arr
    else:
        print(f'\n  Hand labels consistent (normal r={r_normal:.2f}, swapped r={r_swapped:.2f})')

    # ‚îÄ‚îÄ Temporal Alignment ‚îÄ‚îÄ
    # Find the best frame offset by maximising index-fingertip correlation.
    live_r_rate = LiveHandDetector.detection_rate(right_raw)
    live_l_rate = LiveHandDetector.detection_rate(left_raw)
    best_live = right_raw if live_r_rate >= live_l_rate else left_raw
    best_ds   = ds_right_arr if live_r_rate >= live_l_rate else ds_left_arr
    best_hand_label = 'Right' if live_r_rate >= live_l_rate else 'Left'

    best_offset, best_r = 0, -1.0
    for offset in range(-20, 21):
        if offset >= 0:
            lv = best_live[offset:]
            ds = best_ds[:len(lv)]
        else:
            ds = best_ds[-offset:]
            lv = best_live[:len(ds)]
        mn = min(len(lv), len(ds))
        lv, ds = lv[:mn], ds[:mn]
        v = ~np.isnan(lv[:, 8, 0]) & ~np.isnan(ds[:, 8, 0])
        if v.sum() < 30:
            continue
        x1, x2 = lv[v, 8, 0], ds[v, 8, 0]
        if np.std(x1) < 1e-6 or np.std(x2) < 1e-6:
            continue
        r = float(np.corrcoef(x1, x2)[0, 1])
        if r > best_r:
            best_r, best_offset = r, offset

    print(f'  Temporal alignment: offset = {best_offset} frames  (peak r = {best_r:.3f})')

    def _align(a, b, off):
        if off > 0:    a2, b2 = a[off:], b[:max(0, len(a)-off)]
        elif off < 0:  b2, a2 = b[-off:], a[:max(0, len(b)+off)]
        else:           a2, b2 = a, b
        mn = min(len(a2), len(b2))
        return a2[:mn], b2[:mn]

    live_L_al, ds_L_al = _align(left_raw,  ds_left_arr,  best_offset)
    live_R_al, ds_R_al = _align(right_raw, ds_right_arr, best_offset)

    # ‚îÄ‚îÄ Detection Rate ‚îÄ‚îÄ
    print(f'\n{"‚îÄ"*55}')
    print(f'  Detection Rates (after alignment)')
    print(f'  {"Hand":8s}  {"Dataset":>10s}  {"Live MP":>10s}')
    for label, ds_a, lv_a in [('Left', ds_L_al, live_L_al), ('Right', ds_R_al, live_R_al)]:
        dr_ds = float(np.mean(~np.isnan(ds_a[:, 0, 0]))) if ds_a.size else 0
        dr_lv = float(np.mean(~np.isnan(lv_a[:, 0, 0]))) if lv_a.size else 0
        print(f'  {label:8s}  {dr_ds:>9.1%}  {dr_lv:>9.1%}')

    # ‚îÄ‚îÄ PCK (Percentage of Correct Keypoints) ‚îÄ‚îÄ
    thresholds = [0.02, 0.05, 0.10, 0.15]
    print(f'\n{"‚îÄ"*55}')
    print(f'  PCK ‚Äî Percentage of Correct Keypoints')
    for label, lv_a, ds_a in [('Right', live_R_al, ds_R_al), ('Left', live_L_al, ds_L_al)]:
        mn = min(len(lv_a), len(ds_a))
        valid = ~np.isnan(lv_a[:mn, 0, 0]) & ~np.isnan(ds_a[:mn, 0, 0])
        n_mutual = int(valid.sum())
        if n_mutual < 5:
            print(f'  {label}: insufficient mutual frames ({n_mutual})')
            continue
        p, g = lv_a[:mn][valid], ds_a[:mn][valid]
        dist = np.sqrt(np.sum((p[:, :, :2] - g[:, :, :2]) ** 2, axis=-1))
        print(f'  {label} hand  ({n_mutual} mutual frames):')
        for t in thresholds:
            pck = float((dist < t).mean())
            bar = '‚ñà' * int(pck * 40)
            print(f'    PCK@{t:.2f} = {pck:6.1%}  {bar}')

    # ‚îÄ‚îÄ Fingertip Trajectory Correlation ‚îÄ‚îÄ
    tip_map = {4: 'Thumb', 8: 'Index', 12: 'Middle', 16: 'Ring', 20: 'Pinky'}
    print(f'\n{"‚îÄ"*55}')
    print(f'  Fingertip X-Trajectory Pearson Correlation')
    all_corrs = []
    for label, lv_a, ds_a in [('Right', live_R_al, ds_R_al), ('Left', live_L_al, ds_L_al)]:
        mn = min(len(lv_a), len(ds_a))
        valid = ~np.isnan(lv_a[:mn, 0, 0]) & ~np.isnan(ds_a[:mn, 0, 0])
        if valid.sum() < 10:
            continue
        for tidx, tname in tip_map.items():
            x1 = lv_a[:mn, tidx, 0][valid]
            x2 = ds_a[:mn, tidx, 0][valid]
            if np.std(x1) < 1e-6 or np.std(x2) < 1e-6:
                continue
            r = float(np.corrcoef(x1, x2)[0, 1])
            all_corrs.append(r)
            print(f'    {label:5s} {tname:7s}:  r = {r:.3f}')

    if all_corrs:
        mean_r = np.mean(all_corrs)
        strength = 'strong' if mean_r > 0.7 else 'moderate' if mean_r > 0.4 else 'weak'
        print(f'\n    Mean correlation:  r = {mean_r:.3f}  ({strength} agreement)')

    # ‚îÄ‚îÄ Key Agreement ‚îÄ‚îÄ
    # Do both methods place each fingertip over the same piano key?
    kc_items = sorted(assigner.key_centers.items(), key=lambda kv: kv[1][0])
    kc_ids = [k for k, _ in kc_items]
    kc_xs  = np.array([cx for _, (cx, _) in kc_items])

    def _nearest_key(x_norm, fw=FRAME_W):
        return kc_ids[np.argmin(np.abs(kc_xs - x_norm * fw))]

    print(f'\n{"‚îÄ"*55}')
    print(f'  Key Agreement (same piano key?)')
    for label, lv_a, ds_a in [('Right', live_R_al, ds_R_al), ('Left', live_L_al, ds_L_al)]:
        mn = min(len(lv_a), len(ds_a))
        valid = ~np.isnan(lv_a[:mn, 0, 0]) & ~np.isnan(ds_a[:mn, 0, 0])
        if valid.sum() < 5:
            continue
        total_t, agree_t = 0, 0
        for tidx in [4, 8, 12, 16, 20]:
            idxs = np.where(valid)[0]
            k_lv = [_nearest_key(lv_a[f, tidx, 0]) for f in idxs]
            k_ds = [_nearest_key(ds_a[f, tidx, 0]) for f in idxs]
            matches = sum(a == b for a, b in zip(k_lv, k_ds))
            total_t += len(k_lv)
            agree_t += matches
        pct = agree_t / total_t * 100 if total_t else 0
        print(f'    {label} hand: {pct:.1f}% key agreement  ({agree_t}/{total_t} fingertip-frames)')

    # ‚îÄ‚îÄ Trajectory Visualisation ‚îÄ‚îÄ
    live_vis = live_R_al if best_hand_label == 'Right' else live_L_al
    ds_vis   = ds_R_al   if best_hand_label == 'Right' else ds_L_al
    T_vis    = min(3000, min(len(live_vis), len(ds_vis)))

    fig, axes = plt.subplots(3, 1, figsize=(16, 10), sharex=True)
    for ax_i, (tidx, tname) in enumerate([(8, 'Index'), (4, 'Thumb'), (12, 'Middle')]):
        axes[ax_i].plot(ds_vis[:T_vis, tidx, 0],   alpha=0.7, lw=0.8,
                        color='tab:blue',   label='Dataset skeleton')
        axes[ax_i].plot(live_vis[:T_vis, tidx, 0], alpha=0.7, lw=0.8,
                        color='tab:orange', label='Live MediaPipe')
        axes[ax_i].set_ylabel('X (norm)')
        axes[ax_i].set_title(f'{best_hand_label} Hand ‚Äî {tname} Fingertip')
        axes[ax_i].legend(loc='upper right', fontsize=8)

    axes[-1].set_xlabel('Frame')
    plt.suptitle('Trajectory Comparison: Live MediaPipe vs. Dataset Skeletons (aligned)',
                 fontsize=14)
    plt.tight_layout()
    plt.show()

    # ‚îÄ‚îÄ Summary PCK bar chart ‚îÄ‚îÄ
    fig, ax = plt.subplots(figsize=(8, 4))
    pck_vals = []
    for label, lv_a, ds_a in [('Right', live_R_al, ds_R_al), ('Left', live_L_al, ds_L_al)]:
        mn = min(len(lv_a), len(ds_a))
        v = ~np.isnan(lv_a[:mn, 0, 0]) & ~np.isnan(ds_a[:mn, 0, 0])
        if v.sum() < 5:
            continue
        p2, g2 = lv_a[:mn][v], ds_a[:mn][v]
        d2 = np.sqrt(np.sum((p2[:, :, :2] - g2[:, :, :2]) ** 2, axis=-1))
        for t in thresholds:
            pck_vals.append({'Hand': label, 'Threshold': f'@{t:.2f}', 'PCK': float((d2 < t).mean())})

    if pck_vals:
        pck_df = pd.DataFrame(pck_vals)
        for hi, hand in enumerate(pck_df['Hand'].unique()):
            sub = pck_df[pck_df['Hand'] == hand]
            x_pos = np.arange(len(sub)) + hi * 0.35
            ax.bar(x_pos, sub['PCK'], 0.3, label=f'{hand} hand',
                   color=['steelblue', 'coral'][hi], edgecolor='white')
            for xp, yp in zip(x_pos, sub['PCK']):
                ax.text(xp, yp + 0.01, f'{yp:.0%}', ha='center', fontsize=9, fontweight='bold')
        ax.set_xticks(np.arange(len(thresholds)) + 0.15)
        ax.set_xticklabels([f'PCK@{t:.2f}' for t in thresholds])
        ax.set_ylabel('PCK')
        ax.set_ylim(0, 1.1)
        ax.set_title('Percentage of Correct Keypoints ‚Äî Live MP vs. Dataset Skeletons')
        ax.legend()
        plt.tight_layout()
        plt.show()

except Exception as e:
    import traceback
    print(f'‚ö†Ô∏è  Skeleton comparison failed: {e}')
    traceback.print_exc()
    print('\nThe pipeline does NOT depend on pre-extracted skeletons ‚Äî this is validation only.')

### 9.3 Ablation Study

We evaluate the contribution of **each pipeline component** by progressively enabling stages and measuring the effect on IFR. This is run on the demo sample whose landmarks are already cached.

| Config | Active Components |
|--------|------------------|
| **A** | Raw MediaPipe landmarks + Gaussian assignment |
| **B** | + Temporal filtering (Hampel / Interp / SavGol) |
| **C** | + Max-distance gate (4œÉ rejection) |
| **D** | + BiLSTM neural refinement |
| **E** | + Constrained Viterbi decoding |

In [None]:
# ‚îÄ‚îÄ 9.3  Ablation Study ‚îÄ‚îÄ
print('Ablation Study ‚Äî Contribution of Each Pipeline Component')
print(f'Sample: {sample.id}\n')

# Scale raw (unfiltered) landmarks to pixel space
left_raw_px  = left_raw.copy();   left_raw_px[:, :, 0]  *= FRAME_W; left_raw_px[:, :, 1]  *= FRAME_H
right_raw_px = right_raw.copy();  right_raw_px[:, :, 0] *= FRAME_W; right_raw_px[:, :, 1] *= FRAME_H

# right_px / left_px were defined earlier (filtered + scaled to pixel space)

ablation_configs = [
    # (name, right_arr, left_arr, disable_gate, do_refine, use_viterbi)
    ('A: Raw + Gaussian',                     right_raw_px, left_raw_px, True,  False, False),
    ('B: + Temporal filtering',               right_px,     left_px,     True,  False, False),
    ('C: + Max-distance gate (4sig)',         right_px,     left_px,     False, False, False),
    ('D: + BiLSTM refinement',                right_px,     left_px,     False, True,  False),
    ('E: + Constrained Viterbi',              right_px,     left_px,     False, True,  True),
]

ablation_results = []

for name, r_arr, l_arr, disable_gate, do_refine, use_viterbi in ablation_configs:
    # Build assigner
    a = GaussianFingerAssigner(
        key_boundaries=key_boundaries_px,
        sigma=config.assignment.sigma,
        candidate_range=config.assignment.candidate_keys
    )
    if disable_gate:
        a.max_distance_px = 1e9        # effectively no rejection

    # Assignment pass
    asgns_abl = []
    for ev in synced_events:
        fidx, kidx = ev.frame_idx, ev.key_idx
        if kidx not in a.key_centers:
            continue
        ar, al = None, None
        if fidx < len(r_arr):
            lm = r_arr[fidx]
            if not np.any(np.isnan(lm)):
                ar = a.assign_from_landmarks(lm, kidx, 'right', fidx, ev.onset_time)
        if fidx < len(l_arr):
            lm = l_arr[fidx]
            if not np.any(np.isnan(lm)):
                al = a.assign_from_landmarks(lm, kidx, 'left', fidx, ev.onset_time)
        cands = [x for x in (ar, al) if x is not None]
        if cands:
            asgns_abl.append(max(cands, key=lambda x: x.confidence))

    # Optionally refine
    if do_refine and trained_model is not None and len(asgns_abl) > 0:
        asgns_abl = refine_assignments(
            trained_model, asgns_abl, feature_extractor, DEVICE,
            use_constraints=use_viterbi
        )

    # IFR
    if len(asgns_abl) >= 2:
        ps  = [x.midi_pitch      for x in asgns_abl]
        fs  = [x.assigned_finger for x in asgns_abl]
        hs  = [x.hand            for x in asgns_abl]
        viols = constraints.validate_sequence(fs, ps, hs)
        ifr = len(viols) / (len(asgns_abl) - 1)
    else:
        ifr = float('nan')

    ablation_results.append((name, len(asgns_abl), ifr))
    print(f'  {name:42s}  notes={len(asgns_abl):>5d}  IFR={ifr:.4f}')

# ‚îÄ‚îÄ Ablation bar chart ‚îÄ‚îÄ
fig, ax = plt.subplots(figsize=(11, 5))
labels_abl = [r[0] for r in ablation_results]
ifrs_abl   = [r[2] for r in ablation_results]
colors_abl = ['#d62728', '#ff7f0e', '#2ca02c', '#1f77b4', '#9467bd']

bars = ax.bar(range(len(ifrs_abl)), ifrs_abl,
              color=colors_abl[:len(ifrs_abl)], edgecolor='white', linewidth=1.5)
ax.set_xticks(range(len(ifrs_abl)))
ax.set_xticklabels(labels_abl, rotation=30, ha='right', fontsize=9)
ax.set_ylabel('IFR  (lower = better)')
ax.set_title('Ablation Study ‚Äî Effect of Each Pipeline Component')

for bar, v in zip(bars, ifrs_abl):
    if not np.isnan(v):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
                f'{v:.4f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

# ‚îÄ‚îÄ Summary ‚îÄ‚îÄ
if len(ablation_results) >= 3:
    a_ifr, b_ifr, c_ifr = [r[2] for r in ablation_results[:3]]
    print(f'\nComponent contributions (IFR reduction):')
    if a_ifr > 0:
        print(f'  Temporal filtering:     {a_ifr:.4f} ‚Üí {b_ifr:.4f}  '
              f'({(a_ifr - b_ifr) / a_ifr * 100:+.1f}%)')
        print(f'  Max-distance gate:      {b_ifr:.4f} ‚Üí {c_ifr:.4f}  '
              f'({(b_ifr - c_ifr) / max(b_ifr, 1e-6) * 100:+.1f}%)')
    if len(ablation_results) >= 5:
        e_ifr = ablation_results[4][2]
        if not np.isnan(e_ifr) and a_ifr > 0:
            print(f'  Full pipeline (A‚ÜíE):    {a_ifr:.4f} ‚Üí {e_ifr:.4f}  '
                  f'({(a_ifr - e_ifr) / a_ifr * 100:+.1f}% total)')
    # Also note coverage changes
    a_n, c_n = ablation_results[0][1], ablation_results[2][1]
    if a_n != c_n:
        print(f'\n  Note: Max-distance gate reduced coverage from {a_n} to {c_n} notes '
              f'({(a_n - c_n) / a_n * 100:.1f}% rejected as too far)')

### 9.4 Qualitative Analysis

We overlay fingertip positions and predicted finger labels on actual video frames to visually assess the system.

- **Coloured dots** mark each fingertip (1=red, 2=green, 3=blue, 4=purple, 5=cyan).
- **Green rectangle** shows the auto-detected keyboard boundary.
- For each frame we list the notes being played and the assigned finger.

In [None]:
# ‚îÄ‚îÄ 9.4  Qualitative Analysis ‚îÄ‚îÄ
vp = VideoProcessor()
vp.open(video_path)

finger_colors_bgr = {1: (0,0,255), 2: (0,200,0), 3: (255,0,0), 4: (200,0,200), 5: (0,200,200)}
finger_names_q     = {1: 'Thumb', 2: 'Index', 3: 'Middle', 4: 'Ring', 5: 'Pinky'}

# Select 4 frames that have active assignments
assigned_frames = sorted(set(a.frame_idx for a in assignments))
n_qual = min(4, len(assigned_frames))
qual_idxs = [assigned_frames[int(i)]
             for i in np.linspace(0, len(assigned_frames) - 1, n_qual)]

fig, axes = plt.subplots(2, 2, figsize=(18, 12))
axes = axes.flat

for i, fidx in enumerate(qual_idxs):
    frame_bgr = vp.get_frame(int(fidx))
    if frame_bgr is None:
        continue
    vis = frame_bgr.copy()
    h, w = vis.shape[:2]

    # Draw keyboard boundary
    bx1, by1, bx2, by2 = keyboard_region.bbox
    cv2.rectangle(vis, (bx1, by1), (bx2, by2), (0, 255, 0), 2)

    # Draw hand skeletons + fingertip markers
    connections = [
        (0,1),(1,2),(2,3),(3,4),(0,5),(5,6),(6,7),(7,8),
        (5,9),(9,10),(10,11),(11,12),(9,13),(13,14),(14,15),(15,16),
        (13,17),(17,18),(18,19),(19,20),(0,17)
    ]
    for hand_key, arr, col_base in [('right', right_filtered, (0,255,0)),
                                     ('left',  left_filtered,  (255,200,0))]:
        fidx_int = int(fidx)
        if fidx_int >= len(arr) or np.any(np.isnan(arr[fidx_int])):
            continue
        lm = arr[fidx_int]
        for c1, c2 in connections:
            p1 = (int(lm[c1, 0] * w), int(lm[c1, 1] * h))
            p2 = (int(lm[c2, 0] * w), int(lm[c2, 1] * h))
            cv2.line(vis, p1, p2, col_base, 1)
        for f_num, tip_idx in [(1,4),(2,8),(3,12),(4,16),(5,20)]:
            px = int(lm[tip_idx, 0] * w)
            py = int(lm[tip_idx, 1] * h)
            cv2.circle(vis, (px, py), 7, finger_colors_bgr[f_num], -1)
            cv2.circle(vis, (px, py), 7, (255,255,255), 1)
            cv2.putText(vis, str(f_num), (px+9, py-4),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.45, (255,255,255), 1)

    # Highlight assigned keys
    frame_asgns = [a for a in assignments if a.frame_idx == int(fidx)]
    for a in frame_asgns:
        kb = key_boundaries_px.get(a.key_idx)
        if kb:
            kx1, ky1, kx2, ky2 = kb
            cv2.rectangle(vis, (int(kx1), int(ky1)), (int(kx2), int(ky2)),
                          finger_colors_bgr.get(a.assigned_finger, (255,255,0)), 2)

    vis_rgb = cv2.cvtColor(vis, cv2.COLOR_BGR2RGB)
    axes[i].imshow(vis_rgb)
    title = f'Frame {int(fidx)} ‚Äî {len(frame_asgns)} note(s)'
    if frame_asgns:
        labels = [f'{a.hand[0].upper()}{a.assigned_finger}' for a in frame_asgns[:4]]
        title += '\n' + '  '.join(labels) + ('...' if len(frame_asgns) > 4 else '')
    axes[i].set_title(title, fontsize=10)
    axes[i].axis('off')

vp.close()
plt.suptitle('Qualitative: Finger-Key Assignments on Video Frames\n'
             '(dots: 1=red 2=green 3=blue 4=purple 5=cyan  |  '
             'green rect = keyboard  |  coloured key rect = assigned key)',
             fontsize=12)
plt.tight_layout()
plt.show()

# ‚îÄ‚îÄ Confidence Analysis ‚îÄ‚îÄ
if assignments:
    low_conf  = [a for a in assignments if a.confidence < 0.3]
    high_conf = [a for a in assignments if a.confidence > 0.8]
    print(f'Confidence breakdown ({len(assignments)} total assignments):')
    print(f'  High confidence  (>0.8) : {len(high_conf):>5d}  '
          f'({len(high_conf)/len(assignments)*100:.1f}%)')
    print(f'  Low confidence   (<0.3) : {len(low_conf):>5d}  '
          f'({len(low_conf)/len(assignments)*100:.1f}%)')

    # Error analysis: when does the system struggle?
    if low_conf:
        print(f'\n  Low-confidence examples (potential failure cases):')
        for a in low_conf[:5]:
            print(f'    Frame {a.frame_idx:>5d}  MIDI {a.midi_pitch}  '
                  f'{a.hand:5s} {finger_names_q[a.assigned_finger]:6s}  '
                  f'conf={a.confidence:.3f}')
        print(f'\n  Common causes of low confidence:')
        print(f'    - Hand far from key (weak Gaussian signal)')
        print(f'    - Fast motion / motion blur (noisy landmarks)')
        print(f'    - Hand occlusion (fingers overlapping)')