# üìä H5 Dataset Inspector & Visualization Tool

## Overview

This notebook provides an interactive interface to explore the preprocessed datasets stored in **HDF5 format**. It is designed to help developers and researchers quickly verify the integrity of the data, visualize time-series patterns, and analyze statistical properties of individual samples.

### Key Features

  * **Automatic File Discovery**: Automatically scans the `data/cache` directory for available `.h5` splits.
  * **Time-Series Visualization**: Displays both a global heatmap and detailed line charts for high-variance features.
  * **Padding Handling**: Intelligent masking logic to exclude zero-padding regions when calculating statistics (e.g., mean, variance).
  * **Interactive UI**: Powered by `ipywidgets` for real-time navigation without restarting the kernel.


In [None]:
# 1. Environment Setup & Imports
# ------------------------------------------------------------------
import h5py
import json
import random
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display, clear_output, HTML

# Configure Visualization Style
sns.set_theme(style="whitegrid")
plt.rcParams['axes.unicode_minus'] = False  # Fix for minus sign display

# üìÇ Path Configuration
# Adjust ROOT_DIR if your project structure differs.
ROOT_DIR = Path.cwd().parent
CACHE_DIR = ROOT_DIR / "data" / "cache"

print(f"‚úÖ Environment Setup Complete.")
print(f"üìÇ Target Data Directory: {CACHE_DIR.absolute()}")

## üõ†Ô∏è Data Loader Class (`H5Inspector`)

The `H5Inspector` class is the core component responsible for interacting with the file system. It handles:
1.  **Metadata Loading**: Reads `feature_info.json` to map numerical IDs to human-readable feature names.
2.  **File Management**: Opens and closes `.h5` files efficiently.
3.  **Data Extraction**: Retrieves specific samples by index.

> **‚ö†Ô∏è Important Note on Data Types:**
> The raw data is often stored as `float16` to save space. However, mathematical operations (like calculating variance) on `float16` can easily cause **Overflow Errors**. In the `get_sample` method, we explicitly cast the data to `float32` (`.astype(np.float32)`) to ensure numerical stability.

In [None]:
class H5Inspector:
    def __init__(self, cache_dir):
        self.cache_dir = Path(cache_dir)
        self.meta = self._load_meta()
        self.h5_file = None
        self.current_split = None
        
        # Load metadata mappings (Feature names, Class names)
        self.feat_names_num = self.meta.get("feat_names_numeric", [])
        self.class_names = self.meta.get("class_names", [])

    def _load_meta(self):
        """Loads feature descriptions from feature_info.json"""
        meta_path = self.cache_dir / "feature_info.json"
        if not meta_path.exists(): 
            print("‚ö†Ô∏è Warning: feature_info.json not found.")
            return {} 
        with open(meta_path, "r", encoding='utf-8') as f:
            return json.load(f)

    def get_available_splits(self):
        """Scans the directory for .h5 files automatically."""
        files = list(self.cache_dir.glob("*.h5"))
        return sorted([f.stem for f in files])

    def load_split(self, split_name):
        """Loads a specific dataset split (e.g., train, test)."""
        if self.h5_file is not None:
            self.h5_file.close()
        
        self.current_split = split_name
        h5_path = self.cache_dir / f"{split_name}.h5"
        
        if not h5_path.exists():
            return False, f"‚ùå File not found: {h5_path}"
        
        self.h5_file = h5py.File(h5_path, "r")
        return True, f"üìÇ Loaded: {split_name}.h5 (Total Samples: {self.h5_file['y'].shape[0]})"

    def get_sample(self, idx):
        """Retrieves a single sample and its metadata by index."""
        if self.h5_file is None: return None
        f = self.h5_file
        if not (0 <= idx < f["y"].shape[0]): return None

        # ‚≠ê CRITICAL FIX: Cast to float32 to prevent overflow during stats calculation
        x_num = f["X_num"][idx].astype(np.float32) 
        y = f["y"][idx]
        sid = f["sid"][idx]
        
        # Resolve Label Name
        try: label_name = self.class_names[y]
        except IndexError: label_name = str(y)
            
        return {"sid": sid, "y": y, "label_name": label_name, "x_num": x_num, "idx": idx}

## üß† Helper Functions: Handling Time-Series Padding

In sequence modeling (like LSTM/Transformer), input data is often **padded with zeros** to ensure all samples have the same length.
* **Problem:** If we include these zeros in our statistical analysis, the mean and variance will be heavily distorted (skewed towards zero).
* **Solution:** The `get_valid_mask` function identifies the "Active Region" (where data exists) and masks out the padding.

In [None]:
# Create the global inspector instance
inspector = H5Inspector(CACHE_DIR)

def get_valid_mask(x_data):
    """
    Identifies the 'Active Region' of a time-series.
    Returns a boolean mask where True indicates valid data and False indicates padding.
    """
    # Check if ANY feature at a specific time step is non-zero.
    # Axis 0 = Features, Axis 1 = Time Steps
    is_active = np.any(x_data != 0, axis=0)
    
    # Edge Case: If the entire sequence is 0 (missing data), return all True to visualize it as is.
    if not np.any(is_active):
        return np.ones(x_data.shape[1], dtype=bool)
    return is_active

## üéõÔ∏è UI Initialization

Here we define the interactive widgets using `ipywidgets`.
* **Dataset Split:** A dropdown menu populated by scanning the directory.
* **Sample Index:** An input box to jump to a specific data point.
* **Random Button:** Quickly samples a random index for exploration.
* **Load Button:** Loads the selected H5 file into memory.

In [None]:
# 1. Initialize Options
split_options = inspector.get_available_splits()
initial_value = split_options[0] if split_options else None

# 2. Define Widgets
style = {'description_width': 'initial'}

split_dropdown = widgets.Dropdown(
    options=split_options, 
    value=initial_value, 
    description='Dataset Split:', 
    style=style,
    layout=widgets.Layout(width='300px')
)

idx_input = widgets.BoundedIntText(
    value=0, min=0, max=9999999, step=1, 
    description='Sample Index:', 
    style=style
)

btn_random = widgets.Button(description='üé≤ Random Sample', button_style='info')
btn_load = widgets.Button(description='üìÇ Load Split', button_style='warning')

# 3. Output Areas
# out_status: Shows loading messages (Success/Fail)
# out_vis: Shows the actual graphs and data tables
out_status = widgets.Output()
out_vis = widgets.Output()

## üìä Visualization Logic

This is the main rendering function. It performs the following steps:
1.  **Fetch Data:** Gets the raw matrix (`Features x Time`).
2.  **Calculate Mask:** Determines which part of the sequence is real data vs. padding.
3.  **Compute Stats:** Calculates variance **only on the active region** to identify the top 5 most dynamic features.
4.  **Plot:**
    * **Heatmap (Top):** Shows the global pattern of all features over time.
    * **Line Plot (Bottom):** distinct lines for the most changing features. The "Padding Region" is shaded in gray.

In [None]:
def refresh_view(idx):
    """Updates the visualization based on the selected sample index."""
    data = inspector.get_sample(idx)
    if not data: return
    
    x_num = data['x_num']
    
    # --- Step 1: Detect Active Region ---
    valid_mask = get_valid_mask(x_num)
    valid_x = x_num[:, valid_mask] # Slice data to exclude padding
    
    # --- Step 2: Calculate Statistics (on valid data only) ---
    # We look for high variance features to show what is changing in this patient/sample
    stds = np.std(valid_x, axis=1) 
    top_indices = np.argsort(stds)[::-1][:5] # Get Top 5 indices

    out_vis.clear_output(wait=True)
    with out_vis:
        # --- Header Information ---
        display(HTML(f"""
        <div style="background-color: #f0f2f6; padding: 10px; border-radius: 5px; margin-bottom: 10px;">
            <span style="font-size: 1.1em; font-weight: bold;">üÜî ID: {data['sid']} (Idx: {data['idx']})</span> &nbsp;|&nbsp; 
            <span style="color: #d63384; font-weight: bold;">Label: {data['y']} ({data['label_name']})</span> &nbsp;|&nbsp; 
            <span>Active Time Steps: {np.sum(valid_mask)} / {len(valid_mask)}</span>
        </div>
        """))

        # --- Plotting ---
        fig, axes = plt.subplots(2, 1, figsize=(12, 8), gridspec_kw={'height_ratios': [1, 1.5]})
        
        # Plot A: Global Heatmap
        sns.heatmap(x_num, ax=axes[0], cmap="RdBu_r", center=0, cbar=True, xticklabels=10, yticklabels=False)
        axes[0].set_title("Global Time-Series Heatmap (All Features)", fontsize=11)
        axes[0].set_ylabel("Features")

        # Plot B: Top 5 Dynamic Features
        for i in top_indices:
            feat_name = inspector.feat_names_num[i] if i < len(inspector.feat_names_num) else f"Feature {i}"
            axes[1].plot(x_num[i, :], label=f"{feat_name}", linewidth=1.5, alpha=0.9)
        
        # Visual Aid: Shade the Padding Region
        first_valid = np.argmax(valid_mask)
        if first_valid > 0:
             axes[1].axvspan(0, first_valid, color='gray', alpha=0.15, label='Padding (Ignored)')

        axes[1].set_title(f"Top 5 Dynamic Features (Calculated from Active Region)", fontsize=11)
        axes[1].legend(loc='upper right', fontsize='small', frameon=True)
        axes[1].set_xlim(0, x_num.shape[1])
        axes[1].set_xlabel("Time Step")
        
        plt.tight_layout()
        plt.show()

# --- Event Handlers ---
def on_load_click(b):
    out_status.clear_output()
    with out_status:
        # Re-scan for files (in case new files were added)
        current_opts = inspector.get_available_splits()
        split_dropdown.options = current_opts
        
        if split_dropdown.value:
            success, msg = inspector.load_split(split_dropdown.value)
            print(msg)
            if success:
                idx_input.max = inspector.h5_file["y"].shape[0] - 1
                idx_input.value = 0
                refresh_view(0)
        else:
            print("‚ùå No h5 file selected or found.")

def on_random_click(b):
    if inspector.h5_file:
        idx_input.value = random.randint(0, inspector.h5_file["y"].shape[0] - 1)

## üöÄ Launch Application

Run the cell below to display the inspector tool.
1.  **Select** a dataset split from the dropdown.
2.  Click **"üìÇ Load Split"**.
3.  Use **"üé≤ Random Sample"** or type an index to explore the data.

In [None]:
# Link Events
btn_load.on_click(on_load_click)
btn_random.on_click(on_random_click)
idx_input.observe(lambda change: refresh_view(change['new']), names='value')

# Auto-load the first dataset if available
on_load_click(None)

# Display the UI
display(widgets.VBox([
    widgets.HBox([split_dropdown, btn_load, idx_input, btn_random]),
    out_status,
    out_vis
]))