# SHMTools Dataset Management

## Overview

This notebook provides comprehensive documentation and management utilities for the SHMTools example datasets. It covers:

1. **Dataset Overview**: Physical descriptions and experimental setups
2. **Data Loading**: Standardized loading procedures and validation
3. **Dataset Exploration**: Structure analysis and visualization
4. **Usage Examples**: Common data access patterns for different examples
5. **Integrity Validation**: Automated checking of dataset completeness

This notebook serves as both documentation and a practical utility for working with SHMTools data.

## Setup

Import required modules and setup the environment.

In [None]:
import numpy as npimport matplotlib.pyplot as plt# Import shmtools (installed package)from examples.data import (# Set up plottingplt.style.use('default')plt.rcParams['figure.figsize'] = (12, 8)plt.rcParams['font.size'] = 10

## Dataset Availability Check

Check which datasets are currently available and validate their integrity.

In [None]:
# Check dataset availability
print("Dataset Availability:")
print("=" * 60)
check_data_availability()

In [None]:
# Comprehensive dataset summary
print_dataset_summary()

## Dataset 1: 3-Story Structure (data3SS.mat)

### Physical Description

The primary dataset contains measurements from a 3-story aluminum frame structure designed for structural health monitoring research at Los Alamos National Laboratory.

**Physical Structure:**
- Aluminum columns (17.7×2.5×0.6 cm) and plates (30.5×30.5×2.5 cm)
- 4-column frame design per floor (essentially 4-DOF system)
- Sliding rails constraining motion to x-direction only
- Suspended center column with adjustable bumper for damage simulation
- Base isolation using rigid foam

**Instrumentation:**
- Electrodynamic shaker for base excitation (band-limited random 20-150 Hz)
- Load cell measuring input force (2.2 mV/N sensitivity)
- 4 accelerometers at floor centerlines (1000 mV/g sensitivity)
- National Instruments PXI data acquisition system

In [None]:
# Load and examine 3-story structure data
try:
    data_3story = load_3story_data()
    
    print("3-Story Structure Dataset:")
    print("=" * 50)
    print(f"Dataset shape: {data_3story['dataset'].shape}")
    print(f"Sampling frequency: {data_3story['fs']} Hz")
    print(f"Channels: {data_3story['channels']}")
    print(f"Total conditions: {len(data_3story['conditions'])}")
    print(f"Damage states: {len(data_3story['state_descriptions'])}")
    print(f"Description: {data_3story['description']}")
    
    # Show damage state descriptions
    print("\nDamage State Descriptions:")
    print("-" * 50)
    for state, desc in data_3story['state_descriptions'].items():
        print(f"State {state:2d}: {desc}")
        
except FileNotFoundError as e:
    print(f"3-Story dataset not available: {e}")
    print("Please download data3SS.mat and place it in the data directory.")

### Data Structure Analysis

In [None]:
# Analyze data structure if available
if 'data_3story' in locals():
    dataset = data_3story['dataset']
    damage_states = data_3story['damage_states']
    
    # Plot damage state distribution
    plt.figure(figsize=(12, 6))
    
    plt.subplot(1, 2, 1)
    unique_states, counts = np.unique(damage_states, return_counts=True)
    plt.bar(unique_states, counts)
    plt.xlabel('Damage State')
    plt.ylabel('Number of Tests')
    plt.title('Distribution of Test Conditions by Damage State')
    plt.grid(True, alpha=0.3)
    
    # Plot example time series from different states
    plt.subplot(1, 2, 2)
    t = np.arange(dataset.shape[0]) / data_3story['fs']
    
    # Plot baseline condition (state 1, test 1) - channel 2
    baseline_signal = dataset[:1000, 1, 0]  # First 1000 points, channel 2, condition 1
    plt.plot(t[:1000], baseline_signal, 'b-', label='Baseline (State 1)', alpha=0.7)
    
    # Plot damaged condition (state 10, test 1) - channel 2  
    damage_signal = dataset[:1000, 1, 90]  # First 1000 points, channel 2, condition 91 (state 10)
    plt.plot(t[:1000], damage_signal, 'r-', label='Damaged (State 10)', alpha=0.7)
    
    plt.xlabel('Time (s)')
    plt.ylabel('Acceleration')
    plt.title('Sample Time Series Comparison')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Statistical summary
    print("\nStatistical Summary (Channel 2):")
    print("-" * 40)
    
    # Compare baseline vs damaged conditions
    baseline_data = dataset[:, 1, :90].flatten()  # All baseline conditions, channel 2
    damaged_data = dataset[:, 1, 90:].flatten()   # All damaged conditions, channel 2
    
    print(f"Baseline - Mean: {np.mean(baseline_data):.4f}, Std: {np.std(baseline_data):.4f}")
    print(f"Damaged  - Mean: {np.mean(damaged_data):.4f}, Std: {np.std(damaged_data):.4f}")
    print(f"RMS Ratio (Damaged/Baseline): {np.std(damaged_data)/np.std(baseline_data):.3f}")

## Dataset 2: Condition-Based Monitoring (data_CBM.mat)

### Physical Description

Rotating machinery vibration data collected from the SpectraQuest Magnum Machinery Fault Simulator for bearing and gearbox fault analysis.

**Test Setup:**
- Main shaft: 3/4" diameter steel, 28.5" center-to-center bearing support
- Gearbox: Hub City M2, 1.5:1 ratio, 18/27 teeth (pinion/gear)
- Belt drive: ~1:3.71 ratio, 13" span, 3.7 lbs tension
- Magnetic brake: 1.9 lbs-in torsional load
- Shaft speed: ~1000 rpm nominally constant

**Fault Conditions:**
- Ball bearing faults (roller spin)
- Gearbox worn tooth faults
- Baseline conditions with ball and fluid bearings

In [None]:
# Load and examine CBM data
try:
    data_cbm = load_cbm_data()
    
    print("Condition-Based Monitoring Dataset:")
    print("=" * 50)
    
    # Show available variables
    print("Available variables:")
    for key, value in data_cbm.items():
        if isinstance(value, np.ndarray):
            print(f"  {key}: {value.shape} ({value.dtype})")
        else:
            print(f"  {key}: {value}")
    
    # Show fault state descriptions
    if 'fault_states' in data_cbm:
        print("\nFault State Descriptions:")
        print("-" * 50)
        for state, desc in data_cbm['fault_states'].items():
            print(f"State {state}: {desc}")
    
    # Show bearing fault frequencies if available
    shaft_freq = data_cbm['shaft_speed_rpm'] / 60.0  # Convert RPM to Hz
    print(f"\nBearing Fault Frequencies (Shaft = {shaft_freq:.1f} Hz):")
    print("-" * 50)
    print(f"Cage Speed: {3.048 * shaft_freq:.1f} Hz")
    print(f"Outer Race: {3.048 * shaft_freq:.1f} Hz")
    print(f"Inner Race: {4.95 * shaft_freq:.1f} Hz")
    print(f"Ball Spin: {1.992 * shaft_freq:.1f} Hz")
        
except FileNotFoundError as e:
    print(f"CBM dataset not available: {e}")
    print("Please download data_CBM.mat and place it in the data directory.")

### CBM Data Visualization

In [None]:
# Visualize CBM data if available
if 'data_cbm' in locals() and 'dataset' in data_cbm:
    cbm_dataset = data_cbm['dataset']
    fs = data_cbm['fs']
    channels = data_cbm['channels']
    
    print(f"CBM Dataset shape: {cbm_dataset.shape}")
    
    # Plot example signals from different channels and conditions
    plt.figure(figsize=(14, 10))
    
    # Time vector
    t = np.arange(1000) / fs  # First 1000 points for visualization
    
    # Plot signals from each channel
    for i, channel in enumerate(channels):
        plt.subplot(2, 2, i+1)
        
        # Plot baseline condition (assuming condition 0)
        if cbm_dataset.shape[2] > 0:
            baseline_signal = cbm_dataset[:1000, i, 0]
            plt.plot(t, baseline_signal, 'b-', label='Baseline', alpha=0.7)
        
        # Plot fault condition (assuming later condition exists) 
        if cbm_dataset.shape[2] > 64:  # If we have fault conditions
            fault_signal = cbm_dataset[:1000, i, 64]
            plt.plot(t, fault_signal, 'r-', label='Fault', alpha=0.7)
        
        plt.xlabel('Time (s)')
        plt.ylabel('Amplitude')
        plt.title(f'{channel}')
        plt.legend()
        plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
elif 'data_cbm' in locals():
    print("CBM data loaded but 'dataset' variable not found.")
    print("Available variables:", list(data_cbm.keys()))

## Other Datasets

Brief exploration of the remaining datasets used in specialized examples.

In [None]:
# Load other datasets if available
datasets_to_check = [
    ('Active Sensing', load_active_sensing_data),
    ('Sensor Diagnostic', load_sensor_diagnostic_data),
    ('Modal OSP', load_modal_osp_data)
]

loaded_datasets = {}

for name, loader_func in datasets_to_check:
    try:
        data = loader_func()
        loaded_datasets[name] = data
        
        print(f"\n{name} Dataset:")
        print("=" * (len(name) + 10))
        
        # Show dataset structure
        total_size = 0
        for key, value in data.items():
            if isinstance(value, np.ndarray):
                size_mb = value.nbytes / (1024**2)
                total_size += size_mb
                print(f"  {key}: {value.shape} ({value.dtype}) - {size_mb:.2f} MB")
            elif isinstance(value, (list, dict)):
                print(f"  {key}: {type(value).__name__} (length: {len(value)})")
            else:
                print(f"  {key}: {value}")
        
        print(f"Total estimated size: {total_size:.2f} MB")
        
    except FileNotFoundError:
        print(f"\n{name} dataset not available.")
    except Exception as e:
        print(f"\nError loading {name} dataset: {e}")

## Dataset Usage Examples

Demonstrate common data access patterns for different types of SHMTools examples.

In [None]:
# Example 1: Loading data for outlier detection examples (PCA, Mahalanobis, SVD)
print("Example 1: Outlier Detection Data Loading")
print("=" * 50)

try:
    # This convenience function preprocesses the 3-story data for outlier detection
    pca_data = load_example_data('pca')
    
    print(f"Preprocessed signals shape: {pca_data['signals'].shape}")
    print(f"Channels included: {pca_data['channels']}")
    print(f"Time points (t): {pca_data['t']}")
    print(f"Channels (m): {pca_data['m']}")
    print(f"Conditions (n): {pca_data['n']}")
    
    # Show how to split into baseline and damaged conditions
    signals = pca_data['signals']
    damage_states = pca_data['damage_states']
    
    # Extract baseline conditions (states 1-9)
    baseline_indices = np.where(damage_states <= 9)[0]
    damaged_indices = np.where(damage_states >= 10)[0]
    
    baseline_signals = signals[:, :, baseline_indices]
    damaged_signals = signals[:, :, damaged_indices]
    
    print(f"Baseline conditions: {baseline_signals.shape[2]} tests")
    print(f"Damaged conditions: {damaged_signals.shape[2]} tests")
    
except FileNotFoundError:
    print("3-story dataset required for outlier detection examples not available.")

In [None]:
# Example 2: Accessing specific damage states
print("\nExample 2: Accessing Specific Damage States")
print("=" * 50)

if 'pca_data' in locals():
    damage_states = pca_data['damage_states']
    state_descriptions = pca_data['state_descriptions']
    signals = pca_data['signals']
    
    # Access specific states
    target_states = [1, 10, 14]  # Baseline, first damage, severe damage
    
    for state in target_states:
        state_indices = np.where(damage_states == state)[0]
        state_signals = signals[:, :, state_indices]
        
        # Calculate RMS for each test in this state
        rms_values = np.sqrt(np.mean(state_signals**2, axis=0))  # RMS over time
        mean_rms = np.mean(rms_values, axis=1)  # Mean RMS across tests
        
        print(f"State {state}: {state_descriptions[state]}")
        print(f"  Tests: {len(state_indices)}")
        print(f"  Mean RMS per channel: {mean_rms}")
        print()

In [None]:
# Example 3: Training/Testing splits commonly used in examples
print("Example 3: Common Training/Testing Splits")
print("=" * 50)

if 'pca_data' in locals():
    signals = pca_data['signals']
    damage_states = pca_data['damage_states']
    
    # Common split: Use baseline conditions for training
    baseline_indices = np.where(damage_states <= 9)[0]  # States 1-9
    damaged_indices = np.where(damage_states >= 10)[0]  # States 10-17
    
    training_signals = signals[:, :, baseline_indices]
    testing_signals = signals[:, :, np.concatenate([baseline_indices, damaged_indices])]
    
    # Create binary labels for testing (0=undamaged, 1=damaged)
    test_damage_states = damage_states[np.concatenate([baseline_indices, damaged_indices])]
    binary_labels = (test_damage_states >= 10).astype(int)
    
    print(f"Training set: {training_signals.shape[2]} undamaged conditions")
    print(f"Testing set: {testing_signals.shape[2]} total conditions")
    print(f"  - Undamaged: {np.sum(binary_labels == 0)} tests")
    print(f"  - Damaged: {np.sum(binary_labels == 1)} tests")
    
    # Alternative split: Use subset of each state for training
    print("\nAlternative split (subset training):")
    train_indices = []
    test_indices = []
    
    for state in range(1, 18):  # States 1-17
        state_indices = np.where(damage_states == state)[0]
        # Use first 7 tests for training, last 3 for testing
        train_indices.extend(state_indices[:7])
        test_indices.extend(state_indices[7:])
    
    train_indices = np.array(train_indices)
    test_indices = np.array(test_indices)
    
    print(f"Training set: {len(train_indices)} conditions from all states")
    print(f"Testing set: {len(test_indices)} conditions from all states")

## Dataset Integrity Validation

Automated validation of all datasets to ensure they're properly loaded and structured.

In [None]:
# Run comprehensive dataset validation
print("Dataset Integrity Validation")
print("=" * 50)

validation_results = validate_dataset_integrity()

# Create summary table manually (without pandas)
print(f"{'Dataset':<30} {'Size (MB)':<10} {'Available':<10} {'Valid':<8} {'Errors':<8} {'Warnings':<8}")
print("-" * 80)

for dataset_name, result in validation_results.items():
    dataset_info = get_available_datasets()[dataset_name]
    
    dataset_file = dataset_info['file']
    size_mb = dataset_info['size_mb']
    available = '✓' if result['available'] else '✗'
    valid = '✓' if result['valid'] else '✗'
    errors = len(result['errors'])
    warnings = len(result['warnings'])
    
    print(f"{dataset_file:<30} {size_mb:<10} {available:<10} {valid:<8} {errors:<8} {warnings:<8}")

# Show detailed errors/warnings if any
print("\nDetailed Issues:")
print("-" * 30)

issues_found = False
for dataset_name, result in validation_results.items():
    if result['errors'] or result['warnings']:
        issues_found = True
        dataset_info = get_available_datasets()[dataset_name]
        print(f"\n{dataset_info['file']}:")
        
        for error in result['errors']:
            print(f"  ERROR: {error}")
        for warning in result['warnings']:
            print(f"  WARNING: {warning}")

if not issues_found:
    print("No issues found. All available datasets are valid.")

## Dataset File Information

Detailed file information and download guidance.

In [None]:
# Show data directory and file information
data_dir = get_data_dir()
print(f"Data Directory: {data_dir}")
print(f"Directory exists: {data_dir.exists()}")
print()

if data_dir.exists():
    print("Files in data directory:")
    print("-" * 40)
    
    # List all .mat files
    mat_files = list(data_dir.glob('*.mat'))
    
    if mat_files:
        for mat_file in sorted(mat_files):
            size_mb = mat_file.stat().st_size / (1024**2)
            print(f"  {mat_file.name:30} ({size_mb:6.2f} MB)")
    else:
        print("  No .mat files found")
    
    # List other files
    other_files = [f for f in data_dir.iterdir() if f.is_file() and not f.name.endswith('.mat')]
    if other_files:
        print("\nOther files:")
        for other_file in sorted(other_files):
            print(f"  {other_file.name}")
else:
    print(f"Data directory does not exist: {data_dir}")
    print("Please create the directory and download the dataset files.")

print("\nDataset Download Information:")
print("-" * 40)
print("All datasets are from the original SHMTools library (LA-CC-14-046)")
print("developed by Los Alamos National Laboratory.")
print("")
print("To obtain the datasets:")
print("1. Download from the original MATLAB SHMTools distribution")
print("2. Extract the .mat files from the Examples/ExampleData/ directory")
print(f"3. Place them in: {data_dir}")
print("")
print("See the README.md file in the data directory for detailed instructions.")

## Summary

This notebook provides comprehensive dataset management utilities for SHMTools Python. Key takeaways:

### Available Datasets
1. **data3SS.mat**: Primary 3-story structure dataset (25 MB)
2. **data_CBM.mat**: Condition-based monitoring rotating machinery (54 MB)
3. **data_example_ActiveSense.mat**: Guided wave measurements (32 MB)
4. **dataSensorDiagnostic.mat**: Sensor health monitoring (63 KB)
5. **data_OSPExampleModal.mat**: Modal analysis and sensor placement (50 KB)

### Key Functions
- `load_3story_data()`: Primary structural dataset with detailed metadata
- `load_cbm_data()`: Rotating machinery with fault information
- `load_example_data(type)`: Convenient preprocessing for specific examples
- `validate_dataset_integrity()`: Automated validation and checking
- `check_data_availability()`: Quick availability status

### Usage Patterns
- **Outlier Detection**: Use `load_example_data('pca')` for preprocessed 3-story data
- **Training/Testing**: Split by damage states or use subset sampling
- **Validation**: Run integrity checks before starting analysis
- **Exploration**: Use metadata and descriptions for understanding structure

The enhanced data loading utilities provide comprehensive documentation, metadata, and validation capabilities that simplify working with SHMTools datasets while maintaining compatibility with the original MATLAB examples.