# AR Model Order Selection

## Introduction

The goal of this example is to demonstrate various methods for selecting the optimal order of autoregressive (AR) models. AR model order selection is a critical step in time series analysis and feature extraction for structural health monitoring, as the model order directly affects the quality of the extracted features and subsequent damage detection performance.

This example demonstrates five different methods for AR model order selection:

1. **AIC (Akaike Information Criterion)** - Balances model fit vs. complexity
2. **BIC (Bayesian Information Criterion)** - More conservative than AIC
3. **PAF (Partial Autocorrelation Function)** - Based on statistical significance of AR parameters
4. **SVD (Singular Value Decomposition)** - Analyzes rank structure of data matrix
5. **RMS (Root Mean Square Error)** - Based on prediction error improvement

Data from **Channel 5** of the base-excited three story structure is used in this example to demonstrate the different order selection approaches.

**Key Applications:**
- Optimal feature extraction for damage detection
- Time series modeling and prediction
- Signal compression and denoising
- System identification in structural dynamics

**SHMTools functions used:**
- `ar_model_order`
- `ar_model`

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import sys
import os

# Add shmtools to path - handle different execution contexts (lesson from previous phases)
current_dir = Path.cwd()
notebook_dir = Path(__file__).parent if '__file__' in globals() else current_dir

# Try different relative paths to find shmtools
possible_paths = [
    notebook_dir.parent.parent.parent,  # From examples/notebooks/basic/
    current_dir.parent.parent,          # From examples/notebooks/
    current_dir,                        # From project root
    Path('/Users/eric/repo/shm/shmtools-python')  # Absolute fallback
]

shmtools_found = False
for path in possible_paths:
    if (path / 'shmtools').exists():
        if str(path) not in sys.path:
            sys.path.insert(0, str(path))
        shmtools_found = True
        print(f"Found shmtools at: {path}")
        break

if not shmtools_found:
    print("Warning: Could not find shmtools module")

from shmtools.utils.data_loading import load_3story_data
from shmtools.features.time_series import ar_model_order, ar_model

# Set up plotting
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

## Load and Prepare Data

Load the 3-story structure dataset and extract Channel 5 data from a baseline (undamaged) condition for AR model order analysis.

In [None]:
# Load data set
data_dict = load_3story_data()
dataset = data_dict['dataset']
fs = data_dict['fs']
channels = data_dict['channels']

print(f"Dataset shape: {dataset.shape}")
print(f"Sampling frequency: {fs} Hz")
print(f"Channels: {channels}")

# Extract Channel 5 data from first baseline condition (index 4 for channel, 0 for condition)
channel_5_baseline = dataset[:, 4, 0]  # Shape: (8192,)
t = len(channel_5_baseline)

print(f"\nChannel 5 baseline data:")
print(f"Time points: {t}")
print(f"Mean: {np.mean(channel_5_baseline):.6f}")
print(f"Std: {np.std(channel_5_baseline):.6f}")
print(f"Min: {np.min(channel_5_baseline):.6f}")
print(f"Max: {np.max(channel_5_baseline):.6f}")

### Plot Time Series Data

Visualize the baseline acceleration time history from Channel 5 to understand the signal characteristics.

In [None]:
# Plot time series
plt.figure(figsize=(14, 6))

time_points = np.arange(1, t + 1)
plt.plot(time_points, channel_5_baseline, 'k-', linewidth=0.8)
plt.title('Channel 5 Baseline Acceleration Time History', fontsize=14, fontweight='bold')
plt.xlabel('Time Points')
plt.ylabel('Acceleration (g)')
plt.grid(True, alpha=0.3)
plt.xlim([1, t])

plt.tight_layout()
plt.show()

# Show a zoomed view of the first 1000 points
plt.figure(figsize=(14, 6))
zoom_points = 1000
plt.plot(time_points[:zoom_points], channel_5_baseline[:zoom_points], 'k-', linewidth=1.0)
plt.title(f'Channel 5 Baseline - First {zoom_points} Points (Zoomed View)', fontsize=14, fontweight='bold')
plt.xlabel('Time Points')
plt.ylabel('Acceleration (g)')
plt.grid(True, alpha=0.3)
plt.xlim([1, zoom_points])

plt.tight_layout()
plt.show()

## AR Model Order Selection Methods

Apply different methods to determine the optimal AR model order. Each method has different theoretical foundations and practical advantages.

In [None]:
# Define parameters
max_order = 30
methods = ['aic', 'bic', 'paf', 'svd', 'rms']
tolerance = 0.078  # Tolerance for PAF and RMS methods

# Store results
results = {}
criterion_values = {}

print(f"Testing AR model orders from 1 to {max_order}...\n")

# Apply each method
for method in methods:
    print(f"Applying {method.upper()} method...")
    
    optimal_order, values = ar_model_order(
        channel_5_baseline, 
        max_order=max_order, 
        method=method, 
        tolerance=tolerance
    )
    
    results[method] = optimal_order
    criterion_values[method] = values
    
    print(f"  Optimal order: {optimal_order}")
    print()

# Summary of results
print("=" * 50)
print("AR MODEL ORDER SELECTION RESULTS")
print("=" * 50)
for method in methods:
    print(f"{method.upper():>4}: Optimal order = {results[method]:2d}")
print("=" * 50)

### Visualize Selection Criteria

Plot the criterion values for each method to understand how they select the optimal order.

In [None]:
# Plot selection criteria
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

orders = np.arange(1, max_order + 1)

for i, method in enumerate(methods):
    ax = axes[i]
    values = criterion_values[method]
    optimal_order = results[method]
    
    # Plot criterion values
    ax.plot(orders, values, 'o-', linewidth=2, markersize=4, label=f'{method.upper()} values')
    
    # Mark optimal order
    ax.axvline(x=optimal_order, color='r', linestyle='--', alpha=0.7, 
               label=f'Optimal order: {optimal_order}')
    
    # Special handling for different methods
    if method == 'paf':
        # Add confidence interval lines for PAF
        k = 2.0 / np.sqrt(t)
        ax.axhline(y=k, color='g', linestyle=':', alpha=0.7, label=f'Upper limit: +{k:.4f}')
        ax.axhline(y=-k, color='g', linestyle=':', alpha=0.7, label=f'Lower limit: -{k:.4f}')
        ax.set_ylabel('Last AR Parameter')
        ax.set_title(f'{method.upper()} Method\n(Partial Autocorrelation Function)')
        
    elif method == 'svd':
        ax.set_ylabel('Singular Values')
        ax.set_title(f'{method.upper()} Method\n(Singular Value Decomposition)')
        ax.set_yscale('log')  # Log scale for better visualization
        
    elif method in ['aic', 'bic']:
        # Mark minimum for information criteria
        min_idx = np.argmin(values)
        ax.plot(orders[min_idx], values[min_idx], 'ro', markersize=8, 
                label=f'Minimum: {values[min_idx]:.2f}')
        ax.set_ylabel(f'{method.upper()} Value')
        ax.set_title(f'{method.upper()} Method\n({"Akaike" if method == "aic" else "Bayesian"} Information Criterion)')
        
    else:  # rms
        ax.set_ylabel('RMS Error')
        ax.set_title(f'{method.upper()} Method\n(Root Mean Square Error)')
    
    ax.set_xlabel('AR Model Order')
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.set_xlim([1, max_order])

# Remove empty subplot
axes[-1].remove()

plt.tight_layout()
plt.show()

### Compare AR Models at Different Orders

Fit AR models at the orders suggested by different methods and compare their prediction performance.

In [None]:
# Select a few representative orders for comparison
comparison_orders = [results['aic'], results['bic'], results['paf']]
comparison_orders = sorted(list(set(comparison_orders)))  # Remove duplicates and sort

print(f"Comparing AR models at orders: {comparison_orders}")

# Prepare data for AR modeling (need 3D format)
data_3d = channel_5_baseline.reshape(-1, 1, 1)  # (TIME, CHANNELS, INSTANCES)

# Fit AR models and compute performance metrics
model_performance = {}

for order in comparison_orders:
    print(f"\nFitting AR({order}) model...")
    
    # Fit AR model
    ar_params_fv, rms_fv, ar_params, ar_residuals, ar_prediction = ar_model(data_3d, order)
    
    # Compute performance metrics
    prediction_error = np.mean(ar_residuals[:, 0, 0]**2)  # MSE
    prediction_rmse = np.sqrt(prediction_error)
    
    # R-squared (coefficient of determination)
    total_variance = np.var(channel_5_baseline)
    residual_variance = np.var(ar_residuals[:, 0, 0])
    r_squared = 1 - (residual_variance / total_variance)
    
    model_performance[order] = {
        'mse': prediction_error,
        'rmse': prediction_rmse,
        'r_squared': r_squared,
        'residuals': ar_residuals[:, 0, 0],
        'prediction': ar_prediction[:, 0, 0]
    }
    
    print(f"  MSE: {prediction_error:.6f}")
    print(f"  RMSE: {prediction_rmse:.6f}")
    print(f"  R-squared: {r_squared:.6f}")

# Display comparison table
print("\n" + "="*60)
print("AR MODEL PERFORMANCE COMPARISON")
print("="*60)
print(f"{'Order':<8} {'MSE':<12} {'RMSE':<12} {'R-squared':<12}")
print("-" * 60)
for order in comparison_orders:
    perf = model_performance[order]
    print(f"{order:<8} {perf['mse']:<12.6f} {perf['rmse']:<12.6f} {perf['r_squared']:<12.6f}")
print("="*60)

### Visualize Model Predictions and Residuals

Compare the prediction accuracy and residual patterns for different AR model orders.

In [None]:
# Plot model predictions (zoomed view)
zoom_start = 1000
zoom_end = 1500
zoom_range = slice(zoom_start, zoom_end)
time_zoom = np.arange(zoom_start + 1, zoom_end + 1)

fig, axes = plt.subplots(len(comparison_orders), 1, figsize=(14, 4*len(comparison_orders)))
if len(comparison_orders) == 1:
    axes = [axes]

for i, order in enumerate(comparison_orders):
    ax = axes[i]
    
    # Plot original signal and prediction
    ax.plot(time_zoom, channel_5_baseline[zoom_range], 'k-', linewidth=1.5, 
            label='Original Signal', alpha=0.8)
    ax.plot(time_zoom, model_performance[order]['prediction'][zoom_range], 'r--', 
            linewidth=1.5, label=f'AR({order}) Prediction', alpha=0.8)
    
    ax.set_title(f'AR({order}) Model Prediction (Points {zoom_start+1}-{zoom_end})', 
                 fontweight='bold')
    ax.set_xlabel('Time Points')
    ax.set_ylabel('Acceleration (g)')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Add performance metrics as text
    perf = model_performance[order]
    ax.text(0.02, 0.98, f"R² = {perf['r_squared']:.4f}\nRMSE = {perf['rmse']:.6f}", 
            transform=ax.transAxes, verticalalignment='top',
            bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

In [None]:
# Plot residuals analysis
fig, axes = plt.subplots(len(comparison_orders), 2, figsize=(16, 4*len(comparison_orders)))
if len(comparison_orders) == 1:
    axes = axes.reshape(1, -1)

for i, order in enumerate(comparison_orders):
    residuals = model_performance[order]['residuals']
    
    # Time series plot of residuals
    axes[i, 0].plot(time_zoom, residuals[zoom_range], 'b-', linewidth=1.0, alpha=0.7)
    axes[i, 0].axhline(y=0, color='k', linestyle='--', alpha=0.5)
    axes[i, 0].set_title(f'AR({order}) Residuals (Points {zoom_start+1}-{zoom_end})')
    axes[i, 0].set_xlabel('Time Points')
    axes[i, 0].set_ylabel('Residual')
    axes[i, 0].grid(True, alpha=0.3)
    
    # Histogram of residuals
    axes[i, 1].hist(residuals, bins=50, alpha=0.7, color='blue', density=True)
    axes[i, 1].set_title(f'AR({order}) Residuals Distribution')
    axes[i, 1].set_xlabel('Residual Value')
    axes[i, 1].set_ylabel('Density')
    axes[i, 1].grid(True, alpha=0.3)
    
    # Add statistics
    mean_res = np.mean(residuals)
    std_res = np.std(residuals)
    axes[i, 1].text(0.02, 0.98, f"Mean = {mean_res:.6f}\nStd = {std_res:.6f}", 
                    transform=axes[i, 1].transAxes, verticalalignment='top',
                    bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

### Method Comparison and Recommendations

Analyze the results and provide guidance on method selection.

In [None]:
# Create summary comparison
print("\n" + "="*80)
print("AR MODEL ORDER SELECTION METHOD COMPARISON")
print("="*80)

print("\nMethod Characteristics:")
print("-" * 40)
print("AIC (Akaike Information Criterion):")
print(f"  • Suggested order: {results['aic']}")
print("  • Balances model fit vs. complexity")
print("  • Tends to select higher orders than BIC")
print("  • Good for prediction applications\n")

print("BIC (Bayesian Information Criterion):")
print(f"  • Suggested order: {results['bic']}")
print("  • More conservative than AIC")
print("  • Stronger penalty for model complexity")
print("  • Good for model selection and interpretation\n")

print("PAF (Partial Autocorrelation Function):")
print(f"  • Suggested order: {results['paf']}")
print("  • Based on statistical significance testing")
print("  • Uses confidence intervals for white noise")
print("  • Traditional time series approach\n")

print("SVD (Singular Value Decomposition):")
print(f"  • Suggested order: {results['svd']}")
print("  • Analyzes rank structure of data matrix")
print("  • Good for identifying dominant patterns")
print("  • Robust to noise\n")

print("RMS (Root Mean Square Error):")
print(f"  • Suggested order: {results['rms']}")
print("  • Based on prediction error improvement")
print("  • Simple and intuitive approach")
print("  • May overfit with high orders\n")

# Consensus analysis
orders_list = list(results.values())
unique_orders = list(set(orders_list))
order_counts = {order: orders_list.count(order) for order in unique_orders}

print("Order Selection Consensus:")
print("-" * 30)
for order in sorted(unique_orders):
    count = order_counts[order]
    methods_for_order = [method for method, selected_order in results.items() if selected_order == order]
    print(f"Order {order:2d}: Selected by {count} method(s) - {', '.join(methods_for_order).upper()}")

print("\nRecommendations:")
print("-" * 20)
most_common_order = max(order_counts, key=order_counts.get)
if order_counts[most_common_order] > 1:
    print(f"• Consensus choice: AR({most_common_order}) - selected by multiple methods")
else:
    print("• No clear consensus - consider application-specific requirements")

print(f"• Conservative choice: AR({results['bic']}) (BIC method)")
print(f"• Prediction-focused: AR({results['aic']}) (AIC method)")
print(f"• Traditional approach: AR({results['paf']}) (PAF method)")

print("\n• For SHM applications: Lower orders (5-15) often sufficient for damage detection")
print("• Consider computational cost vs. performance trade-offs")
print("• Validate with actual damage detection performance if possible")

print("="*80)

## Summary

This example demonstrated comprehensive AR model order selection using five different methods:

1. **AIC and BIC**: Information criteria that balance model fit against complexity
2. **PAF**: Statistical significance testing of AR parameters
3. **SVD**: Matrix rank analysis approach
4. **RMS**: Direct prediction error minimization

**Key Insights:**

- **No Universal "Best" Method**: Different methods can suggest different optimal orders
- **Application-Dependent Choice**: The optimal method depends on your specific use case:
  - **Damage Detection**: Lower orders often sufficient (5-15)
  - **Prediction**: AIC often performs well
  - **Model Interpretation**: BIC provides more conservative estimates
  - **Noise Robustness**: SVD can be more robust to measurement noise

- **Practical Considerations**:
  - Higher orders increase computational cost
  - Overfitting risk with very high orders
  - Cross-validation can help validate order choice
  - Consider multiple methods and look for consensus

**For Structural Health Monitoring:**

The choice of AR order affects the quality of damage-sensitive features. This example provides a systematic approach to select appropriate orders based on data characteristics and application requirements. The methods implemented here can be used as a preprocessing step before applying outlier detection algorithms like those demonstrated in the PCA, Mahalanobis, SVD, and Factor Analysis examples.

**See also:**
- [Outlier Detection based on Principal Component Analysis](pca_outlier_detection.ipynb)
- [Outlier Detection based on Mahalanobis Distance](mahalanobis_outlier_detection.ipynb)
- [Outlier Detection based on Singular Value Decomposition](svd_outlier_detection.ipynb)
- [Outlier Detection based on Factor Analysis](../intermediate/factor_analysis_outlier_detection.ipynb)