# ðŸ”¬ W02 â€” BI Fusion Pipeline Validation
**Objective**: Test the complete preprocessing pipeline with BI integration (Section III-B of the methodology).

**Pipeline under test:**
1. Load C-MAPSS sensor data â†’ `data_loader.py`
2. Normalize sensor features (MinMax) â†’ `preprocessing.py`
3. Fuse with BI data (forward-fill, one-hot encoding) â†’ `bi_fusion.py` (Section III-B3)
4. BI-aware feature selection (variance + correlation) â†’ `feature_selection.py` (Section III-B4)

**Author**: Fatima Khadija Benzine  
**Date**: February 2026

---
## 0. Setup

In [None]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Navigate up to project root so we can import from src/
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))
sys.path.insert(0, str(project_root / 'src'))

print(f"Project root: {project_root}")
print(f"Expected src/: {project_root / 'src'}")
print(f"Expected data/: {project_root / 'data'}")

# Verify files exist
for f in ['src/data_loader.py', 'src/bi_fusion.py', 'src/feature_selection.py', 'src/preprocessing.py']:
    p = project_root / f
    status = 'âœ“' if p.exists() else 'âœ— MISSING'
    print(f"  {status}  {f}")

plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
%matplotlib inline

In [None]:
from data_loader import MultiDatasetLoader
from preprocessing import PreprocessingPipelineBI
from bi_fusion import BIFusionPipeline, BIDataLoader, BI_DELTA_CONFIG
from feature_selection import BIAwareFeatureSelector

print("All modules imported successfully âœ“")

---
## 1. Load Raw Data

In [None]:
loader = MultiDatasetLoader()

print("Loading FD001...")
fd001 = loader.load_fd001()
train_raw = fd001['train']
test_raw = fd001['test']

print(f"\nTrain: {train_raw.shape}")
print(f"Test:  {test_raw.shape}")
print(f"Units: {train_raw['unit'].nunique()} train, {test_raw['unit'].nunique()} test")
print(f"Columns: {list(train_raw.columns[:8])}...")
train_raw.head()

---
## 2. Inspect BI Data (before fusion)

In [None]:
bi_loader = BIDataLoader()
bi_df = bi_loader.load_bi('FD001')

print(f"BI data shape: {bi_df.shape}")
print(f"\nColumns: {list(bi_df.columns)}")
print(f"\nUpdate frequencies (Delta_k):")
print(bi_loader.get_delta_summary().to_string(index=False))

bi_df.head(10)

In [None]:
# Visualize Delta behavior: BI variables stay constant within each Delta_k period
unit1_bi = bi_df[bi_df['unit_id'] == 1].set_index('cycle')

fig, axes = plt.subplots(3, 1, figsize=(14, 8), sharex=True)

# Delta=10: production_priority
axes[0].step(unit1_bi.index, unit1_bi['production_priority'], where='post', linewidth=1.5)
axes[0].set_ylabel('Production\nPriority')
axes[0].set_title('Î”_k = 10 cycles (MES â€” per shift)', fontsize=11, loc='left')
axes[0].set_yticks([0, 1, 2])
axes[0].set_yticklabels(['Low', 'Med', 'High'])

# Delta=25: spare_parts_available
axes[1].step(unit1_bi.index, unit1_bi['spare_parts_available'], where='post', 
             linewidth=1.5, color='tab:orange')
axes[1].set_ylabel('Spare Parts\nAvailable')
axes[1].set_title('Î”_k = 25 cycles (ERP/Inventory â€” daily/weekly)', fontsize=11, loc='left')
axes[1].set_yticks([0, 1])
axes[1].set_yticklabels(['No', 'Yes'])

# Delta=50: pm_cost
axes[2].step(unit1_bi.index, unit1_bi['pm_cost'], where='post', 
             linewidth=1.5, color='tab:green')
axes[2].set_ylabel('PM Cost ($)')
axes[2].set_title('Î”_k = 50 cycles (ERP â€” monthly)', fontsize=11, loc='left')
axes[2].set_xlabel('Cycle (= flight)')

fig.suptitle('Multi-Rate BI Variables â€” Unit 1 (FD001)', fontsize=13, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

---
## 3. Validate Degradation Correlation
Per the mixed correlation design: `production_priority` and `downtime_penalty` should increase as the engine approaches failure.

In [None]:
# Add health index to BI data for analysis
bi_with_hi = bi_df.copy()
max_cycles = bi_with_hi.groupby('unit_id')['cycle'].transform('max')
bi_with_hi['health_index'] = 1.0 - (bi_with_hi['cycle'] / max_cycles)
bi_with_hi['life_phase'] = pd.cut(bi_with_hi['health_index'], 
                                   bins=[0, 0.3, 0.7, 1.0],
                                   labels=['Near Failure', 'Mid Life', 'Healthy'])

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Production priority distribution by life phase
phase_priority = bi_with_hi.groupby('life_phase')['production_priority'].value_counts(normalize=True)
phase_priority = phase_priority.unstack(fill_value=0)
phase_priority.columns = ['Low', 'Medium', 'High']
phase_priority.loc[['Healthy', 'Mid Life', 'Near Failure']].plot(
    kind='bar', stacked=True, ax=axes[0], color=['#66c2a5', '#fc8d62', '#e63946'])
axes[0].set_title('Production Priority by Life Phase', fontweight='bold')
axes[0].set_ylabel('Proportion')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)
axes[0].legend(title='Priority')

# Downtime penalty by life phase
order = ['Healthy', 'Mid Life', 'Near Failure']
bi_with_hi['life_phase'] = pd.Categorical(bi_with_hi['life_phase'], categories=order, ordered=True)
sns.boxplot(data=bi_with_hi, x='life_phase', y='downtime_penalty', ax=axes[1],
            order=order, palette=['#66c2a5', '#fc8d62', '#e63946'])
axes[1].set_title('Downtime Penalty by Life Phase', fontweight='bold')
axes[1].set_ylabel('Downtime Penalty ($/hr)')
axes[1].set_xlabel('')

fig.suptitle('Degradation-Correlated BI Variables â€” FD001', fontsize=13, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("Mean production_priority by phase:")
print(bi_with_hi.groupby('life_phase')['production_priority'].mean())
print("\nMean downtime_penalty by phase:")
print(bi_with_hi.groupby('life_phase')['downtime_penalty'].mean())

---
## 4. Run Full Pipeline

In [None]:
pipeline = PreprocessingPipelineBI(
    normalization_method='minmax',
    variance_threshold=0.01,
    correlation_threshold=0.95,
    rul_max=125,
)

train_processed = pipeline.fit_transform(train_raw, 'FD001')

In [None]:
# Apply to test data
test_processed = pipeline.transform(test_raw, 'FD001')

print(f"Train processed: {train_processed.shape}")
print(f"Test processed:  {test_processed.shape}")
train_processed.head()

In [None]:
# Feature selection report
report = pipeline.get_selection_report()
groups = pipeline.get_feature_groups()

print("=== Feature Selection Report ===")
print(f"Sensor/setting features selected: {len(groups['sensor']) + len(groups['setting'])}")
print(f"  Sensors: {groups['sensor']}")
print(f"  Settings: {groups['setting']}")
print(f"\nBI features selected: {len(groups['bi'])}")
print(f"  {groups['bi']}")
print(f"\nRemoved (low variance): {report.get('variance_removed', [])}")
print(f"Removed (high correlation): {report.get('correlation_removed', [])}")

---
## 5. Visualize Fused Data

In [None]:
# Show one unit's fused features over its lifetime
unit_id = 1
unit_data = train_processed[train_processed['unit'] == unit_id].set_index('cycle')

# Pick a few representative features from each group
sensor_to_plot = [c for c in groups['sensor'] if c in unit_data.columns][:3]
bi_to_plot = [c for c in groups['bi'] if c in unit_data.columns 
              and not any(c.startswith(p) for p in ['production_priority_', 'shift_pattern_'])][:3]

fig, axes = plt.subplots(3, 1, figsize=(14, 9), sharex=True)

# Sensor features (normalized)
for col in sensor_to_plot:
    axes[0].plot(unit_data.index, unit_data[col], label=col, alpha=0.8)
axes[0].set_ylabel('Normalized Value')
axes[0].set_title('Sensor Features (normalized, change every cycle)', fontsize=11, loc='left')
axes[0].legend(loc='upper left', fontsize=9)

# BI features (original scale)
for i, col in enumerate(bi_to_plot):
    axes[1].step(unit_data.index, unit_data[col], where='post', 
                 label=col, alpha=0.8, linewidth=1.5)
axes[1].set_ylabel('BI Value')
axes[1].set_title('BI Features (multi-rate, step function)', fontsize=11, loc='left')
axes[1].legend(loc='upper left', fontsize=9)

# RUL
axes[2].plot(unit_data.index, unit_data['rul'], color='red', linewidth=2)
axes[2].axhline(y=125, color='gray', linestyle='--', alpha=0.5, label='RUL_max=125')
axes[2].set_ylabel('RUL (cycles)')
axes[2].set_xlabel('Cycle (= flight)')
axes[2].set_title('Remaining Useful Life (clipped at 125)', fontsize=11, loc='left')
axes[2].legend(loc='upper right', fontsize=9)

fig.suptitle(f'Fused Sensor + BI Data â€” Unit {unit_id} (FD001)', 
             fontsize=13, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap: sensor vs BI features
all_features = groups['sensor'] + groups['setting'] + groups['bi']
# Keep only numeric columns that exist
plot_cols = [c for c in all_features if c in train_processed.columns]

corr = train_processed[plot_cols].corr()

fig, ax = plt.subplots(figsize=(16, 12))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, cmap='RdBu_r', center=0, 
            vmin=-1, vmax=1, ax=ax, square=True,
            linewidths=0.5, cbar_kws={'shrink': 0.6})
ax.set_title('Feature Correlation Matrix (Sensor + BI)', fontweight='bold', fontsize=13)
plt.tight_layout()
plt.show()

---
## 6. Cost Realism Check

In [None]:
bi_full = bi_loader.load_bi('FD001')

print("=== BI Data Statistics (FD001) ===")
cost_cols = ['pm_cost', 'cm_cost', 'downtime_penalty', 'revenue_per_hour']
print(bi_full[cost_cols].describe().round(1))

cm_pm_ratio = bi_full['cm_cost'].mean() / bi_full['pm_cost'].mean()
print(f"\nCM/PM cost ratio: {cm_pm_ratio:.1f}x (industry standard: 5-10x)")

---
## 7. Multi-Dataset Validation (FD001-FD004)

In [None]:
results = {}

for ds_name in ['FD001', 'FD002', 'FD003', 'FD004']:
    print(f"\n{'='*60}")
    try:
        ds = loader.load_cmapss_dataset(ds_name)
        p = PreprocessingPipelineBI(rul_max=125)
        train_p = p.fit_transform(ds['train'], ds_name)
        test_p = p.transform(ds['test'], ds_name)
        g = p.get_feature_groups()
        results[ds_name] = {
            'train_shape': train_p.shape,
            'test_shape': test_p.shape,
            'n_sensor': len(g['sensor']),
            'n_setting': len(g['setting']),
            'n_bi': len(g['bi']),
            'total_features': len(g['all']),
            'status': 'âœ“',
        }
    except Exception as e:
        print(f"  âœ— Failed: {e}")
        results[ds_name] = {'status': f'âœ— {e}'}

print(f"\n{'='*60}")
print("\n=== Summary ===")
summary_df = pd.DataFrame(results).T
summary_df

---
## 8. Methodology Alignment Summary

This section maps each pipeline step to the corresponding section in the thesis methodology.

| Pipeline Step | Code Module | Methodology Section | Equation(s) | Status |
|:---|:---|:---|:---|:---|
| Data loading (C-MAPSS) | `data_loader.py` | III-B1 (Data Acquisition) | â€” | âœ“ |
| RUL calculation & clipping | `preprocessing.py` | III-B2 (Piecewise Linear RUL) | Eq. 1-2 | âœ“ |
| BI data temporal alignment | `bi_fusion.py` | III-B3 (BI Data Fusion) | Eq. 3-4 | âœ“ |
| Source-driven Î”_k per variable | `bi_fusion.py` (`BI_DELTA_CONFIG`) | III-B3 (Update Frequencies) | â€” | âœ“ |
| One-hot encoding of categoricals | `bi_fusion.py` (`_encode_categoricals`) | III-B3 (Eq. 5) | Eq. 5 | âœ“ |
| Feature-level fusion (concat) | `bi_fusion.py` (`fuse`) | III-B3 (Eq. 6) | Eq. 6 | âœ“ |
| Min-Max normalization | `preprocessing.py` (`DataNormalizer`) | III-B4 (Normalization) | Eq. 7-8 | âœ“ |
| Variance-based filtering (sensor only) | `feature_selection.py` | III-B4 (Eq. 10) | Eq. 10 | âœ“ |
| BI exemption from variance filter | `feature_selection.py` | III-B4 (BI-aware selection) | â€” | âœ“ |
| Correlation-based filtering | `feature_selection.py` | III-B4 (Eq. 12) | Eq. 12 | âœ“ |
| BI prioritization in corr. removal | `feature_selection.py` | III-B4 (BI retention) | â€” | âœ“ |

### What remains for the next steps:

| Next Step | Methodology Section | Status |
|:---|:---|:---|
| Sliding window generation | III-B5 (Sequence Construction) | â¬œ To do |
| GA hyperparameter optimization | III-B4 (Eq. 14-15) | â¬œ To do |
| Attention mechanism (BI-sensor weighting) | III-C2 (Eq. 16-18) | â¬œ To do |
| Hybrid ML+DL model (XGBoost + CNN-LSTM) | III-C3-C4 (Eq. 19-32) | â¬œ To do |
| SHAP explainability | III-D2 | â¬œ To do |
| Cost-sensitive decision support | III-D3-D4 (Eq. 45-49) | â¬œ To do |