# VSB Power Line Fault Detection (Kaggle Case Study)

Link to the Dataset: https://www.kaggle.com/competitions/vsb-power-line-fault-detection/overview

### Problem Description

Faults in electric transmission lines can lead to a destructive phenomenon called partial discharge.
If left alone, partial discharges can damage equipment to the point that it stops functioning entirely. Your challenge is to detect partial discharges so that repairs can be made before any lasting harm occurs.

Each signal contains 800,000 measurements of a power line's voltage, taken over 20 milliseconds. As the underlying electric grid operates at 50 Hz, this means each signal covers a single complete grid cycle. The grid itself operates on a 3-phase power scheme, and all three phases are measured simultaneously.

### File Descriptions
1. metadata_[train/test].csv

* id_measurement: the ID code for a trio of signals recorded at the same time.
* signal_id: the foreign key for the signal data. Each signal ID is unique across both train and test, so the first ID in train is '0' but the first ID in test is '8712'.
* phase: the phase ID code within the signal trio. The phases may or may not all be impacted by a fault on the line.
* target: 0 if the power line is undamaged, 1 if there is a fault.
2. [train/test].parquet - The signal data. Each column contains one signal; 800,000 int8 measurements as exported with pyarrow.parquet version 0.11. 

### Objective
Detect partial discharge patterns in power line signals using ML classifiers. Reduce maintenance costs and prevent outages through automated monitoring

## Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore
from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pyarrow.parquet as pq
from scipy import signal
from scipy.fft import fft, fftfreq
import plotly.graph_objects as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

In [None]:
DATA_PATH = r'./data/vsb-power-line-fault-detection/'
CHUNK_SIZE = 1000
SAMPLE_SIZE=1000

In [None]:
metadata_df  = pd.read_csv(f"{DATA_PATH}/metadata_train.csv")

In [None]:
metadata_df.shape

#### Visualizing Target Distribution

In [None]:
%matplotlib inline
plt.figure(figsize=(8, 6))
ax = sns.countplot(data = metadata_df, x='target', palette="viridis")
total = len(metadata_df)

# Add percentage labels to the bars
for p in ax.patches:
    percentage = '{:.1f}%'.format(100 * p.get_height() / total)
    x = p.get_x() + p.get_width() / 2
    y = p.get_height()
    ax.annotate(percentage, (x, y), ha='center', va='bottom')
    
plt.title('Target Distribution')
plt.xlabel('Target')
plt.ylabel('Count')
plt.show()

In [None]:
metadata_df['target'].value_counts()

#### We have an imbalanced dataset with 6% positive targets (faults)

In [None]:
train_path = f"{DATA_PATH}/train.parquet"
test_path = f"{DATA_PATH}/test.parquet"

In [None]:
%%time 
signals_df = pd.read_parquet(train_path, engine='fastparquet')

In [None]:
signals_df.shape

In [None]:
sampling_rate = signals_df.shape[0]

### Missing Data

In [None]:
signal_missing = signals_df.isnull().sum()
signal_missing_pct = (signal_missing / len(signals_df)) * 100

total_missing_signals = signal_missing.sum()
total_cells_signals = signals_df.shape[0] * signals_df.shape[1]
overall_missing_pct_signals = (total_missing_signals / total_cells_signals) * 100

print(f"Signal Data Shape: {signals_df.shape}")
print(f"Total Missing Values: {total_missing_signals:,}")
print(f"Total Cells: {total_cells_signals:,}")
print(f"Overall Missing Percentage: {overall_missing_pct_signals:.4f}%")

In [None]:
signals_df.head()

### Visualizing Sample Signals

In [None]:
n_signals = 10
signal_ids = np.random.choice(signals_df.columns, n_signals, replace=False)
time_axis = np.linspace(0, 1, len(signals_df))            
fig, axes = plt.subplots(n_signals//2, 2, figsize=(15, 12))
axes = axes.flatten() if n_signals > 2 else [axes]

for i, sig_id in enumerate(signal_ids):
    if i >= len(axes):
        break
        
    signal_data = signals_df[str(sig_id)]
    
    # Get target label if available
    target_label = "Unknown"
    if metadata_df is not None and str(sig_id) in metadata_df['signal_id'].astype(str).values:
        target = metadata_df[metadata_df['signal_id'].astype(str) == str(sig_id)]['target'].iloc[0]
        target_label = "Fault" if target == 1 else "Normal"
    
    axes[i].plot(time_axis, signal_data, linewidth=0.8)
    axes[i].set_title(f'Signal {sig_id} - {target_label}')
    axes[i].set_xlabel('Time (seconds)')
    axes[i].set_ylabel('Amplitude')
    axes[i].grid(True, alpha=0.3)
    
    # Add basic stats to plot
    axes[i].text(0.02, 0.98, f'Mean: {signal_data.mean():.4f}\nStd: {signal_data.std():.4f}', 
                transform=axes[i].transAxes, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.suptitle('Sample Power Line Signals', y=1.02, fontsize=16)
plt.show()

#### Visualizing Fault Signals

In [None]:
positive_signal_ids = metadata_df.loc[metadata_df['target']==1].signal_id.unique()[:10]

In [None]:
signal_ids = positive_signal_ids
fig, axes = plt.subplots(n_signals//2, 2, figsize=(15, 12))
axes = axes.flatten() if n_signals > 2 else [axes]

for i, sig_id in enumerate(signal_ids):
    if i >= len(axes):
        break
        
    signal_data = signals_df[str(sig_id)]
    
    # Get target label if available
    target_label = "Unknown"
    if metadata_df is not None and str(sig_id) in metadata_df['signal_id'].astype(str).values:
        target = metadata_df[metadata_df['signal_id'].astype(str) == str(sig_id)]['target'].iloc[0]
        target_label = "Fault" if target == 1 else "Normal"
    
    axes[i].plot(time_axis, signal_data, linewidth=0.8)
    axes[i].set_title(f'Signal {sig_id} - {target_label}')
    axes[i].set_xlabel('Time (seconds)')
    axes[i].set_ylabel('Amplitude')
    axes[i].grid(True, alpha=0.3)
    
    # Add basic stats to plot
    axes[i].text(0.02, 0.98, f'Mean: {signal_data.mean():.4f}\nStd: {signal_data.std():.4f}', 
                transform=axes[i].transAxes, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.suptitle('Sample Power Line Signals', y=1.02, fontsize=16)
plt.show()

* Fault signals show high variability and display diverse characteristics - from high-frequency noise (Signals 3,5) to dramatic spikes (Signal 228 with ±100 amplitude excursions) to gradual trends (Signal 202), indicating different types of electrical faults.
* Standard deviation shows consistently higher for fault signals between 12-15
* Extreme amplitude events are common for fault signals with sharp transient spikes (Signal 4 at 0.2s, Signal 228's massive spikes)
* __Peak detection and amplitude threshold features could be great discriminative features__

### Normal and Fault Signal Comparison

In [None]:
normal_signals = metadata_df[metadata_df['target'] == 0]['signal_id'].sample(3).tolist()
fault_signals = metadata_df[metadata_df['target'] == 1]['signal_id'].sample(3).tolist()

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Plot normal signals
for i, sig_id in enumerate(normal_signals[:3]):
    if str(sig_id) in signals_df.columns:
        signal_data = signals_df[str(sig_id)]
        axes[0, i].plot(time_axis, signal_data, color='green', alpha=0.8)
        axes[0, i].set_title(f'Normal Signal {sig_id}')
        axes[0, i].set_ylabel('Amplitude')
        axes[0, i].grid(True, alpha=0.3)

# Plot fault signals
for i, sig_id in enumerate(fault_signals[:3]):
    if str(sig_id) in signals_df.columns:
        signal_data = signals_df[str(sig_id)]
        axes[1, i].plot(time_axis, signal_data, color='red', alpha=0.8)
        axes[1, i].set_title(f'Fault Signal {sig_id}')
        axes[1, i].set_xlabel('Time (seconds)')
        axes[1, i].set_ylabel('Amplitude')
        axes[1, i].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Normal vs Fault Signal Comparison', y=1.02, fontsize=16)
plt.show()

* Normal signals show smooth, predictable waveforms  which are clean sinusoidal or wave-like patterns with consistent amplitude ranges (±20 to ±30), while some show occasional sharp spikes but overall consistent.
    
* Fault signals exhibit extreme noise and chaos with high-frequency noise, irregular amplitude variations, and completely disrupted waveforms

* Amplitude ranges clearly distinguish normal and fault signals. Normal signals stay within bounds, while fault signals show much wider amplitude excursions (up to ±80 in some cases) and high-frequency noises throughout the entire signal duration

* __Noise level  and amplitude variance could be highly discriminative features__

### Frequency Domain Analysis

In [None]:
n_signals = 4
signal_ids = np.random.choice(signals_df.columns, n_signals, replace=False)
targets = (
    metadata_df.set_index("signal_id")
    .loc[list(map(int, signal_ids)), "target"]
    .map({1: "Fault", 0: "Normal"})
    .tolist()
)
fig, axes = plt.subplots(2, 2, figsize=(15,10))
        
for i, sig_id in enumerate(signal_ids):
    if i >= 4:
        break
        
    row, col = i // 2, i % 2
    signal_data = signals_df[str(sig_id)]
    
    # Compute FFT
    fft_vals = fft(signal_data)
    fft_freq = fftfreq(len(signal_data), 1/sampling_rate)
    
    # Plot only positive frequencies
    positive_freq_mask = fft_freq > 0
    frequencies = fft_freq[positive_freq_mask]
    magnitudes = np.abs(fft_vals[positive_freq_mask])
    
    axes[row, col].loglog(frequencies, magnitudes,color = 'green')
    axes[row, col].set_title(f'Frequency Spectrum - {targets[i]} Signal {sig_id} ')
    axes[row, col].set_xlabel('Frequency (Hz)')
    axes[row, col].set_ylabel('Magnitude')
    axes[row, col].grid(True, alpha=0.3)
    
    # Mark dominant frequencies
    dominant_freq_idx = np.argmax(magnitudes[frequencies < sampling_rate/4])  # Avoid aliasing
    dominant_freq = frequencies[dominant_freq_idx]
    axes[row, col].axvline(dominant_freq, color='red', linestyle='--', alpha=0.7, 
                         label=f'Peak: {dominant_freq:.1f} Hz')
    axes[row, col].legend()

plt.tight_layout()
plt.suptitle('Frequency Domain Analysis', y=1.02, fontsize=16)
plt.show()

#### Fault Signal Frequency Domain

In [None]:
n_signals = 4
positive_signal_ids = metadata_df.loc[metadata_df['target']==1].signal_id.unique()[:10]
signal_ids = positive_signal_ids

targets = (
    metadata_df.set_index("signal_id")
    .loc[list(map(int, signal_ids)), "target"]
    .map({1: "Fault", 0: "Normal"})
    .tolist()
)

fig, axes = plt.subplots(2, 2, figsize=(15,10))
        
for i, sig_id in enumerate(signal_ids):
    if i >= 4:
        break
        
    row, col = i // 2, i % 2
    signal_data = signals_df[str(sig_id)]
    
    # Compute FFT
    fft_vals = fft(signal_data)
    fft_freq = fftfreq(len(signal_data), 1/sampling_rate)
    
    # Plot only positive frequencies
    positive_freq_mask = fft_freq > 0
    frequencies = fft_freq[positive_freq_mask]
    magnitudes = np.abs(fft_vals[positive_freq_mask])
    
    axes[row, col].loglog(frequencies, magnitudes, color='red')
    axes[row, col].set_title(f'Frequency Spectrum - {targets[i]} Signal {sig_id}')
    axes[row, col].set_xlabel('Frequency (Hz)')
    axes[row, col].set_ylabel('Magnitude')
    axes[row, col].grid(True, alpha=0.3)
    
    # Mark dominant frequencies
    dominant_freq_idx = np.argmax(magnitudes[frequencies < sampling_rate/4])  # Avoid aliasing
    dominant_freq = frequencies[dominant_freq_idx]
    axes[row, col].axvline(dominant_freq, color='blue', linestyle='--', alpha=0.7, 
                         label=f'Peak: {dominant_freq:.1f} Hz')
    axes[row, col].legend()

plt.tight_layout()
plt.suptitle('Frequency Domain Analysis', y=1.02, fontsize=16)
plt.show()

* All signals show dominant low-frequency peaks around 1.0 Hz
* Fault signal (2906) exhibits noticeably lower overall magnitude and flatter frequency response compared to the three normal signals.
* Normal signals maintain higher energy levels across most frequency bands and show more pronounced spectral peaks, while the fault signal appears more attenuated and lacks the robust frequency characteristics of healthy power line operation.
* Spectral energy ratios, dominant frequency magnitudes, and frequency band power distributions will be valuable features for distinguishing between normal and fault conditions.

### Statistical Analysis and Visualization of Sample Signals

In [None]:
# Calculate statistics for all signals
sample_df = signals_df.iloc[:,500:1000]
signal_stats = pd.DataFrame({
    'mean': sample_df.mean(),
    'std': sample_df.std(),
    'min': sample_df.min(),
    'max': sample_df.max(),
    'skewness': sample_df.skew(),
    'kurtosis': sample_df.kurtosis()
})

# Add target labels if available
signal_ids = signal_stats.index.astype(int)
targets = []
for sig_id in signal_ids:
    if sig_id in metadata_df['signal_id'].values:
        target = metadata_df[metadata_df['signal_id'] == sig_id]['target'].iloc[0]
        targets.append('Fault' if target == 1 else 'Normal')
    else:
        targets.append('Unknown')
signal_stats['target'] = targets

# Create subplots
fig, axes = plt.subplots(2, 3, figsize=(15,12))
axes = axes.flatten()

stats_to_plot = ['mean', 'std', 'min', 'max', 'skewness', 'kurtosis']

for i, stat in enumerate(stats_to_plot):
    if metadata_df is not None and 'target' in signal_stats.columns:
        # Plot distributions by target class
        for target_class in signal_stats['target'].unique():
            if target_class != 'Unknown':
                data = signal_stats[signal_stats['target'] == target_class][stat]
                axes[i].hist(data, alpha=0.7, label=target_class, bins=30)
        axes[i].legend()
    else:
        axes[i].hist(signal_stats[stat], bins=30, alpha=0.7)
    
    axes[i].set_title(f'Distribution of {stat.capitalize()}')
    axes[i].set_xlabel(stat.capitalize())
    axes[i].set_ylabel('Frequency')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Statistical Distributions of Signal Properties for a Sample of 500 Signals', y=1.02, fontsize=16)
plt.show()

* Mean Distribution: Normal signals cluster tightly around -1.5 to -0.5, while fault signals scatter across wider ranges as outliers. This indicates faults cause baseline shifts in electrical signals, making mean values a useful but moderate discriminator.
  
* Standard Deviation Distribution: Normal signals form a sharp peak around 15, while fault signals spread wider with different variability patterns. This shows the highest discrimination potential, as normal operation has very consistent signal variability.
  
* Minimum Values Distribution: Normal signals show two distinct clusters (-20 and -60), while fault signals appear scattered throughout these ranges. Fault signals can cause unusual minimum excursions beyond normal operational boundaries.

* Maximum Values Distribution: Normal signals concentrate sharply around 20-30, while fault signals extend to extreme values (120+). This reveals clear fault signatures through dramatic voltage/current spikes during electrical faults.
  
* Skewness Distribution: Normal signals cluster around 0 (symmetric), while fault signals scatter with higher skewness values.

* Kurtosis Distribution: Normal signals cluster around 0, while fault signals show extreme outliers with high kurtosis (40+). 
  
__Inference__

* Standard deviation and maximum values are the most discriminative features, showing the clearest separation between normal and fault classes with minimal overlap
* Combination of multiple statistical measures will be highly effective, since normal signals cluster tightly in ALL metrics while fault signals deviate across multiple measures simultaneously
* Simple threshold-based detection could work well due to clear separation points (e.g., std > 16, max > 50), making this suitable for real-time monitoring with low computational overhead

### Outlier Analysis

In [None]:
# Sample signals for analysis
sample_size=100
sample_signals = signals_df.sample(n=min(sample_size, len(signals_df.columns)), axis=1)

# Calculate basic statistics for each signal
signal_stats = pd.DataFrame({
    'signal_id': sample_signals.columns,
    'mean': sample_signals.mean(),
    'std': sample_signals.std(), 
    'min': sample_signals.min(),
    'max': sample_signals.max(),
    'range': sample_signals.max() - sample_signals.min()
})
metadata_df_copy=metadata_df.copy()
metadata_df_copy['signal_id']=metadata_df_copy['signal_id'].astype(str)
# Add target labels 
signal_stats = signal_stats.merge(
            metadata_df_copy[['signal_id', 'target']], 
            left_on='signal_id', right_on='signal_id', 
            how='left'
        )
colors = ['red' if t == 1 else 'green' if t == 0 else 'gray' for t in signal_stats['target']]
labels = ['Fault' if t == 1 else 'Normal' if t == 0 else 'Unknown' for t in signal_stats['target']]

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

# 1. Box plot for standard deviations
axes[0].boxplot(signal_stats['std'], vert=True)
axes[0].set_title('Signal Standard Deviation\nBox Plot')
axes[0].set_ylabel('Standard Deviation')
axes[0].grid(True, alpha=0.3)

# Add outlier threshold lines
Q1 = signal_stats['std'].quantile(0.25)
Q3 = signal_stats['std'].quantile(0.75)
IQR = Q3 - Q1
upper_outlier = Q3 + 1.5 * IQR
lower_outlier = Q1 - 1.5 * IQR
axes[0].axhline(y=upper_outlier, color='red', linestyle='--', alpha=0.7, label=f'Upper threshold: {upper_outlier:.2f}')
axes[0].axhline(y=lower_outlier, color='red', linestyle='--', alpha=0.7, label=f'Lower threshold: {lower_outlier:.2f}')
axes[0].legend(fontsize=8)

# 2. Scatter plot: Mean vs Std
axes[1].scatter(signal_stats['mean'], signal_stats['std'], c=colors, alpha=0.6, s=30)
axes[1].set_xlabel('Signal Mean')
axes[1].set_ylabel('Signal Standard Deviation')
axes[1].set_title('Mean vs Standard Deviation')
axes[1].grid(True, alpha=0.3)

# 3. Scatter plot: Min vs Max
axes[2].scatter(signal_stats['min'], signal_stats['max'], c=colors, alpha=0.6, s=30)
axes[2].set_xlabel('Signal Minimum')
axes[2].set_ylabel('Signal Maximum') 
axes[2].set_title('Min vs Max Values')
axes[2].grid(True, alpha=0.3)

# 4. Z-score outlier detection
z_scores = np.abs(zscore(signal_stats[['mean', 'std', 'range']]))
outlier_mask = (z_scores > 3).any(axis=1)

outlier_counts = [
    np.sum(~outlier_mask),  # Normal
    np.sum(outlier_mask)    # Outliers
]

axes[3].pie(outlier_counts, labels=['Normal', 'Outliers'], autopct='%1.1f%%', 
           colors=['green', 'salmon'])
axes[3].set_title(f'Z-Score Outliers\n(threshold = 3)')

# 5. Isolation Forest outlier detection
iso_forest = IsolationForest(contamination=0.1, random_state=42)
features = signal_stats[['mean', 'std', 'min', 'max', 'range']].fillna(0)
outlier_preds = iso_forest.fit_predict(features)
outlier_scores = iso_forest.decision_function(features)

outlier_colors = ['red' if pred == -1 else 'green' for pred in outlier_preds]
axes[4].scatter(range(len(outlier_scores)), outlier_scores, c=outlier_colors, alpha=0.6, s=30)
axes[4].axhline(y=0, color='black', linestyle='--', alpha=0.5, label='Decision boundary')
axes[4].set_xlabel('Signal Index')
axes[4].set_ylabel('Outlier Score')
axes[4].set_title('Isolation Forest Outlier Scores')
axes[4].grid(True, alpha=0.3)
axes[4].legend()
# 6. Summary statistics
axes[5].axis('off')

# Calculate summary stats
n_zscore_outliers = np.sum(outlier_mask)
n_isolation_outliers = np.sum(outlier_preds == -1)

summary_text = f"""OUTLIER DETECTION SUMMARY

Dataset Info:
• Total signals analyzed: {len(signal_stats)}
• Sample from {len(signals_df.columns)} total signals

Z-Score Method (threshold=3):
• Outliers detected: {n_zscore_outliers}
• Percentage: {n_zscore_outliers/len(signal_stats)*100:.1f}%

Isolation Forest Method:
• Outliers detected: {n_isolation_outliers} 
• Percentage: {n_isolation_outliers/len(signal_stats)*100:.1f}%

Signal Statistics:
• Mean std dev: {signal_stats['std'].mean():.2f}
• Mean range: {signal_stats['range'].mean():.2f}
"""
    
if metadata_df is not None:
    fault_signals = signal_stats[signal_stats['target'] == 1]
    normal_signals = signal_stats[signal_stats['target'] == 0]
    
    if len(fault_signals) > 0 and len(normal_signals) > 0:
        summary_text += f"""
Target Distribution:
• Normal signals: {len(normal_signals)}
• Fault signals: {len(fault_signals)}
• Fault rate: {len(fault_signals)/len(signal_stats)*100:.1f}%

Fault vs Normal Stats:
• Normal avg std: {normal_signals['std'].mean():.2f}
• Fault avg std: {fault_signals['std'].mean():.2f}
"""

axes[5].text(0.05, 0.95, summary_text, transform=axes[5].transAxes, 
            fontsize=10, verticalalignment='top', fontfamily='monospace')

plt.tight_layout()
plt.suptitle('Power Line Signal Outlier Detection', y=1.02, fontsize=16)
plt.show()


##### Z-Score method: Detected only 1% outliers , Isolation Forest: Detected 10%,  Actual : 5% Fault signals

* Avoid Z-Score alone: Too conservative for fault detection in power systems
* Use Isolation Forest: Better captures multivariate anomalies in signal behavior

### Visualizing Outliers signals captured with Zscore and Isolation Forest

In [None]:
# Return outlier information
outlier_info = {
    'zscore_outliers': signal_stats[outlier_mask]['signal_id'].tolist(),
    'isolation_outliers': signal_stats[outlier_preds == -1]['signal_id'].tolist(),
    'outlier_scores': outlier_scores
}
# outlier_info

#### Zscore Outliers

In [None]:
outlier_signal_ids =outlier_info['zscore_outliers']

fig, axes = plt.subplots(2, 1, figsize=(15,8))
    
# Plot outlier signals
n_outliers_to_plot = min(5, len(outlier_signal_ids))
normal_signal_ids = normal_signals['signal_id'].astype(str).tolist()
time_axis = np.linspace(0, 1, len(signals_df))

for i in range(n_outliers_to_plot):
    signal_id = outlier_signal_ids[i]
    if str(signal_id) in signals_df.columns:
        signal_data = signals_df[str(signal_id)]
        axes[0].plot(time_axis, signal_data, alpha=0.7, linewidth=1, 
                    label=f'Outlier {signal_id}')

axes[0].set_title('Outlier Signals (Time Domain)')
axes[0].set_xlabel('Time (seconds)')
axes[0].set_ylabel('Amplitude')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot normal signals for comparison

n_normal_to_plot = min(5, len(normal_signal_ids))
for i in range(n_normal_to_plot):
    signal_id = normal_signal_ids[i]
    if str(signal_id) in signals_df.columns:
        signal_data = signals_df[str(signal_id)]
        axes[1].plot(time_axis, signal_data, alpha=0.7, linewidth=1,
                   label=f'Normal {signal_id}')

axes[1].set_title('Normal Signals (Time Domain)')
axes[1].set_xlabel('Time (seconds)')
axes[1].set_ylabel('Amplitude') 
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Outlier vs Normal Signal Comparison', y=1.02, fontsize=14)
plt.show()

### Isolation Forest Outliers

In [None]:
outlier_signal_ids =outlier_info['isolation_outliers']

fig, axes = plt.subplots(2, 1, figsize=(15,8))
    
# Plot outlier signals
n_outliers_to_plot = min(5, len(outlier_signal_ids))

time_axis = np.linspace(0, 1, len(signals_df))

for i in range(n_outliers_to_plot):
    signal_id = outlier_signal_ids[i]
    if str(signal_id) in signals_df.columns:
        signal_data = signals_df[str(signal_id)]
        axes[0].plot(time_axis, signal_data, alpha=0.7, linewidth=1, 
                    label=f'Outlier {signal_id}')

axes[0].set_title('Outlier Signals (Time Domain)')
axes[0].set_xlabel('Time (seconds)')
axes[0].set_ylabel('Amplitude')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot normal signals for comparison
normal_signal_ids = normal_signals['signal_id'].astype(str).tolist()
n_normal_to_plot = min(5, len(normal_signal_ids))
for i in range(n_normal_to_plot):
    signal_id = normal_signal_ids[i]
    if str(signal_id) in signals_df.columns:
        signal_data = signals_df[str(signal_id)]
        axes[1].plot(time_axis, signal_data, alpha=0.7, linewidth=1,
                   label=f'Normal {signal_id}')

axes[1].set_title('Normal Signals (Time Domain)')
axes[1].set_xlabel('Time (seconds)')
axes[1].set_ylabel('Amplitude') 
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Outlier vs Normal Signal Comparison', y=1.02, fontsize=14)
plt.show()

#### Outlier Signals (Top Panel):

* Show persistent high-amplitude transients throughout the entire signal duration
* Exhibit chaotic, unpredictable spike patterns with amplitudes reaching ±100
* Display no underlying structural consistency - the baseline waveform is completely disrupted
* Demonstrate sustained abnormal behavior rather than isolated events

#### Normal Signals (Bottom Panel):

* Maintain consistent baseline patterns with occasional sharp spikes
* Show predictable underlying structure despite some transient events
* Keep controlled amplitude ranges mostly within ±50
* Display recoverable behavior - return to baseline after transients

#### Inferences 

* Use median-based scaling instead of mean-based due to extreme outliers
* Decide whether to remove extreme outliers or use them as additional fault examples
* Cap extreme values to prevent single outliers from dominating features
* True fault detection requires distinguishing between measurement anomalies and actual electrical faults


### Correlation Analysis

##### Plot correlation heatmap between signals

In [None]:
metadata_df["signal_id"] = metadata_df["signal_id"].astype(str)

In [None]:
sample_signals = signals_df.sample(n=min(100, len(signals_df.columns)), axis=1)
signal_ids = sample_signals.columns
targets = (
    metadata_df.set_index("signal_id")
    .loc[signal_ids, "target"]
    .map({1: "Fault", 0: "Normal"})
    .tolist()
)
        
# Calculate correlation matrix
corr_matrix = sample_signals.corr()

# Plot heatmap
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # Mask upper triangle
sns.heatmap(corr_matrix, mask=mask, cmap='coolwarm', center=0,
            square=True, cbar_kws={"shrink": .8})
plt.title(f'Signal Correlation Heatmap (Sample of {len(sample_signals.columns)} signals)')
plt.tight_layout()
plt.show()

* The correlation heatmap shows mostly weak correlations (ranging -0.75 to +0.75) between the 50 sample signals, indicating that power line signals behave quite independently from each other.
* __Individual signal features could be more valuable for fault detection than cross-signal relationships__

### Time Series Patterns 

##### Analyze time series patterns in a specific signal using sliding windows

In [None]:
window_size=1000
negative_df=metadata_df.loc[metadata_df.target==0]
sample_df = negative_df.sample(n=min(1, len(negative_df.columns)), axis=0)
signal_id = str(sample_df['signal_id'].unique()[0])
signal_data = signals_df[signal_id]

# Create sliding windows
n_windows = len(signal_data) // window_size
windows = []
window_means = []
window_stds = []
target_val = metadata_df.loc[metadata_df['signal_id']==signal_id]['target'].unique()[0]

for i in range(n_windows):
    start_idx = i * window_size
    end_idx = start_idx + window_size
    window = signal_data.iloc[start_idx:end_idx]
    windows.append(window)
    window_means.append(window.mean())
    window_stds.append(window.std())

# Plot results
fig, axes = plt.subplots(3, 1, figsize=(15,8))

# Original signal
axes[0].plot(time_axis, signal_data, color = 'green')
axes[0].set_title(f'Original Signal {signal_id} with target {target_val}')
axes[0].set_ylabel('Amplitude')
axes[0].grid(True, alpha=0.3)

# Window means
window_time = np.arange(n_windows) * window_size / sampling_rate
axes[1].plot(window_time, window_means, marker='o', linewidth=2, color = 'green')
axes[1].set_title('Window Means Over Time')
axes[1].set_ylabel('Mean Amplitude')
axes[1].grid(True, alpha=0.3)

# Window standard deviations
axes[2].plot(window_time, window_stds, marker='s', linewidth=2, color='orange')
axes[2].set_title('Window Standard Deviations Over Time')
axes[2].set_xlabel('Time (seconds)')
axes[2].set_ylabel('Standard Deviation')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

##### Time Series analysis for Fault Signal

In [None]:
window_size=1000
positive_df=metadata_df.loc[metadata_df.target==1]
sample_df = positive_df.sample(n=min(1, len(positive_df.columns)), axis=0)
signal_id = str(sample_df['signal_id'].unique()[0])
signal_data = signals_df[signal_id]

# Create sliding windows
n_windows = len(signal_data) // window_size
windows = []
window_means = []
window_stds = []
target_val = metadata_df.loc[metadata_df['signal_id']==signal_id]['target'].unique()[0]
for i in range(n_windows):
    start_idx = i * window_size
    end_idx = start_idx + window_size
    window = signal_data.iloc[start_idx:end_idx]
    windows.append(window)
    window_means.append(window.mean())
    window_stds.append(window.std())

# Plot results
fig, axes = plt.subplots(3, 1, figsize=(15,8))

# Original signal
axes[0].plot(time_axis, signal_data, color = 'red')
axes[0].set_title(f'Original Signal {signal_id} with target {target_val}')
axes[0].set_ylabel('Amplitude')
axes[0].grid(True, alpha=0.3)

# Window means
window_time = np.arange(n_windows) * window_size / sampling_rate
axes[1].plot(window_time, window_means, marker='o', linewidth=2, color = 'red')
axes[1].set_title('Window Means Over Time')
axes[1].set_ylabel('Mean Amplitude')
axes[1].grid(True, alpha=0.3)

# Window standard deviations
axes[2].plot(window_time, window_stds, marker='s', linewidth=2, color='orange')
axes[2].set_title('Window Standard Deviations Over Time')
axes[2].set_xlabel('Time (seconds)')
axes[2].set_ylabel('Standard Deviation')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

#### The sliding window analysis shows clear differences between fault and normal signals over time.

* Normal signals preserve underlying waveform structure with smooth mean transitions, while fault signals completely lose coherent patterns and show erratic, broadband noise characteristics
* Normal signal window means shows smooth, gradual transitions following a predictable curve with no erratic jumps or sudden changes, while fault signals shows highly variable and erratic behavior, scattered pattern with no smooth transitions
* Normal signals maintain low, consistent standard deviation (~1.0) with isolated spikes, while fault signals show persistently elevated variability (5-15 range) throughout the entire duration

#### Window-based statistical features (variance over time, peak detection in sliding windows, and transient burst identification) will be powerful discriminators for fault detection algorithms.

### Interactive Plotly visualization for signal exploration

In [None]:
# Two Normal and Two Fault Signals
signal_ids = ['2871','6457','6760','1027']
targets=[]
for sig_id in signal_ids:
    target = metadata_df[metadata_df['signal_id'] == sig_id]['target'].iloc[0]
    targets.append('Fault' if target == 1 else 'Normal')
targets

In [None]:
fig = make_subplots(
    rows=len(signal_ids), cols=2,
    subplot_titles=
    [f'{target} Signal {sid} - {domain} Domain' for sid, target in zip(signal_ids, targets) for domain in ["Time", "Frequency"]],
    specs=[[{"secondary_y": False}, {"secondary_y": False}] for _ in signal_ids]
)

for i, sig_id in enumerate(signal_ids):
    signal_data = signals_df[str(sig_id)]
    
    # Time domain plot
    fig.add_trace(
        go.Scatter(x=time_axis, y=signal_data, name=f'Signal {sig_id}',
                  line=dict(width=1)),
        row=i+1, col=1
    )
    
    # Frequency domain plot
    fft_vals = fft(signal_data)
    fft_freq = fftfreq(len(signal_data), 1/sampling_rate)
    positive_mask = fft_freq > 0
    
    fig.add_trace(
        go.Scatter(x=fft_freq[positive_mask], y=np.abs(fft_vals[positive_mask]),
                  name=f'FFT {sig_id}', line=dict(width=1)),
        row=i+1, col=2
    )
    
    # Update axis labels
    fig.update_xaxes(title_text="Time (s)", row=i+1, col=1)
    fig.update_yaxes(title_text="Amplitude", row=i+1, col=1)
    fig.update_xaxes(title_text="Frequency (Hz)", row=i+1, col=2, type="log")
    fig.update_yaxes(title_text="Magnitude", row=i+1, col=2, type="log")

fig.update_layout(height=300*len(signal_ids), title_text="Interactive Signal Explorer")
fig.show()

* Normal signals show smooth, regular patterns with most energy in low frequencies, while fault signals display chaotic behavior with sudden spikes and energy spread across all frequencies.
* The fault signals have much higher amplitude variations and irregular patterns compared to the stable, predictable normal signals.
* __Amplitude-based features (like peak detection and signal variability) and frequency features (like high-frequency content) could effectively distinguish between healthy and faulty power line conditions__

## Key EDA Insights Summary:

* Standard deviation was the most discriminative single feature
* Peak amplitude values clearly separated normal vs fault signals
* High-frequency content was characteristic of faults
* Statistical shape measures (skewness, kurtosis) showed good separation
* Time-based variability patterns were distinctive
* Simple threshold-based detection showed promise for real-time applications

### End of Exploratory Data Analysis