# PART B: Machine Learning Fault Classification
## Vibration-Based Bearing Fault Detection System

**Author:** Ayush Anand  
**Date:** November 2024  
**Dataset:** CWRU Bearing Features (from Part A)

---

### Project Overview

This notebook implements machine learning models to automatically classify bearing faults based on vibration signal features.

### Objectives:
1. ‚úÖ Prepare dataset with extracted features
2. ‚úÖ Train **Random Forest** classifier
3. ‚úÖ Train **MLP Neural Network** classifier
4. ‚úÖ Compare model performance
5. ‚úÖ Evaluate using multiple metrics

### Models:
- **Random Forest** - Ensemble decision tree classifier
- **MLP (Multi-Layer Perceptron)** - Neural network classifier

### Evaluation Metrics:
- Accuracy, Precision, Recall, F1-Score
- Confusion Matrices
- ROC Curves & AUC
- Feature Importance Analysis

---

In [1]:
# Import Required Libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import warnings
from datetime import datetime

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, 
    roc_curve, auc, roc_auc_score
)
from sklearn.preprocessing import label_binarize

# Configure
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

# Create directories
os.makedirs('models', exist_ok=True)
os.makedirs('../reports', exist_ok=True)

print('='*70)
print('‚úÖ LIBRARIES IMPORTED SUCCESSFULLY')
print('='*70)
print(f'üì¶ NumPy version: {np.__version__}')
print(f'üì¶ Pandas version: {pd.__version__}')
print(f'üì¶ Scikit-learn ready')
print(f'üì¶ Matplotlib ready')
print(f'‚è∞ Timestamp: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}')
print('='*70)

‚úÖ LIBRARIES IMPORTED SUCCESSFULLY
üì¶ NumPy version: 2.3.5
üì¶ Pandas version: 2.3.3
üì¶ Scikit-learn ready
üì¶ Matplotlib ready
‚è∞ Timestamp: 2025-11-30 11:59:07


---
## 1Ô∏è‚É£ Dataset Preparation

We load the feature dataset created in Part A, which contains 14 engineered features extracted from vibration signals.

### Features Include:
- **Time-Domain (6):** RMS, Peak, Crest Factor, Kurtosis, Skewness, Std Dev
- **Frequency-Domain (8):** Dominant Frequency, Peak FFT Magnitude, Top 3 Frequencies, Spectral Entropy, Frequency Centroid

### Classes:
- **Normal** - Healthy bearing
- **Inner Race** - Inner race fault
- **Outer Race** - Outer race fault  
- **Ball** - Ball bearing fault

---

In [4]:
# Load preprocessed features from Part A
print('='*70)
print('LOADING DATASET')
print('='*70)

df = pd.read_csv('data/cwru_features.csv')

print(f"\n‚úÖ Dataset loaded successfully!")
print(f"üìä Dataset shape: {df.shape}")
print(f"üìã Number of features: {df.shape[1] - 1}")
print(f"üè∑Ô∏è  Number of classes: {df['fault_type'].nunique()}")

print(f"\n{'='*70}")
print('CLASS DISTRIBUTION')
print('='*70)
class_counts = df['fault_type'].value_counts()
for fault, count in class_counts.items():
    print(f"  {fault:15s}: {count:3d} samples ({count/len(df)*100:.1f}%)")
print('='*70)

print(f"\n{'='*70}")
print('FEATURE COLUMNS')
print('='*70)
print("\nTime-Domain Features:")
time_features = ['rms', 'peak', 'crest_factor', 'kurtosis', 'skewness', 'std_dev', 'peak_to_peak']
for idx, col in enumerate([c for c in df.columns if c in time_features], 1):
    print(f"  {idx}. {col}")

print("\nFrequency-Domain Features:")
freq_features = ['dominant_frequency', 'peak_fft_magnitude', 'top_freq_1', 'top_freq_2', 
                 'top_freq_3', 'spectral_entropy', 'frequency_centroid']
for idx, col in enumerate([c for c in df.columns if c in freq_features], 1):
    print(f"  {idx}. {col}")
print('='*70)

# Display sample data
print("\nüìã Sample Data:")
display(df)

# Statistical summary
print("\nüìä Statistical Summary:")
display(df.describe().T)

LOADING DATASET

‚úÖ Dataset loaded successfully!
üìä Dataset shape: (4, 15)
üìã Number of features: 14
üè∑Ô∏è  Number of classes: 4

CLASS DISTRIBUTION
  normal         :   1 samples (25.0%)
  inner_race     :   1 samples (25.0%)
  outer_race     :   1 samples (25.0%)
  ball           :   1 samples (25.0%)

FEATURE COLUMNS

Time-Domain Features:
  1. rms
  2. peak
  3. peak_to_peak
  4. crest_factor
  5. kurtosis
  6. skewness
  7. std_dev

Frequency-Domain Features:
  1. dominant_frequency
  2. peak_fft_magnitude
  3. top_freq_1
  4. top_freq_2
  5. top_freq_3
  6. spectral_entropy
  7. frequency_centroid

üìã Sample Data:


Unnamed: 0,rms,peak,peak_to_peak,crest_factor,kurtosis,skewness,std_dev,dominant_frequency,peak_fft_magnitude,top_freq_1,top_freq_2,top_freq_3,spectral_entropy,frequency_centroid,fault_type
0,1.0,2.437762,4.867983,2.437762,-0.076018,-0.005374,1.0,30.0,1.0,30.0,30.5,29.5,1.321167,31.186708,normal
1,1.0,2.556979,5.059222,2.556979,-0.061326,-0.007121,1.0,297.0,1.0,297.0,296.5,297.5,1.311411,299.266346,inner_race
2,1.0,2.417903,4.809594,2.417903,-0.077072,0.002775,1.0,250.0,1.0,250.0,249.5,250.5,1.268307,250.56715,outer_race
3,1.0,2.515899,5.025675,2.515899,-0.071225,0.000556,1.0,400.0,1.0,400.0,399.5,400.5,1.316902,401.860623,ball



üìä Statistical Summary:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rms,4.0,1.0,1.648033e-10,1.0,1.0,1.0,1.0,1.0
peak,4.0,2.482136,0.06541226,2.417903,2.432798,2.476831,2.526169,2.556979
peak_to_peak,4.0,4.940618,0.120754,4.809594,4.853386,4.946829,5.034061,5.059222
crest_factor,4.0,2.482136,0.06541226,2.417903,2.432798,2.476831,2.526169,2.556979
kurtosis,4.0,-0.07141,0.00718834,-0.077072,-0.076281,-0.073621,-0.06875,-0.061326
skewness,4.0,-0.002291,0.004711836,-0.007121,-0.00581,-0.002409,0.001111,0.002775
std_dev,4.0,1.0,1.648033e-10,1.0,1.0,1.0,1.0,1.0
dominant_frequency,4.0,244.25,155.9666,30.0,195.0,273.5,322.75,400.0
peak_fft_magnitude,4.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
top_freq_1,4.0,244.25,155.9666,30.0,195.0,273.5,322.75,400.0


In [7]:
print('='*70)
print('DATA PREPROCESSING & AUGMENTATION')
print('='*70)

# Check current data size
print(f"\n‚ö†Ô∏è  Original dataset: {len(df)} samples")
print(f"   ‚îî‚îÄ Too small for train-test split!")

# Separate features and target
X_original = df.drop('fault_type', axis=1)
y_original = df['fault_type']

print(f"\nüîÑ Generating synthetic samples using data augmentation...")

# Generate more samples by adding small random noise
np.random.seed(42)
augmented_data = []

for idx, row in df.iterrows():
    fault_type = row['fault_type']
    features = row.drop('fault_type')
    
    # Keep original sample
    augmented_data.append(row.to_dict())
    
    # Generate 24 more samples with small random variations
    for i in range(24):
        # Add Gaussian noise (5% of standard deviation)
        noise = np.random.normal(0, 0.05, len(features))
        augmented_features = features + noise
        
        # Create new sample
        new_sample = augmented_features.to_dict()
        new_sample['fault_type'] = fault_type
        augmented_data.append(new_sample)

# Create augmented dataframe
df_augmented = pd.DataFrame(augmented_data)

print(f"‚úÖ Augmented dataset: {len(df_augmented)} samples")
print(f"   ‚îî‚îÄ {len(df_augmented) // 4} samples per class")

# Now separate features and target from augmented data
X = df_augmented.drop('fault_type', axis=1)
y = df_augmented['fault_type']

print(f"\n‚úÖ Features (X): {X.shape}")
print(f"‚úÖ Labels (y): {y.shape}")

# Encode labels to numeric values
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print(f"\n{'='*70}")
print('LABEL ENCODING')
print('='*70)
for idx, fault in enumerate(label_encoder.classes_):
    count = np.sum(y_encoded == idx)
    print(f"  {fault:15s} ‚Üí {idx}  ({count} samples)")
print('='*70)

# Now we can split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, 
    test_size=0.20, 
    random_state=42, 
    stratify=y_encoded
)

print(f"\n{'='*70}")
print('TRAIN-TEST SPLIT (80%-20%)')
print('='*70)
print(f"üì¶ Training set:   {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"üì¶ Testing set:    {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"üìä Features:       {X_train.shape[1]}")

print(f"\nüîç Training set class distribution:")
for idx, fault in enumerate(label_encoder.classes_):
    count = np.sum(y_train == idx)
    print(f"   {fault:15s}: {count} samples")

print(f"\nüîç Testing set class distribution:")
for idx, fault in enumerate(label_encoder.classes_):
    count = np.sum(y_test == idx)
    print(f"   {fault:15s}: {count} samples")

print('='*70)

# Feature names for later use
feature_names = X.columns.tolist()

print(f"\n‚úÖ Data preprocessing complete!")
print(f"‚úÖ Dataset augmented to {len(df_augmented)} samples")
print(f"‚úÖ Ready for model training!\n")

DATA PREPROCESSING & AUGMENTATION

‚ö†Ô∏è  Original dataset: 4 samples
   ‚îî‚îÄ Too small for train-test split!

üîÑ Generating synthetic samples using data augmentation...
‚úÖ Augmented dataset: 100 samples
   ‚îî‚îÄ 25 samples per class

‚úÖ Features (X): (100, 14)
‚úÖ Labels (y): (100,)

LABEL ENCODING
  ball            ‚Üí 0  (25 samples)
  inner_race      ‚Üí 1  (25 samples)
  normal          ‚Üí 2  (25 samples)
  outer_race      ‚Üí 3  (25 samples)

TRAIN-TEST SPLIT (80%-20%)
üì¶ Training set:   80 samples (80.0%)
üì¶ Testing set:    20 samples (20.0%)
üìä Features:       14

üîç Training set class distribution:
   ball           : 20 samples
   inner_race     : 20 samples
   normal         : 20 samples
   outer_race     : 20 samples

üîç Testing set class distribution:
   ball           : 5 samples
   inner_race     : 5 samples
   normal         : 5 samples
   outer_race     : 5 samples

‚úÖ Data preprocessing complete!
‚úÖ Dataset augmented to 100 samples
‚úÖ Ready for m

---
## 2Ô∏è‚É£ Model Training

### Random Forest Classifier

**Algorithm:** Ensemble of decision trees  
**Key Hyperparameters:**
- `n_estimators=100` - Number of trees in the forest
- `max_depth=10` - Maximum depth of each tree
- `min_samples_split=2` - Minimum samples to split a node
- `random_state=42` - For reproducibility

**Advantages:**
- ‚úÖ Handles non-linear relationships
- ‚úÖ Robust to overfitting
- ‚úÖ Provides feature importance
- ‚úÖ Works well with small datasets

---

In [8]:
print('='*70)
print('TRAINING RANDOM FOREST CLASSIFIER')
print('='*70)

# Initialize Random Forest
rf_model = RandomForestClassifier(
    n_estimators=100,        # 100 decision trees
    max_depth=10,            # Maximum tree depth
    min_samples_split=2,     # Minimum samples to split node
    min_samples_leaf=1,      # Minimum samples in leaf
    random_state=42,
    n_jobs=-1,               # Use all CPU cores
    verbose=0
)

print("\nüîÑ Training Random Forest...")
print("   ‚îî‚îÄ Building 100 decision trees...")

# Train the model
rf_model.fit(X_train, y_train)

print("‚úÖ Training complete!\n")

# Make predictions
print("üîÑ Making predictions on test set...")
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)
print("‚úÖ Predictions complete!\n")

# Evaluate
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf, average='weighted', zero_division=0)
recall_rf = recall_score(y_test, y_pred_rf, average='weighted', zero_division=0)
f1_rf = f1_score(y_test, y_pred_rf, average='weighted', zero_division=0)

print('='*70)
print('RANDOM FOREST PERFORMANCE')
print('='*70)
print(f"üéØ Accuracy:  {accuracy_rf*100:.2f}%")
print(f"üéØ Precision: {precision_rf*100:.2f}%")
print(f"üéØ Recall:    {recall_rf*100:.2f}%")
print(f"üéØ F1-Score:  {f1_rf*100:.2f}%")
print('='*70)

# Detailed classification report
print("\nüìä Detailed Classification Report:")
print("="*70)
print(classification_report(
    y_test, y_pred_rf, 
    target_names=label_encoder.classes_,
    zero_division=0
))
print("="*70)

# Save model
with open('models/random_forest_model.pkl', 'wb') as f:
    pickle.dump(rf_model, f)
print("\nüíæ Model saved: models/random_forest_model.pkl")

TRAINING RANDOM FOREST CLASSIFIER

üîÑ Training Random Forest...
   ‚îî‚îÄ Building 100 decision trees...
‚úÖ Training complete!

üîÑ Making predictions on test set...
‚úÖ Predictions complete!

RANDOM FOREST PERFORMANCE
üéØ Accuracy:  100.00%
üéØ Precision: 100.00%
üéØ Recall:    100.00%
üéØ F1-Score:  100.00%

üìä Detailed Classification Report:
              precision    recall  f1-score   support

        ball       1.00      1.00      1.00         5
  inner_race       1.00      1.00      1.00         5
      normal       1.00      1.00      1.00         5
  outer_race       1.00      1.00      1.00         5

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20


üíæ Model saved: models/random_forest_model.pkl


---
## 3Ô∏è‚É£ MLP Neural Network Classifier

**Algorithm:** Multi-Layer Perceptron (Feedforward Neural Network)  
**Architecture:**
- **Input Layer:** 14 neurons (one per feature)
- **Hidden Layer 1:** 64 neurons (ReLU activation)
- **Hidden Layer 2:** 32 neurons (ReLU activation)
- **Hidden Layer 3:** 16 neurons (ReLU activation)
- **Output Layer:** 4 neurons (Softmax - one per class)

**Training Parameters:**
- `solver='adam'` - Adaptive learning rate optimizer
- `learning_rate_init=0.001` - Initial learning rate
- `max_iter=500` - Maximum epochs
- `early_stopping=True` - Stop if validation loss doesn't improve

**Advantages:**
- ‚úÖ Learns complex non-linear patterns
- ‚úÖ Deep learning architecture
- ‚úÖ Adaptive learning rate
- ‚úÖ Early stopping prevents overfitting

---

In [9]:
print('='*70)
print('TRAINING MLP NEURAL NETWORK')
print('='*70)

# Initialize MLP
mlp_model = MLPClassifier(
    hidden_layer_sizes=(64, 32, 16),  # 3 hidden layers
    activation='relu',                 # ReLU activation
    solver='adam',                     # Adam optimizer
    learning_rate_init=0.001,          # Learning rate
    max_iter=500,                      # Maximum epochs
    random_state=42,
    early_stopping=True,               # Stop if no improvement
    validation_fraction=0.2,           # 20% for validation
    verbose=False
)

print("\nüîÑ Training MLP Neural Network...")
print("   ‚îú‚îÄ Architecture: Input(14) ‚Üí Hidden(64,32,16) ‚Üí Output(4)")
print("   ‚îú‚îÄ Activation: ReLU")
print("   ‚îú‚îÄ Optimizer: Adam")
print("   ‚îî‚îÄ Early stopping enabled")

# Train the model
mlp_model.fit(X_train, y_train)

print(f"\n‚úÖ Training complete!")
print(f"   ‚îî‚îÄ Converged in {mlp_model.n_iter_} iterations\n")

# Make predictions
print("üîÑ Making predictions on test set...")
y_pred_mlp = mlp_model.predict(X_test)
y_pred_proba_mlp = mlp_model.predict_proba(X_test)
print("‚úÖ Predictions complete!\n")

# Evaluate
accuracy_mlp = accuracy_score(y_test, y_pred_mlp)
precision_mlp = precision_score(y_test, y_pred_mlp, average='weighted', zero_division=0)
recall_mlp = recall_score(y_test, y_pred_mlp, average='weighted', zero_division=0)
f1_mlp = f1_score(y_test, y_pred_mlp, average='weighted', zero_division=0)

print('='*70)
print('MLP NEURAL NETWORK PERFORMANCE')
print('='*70)
print(f"üéØ Accuracy:  {accuracy_mlp*100:.2f}%")
print(f"üéØ Precision: {precision_mlp*100:.2f}%")
print(f"üéØ Recall:    {recall_mlp*100:.2f}%")
print(f"üéØ F1-Score:  {f1_mlp*100:.2f}%")
print('='*70)

# Detailed classification report
print("\nüìä Detailed Classification Report:")
print("="*70)
print(classification_report(
    y_test, y_pred_mlp, 
    target_names=label_encoder.classes_,
    zero_division=0
))
print("="*70)

# Save model
with open('models/mlp_model.pkl', 'wb') as f:
    pickle.dump(mlp_model, f)
print("\nüíæ Model saved: models/mlp_model.pkl")

TRAINING MLP NEURAL NETWORK

üîÑ Training MLP Neural Network...
   ‚îú‚îÄ Architecture: Input(14) ‚Üí Hidden(64,32,16) ‚Üí Output(4)
   ‚îú‚îÄ Activation: ReLU
   ‚îú‚îÄ Optimizer: Adam
   ‚îî‚îÄ Early stopping enabled

‚úÖ Training complete!
   ‚îî‚îÄ Converged in 14 iterations

üîÑ Making predictions on test set...
‚úÖ Predictions complete!

MLP NEURAL NETWORK PERFORMANCE
üéØ Accuracy:  25.00%
üéØ Precision: 6.25%
üéØ Recall:    25.00%
üéØ F1-Score:  10.00%

üìä Detailed Classification Report:
              precision    recall  f1-score   support

        ball       0.00      0.00      0.00         5
  inner_race       0.25      1.00      0.40         5
      normal       0.00      0.00      0.00         5
  outer_race       0.00      0.00      0.00         5

    accuracy                           0.25        20
   macro avg       0.06      0.25      0.10        20
weighted avg       0.06      0.25      0.10        20


üíæ Model saved: models/mlp_model.pkl
