<a href="https://colab.research.google.com/github/cwhitz/ts-trove/blob/master/notebooks/classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Time Series Classification

This notebook explores various time series classification techniques. It makes much fuller use of the bearings dataset also explored in the signal analysis notebook.

## Overview

Time series classification involves assigning time series instances to predefined categories. This notebook will cover:

### Table of Contents

> 1 [Data Preparation](#Data-Preparation)

1.1. [Data Download](##Data-Download)

1.2 [Data Organization](##Data-Organization)

> 2 [Utility Functions](#Training-Functions)

2.1 [Data Loader](##Data-Loader)

> 3 [Time Series Classification with SciKit](#Scikit-Functions)

3.1 [Scikit Trainer and Evaluator](##Scikit-Trainer-and-Evaluator)

> 4 [Deep Learning Models](#Deep-Learning-Models)

4.1 [PyTorch Trainer and Evaluator](##-Trainer-and-Evaluator)

4.2 [Fully Connected Neural Networks]()

4.3 [Recurrent Neural Networks]()

4.3.1 [Classic Recurrent Neural Network]()

4.3.2 [Long Short Term Memory (LSTM) Neural Network]()

4.3.3 [Gated Recurrent Neural Network]()

4.4 [Convolutional Neural Networks]()

4.4.1 [1D Convolutional Neural Network]()

4.4.2 [Temporal Convolutional Network]()

4.5 [Attention Based Models]()

4.5.1 [LSTM with Attention]()

4.5.2 [Time Series Transformer]()



In [1]:
import pandas as pd
import numpy as np
import os
import pathlib
import matplotlib.pyplot as plt
import json
import pathlib
import shutil
import kagglehub

import tqdm

In [2]:
!pip install cesium

Collecting cesium
  Downloading cesium-0.12.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting gatspy>=0.3.0 (from cesium)
  Downloading gatspy-0.3.tar.gz (554 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m554.5/554.5 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading cesium-0.12.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (806 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.5/806.5 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: gatspy
  Building wheel for gatspy (setup.py) ... [?25l[?25hdone
  Created wheel for gatspy: filename=gatspy-0.3-py3-none-any.whl size=43804 sha256=630bee8924cd22299491fbe20895aef47f95bb0aadd1414e5cc0cb4904b4cfdd
  Stored in directory: /root/.cache/pip/wheels/b5/56/88/04643e9be584a6018e10aae5789d98225995da3e89513c3f30
Successfully built gatspy
Installi

# Data Preparation

## Data Overview

**What is this dataset?**

This is a collection of vibration data from electric motor bearings. Bearings those small spinning parts that let machinery rotate smoothly. Think of a bearing like the axle in a wheel: it's got little metal balls inside that roll around, letting a shaft spin with barely any friction.

**Why was this dataset created?**

The researchers at Case Western Reserve University deliberately damaged bearings in different ways, then recorded how the motor vibrated as a result. They made tiny cracks of various sizes (ranging from 7 to 40 thousandths of an inch) in the bearings, then attached vibration sensors to measure what happened.

Cracks of different mm sizes were introduced on the outer race, inner race, and the balls themselves.

![ball_bearing_diagram](https://www.globalspec.com/ImageRepository/LearnMore/20133/ball%20bearing5364b00280ef4db7b85dfba113f04556.png)

The goal was to understand the relationship between bearing damage and vibration patterns, creating a reference library that shows what different types of bearing failure look like.

**Why is it useful?**

This data is incredibly useful for real-world maintenance and diagnostics. In factories and power plants, you can use vibration patterns to detect bearing problems before they cause catastrophic failures. By comparing vibrations from a running machine to patterns in this dataset, maintenance teams can identify early signs of wear, predict when a bearing will fail, and schedule repairs before expensive downtime happens. It's basically like a fingerprint database for bearing damage—once you know what a damaged bearing "sounds like," you can spot trouble coming.

**So what are we actually trying to predict?**

Good question. We will try to train machine learning models to predict three things: 1) Is the bearing in normal operation? 2) If not, where is the crack? 3) And what size is it?

2 and 3 of course become irrelevant if the bearing is in normal operation, but they allow us to go a step beyond simple detection of irregular operation.


## Data Download

The raw data can be downloaded directly from Kaggle.

In [3]:
kagglepath = "sufian79/cwru-mat-full-dataset"
path = kagglehub.dataset_download(kagglepath)


pathlib.Path(f"./{kagglepath.split('/')[-1]}").mkdir(parents=True, exist_ok=True)
shutil.copytree(path, f"./{kagglepath.split('/')[-1]}", dirs_exist_ok=True)

Using Colab cache for faster access to the 'cwru-mat-full-dataset' dataset.


'./cwru-mat-full-dataset'

## Data Organization

The raw data is a collection of numbered mat files and requires reference back to the [original website](https://engineering.case.edu/bearingdatacenter/48k-drive-end-bearing-fault-data) to make sense of. I've gone ahead and done that with the JSON structure below.

The data is organized at top-level describing the type of fault, or lack thereof with "normal" sample files are the motor operating without faults. The next level down is the sampling rate, followed by the location where the crack was introduced (IR being inner race, B being ball, OR being outer race) and then finally, the size of the cracks ranging from 7 to 21 mm.

The code below this cell moves the individual samples into folders matching the structure below, which aligns with how PyTorch's DataSet and DataLoader work (we will make it work for scikit too).

In [4]:
folder_structure = {
  "normal": {
    "48k": ["97", "98", "99", "100"]
  },
  "drive_end_fault": {
    "12k": {
      "IR": {
        "007": ["105", "106", "107", "108"],
        "014": ["169", "170", "171", "172"],
        "021": ["209", "210", "211", "212"]
      },

      "B": {
        "007": ["118", "119", "120", "121"],
        "014": ["185", "186", "187", "188"],
        "021": ["222", "223", "224", "225"]
      },

      "OR": {
        "007": ["130", "131", "132", "133"],
        "014": ["197", "198", "199", "200"],
        "021": ["234", "235", "236", "237"]
      }
    },

    "48k": {
      "IR": {
        "007": ["109", "110", "111", "112"],
        "014": ["174", "175", "176", "177"],
        "021": ["213", "214", "215", "217"]
      },

      "B": {
        "007": ["122", "123", "124", "125"],
        "014": ["189", "190", "191", "192"],
        "021": ["226", "227", "228", "229"]
      },

      "OR": {
        "007": ["135", "136", "137", "138"],
        "014": ["201", "202", "203", "204"],
        "021": ["238", "239", "240", "241"]
      }
    }
  },

  "fan_end_fault": {
    "12k": {
      "IR": {
        "007": ["278", "279", "280", "281"],
        "014": ["274", "275", "276", "277"],
        "021": ["270", "271", "272", "273"]
      },

      "B": {
        "007": ["282", "283", "284", "285"],
        "014": ["286", "287", "288", "289"],
        "021": ["290", "291", "292", "293"]
      },

      "OR": {
        "007": ["298", "299", "300", "301"],
        "014": ["309", "310", "311", "312"],
        "021": ["315", "316", "317", "318"]
      }
    }
  }
}

In [5]:
SOURCE_DIR = "cwru-mat-full-dataset/"
TARGET_DIR = "classification-cwru-mat-organized"
FILE_EXTENSION = ".mat"

def ensure_dir(path):
    os.makedirs(path, exist_ok=True)

def move_file(file_id, dest_dir):
    filename = file_id + FILE_EXTENSION
    src_path = os.path.join(SOURCE_DIR, filename)
    dst_path = os.path.join(dest_dir, filename)

    if not os.path.exists(src_path):
        print(f"⚠️ Missing file: {src_path}")
        return

    ensure_dir(dest_dir)
    shutil.move(src_path, dst_path)

def walk_structure(node, current_path):
    if isinstance(node, list):
        for file_id in node:
            move_file(file_id, current_path)
    elif isinstance(node, dict):
        for key, child in node.items():
            walk_structure(child, os.path.join(current_path, key))
    else:
        raise ValueError("Unexpected structure type")


walk_structure(folder_structure, TARGET_DIR)
print("Done.")

Done.


# Utility Functions

## Data Loader

Before diving into modeling, we first need a consistent way to load and represent our time-series data. Since later sections will experiment with both deep learning and traditional classifiers, we define a reusable dataset structure that keeps preprocessing, sampling rate handling, and labels consistent across all methods.

In [6]:
from torch.utils.data import Dataset
from torch.nn import Module
import scipy.io
import enum

# samplng rate enum
class SamplingRate(enum.Enum):
    sr12K = "12k"
    sr48K = "48k"

class FaultLocation(enum.Enum):
    DE = "drive_end_fault"
    FE = "front_end_fault"


class BearingDataset(Dataset):
    def __init__(self, file_paths, sampling_rate, fault_location, chunk_length, unified_label=True, transform=None):
        self.file_paths = file_paths
        self.sampling_rate = sampling_rate
        self.fault_location = fault_location
        self.chunk_length = chunk_length
        self.transform = transform
        self.unified_label = unified_label

        self.data = []
        self.labels = []

        self._organize_data()

    def _organize_data(self):
        for fp in self.file_paths:
            if not pathlib.Path(fp).exists():
                raise FileNotFoundError(f"File not found: {fp}")

            mat_data = scipy.io.loadmat(fp)

            key_to_match = f"_{str(self.fault_location)[-2:]}_time"
            sensor_key = [key for key in mat_data.keys() if key_to_match in key][0]

            signal = mat_data[sensor_key].squeeze()

            n_chunks = len(signal) // self.chunk_length
            truncated = signal[:n_chunks * self.chunk_length]

            windows = truncated.reshape(n_chunks, self.chunk_length)

            label_parts = fp.parent.parts
            if label_parts[-2] == 'normal':
                label_dict = {
                    'normal': True,
                    'fault_location': 'NA',
                    'crack_size': 'NA'
                }
            else:
                label_dict = {
                    'normal': False,
                    'fault_location': label_parts[-2],
                    'crack_size': label_parts[-1]
                }


            for window in windows:
              self.data.append(window)

              if self.unified_label:
                self.labels.append(f"{label_dict['fault_location']}_{label_dict['crack_size']}" if label_dict['normal'] == False else "normal")
              else:
                self.labels.append(label_dict)


    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        window = self.data[idx]
        label = self.labels[idx]

        if self.transform:
            window = self.transform(window).astype('float32')

        return window, label

The dataset class above seeks to make the most of the data available in the bearings dataset by splitting each sample in the file into multiple overlapping windows. This increases the effective number of training samples and helps models learn more robust patterns. However, care must be taken to avoid data leakage between training and test sets when using overlapping windows - if we were to pull all the data and then split into train/test, windows from the same original sample could end up in both sets.

To prevent this, we ensure that all windows derived from a given file are assigned to either the training or test set exclusively by splitting into train/test at the file level.

In [7]:
from sklearn.model_selection import train_test_split
from pathlib import Path
from collections import Counter

all_files = list(Path("classification-cwru-mat-organized").rglob("*.mat"))

# derive one label per file
file_labels = [
    '_'.join(f.parent.parts[-2:])
    for f in all_files
]

train_files, test_files = train_test_split(
    all_files,
    test_size=.2,
    shuffle=True,
    stratify=file_labels
)

##

We want to set up a class for testing different classification techniques on the bearings dataset. The class will accept a dataset object and classification model, and be able to train and evaluate the model consistently for metrics like accuracy, precision, recall, and F1-score as well as time for training and inference.

In [8]:
from abc import ABC, abstractmethod
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from tqdm import tqdm
import copy


class ClassificationTrainTestEvaluate(ABC):
    def __init__(self, train_dataset: Dataset, test_dataset: Dataset):
        self.train_dataset = train_dataset
        self.test_dataset = test_dataset

        self.model = None

    def load_model(self, model):
        self.model = model

    def classification_report(self):
      """
      Creates a Plotly figure with three tabs, each showing:
      - Confusion matrix heatmap
      - Metrics summary table

      One tab per task: Fault Detection, Fault Location, Crack Size
      """
      from plotly.subplots import make_subplots
      import plotly.graph_objects as go
      from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

      # Define task names and their corresponding predictions/labels
      tasks = {
          'Fault Detection': {
              'predictions': self.predictions_fault_detection,
              'labels': self.test_y_fault_detection,
              'class_names': ['Normal', 'Fault']
          },
          'Fault Location': {
              'predictions': self.predictions_fault_location,
              'labels': self.test_y_fault_location,
              'class_names': ['B', 'IR', 'OR']
          },
          'Crack Size': {
              'predictions': self.predictions_crack_size,
              'labels': self.test_y_crack_size,
              'class_names': ['007', '014', '021']
          }
      }

      # Create subplots for each task
      figs = []

      for task_name, task_data in tasks.items():
          predictions = task_data['predictions']
          labels = task_data['labels']
          class_names = task_data['class_names']

          # Compute confusion matrix
          cm = confusion_matrix(labels, predictions)

          # Compute metrics
          accuracy = accuracy_score(labels, predictions)
          precision = precision_score(labels, predictions, average='weighted', zero_division=0)
          recall = recall_score(labels, predictions, average='weighted', zero_division=0)
          f1 = f1_score(labels, predictions, average='weighted', zero_division=0)

          # Create subplot layout
          fig = make_subplots(
              rows=1, cols=2,
              column_widths=[0.6, 0.4],
              specs=[[{"type": "heatmap"}, {"type": "table"}]],
              subplot_titles=("Confusion Matrix", "Model Performance Metrics")
          )

          # --- Confusion Matrix Heatmap ---
          fig.add_trace(
              go.Heatmap(
                  z=cm,
                  x=class_names,
                  y=class_names,
                  text=cm,
                  texttemplate="%{text}",
                  colorscale="Blues",
                  showscale=False
              ),
              row=1, col=1
          )

          fig.update_xaxes(title_text="Predicted Label", row=1, col=1)
          fig.update_yaxes(title_text="True Label", row=1, col=1)

          # --- Metrics Table ---
          fig.add_trace(
              go.Table(
                  header=dict(
                      values=["Metric", "Value"],
                      fill_color="lightgrey",
                      align="center"
                  ),
                  cells=dict(
                      values=[
                          ["Accuracy", "Precision", "Recall", "F1 Score"],
                          [f"{accuracy:.4f}", f"{precision:.4f}", f"{recall:.4f}", f"{f1:.4f}"]
                      ],
                      align="center"
                  )
              ),
              row=1, col=2
          )

          fig.update_layout(
              title=f"{task_name} - Evaluation Summary",
              height=500,
              width=900,
              showlegend=False
          )

          figs.append((task_name, fig))

      # Display each figure
      for task_name, fig in figs:
          fig.show()

class SciKitCTTE(ClassificationTrainTestEvaluate):
    def prepare_data(self):
        self.train_X, self.train_y = pd.DataFrame(), pd.Series()
        print("Preparing training data...")
        for i in tqdm(range(len(self.train_dataset))):
            X_chunk, label = self.train_dataset[i]

            self.train_X = pd.concat([self.train_X, X_chunk], ignore_index=True)
            self.train_y = pd.concat([self.train_y, pd.Series(label)], ignore_index=True)

        self.test_X, self.test_y = pd.DataFrame(), pd.Series()
        print("Preparing test data...")
        for i in tqdm(range(len(self.test_dataset))):
            X_chunk, labels = self.test_dataset[i]

            self.test_X = pd.concat([self.test_X, X_chunk], ignore_index=True)
            self.test_y = pd.concat([self.test_y, pd.Series(labels)], ignore_index=True)

    def train(self, train_X, train_y):
        self.model.fit(train_X, train_y)
        self.class_names = sorted(self.train_y.unique())

    def evaluate(self, test_X, test_y):
        self.predictions = self.model.predict(test_X)


# Feature Extraction + Feature Based Classification

With a dataset abstraction in place, we can now explore different families of time-series classification techniques. The goal here is not only to compare performance, but also to understand how different representation choices affect model behavior on sensor-like signals.

We begin with feature-based methods, which transform raw time-series into fixed-length statistical representations. These approaches are often strong baselines, easier to interpret, and computationally efficient compared to end-to-end deep learning models.

### Feature Extraction

We will implement a custom transformer class for the PyTorch dataset to extract statistical features using the `cesium` library.

In [9]:
from cesium import featurize

class FeatureExtractionTransform(Module):
    def forward(self, window):
        features_to_use = [
            "amplitude",
            "percent_beyond_1_std",
            "maximum",
            "max_slope",
            "median",
            "median_absolute_deviation",
            "percent_close_to_median",
            "minimum",
            "period_fast",
            "skew",
            "std",
        ]

        fset = featurize.featurize_time_series(
            times=np.arange(len(window)),
            values=window,
            errors=None,
            features_to_use=features_to_use,
        )

        fset = fset.stack(future_stack=True)

        return fset


In [10]:
train_dataset = BearingDataset(
    train_files,
    sampling_rate=SamplingRate.sr48K,
    fault_location=FaultLocation.DE,
    chunk_length=1200,
    unified_label=True,
    transform=FeatureExtractionTransform()
)

test_dataset = BearingDataset(
    test_files,
    sampling_rate=SamplingRate.sr48K,
    fault_location=FaultLocation.DE,
    chunk_length=1200,
    unified_label=True,
    transform=FeatureExtractionTransform()
)

In [11]:
# from sklearn.ensemble import RandomForestClassifier

# sk_ctte = SciKitCTTE(
#     train_dataset,
#     test_dataset)

# sk_ctte.prepare_data()


In [12]:
# rfc = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)

# sk_trainer = sk_ctte
# sk_trainer.load_model(rfc)

# sk_trainer.train(sk_trainer.train_X, sk_trainer.train_y)
# sk_trainer.evaluate(sk_trainer.test_X, sk_trainer.test_y)
# sk_trainer.classification_report()

In [13]:
# from sklearn.svm import SVC

# svm = SVC(kernel='linear', C=.1, random_state=42)

# sk_trainer = sk_ctte
# sk_trainer.load_model(svm)

# sk_trainer.train(sk_trainer.train_X, sk_trainer.train_y)
# sk_trainer.evaluate(sk_trainer.test_X, sk_trainer.test_y)
# sk_trainer.classification_report()

# Deep Learning Models

In this section, I will explore a wide variety of neural network models to find which can perform the best at what is essentially a many-to-one problem, where we are giving the model a dataset of many measurements of vibrational movement where ordering matters, because those measurements unfolded across time.

4.1 [PyTorch Trainer and Evaluator](##
PyTorch-Trainer-and-Evaluator)

4.2 [Fully Connected Neural Networks]()

4.3 [Recurrent Neural Networks]()

4.3.1 [Classic Recurrent Neural Network]()

4.3.2 [Long Short Term Memory (LSTM) Neural Network]()

4.3.3 [Gated Recurrent Neural Network]()

4.4 [Convolutional Neural Networks]()

4.4.1 [1D Convolutional Neural Network]()

4.4.2 [Temporal Convolutional Network]()

4.5 [Attention Based Models]()

4.5.1 [LSTM with Attention]()

4.5.2 [Time Series Transformer]()

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

In [14]:
import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"GPU name: {torch.cuda.get_device_name(0)}")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

CUDA available: True
GPU count: 1
GPU name: Tesla T4


##PyTorch Trainer and Evaluator

In [15]:
from torch.utils.data import DataLoader
from torch import Tensor, float32, LongTensor
import torch
from tqdm import tqdm

class PyTorchCTTE(ClassificationTrainTestEvaluate):
    def __init__(self, train_dataset: Dataset, test_dataset: Dataset, device='cpu', criterion=None, detection_weighting=.5):
        super().__init__(train_dataset, test_dataset)
        self.device = device
        self.criterion = criterion

        self.target_mapping = {
          'fault_location': {'B': 0, 'IR': 1, 'OR': 2, 'NA': 3},
          'crack_size': {'007': 0, '014': 1, '021': 2, 'NA': 3}
          }

        self.train_dataset_mean = None
        self.train_dataset_std = None
        self.detection_weighting = detection_weighting

    def __deepcopy__(self, memo):
        """Deep copy - recursively copies nested objects"""
        return PyTorchCTTE(
            copy.deepcopy(self.train_dataset, memo),
            copy.deepcopy(self.test_dataset, memo),
            copy.deepcopy(self.device, memo),
            copy.deepcopy(self.criterion, memo)
        )

    def load_model(self, model):
        self.model = model

    def load_optimizer(self, optimizer):
        self.optimizer = optimizer

    def prepare_data(self):
        self.train_dataloader = DataLoader(self.train_dataset, batch_size=64, shuffle=True)
        self.test_dataloader = DataLoader(self.test_dataset, batch_size=64, shuffle=False)

    def train(self, epochs: int, batch_size: int):
        self.model.to(self.device)

        # self.train_dataset_mean = np.mean(np.concatenate(self.train_dataset.data))
        # self.train_dataset_std = np.std(np.concatenate(self.train_dataset.data))

        for epoch in range(epochs):
            self.model.train()
            epoch_loss = 0.0

            progress_bar = tqdm(self.train_dataloader, desc=f"Epoch {epoch+1}/{epochs}")

            for batch_X, batch_y in progress_bar:
                # X
                batch_X = Tensor(batch_X.to(float32))
                batch_X = (batch_X - batch_X.mean(dim=1, keepdim=True)) / (batch_X.std(dim=1, keepdim=True) + 1e-6)
                batch_X = batch_X.to(self.device)

                # ys
                batch_y_fault_detection = [l for l in batch_y['normal']]
                batch_y_fault_location = [self.target_mapping['fault_location'].get(l, 2) for l in batch_y['fault_location']]
                batch_y_crack_size = [self.target_mapping['crack_size'].get(l, 2) for l in batch_y['crack_size']]

                # move to GPU
                batch_y_fault_detection = LongTensor(batch_y_fault_detection).to(self.device)
                batch_y_fault_location = LongTensor(batch_y_fault_location).to(self.device)
                batch_y_crack_size = LongTensor(batch_y_crack_size).to(self.device)

                # clear gradients before training
                self.optimizer.zero_grad()

                # run the inputs through the network
                fault_detection, fault_location, crack_size = self.model(batch_X)

                # calculate the loss
                loss_fault_detection = self.criterion(fault_detection, batch_y_fault_detection)
                loss_fault_location = self.criterion(fault_location, batch_y_fault_location)
                loss_crack_size = self.criterion(crack_size, batch_y_crack_size)

                # sum to total loss
                total_loss = (self.detection_weighting * loss_fault_detection) + loss_fault_location + loss_crack_size

                # backpropagate
                total_loss.backward()

                self.optimizer.step()

                epoch_loss += total_loss.item()
                progress_bar.set_postfix(loss=total_loss.item())

            print(f"Epoch {epoch+1} avg loss: {epoch_loss/len(self.train_dataloader):.4f}")

    def evaluate(self):
        self.model.eval()
        self.predictions_fault_detection = []
        self.predictions_fault_location = []
        self.predictions_crack_size = []
        self.test_y_fault_detection = []
        self.test_y_fault_location = []
        self.test_y_crack_size = []

        with torch.no_grad():
            for batch_X, batch_y in self.test_dataloader:
                batch_X = batch_X = Tensor(batch_X.to(float32))
                batch_X = (batch_X - batch_X.mean(dim=1, keepdim=True)) / (batch_X.std(dim=1, keepdim=True) + 1e-6)
                batch_X = batch_X.to(self.device)

                fault_detection, fault_location, crack_size = self.model(batch_X)

                # Get predictions for each task
                _, pred_fd = torch.max(fault_detection, 1)
                _, pred_fl = torch.max(fault_location, 1)
                _, pred_cs = torch.max(crack_size, 1)

                self.predictions_fault_detection.extend(pred_fd.cpu().numpy().tolist())
                self.predictions_fault_location.extend(pred_fl.cpu().numpy().tolist())
                self.predictions_crack_size.extend(pred_cs.cpu().numpy().tolist())

                # Store true labels
                self.test_y_fault_detection.extend([int(l) for l in batch_y['normal']])
                self.test_y_fault_location.extend([self.target_mapping['fault_location'].get(l, 0) for l in batch_y['fault_location']])
                self.test_y_crack_size.extend([self.target_mapping['crack_size'].get(l, 0) for l in batch_y['crack_size']])

### Datasets for Deep Learning

In [16]:
from torch.nn import CrossEntropyLoss

train_dataset = BearingDataset(
    train_files,
    sampling_rate=SamplingRate.sr48K,
    fault_location=FaultLocation.DE,
    unified_label=False,
    chunk_length=1200
)

test_dataset = BearingDataset(
    test_files,
    sampling_rate=SamplingRate.sr48K,
    fault_location=FaultLocation.DE,
    unified_label=False,
    chunk_length=1200
)

pytorch_ctte = PyTorchCTTE(
    train_dataset,
    test_dataset,
    device=device,
    criterion=CrossEntropyLoss()
)

##Fully Connected Neural Network

### Model Intuition

A fully connected neural network treats each vibration sample as a fixed-length vector, learning relationships between all points in the signal simultaneously.

Unlike sequential models, it makes no assumptions about temporal ordering — every input element is connected to every neuron in the next layer, allowing it to discover arbitrary correlations across the entire 1200-point reading. The first layer projects the raw signal into a 512-dimensional space, expanding the representation to capture a rich set of features, while the second layer compresses to 256 dimensions, acting as a bottleneck that forces the network to distill the most discriminative patterns. ReLU activations between layers introduce nonlinearity, enabling the network to learn complex decision boundaries that a simple linear classifier could not. This architecture is well suited for vibration classification when the signal length is fixed and the spatial relationships between measurement points carry meaningful information about fault characteristics, as it is in this project.


### Model Definition

In [17]:
from torch import nn
import torch
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss

class FCNN(nn.Module):
    def __init__(self, input_dim=1200, num_fault_locations=4, num_crack_sizes=4):
        super(FCNN, self).__init__()
        self.shared = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
        )

        self.fault_detection_output = nn.Linear(256, 2)
        self.fault_location_output = nn.Linear(256, num_fault_locations)
        self.crack_size_output = nn.Linear(256, num_crack_sizes)

        # Xavier initialization
        nn.init.xavier_uniform_(self.fault_detection_output.weight)
        nn.init.xavier_uniform_(self.fault_location_output.weight)
        nn.init.xavier_uniform_(self.crack_size_output.weight)

    def forward(self, x):
        x = self.shared(x)

        return (
            torch.sigmoid(self.fault_detection_output(x)),
            self.fault_location_output(x),
            self.crack_size_output(x)
        )

In [18]:
fcnn_ctte = copy.deepcopy(pytorch_ctte)

fcnn_model = FCNN(
    input_dim=1200
)

fcnn_ctte.load_model(fcnn_model)

fcnn_ctte.load_optimizer(
    torch.optim.Adam(fcnn_model.parameters(), lr=1e-3)
)


### Training

In [19]:
fcnn_ctte.prepare_data()
fcnn_ctte.train(epochs=20, batch_size=64)
fcnn_ctte.evaluate()

Epoch 1/20: 100%|██████████| 254/254 [00:01<00:00, 130.06it/s, loss=1.46]


Epoch 1 avg loss: 1.7341


Epoch 2/20: 100%|██████████| 254/254 [00:01<00:00, 183.71it/s, loss=1.01]


Epoch 2 avg loss: 0.9752


Epoch 3/20: 100%|██████████| 254/254 [00:01<00:00, 157.90it/s, loss=0.78]


Epoch 3 avg loss: 0.6372


Epoch 4/20: 100%|██████████| 254/254 [00:01<00:00, 155.82it/s, loss=0.601]


Epoch 4 avg loss: 0.4479


Epoch 5/20: 100%|██████████| 254/254 [00:01<00:00, 186.87it/s, loss=0.278]


Epoch 5 avg loss: 0.3589


Epoch 6/20: 100%|██████████| 254/254 [00:01<00:00, 187.59it/s, loss=0.307]


Epoch 6 avg loss: 0.3219


Epoch 7/20: 100%|██████████| 254/254 [00:01<00:00, 186.61it/s, loss=0.208]


Epoch 7 avg loss: 0.2950


Epoch 8/20: 100%|██████████| 254/254 [00:01<00:00, 186.93it/s, loss=0.331]


Epoch 8 avg loss: 0.2722


Epoch 9/20: 100%|██████████| 254/254 [00:01<00:00, 185.37it/s, loss=0.186]


Epoch 9 avg loss: 0.2489


Epoch 10/20: 100%|██████████| 254/254 [00:01<00:00, 183.96it/s, loss=0.2]


Epoch 10 avg loss: 0.2531


Epoch 11/20: 100%|██████████| 254/254 [00:01<00:00, 179.65it/s, loss=0.264]


Epoch 11 avg loss: 0.2653


Epoch 12/20: 100%|██████████| 254/254 [00:01<00:00, 162.37it/s, loss=0.258]


Epoch 12 avg loss: 0.2744


Epoch 13/20: 100%|██████████| 254/254 [00:01<00:00, 159.02it/s, loss=0.298]


Epoch 13 avg loss: 0.2732


Epoch 14/20: 100%|██████████| 254/254 [00:01<00:00, 184.68it/s, loss=0.316]


Epoch 14 avg loss: 0.2474


Epoch 15/20: 100%|██████████| 254/254 [00:01<00:00, 182.82it/s, loss=0.202]


Epoch 15 avg loss: 0.2307


Epoch 16/20: 100%|██████████| 254/254 [00:01<00:00, 183.87it/s, loss=0.47]


Epoch 16 avg loss: 0.2123


Epoch 17/20: 100%|██████████| 254/254 [00:01<00:00, 185.27it/s, loss=0.192]


Epoch 17 avg loss: 0.2235


Epoch 18/20: 100%|██████████| 254/254 [00:01<00:00, 179.35it/s, loss=0.206]


Epoch 18 avg loss: 0.2600


Epoch 19/20: 100%|██████████| 254/254 [00:01<00:00, 187.16it/s, loss=0.183]


Epoch 19 avg loss: 0.2423


Epoch 20/20: 100%|██████████| 254/254 [00:01<00:00, 173.49it/s, loss=0.213]


Epoch 20 avg loss: 0.2176


### Classification Report

In [20]:
fcnn_ctte.classification_report()

## ResNet

### Model Intuition


### Model Anatomy

**Residual Block Internals**

In the architecture, residual blocks are defined as a separate module class outside the main ResNet1D class and then used in aggregate within the network definition block. Let's look at the definition of the internal components of *ResidualBlock1D* to start. It operates in two modes, depending if the output shape differs from the input shape - then it is downsampling - or not. Looking at the components as they are defined in the forward pass:

* `shortcut` is defined differently if the block is performing downsampling or not. If it is not, then the shortcut is simply a passthrough. If it is performing downsampling, then the input is downsampled to the output dimensions using conv1d with a stride of two to halve the input dimension to the needed output dimension. Note that the identity is used to cache the input identity for later, it is not passed further along the network (yet).
* `conv1` A convolutional layer is applied, but critically only to every other input (stride=2) when downsampling. When every other input is skipped, it halves the length of the output. Batch norm, ReLu and dropout are applied after to normalize, incorporate non-linearity, and prevent overfitting respectively.
* `conv2` is a second convolution to help refine the features again without changing dimensions. Batch norm and dropout applied afterward, but not ReLu.
* `out += identity` is the key move that makes this 'res' net. The input that we cached at the start is added in at the end of the convolutional layers. The final ReLU activates the combined result before passing to the next block.

> This addition of the input layers at the end of this block is the core idea of ResNet. Instead of the network learning the full transformation F(x), it only needs to learn the difference from the input: F(x) = H(x) - x, so the output is x + F(x), the residuals in res net. This means if the optimal transformation is close to doing nothing, the network just learns to push F(x) toward zero — which is much easier than learning a full identity mapping from scratch. This is why deep ResNets can train where plain deep networks collapse.

Now that we understand what is going on in a Residual Block, we will look at the whole architecture of the network.

**Initial Convolutional Layer**

To start, 32 one-dimensional [convolutional filters](https://developers.google.com/machine-learning/glossary#convolutional_filter) are applied to the vibration signal. The convolution uses a wide 1-dimensional kernel (16 samples) to learn filter parameters capable of capturing broad structural features like impulse responses, periodic oscillations, and transient events at different scales from the raw input before passing them into the residual stages. This produces a new value from the filter for each part of the original time series, for each of the 32 filters that are learned - the output shape is 1199 (samples, one lost due to padding) by 32 (learned filters). BatchNorm normalizes each of the 32 channels independently to zero mean and unit variance across the batch, and then ReLu zeros out all negative values.

**Residual Blocks Layers**

A series of eight residual blocks forms the heart of the network. They are arranged into four sequential groups, where each group halves the temporal dimension through stride-2 downsampling while progressively widening the channel count (32→32→64→128→128). The kernel size also steadily shrinks across groups (7→7→5→3), allowing later layers to learn increasingly fine-grained features now that earlier layers have already built up broad contextual awareness.

> Each time a group downsamples, the temporal resolution is cut in half (e.g. 1199→600→300→150), but each remaining position now represents a wider window of the original signal. Combined with the residual connections carrying forward earlier representations, this means deeper blocks operate with an increasingly large receptive field. Each value in the compressed network is influenced by a broader stretch of the original vibration signal, allowing the network to pick up on longer-range structural patterns that wouldn't be visible at finer temporal scales.

**Final Learning Layers**

`AvgPool1D` is the adaptive average pooling layer collapses whatever temporal dimension remains (e.g. 150 time steps) down to a single value per channel by averaging across the entire length. This produces a fixed-size vector of 128 values — one summary statistic per learned feature channel — regardless of the original input length. This is what allows the network to transition from convolutional feature extraction into the fully connected classification heads.

`fc_shared` is a fully connected layer that takes the 128-dimensional pooled vector and maps it to another 128-dimensional representation, followed by ReLU and dropout. It acts as a shared bottleneck that gives the network a chance to learn a final combined representation before branching into the three separate classification heads. Without it, each head would be working directly from the pooled convolutional features — this extra layer lets the network learn a task-aware remixing of those features that benefits all three outputs jointly.

From the outputs of the `fc_shared` layer, the multiple classification heads are able to learn.

### Model Definition

In [21]:

class ResidualBlock1D(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=7, downsample=False, dropout=0.1):
        super().__init__()
        stride = 2 if downsample else 1
        padding = kernel_size // 2

        # residual block layer internals - definition
        self.conv1 = nn.Conv1d(in_channels, out_channels, kernel_size,
                               stride=stride, padding=padding)
        self.bn1 = nn.BatchNorm1d(out_channels)
        self.dropout1 = nn.Dropout(dropout)

        self.conv2 = nn.Conv1d(out_channels, out_channels, kernel_size,
                               stride=1, padding=padding)
        self.bn2 = nn.BatchNorm1d(out_channels)
        self.dropout2 = nn.Dropout(dropout)

        self.shortcut = nn.Identity()
        if downsample or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv1d(in_channels, out_channels, kernel_size=1, stride=stride),
                nn.BatchNorm1d(out_channels)
            )

    def forward(self, x):
        # residual block layer internals - implementation
        identity = self.shortcut(x)
        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out)
        out = self.dropout1(out)
        out = self.conv2(out)
        out = self.bn2(out)
        out += identity
        return F.relu(out)


class ResNet1D_pt(nn.Module):
    def __init__(self,
                 input_channels=1,
                 num_location_classes=4,
                 num_size_classes=4):
        super().__init__()

        # Convolutional Layer - definition
        self.stem = nn.Sequential(
            nn.Conv1d(input_channels, 32, kernel_size=16, stride=1, padding=7),
            nn.BatchNorm1d(32),
            nn.ReLU(),
        )

        # residual blocks layers - definition
        self.layer1a = ResidualBlock1D(32, 32, kernel_size=7, downsample=False)
        self.layer1b = ResidualBlock1D(32, 32, kernel_size=7, downsample=False)

        self.layer2a = ResidualBlock1D(32, 64, kernel_size=7, downsample=True)   # 1200 -> 600
        self.layer2b = ResidualBlock1D(64, 64, kernel_size=7, downsample=False)

        self.layer3a = ResidualBlock1D(64, 128, kernel_size=5, downsample=True)  # 600 -> 300
        self.layer3b = ResidualBlock1D(128, 128, kernel_size=5, downsample=False)

        self.layer4a = ResidualBlock1D(128, 128, kernel_size=3, downsample=True) # 300 -> 150
        self.layer4b = ResidualBlock1D(128, 128, kernel_size=3, downsample=False)

        self.pool = nn.AdaptiveAvgPool1d(1)
        self.dropout = nn.Dropout(0.3)

        # Shared representation
        self.fc_shared = nn.Linear(128, 128)

        # Three classification heads
        self.fc_fault_detection = nn.Linear(128, 2)
        self.fc_fault_location = nn.Linear(128, num_location_classes)
        self.fc_crack_size = nn.Linear(128, num_size_classes)

        self._init_weights()

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv1d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm1d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)

    def forward(self, x):
        if x.dim() == 2:
            x = x.unsqueeze(1)

        # convolutional layer - implementation
        x = self.stem(x)

        # residual blocks layer - implementation
        x = self.layer1a(x)
        x = self.layer1b(x)
        x = self.layer2a(x)
        x = self.layer2b(x)
        x = self.layer3a(x)
        x = self.layer3b(x)
        x = self.layer4a(x)
        x = self.layer4b(x)

        # pooling and final layer
        x = self.pool(x).squeeze(-1)
        x = self.fc_shared(x)
        x = F.relu(x)
        x = self.dropout(x)

        fault_detection = self.fc_fault_detection(x)
        fault_location = self.fc_fault_location(x)
        crack_size = self.fc_crack_size(x)

        return fault_detection, fault_location, crack_size

### Training Preparation

In [22]:
resnet_ctte = copy.deepcopy(pytorch_ctte)

resnet_model = ResNet1D_pt()

resnet_ctte.load_model(resnet_model)

resnet_ctte.load_optimizer(
    torch.optim.Adam(resnet_model.parameters(), lr=1e-3)
)

### Training

In [23]:
resnet_ctte.prepare_data()
resnet_ctte.train(epochs=20, batch_size=64)
resnet_ctte.evaluate()

Epoch 1/20: 100%|██████████| 254/254 [00:12<00:00, 20.00it/s, loss=0.52]


Epoch 1 avg loss: 0.9416


Epoch 2/20: 100%|██████████| 254/254 [00:12<00:00, 20.59it/s, loss=0.491]


Epoch 2 avg loss: 0.3946


Epoch 3/20: 100%|██████████| 254/254 [00:12<00:00, 20.54it/s, loss=0.244]


Epoch 3 avg loss: 0.2527


Epoch 4/20: 100%|██████████| 254/254 [00:12<00:00, 20.67it/s, loss=0.33]


Epoch 4 avg loss: 0.1933


Epoch 5/20: 100%|██████████| 254/254 [00:12<00:00, 20.44it/s, loss=0.334]


Epoch 5 avg loss: 0.1639


Epoch 6/20: 100%|██████████| 254/254 [00:12<00:00, 20.20it/s, loss=0.041]


Epoch 6 avg loss: 0.1502


Epoch 7/20: 100%|██████████| 254/254 [00:12<00:00, 20.06it/s, loss=0.344]


Epoch 7 avg loss: 0.1293


Epoch 8/20: 100%|██████████| 254/254 [00:12<00:00, 19.71it/s, loss=0.0521]


Epoch 8 avg loss: 0.1107


Epoch 9/20: 100%|██████████| 254/254 [00:13<00:00, 19.17it/s, loss=0.149]


Epoch 9 avg loss: 0.1172


Epoch 10/20: 100%|██████████| 254/254 [00:13<00:00, 18.78it/s, loss=0.0229]


Epoch 10 avg loss: 0.0896


Epoch 11/20: 100%|██████████| 254/254 [00:13<00:00, 18.58it/s, loss=0.0103]


Epoch 11 avg loss: 0.0831


Epoch 12/20: 100%|██████████| 254/254 [00:13<00:00, 18.99it/s, loss=0.0872]


Epoch 12 avg loss: 0.0973


Epoch 13/20: 100%|██████████| 254/254 [00:13<00:00, 19.10it/s, loss=0.0833]


Epoch 13 avg loss: 0.0774


Epoch 14/20: 100%|██████████| 254/254 [00:13<00:00, 19.05it/s, loss=0.0611]


Epoch 14 avg loss: 0.1130


Epoch 15/20: 100%|██████████| 254/254 [00:13<00:00, 18.89it/s, loss=0.0432]


Epoch 15 avg loss: 0.0785


Epoch 16/20: 100%|██████████| 254/254 [00:13<00:00, 18.73it/s, loss=0.0186]


Epoch 16 avg loss: 0.0827


Epoch 17/20: 100%|██████████| 254/254 [00:13<00:00, 18.68it/s, loss=0.00634]


Epoch 17 avg loss: 0.0560


Epoch 18/20: 100%|██████████| 254/254 [00:13<00:00, 18.72it/s, loss=0.124]


Epoch 18 avg loss: 0.0746


Epoch 19/20: 100%|██████████| 254/254 [00:13<00:00, 18.79it/s, loss=0.0151]


Epoch 19 avg loss: 0.0576


Epoch 20/20: 100%|██████████| 254/254 [00:13<00:00, 18.84it/s, loss=0.0702]


Epoch 20 avg loss: 0.0659


### Classification Report

In [24]:
resnet_ctte.classification_report()


### Results Interpretation

ResNet's good performance makes sense due to the architecture's natural fit for vibration signal classification. Vibration signals contain diagnostic information at multiple scales — high-frequency transients from crack impacts, medium-frequency resonance patterns, and longer-range periodic structures from rotating components. The progressive downsampling through residual groups means the network builds up representations at each of these scales, from fine-grained waveform features in early layers to broad structural patterns in deeper ones.

The residual connections also help here specifically. Subtle fault signatures can be small perturbations on top of dominant healthy vibration patterns — which is essentially what a residual is. By learning differences from the input rather than full transformations, the network is well-suited to detecting these small but diagnostically meaningful deviations. A healthy signal passes through with near-zero residuals, while a fault introduces learnable differences that propagate through to the classification heads.

The multi-head design also plays a role. Because detection, location, and size classification share the same learned feature backbone, the network can exploit correlations between tasks — for instance, certain frequency signatures that indicate a crack also carry information about where it is. This shared learning likely gives better results than training three separate models, especially when data is limited.

## Gated Recurrent Unit (GRU)

### Model Intuition

The ResNet1D architecture described above processes a signal by sliding learned filters across it. Each convolutional layer looks at a fixed-width window of the input at a time, and the network builds up longer-range awareness by stacking many such layers and progressively compressing the temporal dimension. The key insight is that every position in the signal is treated somewhat independently; context is gathered implicitly through depth and receptive field growth.

Recurrent networks like the GRU we build in this section, and the LSTM in the next, take a fundamentally different approach. Rather than scanning a signal with fixed filters, a series of cells read it sequentially and pass a hidden state that acts as a running memory of everything seen so far.


![Recurrent Neural Network GIF](https://miro.medium.com/v2/resize:fit:720/format:webp/1*AQ52bwW55GsJt6HTxPDuMA.gif)

Note: *With a univariate time series, the 3x1 input vectors above are actually just 1x1 vectors representing the reading at that point in time.*

This "hidden state" sounds a lot more mysterious and shadowy than it really is. Like most things in deep learning, it is just a vector. Each cell receives the hidden state passed from the previous cell and updates it to pass to the cell after it. This means later time steps have direct access to a sort of learned compressed memory of steps that came before it - that is the principle that unites recurrent architectures.  However, it is in the updating of a new hidden state where vanilla RNNs, GRUs, and LSTMs differ from each other.

* A **vanilla RNN** updates the hidden state a simple linear combination of the hidden state and the input value(s). The same set of weights and biases are shared across all cells, so the network learns which values are best to shape the hidden state through the sequence. While this is conceptually clean, it breaks down in practice because the same update rule is applied at every step and gradients decay rapidly as they are propagated back through many time steps, making it very difficult for the network to learn dependencies that span long stretches of the sequence. This is known as the vanishing gradient problem.

* A **GRU** adds complexity by introducing two learned gating mechanisms that allow for long term dependencies. Both gates are learned linear transformations of the current input and the hidden state received from the previous cell, passed through a sigmoid to produce a value between 0 and 1, essentially acting as a learned dial between ignoring and fully passing on a given piece of information.

  * The **reset gate** scales down the previous hidden state before it is used to compute the candidate new hidden state — a low reset value lets the network effectively ignore prior memory and write something fresh.
  * The **update gate** then controls the blend between that candidate state and the old hidden state — determining how much of the previous memory to carry forward unchanged versus how much to replace with the newly computed candidate.

![GRU](https://towardsdatascience.com/wp-content/uploads/2022/02/13a8HnDUlzhhKcSpQzOyiCQ.png)

If the above seems complicated, you're right, it is (but just wait for LSTMs!). We can make more sense of it going step by step though:
1.

* An **LSTM** takes a similar approach but with more machinery. It introduces a separate *cell state* that runs alongside the hidden state as a second memory channel, and uses three gates rather than two to manage it. This additional structure gives the LSTM more expressive control over long-range memory, at the cost of more parameters and slower training. In practice the GRU and LSTM perform similarly on many tasks, and the GRU is often preferred when training efficiency matters.

The practical tradeoff between recurrent and convolutional architectures is real. Recurrent networks capture temporal order explicitly and handle variable-length sequences naturally, but the sequential dependency between cells makes them harder to parallelize during training. Convolutional networks are more efficient but require careful architectural design — stacking, downsampling, widening — to build up the same contextual reach that a recurrent network gets more directly.
```

### Model Anatomy

The Gated Recurrent Unit is the specific recurrent architecture used here. A vanilla recurrent network updates its hidden state at each step with a simple function of the current input and previous state, but in practice this causes the gradient signal to decay rapidly over long sequences, making it difficult to learn dependencies that span many time steps. The GRU addresses this with two learned gating mechanisms — a reset gate that controls how much of the previous state to forget, and an update gate that controls how much of the new candidate state to actually adopt. This allows the network to selectively preserve information across many steps when needed, or discard it quickly when the sequence moves on to something new.

### Model Definition

In [25]:
class GRU1D_pt(nn.Module):
    def __init__(self, input_size=1, seq_len=1200, hidden_size=128, num_location_classes=4, num_size_classes=4):
        super(GRU1D_pt, self).__init__()

        self.seq_len = seq_len
        self.hidden_size = hidden_size

        # GRU Layer
        self.gru = nn.GRU(input_size=input_size, hidden_size=hidden_size, batch_first=True)

        # Fully Connected Layers
        self.fc1 = nn.Linear(hidden_size, 64)
        self.fc2 = nn.Linear(64, 32)

        # Output heads
        self.fc_fault_detection = nn.Linear(32, 2)
        self.fc_fault_location = nn.Linear(32, num_location_classes)
        self.fc_crack_size = nn.Linear(32, num_size_classes)

        self._init_weights()

    def _init_weights(self):
        for layer in [self.fc_fault_detection, self.fc_fault_location, self.fc_crack_size]:
            nn.init.xavier_uniform_(layer.weight)
            if layer.bias is not None:
                nn.init.zeros_(layer.bias)

    def forward(self, x):
        # Input shape normalization
        if x.ndim == 2:
            x = x.unsqueeze(-1)  # (B, T) -> (B, T, 1)
        elif x.ndim == 3 and x.shape[1] == 1:
            x = x.permute(0, 2, 1)  # (B, 1, T) -> (B, T, 1)

        x = x[:, :self.seq_len, :]

        # GRU forward
        gru_out, _ = self.gru(x)
        out = gru_out[:, -1, :]  # Take last time step

        # Fully connected layers
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))

        # Multi-head outputs
        fault_detection = self.fc_fault_detection(out)
        fault_location = self.fc_fault_location(out)
        crack_size = self.fc_crack_size(out)

        return fault_detection, fault_location, crack_size

In [26]:
gru_ctte = copy.deepcopy(pytorch_ctte)

gru_model = GRU1D_pt()

gru_ctte.load_model(gru_model)

gru_ctte.load_optimizer(
    torch.optim.Adam(gru_model.parameters(), lr=1e-3)
)

In [27]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

In [28]:
gru_ctte.prepare_data()
gru_ctte.train(epochs=100, batch_size=64)
gru_ctte.evaluate()

Epoch 1/100: 100%|██████████| 254/254 [00:04<00:00, 56.51it/s, loss=2.7]


Epoch 1 avg loss: 2.6507


Epoch 2/100: 100%|██████████| 254/254 [00:04<00:00, 56.20it/s, loss=2.49]


Epoch 2 avg loss: 2.5997


Epoch 3/100: 100%|██████████| 254/254 [00:04<00:00, 57.04it/s, loss=2.86]


Epoch 3 avg loss: 2.5935


Epoch 4/100: 100%|██████████| 254/254 [00:04<00:00, 56.73it/s, loss=2.8]


Epoch 4 avg loss: 2.6000


Epoch 5/100: 100%|██████████| 254/254 [00:04<00:00, 56.60it/s, loss=2.43]


Epoch 5 avg loss: 2.5692


Epoch 6/100: 100%|██████████| 254/254 [00:04<00:00, 57.01it/s, loss=1.78]


Epoch 6 avg loss: 2.1357


Epoch 7/100: 100%|██████████| 254/254 [00:04<00:00, 56.48it/s, loss=1.75]


Epoch 7 avg loss: 1.8866


Epoch 8/100: 100%|██████████| 254/254 [00:04<00:00, 56.88it/s, loss=1.5]


Epoch 8 avg loss: 1.7316


Epoch 9/100: 100%|██████████| 254/254 [00:04<00:00, 56.90it/s, loss=1.51]


Epoch 9 avg loss: 1.5851


Epoch 10/100: 100%|██████████| 254/254 [00:04<00:00, 56.44it/s, loss=1.21]


Epoch 10 avg loss: 1.4150


Epoch 11/100: 100%|██████████| 254/254 [00:04<00:00, 57.16it/s, loss=1.48]


Epoch 11 avg loss: 1.1844


Epoch 12/100: 100%|██████████| 254/254 [00:04<00:00, 56.62it/s, loss=1.17]


Epoch 12 avg loss: 1.0213


Epoch 13/100: 100%|██████████| 254/254 [00:04<00:00, 56.45it/s, loss=0.713]


Epoch 13 avg loss: 0.9164


Epoch 14/100: 100%|██████████| 254/254 [00:04<00:00, 57.12it/s, loss=0.84]


Epoch 14 avg loss: 0.8162


Epoch 15/100: 100%|██████████| 254/254 [00:04<00:00, 56.32it/s, loss=0.896]


Epoch 15 avg loss: 0.7444


Epoch 16/100: 100%|██████████| 254/254 [00:04<00:00, 57.17it/s, loss=0.42]


Epoch 16 avg loss: 0.6634


Epoch 17/100: 100%|██████████| 254/254 [00:04<00:00, 57.40it/s, loss=0.517]


Epoch 17 avg loss: 0.6193


Epoch 18/100: 100%|██████████| 254/254 [00:04<00:00, 56.75it/s, loss=0.588]


Epoch 18 avg loss: 0.5599


Epoch 19/100: 100%|██████████| 254/254 [00:04<00:00, 57.40it/s, loss=0.684]


Epoch 19 avg loss: 0.5379


Epoch 20/100: 100%|██████████| 254/254 [00:04<00:00, 57.03it/s, loss=0.393]


Epoch 20 avg loss: 0.4984


Epoch 21/100: 100%|██████████| 254/254 [00:04<00:00, 56.86it/s, loss=0.253]


Epoch 21 avg loss: 0.4661


Epoch 22/100: 100%|██████████| 254/254 [00:04<00:00, 57.52it/s, loss=0.322]


Epoch 22 avg loss: 0.4363


Epoch 23/100: 100%|██████████| 254/254 [00:04<00:00, 56.49it/s, loss=0.744]


Epoch 23 avg loss: 0.4184


Epoch 24/100: 100%|██████████| 254/254 [00:04<00:00, 57.37it/s, loss=0.408]


Epoch 24 avg loss: 0.3917


Epoch 25/100: 100%|██████████| 254/254 [00:04<00:00, 57.05it/s, loss=0.197]


Epoch 25 avg loss: 0.3800


Epoch 26/100: 100%|██████████| 254/254 [00:04<00:00, 56.66it/s, loss=0.292]


Epoch 26 avg loss: 0.3667


Epoch 27/100: 100%|██████████| 254/254 [00:04<00:00, 57.39it/s, loss=0.256]


Epoch 27 avg loss: 0.3538


Epoch 28/100: 100%|██████████| 254/254 [00:04<00:00, 56.33it/s, loss=0.139]


Epoch 28 avg loss: 0.3699


Epoch 29/100: 100%|██████████| 254/254 [00:04<00:00, 56.74it/s, loss=0.52]


Epoch 29 avg loss: 0.3195


Epoch 30/100: 100%|██████████| 254/254 [00:04<00:00, 57.41it/s, loss=0.494]


Epoch 30 avg loss: 0.3082


Epoch 31/100: 100%|██████████| 254/254 [00:04<00:00, 56.60it/s, loss=0.316]


Epoch 31 avg loss: 0.3026


Epoch 32/100: 100%|██████████| 254/254 [00:04<00:00, 57.28it/s, loss=0.165]


Epoch 32 avg loss: 0.2699


Epoch 33/100: 100%|██████████| 254/254 [00:04<00:00, 57.09it/s, loss=0.209]


Epoch 33 avg loss: 0.2617


Epoch 34/100: 100%|██████████| 254/254 [00:04<00:00, 56.52it/s, loss=0.161]


Epoch 34 avg loss: 0.2414


Epoch 35/100: 100%|██████████| 254/254 [00:04<00:00, 57.27it/s, loss=0.362]


Epoch 35 avg loss: 0.2186


Epoch 36/100: 100%|██████████| 254/254 [00:04<00:00, 56.72it/s, loss=0.06]


Epoch 36 avg loss: 0.2205


Epoch 37/100: 100%|██████████| 254/254 [00:04<00:00, 56.73it/s, loss=0.0532]


Epoch 37 avg loss: 0.1938


Epoch 38/100: 100%|██████████| 254/254 [00:04<00:00, 57.26it/s, loss=0.0981]


Epoch 38 avg loss: 0.1894


Epoch 39/100: 100%|██████████| 254/254 [00:04<00:00, 56.36it/s, loss=0.0954]


Epoch 39 avg loss: 0.1867


Epoch 40/100: 100%|██████████| 254/254 [00:04<00:00, 57.14it/s, loss=0.233]


Epoch 40 avg loss: 0.1780


Epoch 41/100: 100%|██████████| 254/254 [00:04<00:00, 57.08it/s, loss=0.0601]


Epoch 41 avg loss: 0.1578


Epoch 42/100: 100%|██████████| 254/254 [00:04<00:00, 56.06it/s, loss=0.102]


Epoch 42 avg loss: 0.1533


Epoch 43/100: 100%|██████████| 254/254 [00:04<00:00, 57.03it/s, loss=0.0158]


Epoch 43 avg loss: 0.1577


Epoch 44/100: 100%|██████████| 254/254 [00:04<00:00, 56.97it/s, loss=0.191]


Epoch 44 avg loss: 0.1636


Epoch 45/100: 100%|██████████| 254/254 [00:04<00:00, 56.74it/s, loss=0.218]


Epoch 45 avg loss: 0.2579


Epoch 46/100: 100%|██████████| 254/254 [00:04<00:00, 57.08it/s, loss=0.173]


Epoch 46 avg loss: 0.1437


Epoch 47/100: 100%|██████████| 254/254 [00:04<00:00, 56.58it/s, loss=0.102]


Epoch 47 avg loss: 0.1293


Epoch 48/100: 100%|██████████| 254/254 [00:04<00:00, 56.86it/s, loss=0.0457]


Epoch 48 avg loss: 0.1170


Epoch 49/100: 100%|██████████| 254/254 [00:04<00:00, 56.86it/s, loss=0.0261]


Epoch 49 avg loss: 0.1116


Epoch 50/100: 100%|██████████| 254/254 [00:04<00:00, 55.86it/s, loss=0.185]


Epoch 50 avg loss: 0.0970


Epoch 51/100: 100%|██████████| 254/254 [00:04<00:00, 56.66it/s, loss=0.0991]


Epoch 51 avg loss: 0.1444


Epoch 52/100: 100%|██████████| 254/254 [00:04<00:00, 56.41it/s, loss=0.17]


Epoch 52 avg loss: 0.1003


Epoch 53/100: 100%|██████████| 254/254 [00:04<00:00, 56.64it/s, loss=0.0263]


Epoch 53 avg loss: 0.0913


Epoch 54/100: 100%|██████████| 254/254 [00:04<00:00, 56.95it/s, loss=0.0202]


Epoch 54 avg loss: 0.0820


Epoch 55/100: 100%|██████████| 254/254 [00:04<00:00, 55.93it/s, loss=0.0204]


Epoch 55 avg loss: 0.0990


Epoch 56/100: 100%|██████████| 254/254 [00:04<00:00, 56.99it/s, loss=0.0669]


Epoch 56 avg loss: 0.1212


Epoch 57/100: 100%|██████████| 254/254 [00:04<00:00, 57.10it/s, loss=0.286]


Epoch 57 avg loss: 0.0775


Epoch 58/100: 100%|██████████| 254/254 [00:04<00:00, 56.53it/s, loss=0.00774]


Epoch 58 avg loss: 0.0679


Epoch 59/100: 100%|██████████| 254/254 [00:04<00:00, 56.27it/s, loss=0.0308]


Epoch 59 avg loss: 0.0665


Epoch 60/100: 100%|██████████| 254/254 [00:04<00:00, 56.33it/s, loss=0.154]


Epoch 60 avg loss: 0.0991


Epoch 61/100: 100%|██████████| 254/254 [00:04<00:00, 56.55it/s, loss=0.044]


Epoch 61 avg loss: 0.0868


Epoch 62/100: 100%|██████████| 254/254 [00:04<00:00, 57.19it/s, loss=0.0194]


Epoch 62 avg loss: 0.0581


Epoch 63/100: 100%|██████████| 254/254 [00:04<00:00, 56.51it/s, loss=0.0336]


Epoch 63 avg loss: 0.0665


Epoch 64/100: 100%|██████████| 254/254 [00:04<00:00, 57.09it/s, loss=0.117]


Epoch 64 avg loss: 0.0916


Epoch 65/100: 100%|██████████| 254/254 [00:04<00:00, 57.05it/s, loss=0.125]


Epoch 65 avg loss: 0.0714


Epoch 66/100: 100%|██████████| 254/254 [00:04<00:00, 56.34it/s, loss=0.0151]


Epoch 66 avg loss: 0.0613


Epoch 67/100: 100%|██████████| 254/254 [00:04<00:00, 57.11it/s, loss=0.0665]


Epoch 67 avg loss: 0.0727


Epoch 68/100: 100%|██████████| 254/254 [00:04<00:00, 56.52it/s, loss=0.183]


Epoch 68 avg loss: 0.0701


Epoch 69/100: 100%|██████████| 254/254 [00:04<00:00, 56.74it/s, loss=0.148]


Epoch 69 avg loss: 0.0495


Epoch 70/100: 100%|██████████| 254/254 [00:04<00:00, 57.09it/s, loss=0.121]


Epoch 70 avg loss: 0.0355


Epoch 71/100: 100%|██████████| 254/254 [00:04<00:00, 56.28it/s, loss=0.201]


Epoch 71 avg loss: 0.1057


Epoch 72/100: 100%|██████████| 254/254 [00:04<00:00, 57.06it/s, loss=0.0405]


Epoch 72 avg loss: 0.1537


Epoch 73/100: 100%|██████████| 254/254 [00:04<00:00, 57.07it/s, loss=0.069]


Epoch 73 avg loss: 0.0732


Epoch 74/100: 100%|██████████| 254/254 [00:04<00:00, 56.08it/s, loss=0.103]


Epoch 74 avg loss: 0.0433


Epoch 75/100: 100%|██████████| 254/254 [00:04<00:00, 56.90it/s, loss=0.101]


Epoch 75 avg loss: 0.0476


Epoch 76/100: 100%|██████████| 254/254 [00:04<00:00, 56.73it/s, loss=0.0076]


Epoch 76 avg loss: 0.0423


Epoch 77/100: 100%|██████████| 254/254 [00:04<00:00, 56.65it/s, loss=0.0242]


Epoch 77 avg loss: 0.0404


Epoch 78/100: 100%|██████████| 254/254 [00:04<00:00, 57.03it/s, loss=0.021]


Epoch 78 avg loss: 0.0299


Epoch 79/100: 100%|██████████| 254/254 [00:04<00:00, 56.18it/s, loss=0.0677]


Epoch 79 avg loss: 0.1096


Epoch 80/100: 100%|██████████| 254/254 [00:04<00:00, 56.85it/s, loss=0.0649]


Epoch 80 avg loss: 0.0339


Epoch 81/100: 100%|██████████| 254/254 [00:04<00:00, 56.70it/s, loss=0.00101]


Epoch 81 avg loss: 0.0373


Epoch 82/100: 100%|██████████| 254/254 [00:04<00:00, 55.99it/s, loss=0.00642]


Epoch 82 avg loss: 0.0349


Epoch 83/100: 100%|██████████| 254/254 [00:04<00:00, 56.85it/s, loss=0.011]


Epoch 83 avg loss: 0.0539


Epoch 84/100: 100%|██████████| 254/254 [00:04<00:00, 56.31it/s, loss=0.137]


Epoch 84 avg loss: 0.0429


Epoch 85/100: 100%|██████████| 254/254 [00:04<00:00, 56.24it/s, loss=0.00432]


Epoch 85 avg loss: 0.0489


Epoch 86/100: 100%|██████████| 254/254 [00:04<00:00, 56.36it/s, loss=0.0447]


Epoch 86 avg loss: 0.0319


Epoch 87/100: 100%|██████████| 254/254 [00:04<00:00, 55.84it/s, loss=0.0266]


Epoch 87 avg loss: 0.0634


Epoch 88/100: 100%|██████████| 254/254 [00:04<00:00, 56.92it/s, loss=0.00855]


Epoch 88 avg loss: 0.0314


Epoch 89/100: 100%|██████████| 254/254 [00:04<00:00, 56.92it/s, loss=0.0225]


Epoch 89 avg loss: 0.0470


Epoch 90/100: 100%|██████████| 254/254 [00:04<00:00, 56.19it/s, loss=0.00589]


Epoch 90 avg loss: 0.0265


Epoch 91/100: 100%|██████████| 254/254 [00:04<00:00, 57.20it/s, loss=0.00782]


Epoch 91 avg loss: 0.0628


Epoch 92/100: 100%|██████████| 254/254 [00:04<00:00, 56.75it/s, loss=0.01]


Epoch 92 avg loss: 0.0343


Epoch 93/100: 100%|██████████| 254/254 [00:04<00:00, 56.77it/s, loss=0.00745]


Epoch 93 avg loss: 0.0202


Epoch 94/100: 100%|██████████| 254/254 [00:04<00:00, 57.07it/s, loss=0.00464]


Epoch 94 avg loss: 0.0158


Epoch 95/100: 100%|██████████| 254/254 [00:04<00:00, 56.17it/s, loss=0.0272]


Epoch 95 avg loss: 0.0283


Epoch 96/100: 100%|██████████| 254/254 [00:04<00:00, 57.24it/s, loss=0.0643]


Epoch 96 avg loss: 0.0736


Epoch 97/100: 100%|██████████| 254/254 [00:04<00:00, 57.15it/s, loss=0.0632]


Epoch 97 avg loss: 0.0640


Epoch 98/100: 100%|██████████| 254/254 [00:04<00:00, 56.40it/s, loss=0.0196]


Epoch 98 avg loss: 0.0352


Epoch 99/100: 100%|██████████| 254/254 [00:04<00:00, 57.06it/s, loss=0.0227]


Epoch 99 avg loss: 0.0149


Epoch 100/100: 100%|██████████| 254/254 [00:04<00:00, 56.72it/s, loss=0.0229]


Epoch 100 avg loss: 0.0048


In [29]:
gru_ctte.classification_report()

## Long Short Term Memory (LSTM) Neural Network

### Model Intuition

LSTMs are a subtype of recurrent neural networks.

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

### Model Definition

In [30]:
class LSTM1D_pt(nn.Module):
    def __init__(self,
                 sequence_length=1200,
                 chunk_size=10,
                 hidden_size=128,
                 num_layers=3,
                 dropout_rate=0.3,
                 num_fault_locations=4,
                 num_crack_sizes=4):
        super().__init__()

        self.sequence_length = sequence_length
        self.chunk_size = chunk_size
        self.num_steps = sequence_length // chunk_size  # 1200/10 = 120 steps

        # Input projection: transform each chunk into a richer representation
        self.input_proj = nn.Sequential(
            nn.Linear(chunk_size, hidden_size),
            nn.ReLU(),
            nn.LayerNorm(hidden_size),
        )

        # Bidirectional LSTM
        self.lstm = nn.LSTM(
            input_size=hidden_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout_rate,
            bidirectional=True,
        )

        # Attention pooling over timesteps
        self.attention = nn.Sequential(
            nn.Linear(hidden_size * 2, 64),
            nn.Tanh(),
            nn.Linear(64, 1),
        )

        self.layer_norm = nn.LayerNorm(hidden_size * 2)
        self.dropout = nn.Dropout(dropout_rate)

        # Shared FC
        self.fc_shared = nn.Linear(hidden_size * 2, 256)

        # Output heads — all raw logits
        self.fault_detection_output = nn.Linear(256, 2)
        self.fault_location_output = nn.Linear(256, num_fault_locations)
        self.crack_size_output = nn.Linear(256, num_crack_sizes)

        self._init_weights()

    def _init_weights(self):
        # Orthogonal init for LSTM (proven to help with long sequences)
        for name, param in self.lstm.named_parameters():
            if 'weight_ih' in name:
                nn.init.xavier_uniform_(param)
            elif 'weight_hh' in name:
                nn.init.orthogonal_(param)
            elif 'bias' in name:
                nn.init.zeros_(param)
                # Set forget gate bias to 1 to encourage remembering
                hidden = self.lstm.hidden_size
                param.data[hidden:2*hidden].fill_(1.0)

        for layer in [self.fc_shared, self.fault_detection_output,
                      self.fault_location_output, self.crack_size_output]:
            nn.init.kaiming_normal_(layer.weight, nonlinearity='relu')
            if layer.bias is not None:
                nn.init.zeros_(layer.bias)

    def forward(self, x):
        # x: [batch, seq_len] or [batch, 1, seq_len]
        if x.dim() == 3:
            x = x.squeeze(1)

        batch_size = x.size(0)

        # Chunk the signal: [batch, 1200] -> [batch, 120, 10]
        x = x.view(batch_size, self.num_steps, self.chunk_size)

        # Project each chunk: [batch, 120, 10] -> [batch, 120, hidden_size]
        x = self.input_proj(x)

        # Bidirectional LSTM: [batch, 120, hidden_size*2]
        lstm_out, _ = self.lstm(x)

        # Attention pooling: learn which timesteps matter
        attn_weights = self.attention(lstm_out)        # [batch, 120, 1]
        attn_weights = F.softmax(attn_weights, dim=1)  # [batch, 120, 1]
        features = (lstm_out * attn_weights).sum(dim=1) # [batch, hidden_size*2]

        features = self.layer_norm(features)
        features = self.dropout(F.relu(self.fc_shared(features)))

        fault_detection = self.fault_detection_output(features)
        fault_location = self.fault_location_output(features)
        crack_size = self.crack_size_output(features)

        return fault_detection, fault_location, crack_size

In [31]:
lstm_ctte = copy.deepcopy(pytorch_ctte)

lstm_model = LSTM1D_pt(
    sequence_length=1200,
    hidden_size=128,
    num_layers=2,
    dropout_rate=0.1
)

lstm_ctte.load_model(lstm_model)

lstm_ctte.load_optimizer(
    torch.optim.Adam(lstm_model.parameters(), lr=1e-3)
)

In [32]:
lstm_ctte.prepare_data()
lstm_ctte.train(epochs=20, batch_size=64)
lstm_ctte.evaluate()

Epoch 1/20: 100%|██████████| 254/254 [00:05<00:00, 48.60it/s, loss=0.338]


Epoch 1 avg loss: 1.1087


Epoch 2/20: 100%|██████████| 254/254 [00:05<00:00, 49.46it/s, loss=0.509]


Epoch 2 avg loss: 0.4467


Epoch 3/20: 100%|██████████| 254/254 [00:05<00:00, 49.57it/s, loss=0.224]


Epoch 3 avg loss: 0.3390


Epoch 4/20: 100%|██████████| 254/254 [00:05<00:00, 49.48it/s, loss=0.271]


Epoch 4 avg loss: 0.2605


Epoch 5/20: 100%|██████████| 254/254 [00:05<00:00, 49.54it/s, loss=0.268]


Epoch 5 avg loss: 0.2177


Epoch 6/20: 100%|██████████| 254/254 [00:05<00:00, 49.53it/s, loss=0.0714]


Epoch 6 avg loss: 0.2083


Epoch 7/20: 100%|██████████| 254/254 [00:05<00:00, 49.53it/s, loss=0.0322]


Epoch 7 avg loss: 0.1772


Epoch 8/20: 100%|██████████| 254/254 [00:05<00:00, 49.50it/s, loss=0.258]


Epoch 8 avg loss: 0.1425


Epoch 9/20: 100%|██████████| 254/254 [00:05<00:00, 49.38it/s, loss=0.163]


Epoch 9 avg loss: 0.1430


Epoch 10/20: 100%|██████████| 254/254 [00:05<00:00, 49.51it/s, loss=0.0814]


Epoch 10 avg loss: 0.1077


Epoch 11/20: 100%|██████████| 254/254 [00:05<00:00, 49.40it/s, loss=0.0729]


Epoch 11 avg loss: 0.0894


Epoch 12/20: 100%|██████████| 254/254 [00:05<00:00, 49.49it/s, loss=0.00816]


Epoch 12 avg loss: 0.0900


Epoch 13/20: 100%|██████████| 254/254 [00:05<00:00, 49.50it/s, loss=0.147]


Epoch 13 avg loss: 0.0881


Epoch 14/20: 100%|██████████| 254/254 [00:05<00:00, 49.49it/s, loss=0.0503]


Epoch 14 avg loss: 0.0714


Epoch 15/20: 100%|██████████| 254/254 [00:05<00:00, 49.50it/s, loss=0.0313]


Epoch 15 avg loss: 0.0633


Epoch 16/20: 100%|██████████| 254/254 [00:05<00:00, 49.49it/s, loss=0.0495]


Epoch 16 avg loss: 0.0749


Epoch 17/20: 100%|██████████| 254/254 [00:05<00:00, 49.55it/s, loss=0.0632]


Epoch 17 avg loss: 0.0453


Epoch 18/20: 100%|██████████| 254/254 [00:05<00:00, 49.51it/s, loss=0.0759]


Epoch 18 avg loss: 0.0283


Epoch 19/20: 100%|██████████| 254/254 [00:05<00:00, 49.49it/s, loss=0.00676]


Epoch 19 avg loss: 0.0519


Epoch 20/20: 100%|██████████| 254/254 [00:05<00:00, 49.52it/s, loss=0.0266]


Epoch 20 avg loss: 0.0567


In [33]:
lstm_ctte.classification_report()
