# (00) Data Preprocessing

This notebook prepares the entire ProphetLSTM-GAN project dataset.  
We begin by configuring paths, verifying the repository structure, and ensuring that the raw dataset is correctly placed inside the `/data/raw/train/` and `/data/raw/test/` directories.

### Objectives of this notebook:
- Set up project directory paths
- Scan raw training and testing CSV files
- Validate required feature columns
- Load and concatenate all valid training CSVs
- Fit normalization scaler
- Build sliding-window time-series tensors
- Save processed numpy arrays for training, validation, and later test evaluation

This step is critical because every later notebook (01_training and 02_evaluation) will rely on the processed files generated here.

In [1]:
import torch
print("Device:", torch.cuda.get_device_name(0))

Device: NVIDIA GeForce RTX 4060 Laptop GPU


In [2]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

## Dataset Setup — Spacecraft Thruster Firing Tests (Kaggle)

This project uses the **Spacecraft Thruster Firing Tests Dataset**.

**Kaggle Link:** https://www.kaggle.com/datasets/patrickfleith/spacecraft-thruster-firing-tests-dataset

---

This dataset contains **2,612 CSV files**, each representing a thruster firing sequence with signals such as:

- `ton`  
- `thrust`  
- `mfr`  
- `vl`  
- `anomaly_code` (anomaly indicator)

It is large (**~3 GB**), structured, and perfect for unsupervised anomaly detection using **LSTM-GANs**.

To download the dataset, open the Kaggle link and hit the download button; you'll receive a zip file named something like `spacecraft-thruster-firing-tests-dataset.zip`. After unzipping it, you'll find two folders inside — `train` and `test` — each filled with hundreds of CSV files. Once extracted, your directory will look like `/path/to/downloaded/data/train/` and `/path/to/downloaded/data/test/`. Now just move these two folders into your project’s repository so the structure becomes `ProphetLSTM-GAN/data/raw/train/` and `ProphetLSTM-GAN/data/raw/test/`, where the `train` folder contains around 1267 CSVs and the `test` folder contains about 1344 CSVs. The `processed/` folder inside `data/` doesn’t need to exist yet — it’ll be automatically created later by your preprocessing script.

### Step 1: Project Directory Setup

The following code:
- Detects the project root (the folder where this notebook lives)
- Creates references to all important directories:
  - `data/raw/train/`
  - `data/raw/test/`
  - `data/processed/`
- Ensures the processed directory exists
- Prints the paths so we can visually confirm the structure before proceeding

This ensures the notebook runs correctly no matter where the user executes it, as long as the repo structure is preserved.

In [3]:
from pathlib import Path

# Detect the root of the repository (the directory where the notebook is opened)
PROJECT_ROOT = Path.cwd()

# Data directories
DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
TRAIN_RAW_DIR = RAW_DIR / "train"
TEST_RAW_DIR = RAW_DIR / "test"
PROCESSED_DIR = DATA_DIR / "processed"

# Create processed directory if missing
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print("Project root      :", PROJECT_ROOT)
print("Raw train dir     :", TRAIN_RAW_DIR)
print("Raw test dir      :", TEST_RAW_DIR)
print("Processed dir     :", PROCESSED_DIR)

Project root      : C:\Users\aymisxx\Documents\GitHub\ProphetLSTM-GAN
Raw train dir     : C:\Users\aymisxx\Documents\GitHub\ProphetLSTM-GAN\data\raw\train
Raw test dir      : C:\Users\aymisxx\Documents\GitHub\ProphetLSTM-GAN\data\raw\test
Processed dir     : C:\Users\aymisxx\Documents\GitHub\ProphetLSTM-GAN\data\processed


## Step 2: Scan Raw Training and Test CSV Files

Before we touch any preprocessing logic, we first verify that the raw dataset is correctly placed in:

- `data/raw/train/`  → training CSV files  
- `data/raw/test/`   → testing CSV files  

In this step, we will:

- Recursively search for all `.csv` files inside the train and test folders  
- Count how many files are found in each split  
- Print a short preview of the first few filenames  

This acts as a sanity check to confirm:
- The dataset is actually present
- The train/test separation is in the expected directories
- There are no obvious path mistakes before we proceed to loading and validating columns.

In [4]:
from pathlib import Path

# Collect all CSV files in train and test directories
train_csv_files = sorted(TRAIN_RAW_DIR.glob("*.csv"))
test_csv_files = sorted(TEST_RAW_DIR.glob("*.csv"))

print(f"Found {len(train_csv_files)} train CSV files.")
print(f"Found {len(test_csv_files)} test CSV files.\n")

# Preview a few train files
if train_csv_files:
    print("Sample train files:")
    for p in train_csv_files[:5]:
        print("  -", p.name)
else:
    print("WARNING: No train CSV files found in", TRAIN_RAW_DIR)

print()

# Preview a few test files
if test_csv_files:
    print("Sample test files:")
    for p in test_csv_files[:5]:
        print("  -", p.name)
else:
    print("WARNING: No test CSV files found in", TEST_RAW_DIR)

Found 1267 train CSV files.
Found 1344 test CSV files.

Sample train files:
  - 00001_001_SN01_24bars_ssf.csv
  - 00002_002_SN01_21bars_ssf.csv
  - 00003_003_SN01_18bars_ssf.csv
  - 00004_004_SN01_15bars_ssf.csv
  - 00005_005_SN01_12bars_ssf.csv

Sample test files:
  - 01269_001_SN13_24bars_ssf.csv
  - 01270_002_SN13_21bars_ssf.csv
  - 01271_003_SN13_18bars_ssf.csv
  - 01272_004_SN13_15bars_ssf.csv
  - 01273_005_SN13_12bars_ssf.csv


## Step 3: Inspect a Sample Training File & Define Feature / Label Columns

Before we build sliding windows or normalize anything, we need to clearly understand **what signals** each CSV file contains and decide:

- Which columns are **input features** to the model
- Which column is the **label** (anomaly indicator)

From the dataset, each CSV file includes:

- `time`
- `ton`
- `thrust`
- `mfr`
- `vl`
- `anomaly_code`

For this project, we will use:

- **Input feature columns (model inputs)**:
  - `time`
  - `ton`
  - `thrust`
  - `mfr`
  - `vl`

- **Label column (target)**:
  - `anomaly_code`

In this step, we will:

1. Load a single sample training CSV file.
2. Print:
   - Its shape (rows × columns)
   - The list of column names
3. Define:
   - A list of feature columns: `FEATURE_COLS`
   - A label column: `LABEL_COL`
4. Verify that all chosen feature columns are present in the sample file.

This ensures that we lock in a consistent feature set and can later enforce this across all training and test files.

The next code cell will:

1. Import `pandas` as `pd` for CSV loading.
2. Select the **first** training CSV file from `train_csv_files`
   (which we already collected earlier from `data/raw/train/`).
3. Load that file fully with `pd.read_csv`.
4. Print:
   - The file name
   - The DataFrame shape
   - The list of available columns
5. Define:
   - `FEATURE_COLS = ["time", "ton", "thrust", "mfr", "vl"]`
   - `LABEL_COL = "anomaly_code"`
6. Check:
   - Whether each column in `FEATURE_COLS` exists in the DataFrame.
   - If any are missing, print a warning; otherwise, confirm success.

This step **does not** modify or save anything yet — it only inspects the schema and locks in the feature/label definitions we will use throughout preprocessing, training, and evaluation.

In [5]:
import pandas as pd

# Safety check: make sure we actually have train files
if not train_csv_files:
    raise RuntimeError(f"No training CSV files found in {TRAIN_RAW_DIR}")

# Pick one sample train file to inspect (first in sorted order)
sample_train_path = train_csv_files[0]
print("Sample training file:", sample_train_path.name)

# Load the CSV
df_sample = pd.read_csv(sample_train_path)

print("\nData shape (rows, columns):", df_sample.shape)
print("\nAvailable columns:")
for col in df_sample.columns:
    print("  -", col)


# Define feature columns (model inputs) and label column (target)

FEATURE_COLS = ["time", "ton", "thrust", "mfr", "vl"]
LABEL_COL = "anomaly_code"

print("\nFeature columns (model inputs):")
for c in FEATURE_COLS:
    print("  -", c)

print("\nLabel column (target):")
print("  -", LABEL_COL)

# Check presence of all feature columns in this sample file
missing_features = [c for c in FEATURE_COLS if c not in df_sample.columns]

if missing_features:
    print("\nWARNING: The following feature columns are MISSING in the sample file:")
    for c in missing_features:
        print("  -", c)
else:
    print("\nAll feature columns are present in the sample file.")

Sample training file: 00001_001_SN01_24bars_ssf.csv

Data shape (rows, columns): (30600, 6)

Available columns:
  - time
  - ton
  - thrust
  - mfr
  - vl
  - anomaly_code

Feature columns (model inputs):
  - time
  - ton
  - thrust
  - mfr
  - vl

Label column (target):
  - anomaly_code

All feature columns are present in the sample file.


## Step 4: Validate All Training Files Against the Feature Column Schema

Now that we have identified the correct set of feature columns:

- `time`
- `ton`
- `thrust`
- `mfr`
- `vl`

we need to check all **1267 training CSV files** and verify that each one contains all
required input signals.

This step will:

- Loop through every CSV inside `data/raw/train/`
- Attempt to read the header of each file
- Check whether all `FEATURE_COLS` are present
- Build two lists:
  - `valid_train_files`   → files that contain all required features  
  - `skipped_train_files` → files missing any required feature, or unreadable

At the end, we will print:
- Total number of train files
- Number of valid files
- Number of skipped files
- Examples of skipped files

This ensures that only consistent, clean files enter the preprocessing pipeline, preventing silent failures later during scaling or sliding-window generation.

In [6]:
valid_train_files = []
skipped_train_files = []

for path in train_csv_files:
    try:
        # Read just the first few rows to get column names
        df_head = pd.read_csv(path, nrows=5)
    except Exception as e:
        skipped_train_files.append({
            "file": path.name,
            "reason": f"read_error: {e}"
        })
        continue
    
    # Check required feature presence
    missing = [c for c in FEATURE_COLS if c not in df_head.columns]
    
    if missing:
        skipped_train_files.append({
            "file": path.name,
            "reason": f"missing_columns: {missing}"
        })
    else:
        valid_train_files.append(path)

print(f"Total train CSV files      : {len(train_csv_files)}")
print(f"Valid train files (usable) : {len(valid_train_files)}")
print(f"Skipped train files        : {len(skipped_train_files)}\n")

# Show first few skipped examples
if skipped_train_files:
    print("Examples of skipped train files:")
    for entry in skipped_train_files[:10]:
        print(f"  - {entry['file']}  →  {entry['reason']}")
else:
    print("No train files were skipped.")

Total train CSV files      : 1267
Valid train files (usable) : 1267
Skipped train files        : 0

No train files were skipped.


## Step 5: Load and Concatenate All Valid Training CSV Files

Now that we have verified that **all 1267 training files** contain the required feature
columns, we can safely load them into memory.

In this step, we will:

1. Loop through each file in `valid_train_files`
2. Load the full CSV using `pandas.read_csv`
3. Extract:
   - Only the `FEATURE_COLS` (time, ton, thrust, mfr, vl)
   - And store the corresponding label column (`anomaly_code`) separately
4. Accumulate:
   - `all_features_raw` — list of numpy arrays containing feature data from each file  
   - `all_labels_raw`   — list of numpy arrays containing label data from each file

After processing every file, we will:

- Concatenate all feature arrays into `train_features_raw`  
- Concatenate all label arrays into `train_labels_raw`  
- Print the final shapes for verification

At this stage:
- No scaling is applied yet  
- No sliding windows are created  
- This step only assembles the raw, unified dataset into clean numpy arrays

This will form the foundation for normalization and windowing in later steps.

In [7]:
import numpy as np
import pandas as pd
import warnings

# Updated feature & label config
FEATURE_COLS = ["ton", "thrust", "mfr", "vl"]   # numeric only
LABEL_COL = "anomaly_code"                      # raw label → we'll binarize

print("Using feature columns:", FEATURE_COLS, "\n")

all_features_raw = []
all_labels_raw = []

# Optional: suppress annoying runtime warnings
warnings.filterwarnings("ignore")

print("Reloading training CSV files using corrected feature and label handling...\n")

for i, path in enumerate(valid_train_files):
    df = pd.read_csv(path)

    # FEATURES (always numeric)
    feat = df[FEATURE_COLS].astype(float).values

    # LABELS (binarize anomaly_code)
    raw_labels = df[LABEL_COL]

    # Binary mapping:
    #   NaN / 0 / "0" / empty → 0 (normal)
    #   anything else        → 1 (anomaly)
    is_normal = (
        raw_labels.isna()
        | (raw_labels == 0)
        | (raw_labels == "0")
        | (raw_labels == "")
        | (raw_labels == " ")
    )
    clean_labels = np.where(is_normal, 0, 1).astype(int)

    all_features_raw.append(feat)
    all_labels_raw.append(clean_labels)

# CONCATENATE EVERYTHING
train_features_raw = np.concatenate(all_features_raw, axis=0)
train_labels_raw   = np.concatenate(all_labels_raw, axis=0)

print("Finished loading all train files with corrected features and labels.\n")
print("Total training samples (rows):", train_features_raw.shape[0])
print("Feature matrix shape        :", train_features_raw.shape)
print("Label array shape           :", train_labels_raw.shape)
print("Unique labels               :", np.unique(train_labels_raw))

Using feature columns: ['ton', 'thrust', 'mfr', 'vl'] 

Reloading training CSV files using corrected feature and label handling...

Finished loading all train files with corrected features and labels.

Total training samples (rows): 81180322
Feature matrix shape        : (81180322, 4)
Label array shape           : (81180322,)
Unique labels               : [0 1]


## Step 6: Fit StandardScaler on Normal Data and Normalize Features

To stabilize LSTM-GAN training, all input features must be normalized.

We will now:

1. Use the unified raw arrays from the previous step:
   - `train_features_raw` — shape `(N, 4)`
   - `train_labels_raw`   — shape `(N,)`, values in `{0, 1}`

2. Select only **normal samples** (label = 0) to fit the scaler.  
   This is standard in anomaly detection so that the normalization statistics
   reflect healthy operation only.

3. Fit a `StandardScaler` (zero mean, unit variance) on these normal samples.

4. Transform **all** training samples (normal + anomalous) using this scaler,
   producing `train_features_scaled`.

5. Save the fitted scaler and feature column names as:
   - `data/processed/scaler.pkl`

Later notebooks (`01_training.ipynb` and `02_evaluation.ipynb`) will load this same scaler to keep normalization consistent across train and test data.

In [8]:
from sklearn.preprocessing import StandardScaler
import joblib

# Sanity check
print("Raw feature matrix shape :", train_features_raw.shape)
print("Raw label array shape    :", train_labels_raw.shape)

# 1) Select only normal samples (label = 0)
normal_mask = (train_labels_raw == 0)
X_normal = train_features_raw[normal_mask]

print("\nFitting StandardScaler on normal samples only...")
print("Normal sample count      :", X_normal.shape[0])

# 2) Fit scaler
scaler = StandardScaler()
scaler.fit(X_normal)

# 3) Transform all training features
train_features_scaled = scaler.transform(train_features_raw)

print("\nScaling complete.")
print("Scaled feature matrix shape:", train_features_scaled.shape)

# 4) Save scaler + metadata
scaler_path = PROCESSED_DIR / "scaler.pkl"
joblib.dump(
    {
        "scaler": scaler,
        "feature_cols": FEATURE_COLS,
    },
    scaler_path,
)

print("\nSaved scaler to:", scaler_path)

Raw feature matrix shape : (81180322, 4)
Raw label array shape    : (81180322,)

Fitting StandardScaler on normal samples only...
Normal sample count      : 80687802

Scaling complete.
Scaled feature matrix shape: (81180322, 4)

Saved scaler to: C:\Users\aymisxx\Documents\GitHub\ProphetLSTM-GAN\data\processed\scaler.pkl


## Step 7: Build Sliding Windows, Split into Train / Validation, and Save

Right now we have:

- `train_features_scaled` with shape `(N_timesteps, 4)`  
- `valid_train_files` listing all 1267 train CSVs  
- A fitted `StandardScaler` and saved `scaler.pkl`

Sequence models (LSTMs, GANs) don’t work directly on arbitrarily long time series.
Instead, they expect **fixed-length windows** like:

- shape `(window_size, n_features)` per sample  
- e.g. `(128, 4)` for this project

In this step, we will:

1. Define sliding-window hyperparameters:
   - `WINDOW_SIZE = 128` timesteps
   - `WINDOW_STRIDE = 5` between window starts
   - `MAX_WINDOWS_PER_FILE = 300` to cap memory

2. Implement a helper `make_sliding_windows(sequence_array, ...)` that:
   - Takes one scaled sequence for a single CSV: shape `(T, 4)`
   - Returns windows of shape `(num_windows, 128, 4)`
   - Uses stride and max cap to keep things tractable

In [9]:
import numpy as np
import pandas as pd

# Config
WINDOW_SIZE = 128
WINDOW_STRIDE = 5
MAX_WINDOWS_PER_FILE = 300
FEATURE_COLS = ["ton", "thrust", "mfr", "vl"]

def make_sliding_windows(seq, wsize=WINDOW_SIZE, stride=WINDOW_STRIDE, max_w=MAX_WINDOWS_PER_FILE):
    T, F = seq.shape
    if T < wsize:
        return np.empty((0, wsize, F), dtype=np.float32)
    starts = np.arange(0, T - wsize + 1, stride)
    if len(starts) > max_w:
        starts = starts[:max_w]
    return np.stack([seq[s:s+wsize] for s in starts], axis=0).astype(np.float32)

X_train_windows = []

for path in valid_train_files:
    df = pd.read_csv(path)
    feats = df[FEATURE_COLS].astype(float).values
    scaled = scaler.transform(feats)
    w = make_sliding_windows(scaled)
    if w.shape[0] > 0:
        X_train_windows.append(w)

# Concatenate all windows
X_all = np.concatenate(X_train_windows, axis=0)

# Split 80/20
split = int(0.8 * X_all.shape[0])
X_train_final = X_all[:split]
X_val = X_all[split:]

# Save
np.save(PROCESSED_DIR / "X_train_final.npy", X_train_final)
np.save(PROCESSED_DIR / "X_val.npy", X_val)

print("Total windows:", X_all.shape)
print("X_train_final:", X_train_final.shape)
print("X_val:", X_val.shape)
print("Saved to:", PROCESSED_DIR)

Total windows: (380100, 128, 4)
X_train_final: (304080, 128, 4)
X_val: (76020, 128, 4)
Saved to: C:\Users\aymisxx\Documents\GitHub\ProphetLSTM-GAN\data\processed


In [10]:
import numpy as np
import pandas as pd

# Config (same as train)
WINDOW_SIZE = 128
WINDOW_STRIDE = 5
MAX_WINDOWS_PER_FILE = 300
FEATURE_COLS = ["ton", "thrust", "mfr", "vl"]

def make_sliding_windows(seq, wsize=WINDOW_SIZE, stride=WINDOW_STRIDE, max_w=MAX_WINDOWS_PER_FILE):
    T, F = seq.shape
    if T < wsize:
        return np.empty((0, wsize, F), dtype=np.float32)
    starts = np.arange(0, T - wsize + 1, stride)
    if len(starts) > max_w:
        starts = starts[:max_w]
    return np.stack([seq[s:s+wsize] for s in starts], axis=0).astype(np.float32)

X_test_windows = []

print("Generating test windows...")

for path in test_csv_files:
    df = pd.read_csv(path)

    # Extract features
    feats = df[FEATURE_COLS].astype(float).values

    # Scale using the SAME scaler from training
    scaled = scaler.transform(feats)

    # Generate windows
    w = make_sliding_windows(scaled)
    if w.shape[0] > 0:
        X_test_windows.append(w)

# Concatenate all windows
X_test = np.concatenate(X_test_windows, axis=0)

# Save
np.save(PROCESSED_DIR / "X_test.npy", X_test)

print("X_test shape:", X_test.shape)
print("Saved:", PROCESSED_DIR / "X_test.npy")

Generating test windows...
X_test shape: (403200, 128, 4)
Saved: C:\Users\aymisxx\Documents\GitHub\ProphetLSTM-GAN\data\processed\X_test.npy


### Notebook "00_data_preprocessing.ipynb" complete.