# Notebook 02: Model Development (Supervised & Unsupervised)

**Scope.** Build, tune, and evaluate supervised and unsupervised baselines for the FL-IDS (IIoT surveillance). Results here feed later federated experiments and the thesis comparison tables. This notebook reads the processed datasets created earlier and saves reproducible artifacts (metrics, models). It does not upload any data to the repository.


## Objectives and Structure

**Objectives**
- Train and evaluate supervised baselines (Logistic Regression, SGD Classifier, Random Forest).
- Train and evaluate unsupervised baselines (Isolation Forest, Autoencoder).
- Record accuracy, precision, recall, F1, FP/FN (rates and counts), model sizes, and timing.
- Save artifacts for later use (metrics CSVs, model binaries, thresholds).

**Structure**
1) Supervised model development (with and without SMOTE)  
2) Unsupervised model development (Isolation Forest tuning, Autoencoder + threshold tuning)  
3) Final summary and export of a combined comparison table


## Reproducibility and Output Folders

- All experiments use a fixed `SEED` for `random`, `numpy`, and model initializers when supported.
- Output folders (created automatically) keep models and metrics separate for clarity:
  - `results/models/supervised/{no_smote|with_smote}/`
  - `results/models/unsupervised/`
  - `results/*.csv` (experiment metrics and summaries)


## 1. Supervised Model Development & Evaluation

We evaluate three classifiers on two data variants:
- **No-SMOTE**: original 80/20 stratified split  
- **With-SMOTE**: same split, then SMOTE applied **only on the training set**

Features are standardised (`StandardScaler`) per variant to avoid leakage between sets.
Metrics are computed on the untouched test set. Models are saved for size measurement and reuse.


In [4]:
# Importing the necessary libraries and Setting Global Random Seed in order to have the work reproducable
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import time
import os 
import joblib
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,classification_report)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

#Setting the random seed for reproducability purposes
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

#Define the model output Directory
os.makedirs('results/models/baselines/supervised', exist_ok = True)
print("Libraries imported, Random Seed set, all good")


Libraries imported, Random Seed set, all good


### 1.1 Data Loading & Preprocessing (No-SMOTE and With-SMOTE)

- Read `no_smote/train.csv` and `test.csv`, then standardise features with a scaler **fit on the training data only**.  
- Read `with_smote/train.csv` and `test.csv`, then standardise features again (separate scaler).  
- Target column is `Attack_label`. Non-numeric artifacts from earlier steps have already been removed or encoded.


In [6]:
#Loadind the Preprocessed Data (for No SMOTE)
no_smote_path = r"D:\August-Thesis\FL-IDS-Surveillance\data\processed\surv_supervised\80_20\no_smote"
train_no_smote = pd.read_csv(f"{no_smote_path}\\train.csv", low_memory= False)
test_no_smote = pd.read_csv(f"{no_smote_path}\\test.csv", low_memory = False)

X_train_ns = train_no_smote.drop(columns = ['Attack_label'])
y_train_ns = train_no_smote['Attack_label']

X_test_ns = test_no_smote.drop(columns = ['Attack_label'])
y_test_ns = test_no_smote['Attack_label']

#Checking
print(f"No SMote - Train: {X_train_ns.shape}, and for Testing: {X_test_ns.shape}")


No SMote - Train: (1775067, 42), and for Testing: (443767, 42)


In [7]:
#Loading the Preprocessed Data ( for SMOTE version)
with_smote_path = r"D:\August-Thesis\FL-IDS-Surveillance\data\processed\surv_supervised\80_20\with_smote"
train_with_smote = pd.read_csv(f"{with_smote_path}\\train.csv", low_memory = False)
test_with_smote = pd.read_csv(f"{with_smote_path}\\test.csv", low_memory = False)

X_train_ws = train_with_smote.drop(columns=['Attack_label'])
y_train_ws = train_with_smote["Attack_label"]

X_test_ws = test_with_smote.drop(columns=["Attack_label"])
y_test_ws = test_with_smote["Attack_label"]

#Checking
print(f"With SMOTE- Train: {X_train_ws.shape}, Test: {X_test_ws.shape}")

With SMOTE- Train: (2585028, 42), Test: (443767, 42)


In [9]:
#Using the StandardScaler for both datasets
scaler_ns = StandardScaler()
X_train_ns_scaled = scaler_ns.fit_transform(X_train_ns)
X_test_ns_scaled = scaler_ns.transform(X_test_ns)

scaler_ws = StandardScaler()
X_train_ws_scaled = scaler_ws.fit_transform(X_train_ws)
X_test_ws_scaled = scaler_ws.transform(X_test_ws)

print('Feature Scaling : Applied successfully')

Feature Scaling : Applied successfully


### 1.2 Models and Training Setup

We train the following baselines with sensible defaults:
- **Logistic Regression** (`max_iter=1000`)
- **SGD Classifier** (linear baseline)
- **Random Forest** (parallel, `n_jobs=-1`)

For each model and data variant:
- Fit on the training split, predict on the test split.
- Record metrics, FP/FN counts, FP/FN rates, model size (MB), train time (s), and inference time (ms/sample).
- Save the fitted model under `results/models/supervised/{no_smote|with_smote}/`.


In [12]:
def train_and_evaluate(model, model_name, X_train, y_train, X_test, y_test, save_dir):
    results = {}

    #Training Phase
    start_train = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_train

    #Predicting Phase
    start_test = time.time()
    y_pred = model.predict(X_test)
    test_time = time.time() - start_test
    inference_time_per_sample = test_time / len(X_test)

    #Defining the metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    fp_rate = 100 * fp / (fp + tn)
    fn_rate = 100 * fn / (fn + tp)

    #Saving Model for measuring the size and possibly using later on ...etc
    os.makedirs(save_dir, exist_ok = True)
    save_path = os.path.join(save_dir, f"{model_name}.pkl")
    joblib.dump(model, save_path)
    model_size_mb = os.path.getsize(save_path) / (1024 * 1024)

    #Collecting results for the thesis
    results.update({
        'Model': model_name,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1 Score': f1,
        'False Positives': fp,
        'False Negatives': fn,
        'FP Rate (%)': fp_rate,
        'FN Rate (%)': fn_rate,
        'Model Size (MB)': model_size_mb,
        'Train Time (s)': train_time,
        'Inference Time (ms/sample)': inference_time_per_sample * 1000
     })
    return results

In [13]:
def run_expirenments (models_dict, X_train, y_train, X_test, y_test, save_dir):
    all_resluts = []
    for name, model in models_dict.items():
        print(f"Training {name} in progress . . .")
        result = train_and_evaluate(model, name, X_train, y_train, X_test, y_test, save_dir)
        all_resluts.append(result)
    return pd.DataFrame(all_resluts)

In [14]:
models_to_train = {
    'Logistic_Regressin' : LogisticRegression(max_iter = 1000, random_state = SEED),
    'SGD Classifier' : SGDClassifier(random_state = SEED),
    'Random_Forest' : RandomForestClassifier(n_jobs = -1, random_state = SEED)
}

results_no_smote = run_expirenments(
    models_to_train,
    X_train_ns_scaled, y_train_ns,
    X_test_ns_scaled, y_test_ns,
    "results/models/supervised/no_smote"
)

results_no_smote.to_csv("results/supervised_results_no_smote.csv", index = False)
results_no_smote.style.highlight_max(subset=["Accuracy", "Precision", "Recall", "F1 Score"], color="lightgreen", axis=0)

Training Logistic_Regressin in progress . . .
Training SGD Classifier in progress . . .
Training Random_Forest in progress . . .


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,False Positives,False Negatives,FP Rate (%),FN Rate (%),Model Size (MB),Train Time (s),Inference Time (ms/sample)
0,Logistic_Regressin,0.982667,0.996414,0.939621,0.967184,408,7284,0.126265,6.037899,0.001143,9.686551,9.8e-05
1,SGD Classifier,0.975415,0.985634,0.923018,0.953299,1623,9287,0.502276,7.698238,0.001386,5.191004,7.8e-05
2,Random_Forest,0.997821,0.999449,0.992531,0.995978,66,901,0.020425,0.746863,1.474755,50.454203,0.001066


In [15]:
#With SMOTE Applied now
models_to_train_smote = {
    'Logistic_Regression_SMOTE' : LogisticRegression(max_iter = 1000, random_state = SEED),
    'SGD Classifier_SMOTE' : SGDClassifier(random_state = SEED),
    'Random_Forest_SMOTE' : RandomForestClassifier(n_jobs = -1, random_state = SEED)
}

results_ws_smote = run_expirenments(
    models_to_train_smote,
    X_train_ws_scaled, y_train_ws,
    X_test_ws_scaled, y_test_ws,
    "results/models/supervised/with_smote"
)

results_ws_smote.to_csv("results/supervised_results_with_smote.csv", index = False)
results_ws_smote.style.highlight_max(subset=["Accuracy", "Precision", "Recall", "F1 Score"], color="lightgreen", axis=0).highlight_min(subset=["FP Rate (%)",
                                                                                                                                               "FN Rate (%)",
                                                                                                                                               "Train Time (s)",
                                                                                                                                               "Model Size (MB)",
                                                                                                                                               "Inference Time (ms/sample)"],
                                                                                                                                       color="lightblue",
                                                                                                                                       axis=0
                                                                                                                                       )

Training Logistic_Regression_SMOTE in progress . . .
Training SGD Classifier_SMOTE in progress . . .
Training Random_Forest_SMOTE in progress . . .


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,False Positives,False Negatives,FP Rate (%),FN Rate (%),Model Size (MB),Train Time (s),Inference Time (ms/sample)
0,Logistic_Regression_SMOTE,0.970248,0.9361,0.955802,0.945848,7871,5332,2.435869,4.419835,0.001143,10.738022,0.000158
1,SGD Classifier_SMOTE,0.966852,0.934644,0.944081,0.939339,7964,6746,2.46465,5.591936,0.001386,5.864219,7.6e-05
2,Random_Forest_SMOTE,0.995906,0.985162,1.0,0.992525,1817,0,0.562314,0.0,2.156518,69.797611,0.001177
