# Notebook 03: Federated Unsupervised Baselines (IF & AE)

**Goal.** Build and evaluate unsupervised baselines in a federated setting for IIoT surveillance data. We compare (a) centralized/global models, (b) client-local models, and (c) a simple Federated Averaging (FedAvg) Autoencoder, all evaluated on a shared mixed test set.

**What this notebook does**
- Load each client’s **normal-only** training data and a centralized **mixed** test set.
- Evaluate a **pre-trained global Isolation Forest** on each client’s data (diagnostic baseline).
- Train **local Isolation Forests** per client and evaluate all of them on the same mixed test set.
- Train **local Autoencoders** per client (with MinMax scaling); pick a **threshold** on reconstruction errors to convert to binary labels; evaluate on the mixed test set.
- Construct a **federated Autoencoder (FedAvg)** by averaging client weights; evaluate it on the mixed test set.
- Record metrics: Accuracy, Precision, Recall, F1, FP/FN counts, FP/FN rates, model size, and inference time.

**Scope / assumptions**
- Data paths are repo-relative; raw/processed data remain local and git-ignored.
- This notebook **does not** alter raw data; it trains/evaluates models and writes metrics/models to `results/`.
- Reproducibility: fixed random seeds; same preprocessing per client (StandardScaler for IF, MinMaxScaler for AE).
- Threshold selection for AEs is performed on validation logic described in the code and then fixed for reporting.

**Outputs (for later use)**
- CSVs with per-client and federated results (unsupervised IF and AE).
- Saved models (local IF/AE, federated AE) and any thresholds used for AE decisions.


In [1]:
import pandas as pd
import os

#Base Path for the federetad client data
base_path = r"D:\August-Thesis\FL-IDS-Surveillance\data\processed\federated\unsupervised"

#Loading the client dataset
client_data = {}
for i in range (1,6):
    client_id = f"client_{i}"
    file_path = os.path.join(base_path, client_id, "train.csv")
    df = pd.read_csv(file_path)
    client_data[client_id] = df
    print(f"{client_id}'s shape is: {df.shape}")

client_1's shape is: (323128, 42)
client_2's shape is: (323128, 42)
client_3's shape is: (323128, 42)
client_4's shape is: (323128, 42)
client_5's shape is: (323131, 42)


In [4]:
from sklearn.preprocessing import StandardScaler
import joblib

#Loading the training data
train_path = r"D:\August-Thesis\FL-IDS-Surveillance\data\processed\surv_unsupervised\train_normal_only.csv"
train_df = pd.read_csv(train_path, low_memory = False)

#Dropping the label and the httprequest method feature as well
X_train = train_df.drop(columns = ['Attack_label','http.request.method'], errors="ignore")
X_train = X_train.select_dtypes(include="number")

#Fit the scaler
scaler = StandardScaler()
scaler.fit(X_train)

scaler_save_path = r"results\scalers\std_scaler.pkl"
joblib.dump(scaler, scaler_save_path)
print("Standard Scaler Saved Successfully")

Standard Scaler Saved Successfully


### Part1: Evaluate Global Isolation Forest Model on Each Client

In this step, we evaluate the pre-trained **global Isolation Forest model** on each client’s local dataset.  
This allows us to assess **how well the centralized model generalizes across the diverse data distributions** observed by clients.

#### Purpose:
- Establish a sort of baseline performance for the global model on decentralized data.
- Identify how well the global model detects anomalies across different client datasets.

#### Evaluation Plan:
- Use the previously trained global model: **`isolation_forest_best.pkl`**.
- For each client:
  - Drop the `Attack_label` column before inference.
  - The model outputs:
    - `-1`=> anomaly (mapped to **1 = attack**)  
    - `1` => normal (mapped to **0 = no attack**)  
  - Compare predictions against the true labels to compute:
    - **Accuracy**, **Precision**, **Recall**, **F1 Score**
    - **False Positives (FP)**, **False Negatives (FN)**, **FP Rate**, **FN Rate**

This step provides a baseline understanding of how well the **global Isolation Forest** performs on each client’s local dataset.


In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the pre-trained model and saved scaler
model_path = r"results\models\unsupervised\isolation_forest_best.pkl"
scaler_path = r"results\scalers\std_scaler.pkl"

global_model = joblib.load(model_path)
scaler = joblib.load(scaler_path)

#Load training data to get expected columns, and drop problematic columns
train_df = pd.read_csv(
    r"D:\August-Thesis\FL-IDS-Surveillance\data\processed\surv_unsupervised\train_normal_only.csv",
    low_memory=False
)

X_train = train_df.drop(columns=["Attack_label", "http.request.method"], errors="ignore").select_dtypes(include="number")
expected_columns = X_train.columns  

# Evaluate on each client
client_results = {}
for client_id, df in client_data.items():
    # Drop label and problematic columns from client data
    X = df.drop(columns=["Attack_label", "http.request.method"], errors="ignore").select_dtypes(include="number")
    
    # Align columns with training columns
    X = X[expected_columns]
    y_true = df["Attack_label"]

    # Scale features
    X_scaled = scaler.transform(X)

    # Predict using global Isolation Forest model
    y_pred = global_model.predict(X_scaled)

    # Convert predictions: -1 = anomaly=>1 (attack), 1 = normal => 0
    y_pred_mapped = [1 if pred == -1 else 0 for pred in y_pred]

    # Calculate the metrics for evaluation
    acc = accuracy_score(y_true, y_pred_mapped)
    prec = precision_score(y_true, y_pred_mapped, zero_division=0)
    rec = recall_score(y_true, y_pred_mapped, zero_division=0)
    f1 = f1_score(y_true, y_pred_mapped, zero_division=0)
    fp = sum((y_true == 0) & (pd.Series(y_pred_mapped) == 1))
    fn = sum((y_true == 1) & (pd.Series(y_pred_mapped) == 0))
    fp_rate = 100 * fp / len(y_true)
    fn_rate = 100 * fn / len(y_true)

    # Storing the results
    client_results[client_id] = {
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1 Score": f1,
        "False Positives": fp,
        "False Negatives": fn,
        "FP Rate (%)": fp_rate,
        "FN Rate (%)": fn_rate
    }

# Converts results to DataFrame and display
results_df = pd.DataFrame(client_results).T
results_df = results_df.round(4)
results_df


Unnamed: 0,Accuracy,Precision,Recall,F1 Score,False Positives,False Negatives,FP Rate (%),FN Rate (%)
client_1,0.9601,0.0,0.0,0.0,12904.0,0.0,3.9935,0.0
client_2,0.9597,0.0,0.0,0.0,13016.0,0.0,4.0281,0.0
client_3,0.9601,0.0,0.0,0.0,12881.0,0.0,3.9863,0.0
client_4,0.9603,0.0,0.0,0.0,12833.0,0.0,3.9715,0.0
client_5,0.9598,0.0,0.0,0.0,12992.0,0.0,4.0207,0.0


## Part2: Evaluation of Pre-trained Global Isolation Forest on Client Data

In this step, we evaluated the pre-trained **global Isolation Forest model** (trained on mixed normal and attack traffic) on each client’s local dataset.

### Purpose:
To test how well the centralized model generalizes to distributed, normal-only client data.

### Setup:
- Each client dataset contained only **normal traffic**.
- The model was trained on a mix of normal and attack traffic.
- Features were scaled using a **StandardScaler** fitted on the original training data.
- Predictions were made using the Isolation Forest, where `-1 = anomaly`, `1 = normal`.

### Results:
| Metric              | Observation                                                   |
|---------------------|---------------------------------------------------------------|
| Accuracy            | ~95%, since most data was correctly labeled as normal.        |
| Precision / Recall / F1 | All 0.0 — because there were no attack samples, so no true positives possible. |
| False Positives     | ~5% of normal samples were flagged as anomalies.              |
| False Negatives     | 0 — expected, since there were no attack samples.             |

### Why This Result Is Expected:
- Isolation Forest always flags a certain fraction of samples as anomalies (default contamination is **0.1**).
- Since the client datasets only contain normal traffic, any flagged anomaly is a **false positive**.
- This highlights a limitation of applying a globally trained unsupervised model to unseen client distributions.

### Conclusion
This experiment served as a **diagnostic baseline**, but it does not represent true federated learning.  
In a proper FL setup:
- Each client must train its own local model on its private data.
- The global model should be formed from **aggregating or analyzing these local models**.

We now proceed to train local Isolation Forest models on each client to simulate an actual federated learning scenario.


In [7]:
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# Store models and scalers by client ID
local_models = {}
local_scalers = {}

# Define the model hyperparameters (same for all clients for fairness)
isoforest_params = {
    "n_estimators": 100,
    "contamination": 0.1,
    "random_state": 42
}

# Train a local Isolation Forest per each client
for client_id, df in client_data.items():
    print(f"Training model for {client_id}...")
    
    # Drop label and non-numeric/problematic columns
    X = df.drop(columns=["Attack_label", "http.request.method"], errors="ignore").select_dtypes(include="number")
    
    # Fitting local StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Train the isolation Forest
    model = IsolationForest(**isoforest_params)
    model.fit(X_scaled)
    
    # Store the model and scaler
    local_models[client_id] = model
    local_scalers[client_id] = scaler

print("Local training complete for all clients.")


Training model for client_1...
Training model for client_2...
Training model for client_3...
Training model for client_4...
Training model for client_5...
Local training complete for all clients.


## Part3:  Evaluate Each Local Model on a Common Centralized Test Set

In this step, we evaluate how well each locally trained Isolation Forest model generalizes to a shared, centralized test dataset that contains both normal and attack traffic.

### Purpose:
This evaluation helps us understand:
- Whether local models can detect attacks they have never seen  
- How generalizable each model is to a broader data distribution  
- Which clients produce stronger or weaker models in terms of anomaly detection  

### Evaluation Plan:
- Use the same test set (`test_mixed.csv`) for all clients  
- Apply each client’s local **StandardScaler** to preprocess the test data  
- Use each client’s **Isolation Forest** model to make predictions  
- Convert Isolation Forest output to binary format:  
  - `-1 → anomaly (mapped to 1 = attack)`  
  - `1 → normal (mapped to 0 = no attack)`  
- Calculate standard classification metrics:  
  - **Accuracy, Precision, Recall, F1 Score**  
  - **False Positives (FP), False Negatives (FN), FP Rate, FN Rate**  

This will provide a side-by-side comparison of model performance across all clients.


In [9]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#Loading the centralized dataset
test_data_path = r"D:\August-Thesis\FL-IDS-Surveillance\data\processed\surv_unsupervised\test_mixed.csv"
df_test = pd.read_csv(test_data_path, low_memory=False)

#Drop the non-numeric/problematic columns 
df_test = df_test.drop(columns=["http.request.method"], errors="ignore")

# make the features and labels
X_test_full = df_test.drop(columns=["Attack_label"]).select_dtypes(include="number")
y_true = df_test["Attack_label"]

#Store the evaluation results
evaluation_results = {}

for client_id in local_models.keys():
    model = local_models[client_id]
    scaler = local_scalers[client_id]

    #Align test set columns to the client's expected features 
    expected_columns = scaler.feature_names_in_
    X_test = X_test_full[expected_columns]

    # Scale test data using client's scaler
    X_scaled = scaler.transform(X_test)

    #Predict using client's model
    y_pred_raw = model.predict(X_scaled)
    y_pred = np.where(y_pred_raw == -1, 1, 0)  # -1 => 1 (attack), 1 => 0 (normal)

    #Compute metrics
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()
    fp_rate = 100 * fp / len(y_true)
    fn_rate = 100 * fn / len(y_true)

    #Store results
    evaluation_results[client_id] = {
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1 Score": f1,
        "False Positives": fp,
        "False Negatives": fn,
        "FP Rate (%)": fp_rate,
        "FN Rate (%)": fn_rate
    }

#Convert to DataFrame and save
results_df_local_eval = pd.DataFrame(evaluation_results).T
results_df_local_eval = results_df_local_eval.round(4)

# Save to CSV
results_save_path = r"results\local_model_evaluation_iForest.csv"
results_df_local_eval.to_csv(results_save_path)

results_df_local_eval


Unnamed: 0,Accuracy,Precision,Recall,F1 Score,False Positives,False Negatives,FP Rate (%),FN Rate (%)
client_1,0.7231,0.482,0.2492,0.3286,161560.0,452847.0,7.2813,20.4092
client_2,0.6769,0.2284,0.0792,0.1177,161446.0,555399.0,7.2762,25.0311
client_3,0.6784,0.238,0.0831,0.1232,160536.0,553061.0,7.2352,24.9257
client_4,0.6798,0.2537,0.0916,0.1346,162618.0,547918.0,7.329,24.694
client_5,0.6741,0.2052,0.0691,0.1034,161538.0,561486.0,7.2803,25.3055


## Part4: Reflection on Why Local Models Performed Worse than the Global Model

Although both the global and local Isolation Forest models were trained on data from the same source (`train_normal_only.csv`), the global model significantly outperformed the local models when evaluated on a mixed (attack + normal) test set.

### Key Insight
The only difference between the global and local models is the **amount of training data** they were exposed to.

| Factor               | Global Model                           | Local Models                                  |
|-----------------------|----------------------------------------|-----------------------------------------------|
| Training data size    | Entire dataset (~1.6 million samples)  | ~320,000 samples per client                   |
| View of “normal”      | Full variety across all clients        | Narrower, client-specific normal behavior      |
| Isolation Forest trees| Deeper, richer structure               | Shallower, more overfit to local variance     |
| Contamination (0.1)   | Spread over more data → balanced       | Compressed into smaller sample → more misclassification |

### Why This Matters
- Isolation Forest relies on randomly partitioning the feature space.
- With more data, the model can better separate dense vs. sparse regions (i.e., normal vs. anomalous).
- Local models, trained on smaller samples, had a narrower statistical view of what “normal” is, which led to:
  - More false positives (flagging unseen normal as anomalies)  
  - More false negatives (failing to detect subtle anomalies near local patterns)

### Conclusion
Even though the data distribution was the same, the limited data size per client resulted in less generalizable models.  
This highlights a core trade-off in federated unsupervised learning: **decentralization may reduce data diversity and model robustness unless compensated by smarter aggregation or adaptation strategies.**

---

## Train Local Autoencoder Models on Each Client

In this step, we extend the federated learning setup to use **Autoencoders** — a neural network–based unsupervised method for anomaly detection.

### Why Autoencoders?
Autoencoders are well-suited for detecting anomalies by learning to reconstruct normal input patterns.  
If the reconstruction error for a new sample is high, it is likely anomalous.

Unlike Isolation Forests, Autoencoders are:
- Parametric (learn weights via backpropagation)
- Sensitive to input scale and distribution
- More expressive in modeling nonlinear patterns

### Key Differences in Preprocessing
- We use `MinMaxScaler()` instead of `StandardScaler`, because neural networks benefit from input features scaled to the `[0, 1]` range.
- We still drop the non-numeric or irrelevant column: `http.request.method`.

### Local Training Plan
Each client will:
- Drop the `Attack_label` and `http.request.method` columns (if present)
- Fit a `MinMaxScaler` on its local features
- Train a deep Autoencoder on the scaled normal-only data
- Store the model and scaler separately for evaluation or comparison

This step mirrors the Isolation Forest setup but uses a neural architecture and preprocessing tailored for deep learning.


In [10]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam

def build_autoencoder(input_dim):
    # Input layer
    input_layer = Input(shape=(input_dim,))
    
    # Encoder
    encoded = Dense(32, activation='relu')(input_layer)
    encoded = Dense(16, activation='relu')(encoded)
    encoded = Dense(8, activation='relu')(encoded)
    
    # Decoder
    decoded = Dense(16, activation='relu')(encoded)
    decoded = Dense(32, activation='relu')(decoded)
    output_layer = Dense(input_dim, activation='sigmoid')(decoded)
    
    # Autoencoder model
    autoencoder = Model(inputs=input_layer, outputs=output_layer)
    autoencoder.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
    
    return autoencoder


In [11]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.callbacks import EarlyStopping

#Store the trained model and scaler for each client
local_autoencoders = {}
local_minmax_scalers = {}

#Train parameters
epochs = 50
batch_size = 256
patience = 5

for client_id, df in client_data.items():
    print(f"Training the Autoencode for {client_id} . . .")
    X = df.drop(columns=["Attack_label", "http.request.method"],errors="ignore").select_dtypes(include="number")

    #Fit the scaler
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    #Buuild the Autoencoder based on input shape
    input_dim = X_scaled.shape[1]
    model = build_autoencoder(input_dim)

    #Early stopping 
    early_stop = EarlyStopping(
        monitor = 'loss',
        patience = patience,
        restore_best_weights = True,
        verbose = 1
    )

    #train Model
    model.fit(
        X_scaled, X_scaled, epochs = epochs,
        batch_size = batch_size, shuffle = True,
        callbacks = [early_stop], verbose = 1
    )
    model.save(f"results/models/unsupervised/clients/autoencoder_client_{client_id}.h5")
    local_autoencoders[client_id] = model
    local_minmax_scalers[client_id] = scaler
print("Training complete for all clients.")

Training the Autoencode for client_1 . . .
Epoch 1/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.0556  
Epoch 2/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 993us/step - loss: 0.0120
Epoch 3/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 983us/step - loss: 0.0118
Epoch 4/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 944us/step - loss: 0.0116
Epoch 5/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 959us/step - loss: 0.0116
Epoch 6/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 985us/step - loss: 0.0115
Epoch 7/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0115
Epoch 8/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0116
Epoch 9/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.0116
Epoch 10/50
[1m12



Training the Autoencode for client_2 . . .
Epoch 1/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 1ms/step - loss: 0.0589
Epoch 2/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.0079
Epoch 3/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.0078
Epoch 4/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.0078
Epoch 5/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0077
Epoch 6/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0074
Epoch 7/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0051
Epoch 8/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0052
Epoch 9/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0052
Epoch 10/50
[1m1263/1263[0m 



Training the Autoencode for client_3 . . .
Epoch 1/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.0602  
Epoch 2/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0169
Epoch 3/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0166
Epoch 4/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0166
Epoch 5/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0165
Epoch 6/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0166
Epoch 7/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 995us/step - loss: 0.0164
Epoch 8/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 997us/step - loss: 0.0166
Epoch 9/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 986us/step - loss: 0.0165
Epoch 10/50
[1m1263/1



Training the Autoencode for client_4 . . .
Epoch 1/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 980us/step - loss: 0.0477
Epoch 2/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 979us/step - loss: 0.0146
Epoch 3/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 965us/step - loss: 0.0144
Epoch 4/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 964us/step - loss: 0.0144
Epoch 5/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0143
Epoch 6/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0143
Epoch 7/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 963us/step - loss: 0.0144
Epoch 8/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0143
Epoch 9/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.0142
Epoch 10/50
[1m1263



Training the Autoencode for client_5 . . .
Epoch 1/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 0.0573  
Epoch 2/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0168  
Epoch 3/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0167  
Epoch 4/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0157  
Epoch 5/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0157  
Epoch 6/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0156  
Epoch 7/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 0.0134
Epoch 8/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 996us/step - loss: 0.0130
Epoch 9/50
[1m1263/1263[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 995us/step - loss: 0.0129
Epoch 10/50
[



Training complete for all clients.


## Part5: Evaluate Each Local Autoencoder on a Common Centralized Test Set

In this step, we evaluate the performance of each locally trained Autoencoder on a shared test dataset (`test_mixed.csv`) that contains both normal and attack traffic.

### Why This Matters
This evaluation allows us to assess how well each client’s model, trained only on its local view of normal traffic, generalizes to unseen and more diverse data. Specifically, we measure the model’s ability to:  

- Reconstruct normal inputs accurately  
- Flag abnormal inputs based on reconstruction error  

### Evaluation Procedure
For each client:  

1. Load the centralized test set  
2. Drop the `http.request.method` column  
3. Extract and preprocess features using the client’s own `MinMaxScaler`  
4. Run the test data through the client’s Autoencoder to reconstruct inputs  
5. Compute reconstruction error (Mean Squared Error)  
6. Tune a threshold on reconstruction error to maximize F1 Score  
7. Classify anomalies: reconstruction error > threshold → anomaly  
8. Compare predicted labels to true `Attack_label` and compute:  
   - Accuracy  
   - Precision  
   - Recall  
   - F1 Score  
   - False Positives (FP)  
   - False Negatives (FN)  
   - FP Rate and FN Rate  

This provides insight into which clients produce more generalizable anomaly detectors, despite only seeing a limited view of the system during training.


In [13]:
import time
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

#Load centralized test set
test_data_path = r"D:\August-Thesis\FL-IDS-Surveillance\data\processed\surv_unsupervised\test_mixed.csv"
df_test = pd.read_csv(test_data_path, low_memory=False)

# Drop the non-numeric/problematic columns
df_test = df_test.drop(columns=["http.request.method"], errors="ignore")

#Take the labels
y_true = df_test["Attack_label"].copy()

ae_local_results = {}

for client_id in local_autoencoders.keys():
    model = local_autoencoders[client_id]
    scaler = local_minmax_scalers[client_id]

    #  Extract and align test features
    test_columns = scaler.feature_names_in_
    X_test = df_test[test_columns].copy()

    #Clean and fill any missing values just in case 
    for col in X_test.columns:
        if X_test[col].dtype == 'object' or X_test[col].dtype.name == 'category':
            X_test[col] = pd.to_numeric(X_test[col], errors='coerce')
        if X_test[col].dtype in ['float64', 'int64']:
            X_test[col] = X_test[col].fillna(X_test[col].mean())

    # Scale test data using client's scaler 
    X_test_scaled = scaler.transform(X_test)

    # Run reconstruction and compute reconstruction errors
    start_time = time.time()
    X_reconstructed = model.predict(X_test_scaled, verbose=0)
    end_time = time.time()
    reconstruction_errors = np.mean(np.square(X_test_scaled - X_reconstructed), axis=1)

    #Tune threshold for best F1 score
    thresholds = np.percentile(reconstruction_errors, np.arange(80, 100, 0.1))
    best_f1 = 0
    best_threshold = None
    best_pred = None
    for thresh in thresholds:
        preds = (reconstruction_errors > thresh).astype(int)
        f1 = f1_score(y_true, preds)
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = thresh
            best_pred = preds

    # Final metrics
    accuracy = accuracy_score(y_true, best_pred)
    precision = precision_score(y_true, best_pred)
    recall = recall_score(y_true, best_pred)
    f1 = best_f1
    cm = confusion_matrix(y_true, best_pred)
    tn, fp, fn, tp = cm.ravel()
    fp_rate = fp / (fp + tn) * 100
    fn_rate = fn / (fn + tp) * 100
    total_time = end_time - start_time
    inference_time_per_sample = (total_time / len(X_test_scaled)) * 1000

    #Store the metrics 
    ae_local_results[client_id] = {
        "Best Threshold": best_threshold,
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1,
        "False Positives": fp,
        "False Negatives": fn,
        "FP Rate (%)": fp_rate,
        "FN Rate (%)": fn_rate,
        "Inference Time (s)": total_time,
        "Time per Sample (ms)": inference_time_per_sample
    }

#Convert results to DataFrame and save
ae_local_results_df = pd.DataFrame(ae_local_results).T.round(4)
save_path = r"results\local_model_evaluation_autoencoder.csv"
ae_local_results_df.to_csv(save_path, index=True)
ae_local_results_df


Unnamed: 0,Best Threshold,Accuracy,Precision,Recall,F1 Score,False Positives,False Negatives,FP Rate (%),FN Rate (%),Inference Time (s),Time per Sample (ms)
client_1,0.1002,0.8529,0.9989,0.4593,0.6292,318.0,326154.0,0.0197,54.0714,48.1185,0.0217
client_2,0.0246,0.8876,0.986,0.5948,0.742,5089.0,244391.0,0.315,40.5164,50.6459,0.0228
client_3,0.1002,0.8524,0.9968,0.4584,0.628,875.0,326711.0,0.0542,54.1638,50.1541,0.0226
client_4,0.0489,0.8681,0.9961,0.5166,0.6804,1220.0,291555.0,0.0755,48.3354,50.7439,0.0229
client_5,0.0738,0.8535,0.9975,0.4623,0.6318,693.0,324311.0,0.0429,53.7659,49.8854,0.0225


In [14]:
import tensorflow as tf
import numpy as np
import os
from tensorflow.keras.models import load_model

# Define the architecture
def build_autoencoder(input_dim):
    input_layer = tf.keras.Input(shape=(input_dim,))
    # Encoder
    encoded = tf.keras.layers.Dense(32, activation='relu')(input_layer)
    encoded = tf.keras.layers.Dense(16, activation='relu')(encoded)
    encoded = tf.keras.layers.Dense(8, activation='relu')(encoded)
    # Decoder
    decoded = tf.keras.layers.Dense(16, activation='relu')(encoded)
    decoded = tf.keras.layers.Dense(32, activation='relu')(decoded)
    output_layer = tf.keras.layers.Dense(input_dim, activation='linear')(decoded)
    autoencoder = tf.keras.Model(inputs=input_layer, outputs=output_layer)
    autoencoder.compile(optimizer='adam', loss='mse')
    return autoencoder

# Load all 5 client models 
client_model_paths = [
    f"D:/August-Thesis/FL-IDS-Surveillance/notebooks/results/models/unsupervised/clients/autoencoder_client_client_{i}.h5"
    for i in range(1, 6)
]

#Perform Federated Averaging
client_models = [load_model(path, compile=False) for path in client_model_paths]
client_weights = [model.get_weights() for model in client_models]

avg_weights = []
for weights_list_tuple in zip(*client_weights):
    avg_layer_weights = np.mean(np.array(weights_list_tuple), axis=0)
    avg_weights.append(avg_layer_weights)

#Create a new model and set averaged weights
input_dim = client_models[0].input_shape[1]
federated_autoencoder = build_autoencoder(input_dim)
federated_autoencoder.set_weights(avg_weights)

#Save the federated model
fed_model_path = r"results/models/unsupervised/federated/federated_autoencoder.h5"
os.makedirs(os.path.dirname(fed_model_path), exist_ok=True)
federated_autoencoder.save(fed_model_path)
print(f"Federated Autoencoder saved to:\n{fed_model_path}")




Federated Autoencoder saved to:
results/models/unsupervised/federated/federated_autoencoder.h5


In [15]:
import pandas as pd
import numpy as np
import time
from tensorflow.keras.models import load_model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load federated autoencoder model
fed_model_path = r"results/models/unsupervised/federated/federated_autoencoder.h5"
federated_autoencoder = load_model(fed_model_path, compile=False)

# Load test dataset
test_path = r"D:/August-Thesis/FL-IDS-Surveillance/data/processed/surv_unsupervised/test_mixed.csv"
df_test = pd.read_csv(test_path, low_memory=False)

# Drop label and problematic/non-numeric columns
X_test = df_test.drop(columns=["Attack_label", "http.request.method"], errors="ignore").select_dtypes(include="number")
y_true = df_test["Attack_label"]

# Scale test data using MinMaxScaler from client_1
scaler = local_minmax_scalers["client_1"]
X_test_scaled = scaler.transform(X_test)

# Run reconstruction
start_time = time.time()
X_reconstructed = federated_autoencoder.predict(X_test_scaled)
end_time = time.time()

# Compute reconstruction error
reconstruction_errors = np.mean(np.square(X_test_scaled - X_reconstructed), axis=1)

# Find optimal threshold for best F1 score
thresholds = np.percentile(reconstruction_errors, np.arange(80, 100, 0.1))
best_f1 = 0
best_threshold = None
best_pred = None
for thresh in thresholds:
    preds = (reconstruction_errors > thresh).astype(int)
    f1 = f1_score(y_true, preds)
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = thresh
        best_pred = preds

# Compute final metrics
accuracy = accuracy_score(y_true, best_pred)
precision = precision_score(y_true, best_pred)
recall = recall_score(y_true, best_pred)
f1 = best_f1
tn, fp, fn, tp = confusion_matrix(y_true, best_pred).ravel()
fp_rate = 100 * fp / (fp + tn)
fn_rate = 100 * fn / (fn + tp)
total_time = end_time - start_time
per_sample_time = (total_time / len(X_test)) * 1000

# Output results
print("Federated Autoencoder Evaluation")
print(f"Best Threshold: {best_threshold:.6f}")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"False Positives: {fp} ({fp_rate:.2f}%)")
print(f"False Negatives: {fn} ({fn_rate:.2f}%)")
print(f"Inference Time: {total_time:.2f} s")
print(f"Inference Time per Sample: {per_sample_time:.6f} ms")


[1m69339/69339[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 553us/step
Federated Autoencoder Evaluation
Best Threshold: 23.134763
Accuracy: 0.8381
Precision: 0.9910
Recall: 0.4083
F1 Score: 0.5783
False Positives: 2227 (0.14%)
False Negatives: 356908 (59.17%)
Inference Time: 51.45 s
Inference Time per Sample: 0.023186 ms


### Part7: Now we do the rounds

In [16]:
# Global Setup & Constants
import numpy as np
import random
import tensorflow as tf
import os

# Function for setting the random seed for reproducibility
def set_random_seed(seed=42):
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)


set_random_seed(42)

# Paths & constants
BASE_PATH = "D:/August-Thesis/FL-IDS-Surveillance"
MODEL_DIR = os.path.join(BASE_PATH, "notebooks", "results", "models", "unsupervised", "federated")
DATA_DIR = os.path.join(BASE_PATH, "data", "processed", "federated", "unsupervised")

# Client identifiers
CLIENT_IDS = [f"client_{i}" for i in range(1, 6)]
NUM_CLIENTS = len(CLIENT_IDS)

# Training hyperparameters
BATCH_SIZE = 256
EPOCHS_PER_ROUND = 1

os.makedirs(MODEL_DIR, exist_ok=True)


In [17]:
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense

def build_autoencoder(input_dim, latent_dims=[32, 16, 32], activation='relu', output_activation='sigmoid'):
    """
    Build a deep autoencoder model with a configurable latent layer structure.

    Parameters:
    - input_dim: int, number of input features
    - latent_dims: list of ints, hidden layer sizes [encoder..., decoder last layer]
    - activation: activation function for hidden layers
    - output_activation: activation function for the output layer

    Returns:
    - compiled Keras Model
    """
    
    # Input layer
    input_layer = Input(shape=(input_dim,))
    x = input_layer

    # Encoder layers
    for units in latent_dims[:-1]:
        x = Dense(units, activation=activation)(x)

    # Decoder last hidden layer
    x = Dense(latent_dims[-1], activation=activation)(x)

    # Output layer
    output_layer = Dense(input_dim, activation=output_activation)(x)

    # Build and compile model
    model = Model(inputs=input_layer, outputs=output_layer)
    model.compile(optimizer='adam', loss='mse')

    return model


In [18]:
import os
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def load_and_preprocess_client_data(client_id, columns_to_drop=["Attack_label", "http.request.method"]):
    """
    Load a client's training data, drop unwanted columns, and scale numeric features.

    Parameters:
    - client_id: str, the client identifier
    - columns_to_drop: list of str, columns to exclude from the features

    Returns:
    - X_scaled: np.ndarray, scaled features
    - scaler: fitted MinMaxScaler instance
    - feature_names: list of str, original feature column names
    """
    # Construct file path for the client
    path = os.path.join(DATA_DIR, client_id, "train.csv")
    
    # Load the CSV file
    df = pd.read_csv(path, low_memory=False)
    
    # Drop unwanted columns
    X = df.drop(columns=columns_to_drop, errors="ignore")
    X = X.select_dtypes(include="number")  # Keep only numeric features
    
    # Fit MinMaxScaler on features
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)
    
    return X_scaled, scaler, X.columns.tolist()


In [19]:
from tensorflow.keras.models import clone_model
import numpy as np

def federated_training_round(global_model, client_ids=CLIENT_IDS, batch_size=BATCH_SIZE, epochs=EPOCHS_PER_ROUND):
    """
    Perform one round of federated training by training the global model locally
    on each client and averaging the weights.

    Parameters:
    - global_model: Keras Model, the current global model
    - client_ids: list of str, client identifiers
    - batch_size: int, batch size for local training
    - epochs: int, number of local epochs per client

    Returns:
    - averaged_weights: list of numpy arrays, the federated-averaged weights
    """
    local_weights = []

    for client_id in client_ids:
        print(f"Training on {client_id}...")

        # Load & preprocess client data
        X_client, _, _ = load_and_preprocess_client_data(client_id)

        # Clone the global model for local training
        local_model = clone_model(global_model)
        local_model.set_weights(global_model.get_weights())
        local_model.compile(optimizer='adam', loss='mse')

        # Train the local model
        local_model.fit(X_client, X_client, batch_size=batch_size, epochs=epochs, verbose=0)

        # Collect local weights
        local_weights.append(local_model.get_weights())

    # Perform Federated Averaging
    averaged_weights = [np.mean(layer_weights, axis=0) for layer_weights in zip(*local_weights)]

    return averaged_weights


In [21]:
from tensorflow.keras.models import clone_model
import numpy as np

def federated_training_round(global_model, client_ids=CLIENT_IDS, batch_size=BATCH_SIZE, epochs=EPOCHS_PER_ROUND):
    """
    Perform one round of federated training by training the global model locally
    on each client and averaging the weights.

    Parameters:
    - global_model: Keras Model, the current global model
    - client_ids: list of str, client identifiers
    - batch_size: int, batch size for local training
    - epochs: int, number of local epochs per client

    Returns:
    - averaged_weights: list of numpy arrays, the federated-averaged weights
    """
    local_weights = []

    for client_id in client_ids:
        print(f"Training on {client_id}...")

        # Load & preprocess client data
        X_client, _, _ = load_and_preprocess_client_data(client_id)

        # Clone the global model for local training
        local_model = clone_model(global_model)
        local_model.set_weights(global_model.get_weights())
        local_model.compile(optimizer='adam', loss='mse')

        # Train the local model
        local_model.fit(X_client, X_client, batch_size=batch_size, epochs=epochs, verbose=0)

        # Collect local weights
        local_weights.append(local_model.get_weights())

    # Perform Federated Averaging
    averaged_weights = [np.mean(layer_weights, axis=0) for layer_weights in zip(*local_weights)]

    return averaged_weights


In [22]:
from tensorflow.keras.models import clone_model
import os
import numpy as np

def run_federated_training(input_dim, num_rounds, save_name, client_ids=CLIENT_IDS):
    """
    Perform federated training over multiple rounds on all clients.

    Parameters:
    - input_dim: int, number of features for input
    - num_rounds: int, total number of federated training rounds
    - save_name: str, filename for saving the global model
    - client_ids: list of str, identifiers of clients

    Returns:
    - model_save_path: str, path where the global model is saved
    - global_model: trained Keras model after federated training
    - scaler_client1: scaler fitted on client_1's data (first round)
    """
    # Initialize global model
    global_model = build_autoencoder(input_dim)
    scaler_client1 = None

    for round_num in range(1, num_rounds + 1):
        print(f"\n--- Federated Round {round_num}/{num_rounds} ---")
        local_weights = []

        for client_id in client_ids:
            print(f"Training on {client_id}...")

            # Load & preprocess client data
            X_client, scaler, _ = load_and_preprocess_client_data(client_id)

            # Save the scaler for client_1 (first round) for later use
            if client_id == "client_1" and round_num == 1:
                scaler_client1 = scaler

            # Clone global model for local training
            local_model = clone_model(global_model)
            local_model.set_weights(global_model.get_weights())
            local_model.compile(optimizer='adam', loss='mse')

            # Train locally
            local_model.fit(X_client, X_client, epochs=EPOCHS_PER_ROUND,
                            batch_size=BATCH_SIZE, verbose=0)

            # Collect local weights
            local_weights.append(local_model.get_weights())

        # Federated Averaging
        averaged_weights = [np.mean(weights, axis=0) for weights in zip(*local_weights)]
        global_model.set_weights(averaged_weights)

        # Save the updated global model after each round
        model_save_path = os.path.join(MODEL_DIR, f"{save_name}.h5")
        global_model.save(model_save_path)
        print(f"Model saved to: {model_save_path}")

    return model_save_path, global_model, scaler_client1


### Part 8: One round

In [24]:
# Load client_1 data to determine input dimension
X_client1, _, feature_names = load_and_preprocess_client_data("client_1")
input_dim = X_client1.shape[1]

# Run 1 round of federated learning training
model_path_1round, model_1round, scaler_client1 = run_federated_training(
    input_dim=input_dim,
    num_rounds=1,
    save_name="federated_autoencoder_1round"
)



--- Federated Round 1/1 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_1round.h5


In [25]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import numpy as np

def evaluate_autoencoder(model, X_test_scaled, y_true, thresholds=np.arange(80, 100, 0.1)):
    # Compute reconstruction errors
    X_reconstructed = model.predict(X_test_scaled, verbose=0)
    reconstruction_errors = np.mean(np.square(X_test_scaled - X_reconstructed), axis=1)
    
    best_f1 = 0
    best_threshold = None
    best_metrics = {}

    # Sweep thresholds to find best F1
    for thresh in np.percentile(reconstruction_errors, thresholds):
        preds = (reconstruction_errors > thresh).astype(int)
        f1 = f1_score(y_true, preds)
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = thresh
            tn, fp, fn, tp = confusion_matrix(y_true, preds).ravel()
            best_metrics = {
                "Best Threshold": best_threshold,
                "Accuracy": accuracy_score(y_true, preds),
                "Precision": precision_score(y_true, preds, zero_division=0),
                "Recall": recall_score(y_true, preds, zero_division=0),
                "F1 Score": f1,
                "FP": fp,
                "FN": fn,
                "FP Rate (%)": 100 * fp / (fp + tn),
                "FN Rate (%)": 100 * fn / (fn + tp)
            }

    return best_metrics


In [26]:
import os
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load the centralized test dataset
test_path = os.path.join(BASE_PATH, "data", "processed", "surv_unsupervised", "test_mixed.csv")
df_test = pd.read_csv(test_path, low_memory=False)

X_test = df_test.drop(columns=["Attack_label", "http.request.method"], errors="ignore")
X_test = X_test.select_dtypes(include="number")

# Align test columns with training features
X_test = X_test[feature_names]

# Scale test data using MinMaxScaler fitted on client_1's data
scaler_client1 = MinMaxScaler().fit(X_client1)
X_test_scaled = scaler_client1.transform(X_test)

# Extract true labels
y_true = df_test["Attack_label"].values

# Evaluate the federated Autoencoder (1-round model)
metrics_1round = evaluate_autoencoder(model_1round, X_test_scaled, y_true)
print(metrics_1round)




{'Best Threshold': np.float64(1.9747537200311827e+17), 'Accuracy': 0.5972911898772058, 'Precision': 0.17285422304948317, 'Recall': 0.12716867459892472, 'F1 Score': 0.14653309874894696, 'FP': np.int64(367060), 'FN': np.int64(526484), 'FP Rate (%)': np.float64(22.719127926156954), 'FN Rate (%)': np.float64(87.28313254010753)}


### Part9: 10 rounds

In [27]:
import time
from sklearn.preprocessing import MinMaxScaler
import os

def train_and_evaluate_federated_model(num_rounds, save_name):
    print(f"\n=== Starting Federated Training with {num_rounds} Rounds ===")

    # Train the federated model
    model_path, trained_model, _ = run_federated_training(input_dim, num_rounds, save_name)

    # Prepare centralized test data
    test_path = os.path.join(BASE_PATH, "data", "processed", "surv_unsupervised", "test_mixed.csv")
    df_test = pd.read_csv(test_path, low_memory=False)

    # Drop label and problematic columns, select numeric features
    X_test = df_test.drop(columns=["Attack_label", "http.request.method"], errors="ignore")
    X_test = X_test.select_dtypes(include="number")
    X_test = X_test[feature_names]  # Align columns with training data

    # Scale test data using client_1's scaler
    scaler_client1 = MinMaxScaler().fit(X_client1)
    X_test_scaled = scaler_client1.transform(X_test)

    # True labels
    y_true = df_test["Attack_label"].values

    # Evaluate model and measure inference time
    start_time = time.time()
    metrics = evaluate_autoencoder(trained_model, X_test_scaled, y_true)
    end_time = time.time()

    # Compute model size and per-sample inference time
    model_size_mb = os.path.getsize(model_path) / (1024 ** 2)
    inference_time = end_time - start_time
    inference_time_per_sample = (inference_time / len(X_test_scaled)) * 1000

    # Print report
    print(f"\n=== Final Federated Autoencoder (FedAvg, {num_rounds} Rounds) ===")
    print(f"Best Threshold: {metrics['Best Threshold']:.6f}")
    print(f"Accuracy: {metrics['Accuracy']:.4f}")
    print(f"Precision: {metrics['Precision']:.4f}")
    print(f"Recall: {metrics['Recall']:.4f}")
    print(f"F1 Score: {metrics['F1 Score']:.4f}")
    print(f"False Positives: {metrics['FP']} ({metrics['FP Rate (%)']:.2f}%)")
    print(f"False Negatives: {metrics['FN']} ({metrics['FN Rate (%)']:.2f}%)")
    print(f"Model Size: {model_size_mb:.2f} MB")
    print(f"Inference Time: {inference_time:.2f} seconds")
    print(f"Inference Time per Sample: {inference_time_per_sample:.6f} ms")

    return metrics


In [28]:
import os
import time
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import numpy as np

def evaluate_and_report(model, model_path, X_test_scaled, y_true):
    start_time = time.time()

    # Compute reconstruction errors
    X_reconstructed = model.predict(X_test_scaled, verbose=0)
    reconstruction_errors = np.mean(np.square(X_test_scaled - X_reconstructed), axis=1)

    end_time = time.time()

    # Tune threshold for best F1 score
    thresholds = np.percentile(reconstruction_errors, np.arange(80, 100, 0.1))
    best_f1 = 0
    best_metrics = {}
    best_threshold = None

    for thresh in thresholds:
        preds = (reconstruction_errors > thresh).astype(int)
        f1 = f1_score(y_true, preds)
        if f1 > best_f1:
            best_f1 = f1
            best_threshold = thresh
            tn, fp, fn, tp = confusion_matrix(y_true, preds).ravel()
            best_metrics = {
                "Best Threshold": best_threshold,
                "Accuracy": accuracy_score(y_true, preds),
                "Precision": precision_score(y_true, preds),
                "Recall": recall_score(y_true, preds),
                "F1 Score": f1,
                "FP": fp,
                "FN": fn,
                "FP Rate (%)": 100 * fp / (fp + tn),
                "FN Rate (%)": 100 * fn / (fn + tp)
            }

    # Model size and inference timing
    model_size_mb = os.path.getsize(model_path) / (1024 ** 2)
    total_time = end_time - start_time
    per_sample_time = (total_time / len(X_test_scaled)) * 1000

    # Print report
    print("\n=== Final Federated Autoencoder Evaluation ===")
    print(f"Best Threshold: {best_metrics['Best Threshold']:.6f}")
    print(f"Accuracy: {best_metrics['Accuracy']:.4f}")
    print(f"Precision: {best_metrics['Precision']:.4f}")
    print(f"Recall: {best_metrics['Recall']:.4f}")
    print(f"F1 Score: {best_metrics['F1 Score']:.4f}")
    print(f"False Positives: {best_metrics['FP']} ({best_metrics['FP Rate (%)']:.2f}%)")
    print(f"False Negatives: {best_metrics['FN']} ({best_metrics['FN Rate (%)']:.2f}%)")
    print(f"Model Size: {model_size_mb:.2f} MB")
    print(f"Inference Time: {total_time:.2f} seconds")
    print(f"Inference Time per Sample: {per_sample_time:.6f} ms")

    return best_metrics


In [29]:
# Train the federated autoencoder for 10 rounds and get the scaler
model_path_10, model_10, scaler_client1 = run_federated_training(
    input_dim,
    num_rounds=10,
    save_name="federated_autoencoder_10rounds"
)

# Prepare centralized test data
df_test = pd.read_csv(
    os.path.join(BASE_PATH, "data", "processed", "surv_unsupervised", "test_mixed.csv"),
    low_memory=False
)
X_test = df_test.drop(columns=["Attack_label", "http.request.method"], errors="ignore")
X_test = X_test.select_dtypes(include="number")
X_test = X_test[feature_names]  # Align columns with training features
X_test_scaled = scaler_client1.transform(X_test)
y_true = df_test["Attack_label"].values

# Evaluate the trained model with a formatted report
metrics_10 = evaluate_and_report(model_10, model_path_10, X_test_scaled, y_true)



--- Federated Round 1/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

--- Federated Round 2/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

--- Federated Round 3/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

--- Federated Round 4/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

--- Federated Round 5/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

--- Federated Round 6/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

--- Federated Round 7/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

--- Federated Round 8/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

--- Federated Round 9/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

--- Federated Round 10/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...




Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_10rounds.h5

=== Final Federated Autoencoder Evaluation ===
Best Threshold: 0.025937
Accuracy: 0.8735
Precision: 0.9909
Recall: 0.5395
F1 Score: 0.6986
False Positives: 2992 (0.19%)
False Negatives: 277795 (46.05%)
Model Size: 0.04 MB
Inference Time: 44.96 seconds
Inference Time per Sample: 0.020262 ms


### Testing:

In [30]:
import numpy as np
import random
import tensorflow as tf
import pandas as pd
import os
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import clone_model

def set_random_seed(seed=42):
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)

set_random_seed(42)

# === Constants & Paths ===
BASE_PATH = "D:/August-Thesis/FL-IDS-Surveillance"
MODEL_SAVE_PATH = os.path.join(BASE_PATH,"notebooks", "results","models" ,"unsupervised","federated", "federated_autoencoder_fedavg_rounds.h5")
DATA_DIR = os.path.join(BASE_PATH, "data", "processed", "federated", "unsupervised")
BATCH_SIZE = 256
EPOCHS_PER_ROUND = 1
NUM_ROUNDS = 10
CLIENT_IDS = [f"client_{i}" for i in range(1, 6)]

os.makedirs(os.path.dirname(MODEL_SAVE_PATH), exist_ok=True)


In [31]:
def build_autoencoder(input_dim):
    model = tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape=(input_dim,)),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(16, activation='relu'),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(input_dim, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='mse')
    return model


In [32]:
def run_federated_training_old_style(input_dim, num_rounds=NUM_ROUNDS, model_save_path=MODEL_SAVE_PATH):
    global_model = build_autoencoder(input_dim)
    scaler_client1 = None

    for round_num in range(1, num_rounds + 1):
        print(f"\n--- Federated Round {round_num}/{num_rounds} ---")
        local_weights = []

        for client_id in CLIENT_IDS:
            print(f"Training on {client_id}...")
            df = pd.read_csv(os.path.join(DATA_DIR, client_id, "train.csv"), low_memory=False)
            X = df.drop(columns=["Attack_label", "http.request.method"], errors="ignore").select_dtypes(include="number")

            scaler = MinMaxScaler()
            X_scaled = scaler.fit_transform(X)

            if client_id == "client_1" and round_num == 1:
                scaler_client1 = scaler  # Save client 1's scaler after first round

            local_model = clone_model(global_model)
            local_model.set_weights(global_model.get_weights())
            local_model.compile(optimizer='adam', loss='mse')
            local_model.fit(X_scaled, X_scaled, batch_size=BATCH_SIZE, epochs=EPOCHS_PER_ROUND, verbose=0)

            local_weights.append(local_model.get_weights())

        averaged_weights = [np.mean(w, axis=0) for w in zip(*local_weights)]
        global_model.set_weights(averaged_weights)

    global_model.save(model_save_path)
    print(f"Model saved to: {model_save_path}")

    return global_model, scaler_client1


In [33]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import time

def evaluate_and_report(model, model_path, X_test_scaled, y_true, fixed_threshold=None):
    start_time = time.time()
    reconstruction_errors = np.mean(np.square(X_test_scaled - model.predict(X_test_scaled)), axis=1)
    end_time = time.time()

    if fixed_threshold is None:
        thresholds = np.percentile(reconstruction_errors, np.arange(80, 100, 0.1))
        best_f1 = 0
        best_metrics = {}
        for thresh in thresholds:
            preds = (reconstruction_errors > thresh).astype(int)
            f1 = f1_score(y_true, preds)
            if f1 > best_f1:
                best_f1 = f1
                tn, fp, fn, tp = confusion_matrix(y_true, preds).ravel()
                best_metrics = {
                    "Best Threshold": thresh,
                    "Accuracy": accuracy_score(y_true, preds),
                    "Precision": precision_score(y_true, preds),
                    "Recall": recall_score(y_true, preds),
                    "F1 Score": f1,
                    "FP": fp,
                    "FN": fn,
                    "FP Rate (%)": 100 * fp / (fp + tn),
                    "FN Rate (%)": 100 * fn / (fn + tp)
                }
    else:
        preds = (reconstruction_errors > fixed_threshold).astype(int)
        tn, fp, fn, tp = confusion_matrix(y_true, preds).ravel()
        best_metrics = {
            "Best Threshold": fixed_threshold,
            "Accuracy": accuracy_score(y_true, preds),
            "Precision": precision_score(y_true, preds),
            "Recall": recall_score(y_true, preds),
            "F1 Score": f1_score(y_true, preds),
            "FP": fp,
            "FN": fn,
            "FP Rate (%)": 100 * fp / (fp + tn),
            "FN Rate (%)": 100 * fn / (fn + tp)
        }

    model_size_mb = os.path.getsize(model_path) / (1024 ** 2)
    total_time = end_time - start_time
    per_sample_time = (total_time / len(X_test_scaled)) * 1000

    print("\n=== Final Federated Autoencoder Evaluation ===")
    for k, v in best_metrics.items():
        if isinstance(v, float):
            print(f"{k}: {v:.6f}" if "Threshold" in k else f"{k}: {v:.4f}")
        else:
            print(f"{k}: {v}")
    print(f"Model Size: {model_size_mb:.2f} MB")
    print(f"Inference Time: {total_time:.2f} seconds")
    print(f"Inference Time per Sample: {per_sample_time:.6f} ms")

    return best_metrics


In [34]:
# Get input_dim from client_1 (once at start)
df_sample = pd.read_csv(os.path.join(DATA_DIR, "client_1", "train.csv"), low_memory=False)
X_sample = df_sample.drop(columns=["Attack_label", "http.request.method"], errors="ignore").select_dtypes(include="number")
input_dim = X_sample.shape[1]

# Train Model & Get Client 1 Scaler
trained_model, scaler_client1 = run_federated_training_old_style(input_dim, num_rounds=10)

# Prepare Test Data using Client 1 Scaler
df_test = pd.read_csv(os.path.join(BASE_PATH, "data", "processed", "surv_unsupervised", "test_mixed.csv"), low_memory=False)
X_test = df_test.drop(columns=["Attack_label", "http.request.method"], errors="ignore").select_dtypes(include="number")
X_test = X_test[X_sample.columns]  # Ensure column alignment
X_test_scaled = scaler_client1.transform(X_test)
y_true = df_test["Attack_label"].values

# Evaluate & Print Report
evaluate_and_report(trained_model, MODEL_SAVE_PATH, X_test_scaled, y_true)





--- Federated Round 1/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 2/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 3/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 4/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 5/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 6/10 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 7/10 ---
Training on client_1...
Training on client_2...
Training on client_3..



Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_fedavg_rounds.h5
[1m69339/69339[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 464us/step

=== Final Federated Autoencoder Evaluation ===
Best Threshold: 0.000958
Accuracy: 0.9039
Precision: 0.9394
Recall: 0.6911
F1 Score: 0.7964
FP: 26876
FN: 186300
FP Rate (%): 1.6635
FN Rate (%): 30.8857
Model Size: 0.04 MB
Inference Time: 52.60 seconds
Inference Time per Sample: 0.023708 ms


{'Best Threshold': np.float64(0.0009578974776206702),
 'Accuracy': 0.9039243134006419,
 'Precision': 0.9394366863691983,
 'Recall': 0.6911426065707214,
 'F1 Score': 0.7963853373296732,
 'FP': np.int64(26876),
 'FN': np.int64(186300),
 'FP Rate (%)': np.float64(1.6634863023576372),
 'FN Rate (%)': np.float64(30.885739342927863)}

In [35]:
# Train Model for 20 Rounds & Get Scaler
trained_model_20, scaler_client1_20 = run_federated_training_old_style(input_dim, num_rounds=20)

#Prepare Test Data using Client 1 Scaler
df_test = pd.read_csv(os.path.join(BASE_PATH, "data", "processed", "surv_unsupervised", "test_mixed.csv"), low_memory=False)
X_test = df_test.drop(columns=["Attack_label", "http.request.method"], errors="ignore").select_dtypes(include="number")
X_test = X_test[X_sample.columns]  
X_test_scaled_20 = scaler_client1_20.transform(X_test)
y_true_20 = df_test["Attack_label"].values

# Evaluate & Print Report
evaluate_and_report(trained_model_20, MODEL_SAVE_PATH, X_test_scaled_20, y_true_20)





--- Federated Round 1/20 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 2/20 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 3/20 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 4/20 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 5/20 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 6/20 ---
Training on client_1...
Training on client_2...
Training on client_3...
Training on client_4...
Training on client_5...

--- Federated Round 7/20 ---
Training on client_1...
Training on client_2...
Training on client_3..



Model saved to: D:/August-Thesis/FL-IDS-Surveillance\notebooks\results\models\unsupervised\federated\federated_autoencoder_fedavg_rounds.h5
[1m69339/69339[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 442us/step

=== Final Federated Autoencoder Evaluation ===
Best Threshold: 0.000639
Accuracy: 0.9232
Precision: 0.9877
Recall: 0.7267
F1 Score: 0.8373
FP: 5453
FN: 164877
FP Rate (%): 0.3375
FN Rate (%): 27.3341
Model Size: 0.04 MB
Inference Time: 43.92 seconds
Inference Time per Sample: 0.019794 ms


{'Best Threshold': np.float64(0.0006388418611237216),
 'Accuracy': 0.9232344555744143,
 'Precision': 0.987712020046556,
 'Recall': 0.7266587200405842,
 'F1 Score': 0.8373096150943973,
 'FP': np.int64(5453),
 'FN': np.int64(164877),
 'FP Rate (%)': np.float64(0.3375126807097855),
 'FN Rate (%)': np.float64(27.334127995941586)}