# Federated Learning with Trustrank adjusted learning

This model uses a trustrank-like system to detect botnet-like behavior. The aim is to implement a simplified federated learning network that employs a TrustRank system to identify potential Malware/Botnet impacted devices using Federated Learning, and restricting them on a Trust rating based on their scores achieved in the Federated Training.

The trustrank score accounts for how often each device was interpreted as being infected by the model using a variety of metrics that will be explained/implemented later.

In a network setting, these trustrank scores would implement differing levels of restrictions in regards to: Allowed traffic throughput, packet queue retention, and firewall access.

**Will not be considering adversarial clients, we will be assuming all clients are contributing to the model in good faith.**

An essential part of this project is portraying the hypothetical impacts that restricting botnets on the network will have. In addition, providing a Proof of concept alert to the network switch would do an effective job at playing this concept out.

# TODO

Restart search for a dataset to attempt to simulate different nodes:<br>
  This dataset must have differing nodes inside of the dataset to allow for detection of nodes that posses malware.<br>
    Possible datasets:<br>
      IoT-23 (already investigated, would work but its a fuckton of data)<br>
      bot-iot<br>
      **N-BaIoT** (currently implemented but not working well.)<br>
      TON_IoT Dataset<br>
      CICIDS2017 (??????)<br>

Implement a simple federated learning algorithm acting for all of the nodes analyzing each other, determining which other nodes have the most botnet like action.

Implement a trustrank system using the machine learning results that determines the relevant trust level of each node. Implement this with a dqn test case

Implement the hyperparameter pipeline, most likely will run this section locally on my computer rather than running this section, so likely will comment this section out.
  Include a TXT section explaining there params and include a file that allows for importing the results. Use pipeline for this.

Implement the actual training loop using the training + all other classes and definitions.

Create the plots and graphs for this.

In [1]:
# Dataset cloning + importing of relevant coding libraries

import torch as torch
import torch.nn as nn
import numpy as np
import matplotlib as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tqdm import tqdm
import pandas as pd

## Data Preprocessing for Supervised Learning
- Skip during demonstration

In [6]:
import pandas as pd
import glob
import os
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# Path to your N-BaIoT CSV files
DATA_DIR = "N-BaIoT"
OUTPUT_FILES = [
    "combined_train.csv",
    "combined_test.csv"
] + [f"{i}_{suffix}.csv" for i in range(1, 10) for suffix in ["train", "test"]]  # adjust range if needed

# Clean up old files
for f in OUTPUT_FILES:
    if os.path.exists(f):
        os.remove(f)

tqdm.pandas()
csv_files = glob.glob(os.path.join(DATA_DIR, "*.csv"))
device_data = {}

# Label extractor
def extract_label_from_filename(filename):
    name = os.path.basename(filename).lower()
    if "benign" in name:
        return "benign"
    else:
        return "malicious"  # assumes everything else is an attack

print("Reading and grouping CSV files by device...")
for file in tqdm(csv_files):
    base = os.path.basename(file)
    device_id = base.split('.')[0]
    df = pd.read_csv(file)

    # Add label from filename
    df["label"] = extract_label_from_filename(base)

    # Optional: Drop metadata columns
    df = df.drop(columns=[col for col in df.columns if "device" in col.lower() or "feature" in col.lower() or "file" in col.lower()], errors='ignore')

    if device_id not in device_data:
        device_data[device_id] = []
    device_data[device_id].append(df)

combined_train = []
combined_test = []

print("Splitting data and saving per-device CSVs...")
for device_id in tqdm(device_data):
    device_df = pd.concat(device_data[device_id], ignore_index=True)
    
    # Stratified split by label
    train_df, test_df = train_test_split(device_df, test_size=0.2, random_state=42, shuffle=True, stratify=device_df["label"])

    train_df.to_csv(f"{device_id}_train.csv", index=False)
    test_df.to_csv(f"{device_id}_test.csv", index=False)

    combined_train.append(train_df)
    combined_test.append(test_df)

print("Saving combined train/test files...")
pd.concat(combined_train, ignore_index=True).to_csv("combined_train.csv", index=False)
pd.concat(combined_test, ignore_index=True).to_csv("combined_test.csv", index=False)

print("All done.")

Reading and grouping CSV files by device...


100%|█████████████████████████████████████████████████████████████████████████████████| 92/92 [00:52<00:00,  1.77it/s]


Splitting data and saving per-device CSVs...


100%|█████████████████████████████████████████████████████████████████████████████████| 12/12 [06:49<00:00, 34.16s/it]


Saving combined train/test files...
All done.


In [2]:
# SupLearn Imports
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.impute import SimpleImputer

In [5]:
# Load data
print("Loading training and test data...")
train_df = pd.read_csv("combined_train.csv", low_memory=False)
train_df["label"] = train_df["label"].map({"benign": 0, "malicious": 1})
train_df = train_df.loc[:, ~train_df.columns.str.contains('^Unnamed')]
test_df = pd.read_csv("combined_test.csv",low_memory=False)
test_df["label"] = test_df["label"].map({"benign": 0, "malicious": 1})
test_df = test_df.loc[:, ~test_df.columns.str.contains('^Unnamed')]



# Separate features and labels
X_train = train_df.drop("label", axis=1)
t_train = train_df["label"]
X_test = test_df.drop("label", axis=1)
t_test = test_df["label"]

print("Handling missing values...")
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Feature scaling
print("Scaling features...")
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Feature scaling
print("Scaling features...")
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define baseline models
models = {
    #"Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1),
    "SVM": SVC(),
    "MLP": MLPClassifier(max_iter=500)
}

# Train and evaluate
print("Training and evaluating models...")
for name in tqdm(models):
    model = models[name]
    model.fit(X_train, t_train)
    preds = model.predict(X_test)
    print(f"\n--- {name} ---")
    print(classification_report(t_test, preds, digits=4))

Loading training and test data...
Handling missing values...


ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- HH_L0.01_std
Feature names seen at fit time, yet now missing:
- label


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from tqdm import tqdm

# Load data
print("Loading training and test data...")
train_df = pd.read_csv("combined_train.csv", low_memory=False)
train_df["label"] = train_df["label"].map({"benign": 0, "malicious": 1})
train_df = train_df.loc[:, ~train_df.columns.str.contains('^Unnamed')]

test_df = pd.read_csv("combined_test.csv", low_memory=False)
test_df["label"] = test_df["label"].map({"benign": 0, "malicious": 1})
test_df = test_df.loc[:, ~test_df.columns.str.contains('^Unnamed')]

# ---- Top 10 feature selection ----
correlations = train_df.corr(numeric_only=True)['label'].abs().sort_values(ascending=False)
top_features = correlations.drop("label").head(10).index.tolist()
print("Top 10 correlated features:", top_features)

# Separate features and labels
X_train = train_df[top_features]
t_train = train_df["label"]
X_test = test_df[top_features]
t_test = test_df["label"]

# Impute missing values
print("Handling missing values...")
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Feature scaling
print("Scaling features...")
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Define models
models = {
    # "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1),
    "SVM": SVC(),
    "MLP": MLPClassifier(max_iter=500)
}

# Train and evaluate
print("Training and evaluating models...")
for name in tqdm(models):
    model = models[name]
    model.fit(X_train, t_train)
    preds = model.predict(X_test)
    print(f"\n--- {name} ---")
    print(classification_report(t_test, preds, digits=4))

Loading training and test data...
Top 10 correlated features: ['HH_L0.01_std', 'HH_L0.1_std', 'HpHp_L0.01_std', 'MI_dir_L0.1_weight', 'H_L0.1_weight', 'MI_dir_L1_weight', 'H_L1_weight', 'MI_dir_L3_weight', 'H_L3_weight', 'MI_dir_L5_weight']
Handling missing values...
Scaling features...
Training and evaluating models...


 33%|███████████████████████████▋                                                       | 1/3 [00:57<01:55, 57.73s/it]


--- Random Forest ---
              precision    recall  f1-score   support

           0     1.0000    0.9995    0.9998    111188
           1     1.0000    1.0000    1.0000   1301381

    accuracy                         1.0000   1412569
   macro avg     1.0000    0.9998    0.9999   1412569
weighted avg     1.0000    1.0000    1.0000   1412569



In [None]:
# Definitions of Buffer Class

In [None]:
print(t_train)

In [None]:
# Definitions of classes for Trustrank calculation

In [None]:
# Hyper parameter pipeline for testing

In [None]:
# Actual Training loop


In [None]:
# Trustrank results

In [None]:
# Graphs creation/Plots/Data Analysis


In [None]:
# Code for model saving, creation of non matplotlib graphs, and gifs

In [None]:
# Display graphs