# CIDS-Sim with Non-IID data

Federated Learning involves collaboration among multiple clients to learn from decentralized data. In the context of CIDS, each client acts as a detector unit distributed across various networks, while the central server, responsible for aggregating the models, functions as the correlation unit. This research employs a non-IID (Non-Independent and Identically Distributed) data setting to distribute data across different clients. The Federated Averaging (FedAvg) algorithm is used to aggregate models from multiple clients.

## CIDS Architecture

<p align="center">
    <img width="699" alt="image" src="/images/arch_CIDS-Sim_Non-IID.png">
</p>


## Dataset

This simulator will used Coordinated Attack dataset with [NetFlow (NF) features](https://drive.google.com/file/d/1xioZGRQKYbrpiBhkd56sHxFtdn4rEiTt/view) and [CICFlowMeter (CIC) features](https://www.unb.ca/cic/datasets/ids-2018.html).  For the NF dataset, please download from [here](https://www.kaggle.com/luminardata) and for the CIC dataset, please download from [here](https://www.kaggle.com/luminardata)

## Other Information

The simulator will run binary classification, so the traffic will labeled as normal (0) or anomaly (1)

---

First, import libraries

---

In [None]:
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.metrics import Recall, Precision

from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import matplotlib.pyplot as plt

---

Load Coordinated Attack dataset. Recommended using ".parquet" file for faster reading data.

Reference:
1. https://parquet.apache.org/docs/overview/
2. https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html
3. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html 

Use `CoAt-Set_NF.parquet` or `CoAt-Set_NF.csv` for dataset with NF feature.

Use `CoAt-Set_CIC.parquet` or `CoAt-Set_CIC.csv` for dataset with CIC feature.

---

In [None]:
# Use this to read dataset using parquet file (default)
df = pd.read_parquet('./dataset/CoAt-Set_NF.parquet', engine='pyarrow')

# Use this if you want to read dataset using CSV file
# df = pd.read_csv('./dataset/CoAt-Set_NF.csv')

---

View the dataset information

---

In [None]:
df.info()

---

Look for normal (0) and anomaly (1) traffic distribution

---

In [None]:
df['Label'].value_counts()

---

X and y are used to represent the input features and the corresponding target labels, respectively.

---

In [None]:
X_df = df.drop(columns=['Label'])
y_df = df['Label']

---

Scaling data to ensures that features have values in the same range

---

In [None]:
# Use this scaler for NF dataset
scaler = QuantileTransformer(output_distribution='normal')

# Use this scaler for CIC dataset
#scaler = StandardScaler()

In [None]:
X_df_scl = scaler.fit_transform(X_df)

---

This Python function, `load_data(client_id)`, is designed to load a portion of data for a specific client in a Federated Learning setting where non-IID data is distributed among different clients. Here's a step-by-step explanation of how the code works:

### Function Purpose:
- The function returns a small, random subset of the dataset (`X_df_scl` and `y_df`) based on the `client_id`. The randomness is seeded by the `client_id`, ensuring that the data split is consistent and non-overlapping for each client, simulating non-IID data distribution across clients.

### Line-by-Line Explanation:
1. **`def load_NFUQ_data(client_id):`**
   - The function `load_NFUQ_data` takes a single argument, `client_id`, which is a unique identifier for each client in a distributed learning setting.

2. **`np.random.seed(client_id)`**
   - This sets the random seed using the `client_id`, ensuring that every client receives a unique, reproducible subset of the data. Using the same `client_id` will result in the same split each time.

3. **`indices = np.arange(len(X_df_scl))`**
   - This creates an array of indices that represent the positions of data samples in the dataset `X_df_scl`. `np.arange(len(X_df_scl))` generates indices from 0 to the number of rows in `X_df_scl`.

4. **`np.random.shuffle(indices)`**
   - This shuffles the array of indices randomly, ensuring that data is selected in a non-sequential manner, making the data split for each client random and unique.

5. **`fraction = 0.02`**
   - This sets a fraction (2%) of the total dataset that will be assigned to the client. In this case, each client receives 2% of the total dataset.

6. **`client_data_size = int(fraction * len(X_df_scl))`**
   - This calculates the number of data samples the client will receive by multiplying the fraction (2%) by the total number of samples in `X_df_scl`.

7. **`client_indices = indices[:client_data_size]`**
   - This selects the first `client_data_size` indices from the shuffled `indices` array, which correspond to the data samples that will be assigned to the client.

8. **`X_client = X_df_scl[client_indices]`**
   - This selects the features (input data) for the client using the previously selected indices from the `X_df_scl` dataset, resulting in `X_client`, the feature data for the client.

9. **`y_client = y_df.iloc[client_indices]`**
   - This selects the corresponding labels (target data) from `y_df` using the same indices, resulting in `y_client`, the label data for the client.

10. **`return X_client, y_client`**
    - Finally, the function returns the subset of features (`X_client`) and labels (`y_client`) for the given client.

---

In [None]:
def load_data(client_id):
    
    # Create non-IID splits based on client_id
    np.random.seed(client_id)
    indices = np.arange(len(X_df_scl))
    np.random.shuffle(indices)

    # Choose a fraction of the data for this client
    fraction = 0.02
    client_data_size = int(fraction * len(X_df_scl))
    client_indices = indices[:client_data_size]

    X_client = X_df_scl[client_indices]
    y_client = y_df.iloc[client_indices]

    return X_client, y_client

The `create_model()` function defines a simple neural network model using Keras, following the pyramid method for the number of layers, where each successive layer has fewer neurons than the previous one. Here’s a breakdown:

### Key Components:

1. **Function Definition:**
   - `create_model(input_shape)` takes `input_shape` as an argument, representing the shape of the input data. It defines a neural network model with a series of fully connected (dense) layers.

2. **Keras Sequential Model:**
   - `keras.Sequential()` is a Keras model where layers are stacked linearly, one after the other.

3. **Layers (Pyramid Structure):**
   - The pyramid method refers to gradually reducing the number of neurons (or units) as the layers progress deeper in the network, resembling a pyramid shape. This can help the model learn hierarchical representations of the data, reducing the dimensionality step by step.
   
   Here, the layers follow this pattern:
   
   - **Input Layer:** 
     - `layers.Dense(20, activation='relu', input_shape=(input_shape,))` — The first layer has 20 neurons and expects the input shape specified by `input_shape`. The activation function is ReLU (Rectified Linear Unit), commonly used to introduce non-linearity.
   
   - **Hidden Layers:**
     - `layers.Dense(10, activation='relu')` — The second layer has 10 neurons and uses ReLU.
     - `layers.Dense(5, activation='relu')` — The third layer has 5 neurons and uses ReLU.
     - `layers.Dense(3, activation='relu')` — The fourth layer has 3 neurons and uses ReLU.

   - **Output Layer:**
     - `layers.Dense(1, activation='sigmoid')` — The final output layer has 1 neuron and uses the sigmoid activation function, which is typically used for binary classification (as the output will be between 0 and 1).

4. **Model Compilation:**
   - The model is compiled with the following parameters:
     - **Loss Function:** `mean_squared_error` — Measures the mean of the squares of the errors between the predicted and actual values, though typically `binary_crossentropy` would be more common in classification tasks.
     - **Optimizer:** `sgd` (Stochastic Gradient Descent) — Used for updating the weights during training.
     - **Metrics:** `accuracy`, `Recall()`, and `Precision()` — These are the metrics used to evaluate the model's performance during training. Accuracy measures how many predictions are correct, while recall and precision are related to the model's performance on imbalanced datasets.

### Pyramid Layer Structure:
- The number of neurons decreases from 20 to 1 across the layers, following a decreasing trend like a pyramid. This structure often helps the model reduce dimensionality and focus on the most important features as it processes data through the layers.

### Overall:
This is a basic feed-forward neural network with a pyramid structure, which is useful for gradually condensing the feature space, moving from broader representations to more specific ones, ultimately leading to a single output. The model is compiled for binary classification with additional performance metrics (recall and precision).

In [None]:
# Define a simple neural network model
def create_model(input_shape):
    model = keras.Sequential([
        layers.Dense(20, activation='relu', input_shape=(input_shape,)),
        layers.Dense(10, activation='relu'),
        layers.Dense(5, activation='relu'),
        layers.Dense(3, activation='relu'),
        layers.Dense(1, activation='sigmoid')
    ])
    model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy', Recall(), Precision()])
    return model

The `cids_federated_training()` function implements the **training process for a Collaborative Intrusion Detection System (CIDS) using Federated Learning**. The goal is to train a global model based on the local training of models across multiple distributed nodes, without sharing raw data. Here’s a step-by-step explanation of how it works:

### 1. **Function Definition:**
   - **`cids_federated_training(num_nodes=5, num_rounds=5)`**:
     - This function performs federated learning for intrusion detection across `num_nodes` (clients or devices) over `num_rounds` (iterations of federated learning).
     - Each node trains a local model, and the weights from all nodes are averaged to update a global model in each round.

### 2. **Global Model Initialization:**
   - **`input_shape_glob = X_df_scl.shape[1]`**: Gets the input shape for the global model from the dataset (`X_df_scl`).
   - **`global_model = create_model(input_shape=input_shape_glob)`**: Creates a global model using the input shape of the dataset.
   - **`global_weights = global_model.get_weights()`**: Extracts the initial weights of the global model. These will be updated during the training rounds based on local model updates from the nodes.
   - **Performance metrics initialization:** Several lists (`global_accuracies`, `global_precisions`, `global_recalls`, `global_f1s`) are created to store metrics (accuracy, precision, recall, F1-score) for each round of training.

### 3. **Initial Evaluation:**
   - **`X_test, Y_test = load_data(num_nodes + 1)`**: Loads a test set of data (presumably data not seen by any of the nodes) for evaluating the global model at each round.

### 4. **Training Rounds Loop:**
   - **`for round in range(num_rounds + 1):`**: This loop runs through `num_rounds` iterations, plus an initial round (round 0).

### 5. **Local Training Loop (for each node):**
   - **`for node in range(num_nodes):`**: Inside each round, this loop iterates over all nodes (clients). Each node performs local training using its own data.

   - **Local Data Loading and Splitting:**
     - **`X, Y = load_data(node)`**: Loads the local data for the given node.
     - **`X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)`**: Splits the local data into training and validation sets (80% for training and 20% for validation).
   
   - **Model Creation and Training:**
     - **`model = create_model(input_shape=input_shape)`**: A new local model is created with the input shape of the local training data.
     - **`model.set_weights(global_weights)`**: The local model is initialized with the current global model weights.
     - **`model.fit(X_train, Y_train, epochs=1, verbose=0)`**: The local model is trained on the node's training data for one epoch.

   - **Local Model Evaluation:**
     - **`loss, accuracy, precision, recall = model.evaluate(X_val, Y_val, verbose=0)`**: After training, the local model is evaluated on the validation set. Accuracy, precision, and recall are computed.
     - **`f1_score = 2 * (precision * recall) / (precision + recall + 1e-10)`**: The F1-score is calculated using the precision and recall (a harmonic mean of precision and recall).

   - **Local Weights Collection:**
     - **`local_weights.append(model.get_weights())`**: The trained weights of the local model are collected and stored.

### 6. **Global Model Weight Aggregation:**
   - **`new_weights = [np.mean([weight[layer] for weight in local_weights], axis=0) for layer in range(len(global_weights))]`**:
     - The weights of the local models from all nodes are averaged for each layer. This represents the core step of federated learning, where local updates are combined to form a new global model.
   - **`global_weights = new_weights`**: The global weights are updated with the aggregated (averaged) weights.
   - **`global_model.set_weights(global_weights)`**: The global model is updated with the new aggregated weights.

### 7. **Global Model Evaluation:**
   - After each round, the global model is evaluated on the test set:
     - **`loss, accuracy, precision, recall = model.evaluate(X_test, Y_test, verbose=0)`**: The global model is tested on the global test data.
     - **`f1_score = 2 * (precision * recall) / (precision + recall + 1e-10)`**: The F1-score for the global model is calculated.
     - **The metrics (`accuracy`, `precision`, `recall`, and `f1_score`) are stored** in their respective lists (`global_accuracies`, `global_precisions`, `global_recalls`, `global_f1s`).

### 8. **Return Statement:**
   - **`return global_model, global_accuracies, global_precisions, global_recalls, global_f1s`**: After the final round, the function returns the final global model and the lists containing the performance metrics for each round.

### Summary:
This function implements a federated learning process for CIDS, where:
- Each node trains a local model on its data, starting from the global model's weights.
- The weights from all nodes are aggregated after each round, updating the global model.
- The global model's performance is evaluated and tracked over time using accuracy, precision, recall, and F1-score, ensuring it improves after each round of federated learning.

In [None]:
# CIDS with federated learning training process

# change num_nodes and num_rounds for your simulation scenario

def cids_federated_training(num_nodes=5, num_rounds=5): 
    input_shape_glob = X_df_scl.shape[1]
    global_model = create_model(input_shape=input_shape_glob)
    global_weights = global_model.get_weights()
    global_accuracies = []
    global_precisions = []
    global_recalls = []
    global_f1s = []

    # Initial evaluation
    X_test, Y_test = load_data(num_nodes+1)

    # Training rounds
    for round in range(num_rounds + 1):
        local_weights = []

        for node in range(num_nodes):
            X, Y = load_data(node)
                    
            X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2)
            
            input_shape = X_train.shape[1]

            model = create_model(input_shape=input_shape)
            model.set_weights(global_weights)
            model.fit(X_train, Y_train, epochs=1, verbose=0)

            # Validation
            loss, accuracy, precision, recall = model.evaluate(X_val, Y_val, verbose=0)
            f1_score = 2 * (precision * recall) / (precision + recall + 1e-10)
            print(f"Node {node + 1}: Accuracy {accuracy:.4f} - Precision {precision:.4f} - Recall {recall:.4f} - F1-Score {f1_score:.4f}")

            local_weights.append(model.get_weights())

        # Aggregate weights
        new_weights = [np.mean([weight[layer] for weight in local_weights], axis=0) for layer in range(len(global_weights))]
        global_weights = new_weights

        global_model.set_weights(global_weights)

        # Evaluate global model accuracy
        loss, accuracy, precision, recall = model.evaluate(X_test, Y_test, verbose=0)
        f1_score = 2 * (precision * recall) / (precision + recall + 1e-10)
        global_accuracies.append(accuracy)
        global_precisions.append(precision)
        global_recalls.append(recall)
        global_f1s.append(f1_score)
        
        print(f"\nRound {round}: Accuracy {accuracy:.4f} - Precision {precision:.4f} - Recall {recall:.4f} - F1-Score {f1_score:.4f}\n")

    return global_model,, global_accuracies, global_precisions, global_recalls, global_f1s

---

Run the simulation, then get the global model and perfromance metric in each round.

---

In [None]:
# Run CIDS simulator with Non-IID data from single dataset
print("Simulation for CIDS with Non-IID Data\n")
fl_model, fl_global_accuracies, fl_global_precisions, fl_global_recalls, fl_global_f1s = cids_federated_training()

---

Plot the performance metric in each round using graph

---

In [None]:
# Plotting results
plt.figure(figsize=(14, 10))

plt.subplot(4, 2, 1)
plt.plot(fl_global_accuracies, label='Accuracy')
plt.plot(fl_global_precisions, label='Precision')
plt.plot(fl_global_recalls, label='Recall')
plt.plot(fl_global_f1s, label='F1-Score')
plt.xlabel('Round')
plt.ylabel('Value')
plt.title('CIDS Non-IID')
plt.legend()