# Intrusion Detection System (IDS) Public Datasets Benchmarking

In cybersecurity, the design, development, and implementation of effective Intrusion Detection Systems (IDS) are important for safeguarding IT&C infrastructures from unauthorized access, data breaches, and various forms of malicious activities. The selection of an appropriate ML/DL algorithm plays a essential role in ensuring the security and integrity of protected systems.

But before we can dive in the development of a new-edge algorithm, we shoud have the appropriate data, that needs to be studied and analysed in order to undestant the reality and challenges of our ML problem. In accordance with this paradigm, we chosed to study the early created datasets designed for IDS systems in order to derive leasons learn for feature dataset development.

This experiment aims to comprehensively evaluate the performance of different ML and DL algorithms on a variety of datasets, encompassing a wide range of network traffic scenarios. The datasets used for this analysis include well-known benchmark datasets such as KDD, NSL-KDD, CTU-13, ISCXIDS2012, CIC-IDS2017, CSE-CIC-IDS2018, CIDDS-001/CIDDS-002, and Kyoto 2006+. Each dataset represents a distinct set of challenges and characteristics, making this evaluation both diverse and insightful.

The experiment is divided into three main phases:

1. **Data Acquisition and Preprocessing**:
 - In this phase, we acquire the selected datasets from reputable sources, ensuring the integrity and accuracy of the data.
 - Data preprocessing tasks include handling missing values, selecting the most relevant features using feature selection techniques, normalizing the data, and, if necessary, performing feature engineering to enhance the dataset's suitability for machine learning.

2. **Algorithm Evaluation**:
 - We evaluate the performance of a range of ML/DL algorithms on each dataset. The chosen algorithms include baseline methods like ZeroRule and OneRule, traditional machine learning approaches like Naive Bayes and Random Forest, as well as some of the most used anomaly detection deep learning algorithms.
 - Cross-validation is applied to ensure the robustness of our results. Performance metrics such as precision, variance, and Mean Absolute Error (MAE) are calculated for each algorithm and dataset.

3. **Results and Insights**:
 - The results of this evaluation provide valuable insights into the strengths and weaknesses of different IDS algorithms under various conditions.
 - We analyze the performance of algorithms on both the original datasets and balanced datasets to address the challenge of class imbalance in intrusion detection.
 - Observations and additional details regarding the algorithms' performance are documented, providing a comprehensive overview of their behavior.

By conducting this experiment, we aim to contribute to the understanding of cyber domain dataset generation. The findings will assist in making informed decisions when developing a cybersecurity AI application, by deriving necesary steps and procedures in selecting the appropriate learning data.

The following sections of this Jupyter notebook will provide a detailed walkthrough of the experiment, including code snippets, visualizations, and discussions of the results.

In [2]:
# Mount your Google Drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
import warnings
from google.colab import files

# Suppress all warning messages
warnings.filterwarnings("ignore")

# Check if the Kaggle API credentials file already exists
kaggle_credentials_path = os.path.expanduser("~/.kaggle/kaggle.json")

if not os.path.exists(kaggle_credentials_path):

    if not os.path.exists(os.path.join("/content/drive/MyDrive/.kaggle/", "kaggle.json")):

      # Upload your Kaggle API credentials file (kaggle.json)
      files.upload()

      !mkdir "/content/drive/MyDrive/.kaggle/"
      !mv kaggle.json "/content/drive/MyDrive/.kaggle/"
      !chmod 600 "/content/drive/MyDrive/.kaggle/kaggle.json"

    # Move the Kaggle API Credentials File
    !mkdir -p ~/.kaggle
    !cp '/content/drive/MyDrive/.kaggle/kaggle.json' ~/.kaggle/

else:

    print("Kaggle API credentials file already exists.")

In [4]:
import tensorflow as tf
print("GPU available:", tf.test.is_gpu_available())
print("GPU device name:", tf.test.gpu_device_name())

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


GPU available: True
GPU device name: /device:GPU:0


In [5]:
import os
from psutil import virtual_memory, cpu_count
from tabulate import tabulate

# Function to get CPU information
def get_cpu_info():
    cpu_info = os.popen('lscpu').read()
    return cpu_info

# Function to get RAM information
def get_ram_info():
    ram = virtual_memory()
    total_ram = f"{ram.total / 1e9:.2f} GB"
    available_ram = f"{ram.available / 1e9:.2f} GB"
    return total_ram, available_ram

# Function to get GPU information
def get_gpu_info():
    # Execute nvidia-smi and get its output
    gpu_info = os.popen('nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader,nounits').read().strip()

    # Split the output to get individual GPU details
    details = gpu_info.split(", ")

    # Return GPU name, total, used, and free memory
    return details[0], f"{details[1]} MB", f"{details[2]} MB", f"{details[3]} MB"

# Collect system information
cpu_info = get_cpu_info()
total_ram, available_ram = get_ram_info()
try:
    gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = get_gpu_info()
except:
    gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = 'null',0,0,0

# Extract relevant CPU information
cpu_type = ""
cpu_architecture = ""

for line in cpu_info.splitlines():
    if "Model name:" in line:
        cpu_type = line.split(":")[1].strip()
    elif "Architecture:" in line:
        cpu_architecture = line.split(":")[1].strip()

# Get the number of CPU cores
num_cpu_cores = cpu_count(logical=False)

# Create a table
table = [
    ["CPU Type", cpu_type],
    ["CPU Architecture", cpu_architecture],
    ["Number of CPU Cores", num_cpu_cores],
    ["Total RAM", total_ram],
    ["Available RAM", available_ram],
    ["GPU Name", gpu_name],
    ["GPU Total Memory", gpu_total_memory],
    ["GPU Used Memory", gpu_used_memory],
    ["GPU Free Memory", gpu_free_memory]
]

# Display the table
print(tabulate(table, headers=["Characteristic", "Value"], tablefmt="pretty"))


+---------------------+--------------------------------+
|   Characteristic    |             Value              |
+---------------------+--------------------------------+
|      CPU Type       | Intel(R) Xeon(R) CPU @ 2.00GHz |
|  CPU Architecture   |             x86_64             |
| Number of CPU Cores |               4                |
|      Total RAM      |            54.76 GB            |
|    Available RAM    |            52.15 GB            |
|      GPU Name       |            Tesla T4            |
|  GPU Total Memory   |            15360 MB            |
|   GPU Used Memory   |             359 MB             |
|   GPU Free Memory   |            14742 MB            |
+---------------------+--------------------------------+


## 1. Data Acquisition and Preprocessing

In this section, we focus on acquiring the above mentioned datasets.

### 1.8. CIDDS-002 dataset
Updated version of the CIDDS-001 dataset, CIDDS-002 varies in terms of data size, preprocessing, or other factors. Researchers and professionals have the option to select between these versions based on their particular research objectives and requirements in the realm of network intrusion detection.

### Download and Unzip CIDDS-002 dataset

In [6]:
import os
import pandas as pd
import zipfile

# Specify the dataset name
dataset_name = "dhoogla/cidds002"

# Specify the destination folder in your Google Drive
destination_folder = "/content/drive/MyDrive/CIDDS-002-BM"

# Check if the dataset file already exists in your Google Drive
dataset_file_path = os.path.join(destination_folder, "cidds002.zip")

if not os.path.exists(dataset_file_path):

  # Download the dataset and save it to your Google Drive
  !kaggle datasets download -d $dataset_name -p $destination_folder

  print("Download complete.")

else:

  print("Dataset already exists. Skipping download.")

dest_file = f"{destination_folder}/cidds002.zip"

# Check if the Dataset was downlaoded
if os.path.exists(dest_file) and len(os.listdir(destination_folder))==1:

  # Unzip the downloaded dataset
  with zipfile.ZipFile(dest_file, "r") as zip_ref:
      zip_ref.extractall(destination_folder)

  print("Unzip complete.")

else:

  print("Dataset already exists. Skipping unzip.")

Downloading cidds002.zip to /content/drive/MyDrive/CIDDS-002-BM
 80% 11.0M/13.7M [00:00<00:00, 19.6MB/s]
100% 13.7M/13.7M [00:00<00:00, 15.3MB/s]
Download complete.
Unzip complete.


In [7]:
!ls -ahl '/content/drive/MyDrive/CIDDS-002-BM'

total 30M
-rw------- 1 root root 16M Oct 12 19:52 cidds-002.parquet
-rw------- 1 root root 14M Aug 12  2022 cidds002.zip


In [8]:
import pandas as pd
import os

# Specify the destination folder in your Google Drive
destination_folder = "/content/drive/MyDrive/CIDDS-002-BM"

# Check if the Dataset is saved
df_file_path = os.path.join(destination_folder, "cidds002.csv")

if os.path.exists(df_file_path):

  df = pd.read_csv(df_file_path)

else:

  df = pd.read_parquet("/content/drive/MyDrive/CIDDS-002-BM/cidds-002.parquet")

In [9]:
if not os.path.exists(df_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df.to_csv(df_file_path, index=False)

In [10]:
# Information about the starting CIDDS-001 DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2644135 entries, 0 to 2644134
Data columns (total 15 columns):
 #   Column       Dtype   
---  ------       -----   
 0   duration     float32 
 1   proto        category
 2   packets      int32   
 3   bytes        float32 
 4   flows        int8    
 5   tcp_urg      int8    
 6   tcp_ack      int8    
 7   tcp_psh      int8    
 8   tcp_rst      int8    
 9   tcp_syn      int8    
 10  tcp_fin      int8    
 11  tos          int16   
 12  label        category
 13  attack_type  category
 14  attack_id    int8    
dtypes: category(3), float32(2), int16(1), int32(1), int8(8)
memory usage: 63.0 MB


In [11]:
# Some basic statistical details like percentile, mean, std, etc. of the starting CIDDS-001 DataFrame
df.describe()

Unnamed: 0,duration,packets,bytes,flows,tcp_urg,tcp_ack,tcp_psh,tcp_rst,tcp_syn,tcp_fin,tos,attack_id
count,2644135.0,2644135.0,2644135.0,2644135.0,2644135.0,2644135.0,2644135.0,2644135.0,2644135.0,2644135.0,2644135.0,2644135.0
mean,1.162021,123.4826,175273.2,1.0,0.0,0.9867083,0.9716395,0.05546805,0.6332283,0.1996827,0.1974181,0.03543503
std,7.084023,2871.142,6036762.0,0.0,0.0,0.1145208,0.1660006,0.2288916,0.4819236,0.3997619,6.137102,1.052658
min,0.0,1.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.107,5.0,1166.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,0.276,8.0,2363.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
75%,0.828,16.0,6238.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
max,334.421,205049.0,509900000.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,192.0,43.0


In [12]:
# Shape and columns
df.shape, df.columns

((2644135, 15),
 Index(['duration', 'proto', 'packets', 'bytes', 'flows', 'tcp_urg', 'tcp_ack',
        'tcp_psh', 'tcp_rst', 'tcp_syn', 'tcp_fin', 'tos', 'label',
        'attack_type', 'attack_id'],
       dtype='object'))

In [13]:
label_counts_df = df["label"].value_counts()

# Display the counts with labels for df
print("\nLabel counts for df:")
print(label_counts_df)


Label counts for df:
normal      2640306
victim         3048
attacker        781
Name: label, dtype: int64


In [14]:
# Check duplicate records

df_copy = df.copy()

# Print the shape of the DataFrame 'df_copy' after removing rows with missing values
print(df_copy.shape)

# Remove duplicate rows from the DataFrame 'df_copy' while resetting the index
df_copy = df_copy.drop_duplicates()
df_copy.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df_copy' after removing duplicates and resetting the index
print(df_copy.shape)

(2644135, 15)
(2644135, 15)


In [15]:
# Find identical feature vectors but with different "label"

# Create a subset DataFrame with only feature columns
feature_columns = [col for col in df.columns if col != "label"]

feature_df = df[feature_columns]

# Find duplicate rows based on feature vectors
duplicate_rows = feature_df.duplicated(keep="first")

# Filter the DataFrame to show only duplicate rows
duplicate_records = df[duplicate_rows]

print("Duplicate Records:")
duplicate_records

Duplicate Records:


Unnamed: 0,duration,proto,packets,bytes,flows,tcp_urg,tcp_ack,tcp_psh,tcp_rst,tcp_syn,tcp_fin,tos,label,attack_type,attack_id
163143,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,1
413822,0.0,UDP,1,54.0,1,0,0,0,0,0,0,0,victim,scan,1
413839,0.0,ICMP,1,82.0,1,0,0,0,0,0,0,192,victim,scan,1
446832,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,2
467552,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,3
467596,2.681,ICMP,3,126.0,1,0,0,0,0,0,0,0,victim,scan,3
473212,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,4
534607,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,5
540950,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,6
568216,0.0,UDP,1,54.0,1,0,0,0,0,0,0,0,victim,scan,6


In [16]:
# Filter the DataFrame to show only duplicate rows with different labels than "benign"
filtered_duplicate_records = df[duplicate_rows]
filtered_duplicate_records = filtered_duplicate_records[filtered_duplicate_records['label'] != 'benign']

print("Duplicate Records with Different Labels than 'benign':")
filtered_duplicate_records

Duplicate Records with Different Labels than 'benign':


Unnamed: 0,duration,proto,packets,bytes,flows,tcp_urg,tcp_ack,tcp_psh,tcp_rst,tcp_syn,tcp_fin,tos,label,attack_type,attack_id
163143,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,1
413822,0.0,UDP,1,54.0,1,0,0,0,0,0,0,0,victim,scan,1
413839,0.0,ICMP,1,82.0,1,0,0,0,0,0,0,192,victim,scan,1
446832,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,2
467552,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,3
467596,2.681,ICMP,3,126.0,1,0,0,0,0,0,0,0,victim,scan,3
473212,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,4
534607,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,5
540950,0.0,ICMP,1,42.0,1,0,0,0,0,0,0,0,victim,scan,6
568216,0.0,UDP,1,54.0,1,0,0,0,0,0,0,0,victim,scan,6


In [17]:
# Identify identical feature vectors, list the different labels associated with each vector, and provide the indices for each label

# Create a subset DataFrame with only feature columns
feature_columns = [col for col in df.columns if col not in ["label", "attack_type", "attack_id"]]
feature_df = df[feature_columns]

# Find duplicate rows based on feature vectors
duplicate_rows = feature_df.duplicated(keep="first")

# Get the duplicated feature vectors
duplicated_feature_vectors = feature_df[duplicate_rows]

# Initialize a dictionary to store the different labels and their indices
label_indices = {}

# Initialize a list to store the indexes to drop
indexes_to_drop = []

# Iterate through the duplicated feature vectors
for idx, row in duplicated_feature_vectors.iterrows():
    feature_vector = row.tolist()
    label = df.loc[idx, 'label']

    if tuple(feature_vector) in label_indices:
        label_indices[tuple(feature_vector)].append((label, idx))
    else:
        label_indices[tuple(feature_vector)] = [(label, idx)]

# Print feature vectors with different labels in groups
for feature_vector, labels_indices in label_indices.items():
    if len(labels_indices) > 1:
        unique_labels = set(label for label, _ in labels_indices)
        if len(unique_labels) > 1:
            print("Identical Feature Vector:", feature_vector)
            for label, idx in labels_indices:
                print(f"Label: {label}, Index: {idx}")
                indexes_to_drop.append(idx)
            print()

Identical Feature Vector: (0.0, 'ICMP ', 1, 42.0, 1, 0, 0, 0, 0, 0, 0, 0)
Label: victim, Index: 163143
Label: attacker, Index: 446217
Label: victim, Index: 446832
Label: attacker, Index: 467545
Label: victim, Index: 467552
Label: attacker, Index: 473142
Label: victim, Index: 473212
Label: attacker, Index: 534591
Label: victim, Index: 534607
Label: attacker, Index: 540939
Label: victim, Index: 540950
Label: attacker, Index: 680563
Label: victim, Index: 680595
Label: attacker, Index: 856996
Label: victim, Index: 858118
Label: victim, Index: 993470
Label: attacker, Index: 993471
Label: attacker, Index: 1032897
Label: victim, Index: 1033856
Label: attacker, Index: 1162132
Label: victim, Index: 1162142
Label: attacker, Index: 1179995
Label: victim, Index: 1180013
Label: attacker, Index: 1331151
Label: victim, Index: 1331153
Label: attacker, Index: 1399190
Label: victim, Index: 1399202
Label: attacker, Index: 1422701
Label: victim, Index: 1422710
Label: attacker, Index: 1496375
Label: victim

In [18]:
# Drop the rows with different labels for the same feature vector from the original DataFrame as inconsistency affects learning
df.drop(indexes_to_drop, inplace=True)

In [19]:
df.shape

(2643843, 15)

### Preprocessing of the CIDDS-002 dataset

In [20]:
# Check if the Dataset was not preprocess do:
  # 1 # Handling Missing Values
  # 2 # Encode Categorical Features and Label
  # 3 # Normalization (Min-Max Scaling)
  # 4 # Removing duplicate records

from sklearn.impute import SimpleImputer

df_encoded_file_path = os.path.join(destination_folder, "cidds002_encoded.csv")
if not os.path.exists(df_encoded_file_path):

  # Step 1: Handling Missing Values

  # Check for missing values, NAN
  check_nan = df.isna().sum().sum()

  # Check if missing values are represented as empty values (",,")
  missing_values_as_empty = df.applymap(lambda x: x == '')

  # Count the number of missing values in each column
  missing_values_count = missing_values_as_empty.sum()

  # Check if all elements in the missing_values_count Series are different from 0
  check_null = (missing_values_count != 0).all()

  # Replace empty values with NaN
  if (check_null):
    df.replace("", np.nan, inplace=True)

  # Impute missing values with the most frequent value for categorical columns and mean for numerical columns
  if (check_null or check_nan !=0):
    imputer = SimpleImputer(strategy='most_frequent', missing_values=pd.NA)
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = imputer.fit_transform(df[[col]])
        else:
            df[col] = df[col].fillna(df[col].mean())

In [21]:
# Check again for missing values, NAN
display(df.isna().sum(axis=0))

duration       0
proto          0
packets        0
bytes          0
flows          0
tcp_urg        0
tcp_ack        0
tcp_psh        0
tcp_rst        0
tcp_syn        0
tcp_fin        0
tos            0
label          0
attack_type    0
attack_id      0
dtype: int64

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643843 entries, 0 to 2644134
Data columns (total 15 columns):
 #   Column       Dtype   
---  ------       -----   
 0   duration     float32 
 1   proto        category
 2   packets      int32   
 3   bytes        float32 
 4   flows        int8    
 5   tcp_urg      int8    
 6   tcp_ack      int8    
 7   tcp_psh      int8    
 8   tcp_rst      int8    
 9   tcp_syn      int8    
 10  tcp_fin      int8    
 11  tos          int16   
 12  label        category
 13  attack_type  category
 14  attack_id    int8    
dtypes: category(3), float32(2), int16(1), int32(1), int8(8)
memory usage: 83.2 MB


In [23]:
# 2 # Encode Categorical Features and Label

import numpy as np

df['proto'] = df['proto'].astype('category').cat.codes
df['proto'] = df['proto'].astype(np.int32)

In [24]:
# Display the top 10 most frequent values and their counts in the 'label' column of CTU-13
print(df.label.value_counts().head(10))

# Change the data type of the 'label' column to 'object' (string)
df['label'] = df['label'].astype(dtype='object')

# Check if the 'label' column starts with the string 'normal', and assign a Boolean value accordingly
df['label'] = df['label'].str.startswith('normal', na=False)

# Change the data type of the 'label' column to 'float32'
df['label'] = df['label'].astype(dtype='float32', copy=False)

# Display again the top 10 most frequent values and their counts in the 'label' column of CTU-13 after modifications
print(df.label.value_counts().head(10))

normal      2640288
victim         2919
attacker        636
Name: label, dtype: int64
1.0    2640288
0.0       3555
Name: label, dtype: int64


In [25]:
# Dropping 'attack_type', 'attack_id'

df = df.drop(columns=['attack_type', 'attack_id'])
df.shape

(2643843, 13)

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2643843 entries, 0 to 2644134
Data columns (total 13 columns):
 #   Column    Dtype  
---  ------    -----  
 0   duration  float32
 1   proto     int32  
 2   packets   int32  
 3   bytes     float32
 4   flows     int8   
 5   tcp_urg   int8   
 6   tcp_ack   int8   
 7   tcp_psh   int8   
 8   tcp_rst   int8   
 9   tcp_syn   int8   
 10  tcp_fin   int8   
 11  tos       int16  
 12  label     float32
dtypes: float32(3), int16(1), int32(2), int8(7)
memory usage: 93.3 MB


In [27]:
  # 3 # Normalization (Min-Max Scaling)

from sklearn.preprocessing import MinMaxScaler

# Check if the Dataset was not preprocessed:
if not os.path.exists(df_encoded_file_path):
    min_max_scaler = MinMaxScaler().fit(df)  # Fit the scaler to the data in 'df'
    df_scaled = pd.DataFrame(data=min_max_scaler.transform(df), columns=df.columns)  # Create a new DataFrame with scaled data

In [28]:
  # 4 # Removing duplicate records

# Print the shape of the DataFrame 'df_scaled' after removing rows with missing values
print(df_scaled.shape)

# Remove duplicate rows from the DataFrame 'df_scaled' while resetting the index
df_scaled = df_scaled.drop_duplicates()
df_scaled.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df_scaled' after removing duplicates and resetting the index
print(df_scaled.shape)

(2643843, 13)
(2642714, 13)


In [29]:
# Print out the DataFrames loaded in the memory
%whos DataFrame

Variable                     Type         Data/Info
---------------------------------------------------
df                           DataFrame             duration  proto <...>643843 rows x 13 columns]
df_copy                      DataFrame             duration  proto <...>644135 rows x 15 columns]
df_scaled                    DataFrame             duration     pro<...>642714 rows x 13 columns]
duplicate_records            DataFrame             duration  proto <...>        scan         41  
duplicated_feature_vectors   DataFrame             duration  proto <...>n[1582 rows x 12 columns]
feature_df                   DataFrame             duration  proto <...>644135 rows x 12 columns]
filtered_duplicate_records   DataFrame             duration  proto <...>        scan         41  
missing_values_as_empty      DataFrame             duration  proto <...>643843 rows x 15 columns]


In [30]:
try:
  del df_copy
  del duplicate_records
  del duplicated_feature_vectors
  del feature_df
  del filtered_duplicate_records
  del missing_values_as_empty
except:
  pass

In [31]:
# Check if the Dataset is saved
if not os.path.exists(df_encoded_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df_scaled.to_csv(df_encoded_file_path, index=False)

## 2. Algorithm Evaluation

In this section, we assess the performance of various machine learning algorithms on the upper mentioned datasets.

### 2.8. CIDDS-002 dataset evaluation with baseline and traditional ML algorithms

In this section, we assess the precision and F1 scores as essential metrics for evaluating classification accuracy. We evaluate various machine learning algorithms, including fundamental classifiers like Zero Rule and One Rule, statistical techniques like Naive Bayes, and more advanced models such as Random Forest, using a 10-fold cross-validation methodology. Our evaluation encompasses both 10 and all features from the CIDDS-002 dataset.
These results offer valuable insights into the optimal dataset generation strategy, guiding the selection of effective feature extraction methods from raw data and helping determine the most suitable methodology for the specific dataset.

In [32]:
del df
df = df_scaled.copy()
del df_scaled

In [None]:
"""import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('/content/drive/MyDrive/CIDDS-002-BM/cidds002_encoded.csv')"""

In [34]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif

# Separate features (X) and labels (y)
X = df.drop('label', axis=1)  # Exclude the label column
y = df['label']

# Create a pipeline for feature selection on the preprocessed data
pipeline_10_features = Pipeline([
    ('selector_10', SelectKBest(score_func=f_classif, k=10))
])

# Fit and transform the data for 10 features
X_selected_10 = pipeline_10_features.fit_transform(X, y)

# Display the selected features
print(X_selected_10.shape)  # Check the shape of the selected 10 features

# Display the selected features
print("Selected 10 features:")
selected_feature_indices_10 = pipeline_10_features.named_steps['selector_10'].get_support(indices=True)
selected_features_10 = X.columns[selected_feature_indices_10]
print(selected_features_10)

(2642714, 10)
Selected 10 features:
Index(['duration', 'proto', 'packets', 'bytes', 'tcp_ack', 'tcp_psh',
       'tcp_rst', 'tcp_syn', 'tcp_fin', 'tos'],
      dtype='object')


In [35]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, mean_absolute_error, f1_score
from sklearn.dummy import DummyClassifier
from tabulate import tabulate
import time
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Define the number of desired folds for Cross-Validation (e.g., 10)
num_folds = 10

# Initialize performance metrics lists for 10 and all features
results_10_features = []
results_all_features = []

In [37]:
import os

# Define a file name for saving the results
results_file_name = os.path.join('/content/drive/MyDrive/CIDDS-002-BM/', "cidds002_results.pkl")

# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define ZeroRule classifier
  zero_rule = DummyClassifier(strategy="most_frequent")

  # Evaluate ZeroRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(zero_rule, X, y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(zero_rule, X, y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(zero_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(zero_rule, X, y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display ZeroRule results for 10 features
  print("ZeroRule Precision (10 features):", np.mean(precision_scores_10))
  print("ZeroRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("ZeroRule Variance (10 features):", variance_10)
  print("ZeroRule MAE (10 features):", mae_10)
  print("ZeroRule Execution Time:", elapsed_time_10)

  # Display ZeroRule results for all features
  print("ZeroRule Precision (all features):", np.mean(precision_scores_20))
  print("ZeroRule F1 Score (all features):", np.mean(f1_scores_20))
  print("ZeroRule Variance (all features):", variance_20)
  print("ZeroRule MAE (all features):", mae_20)
  print("ZeroRule Execution Time:", elapsed_time_20)

  results_10_features.append(["ZeroRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_all_features.append(["ZeroRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

ZeroRule Precision (10 features): 0.9990820043357116
ZeroRule F1 Score (10 features): 0.9995407913912411
ZeroRule Variance (10 features): 3.432257141571462e-12
ZeroRule MAE (10 features): 0.0009179956665761032
ZeroRule Execution Time: 7.76543402671814
ZeroRule Precision (all features): 0.9990820043357116
ZeroRule F1 Score (all features): 0.9995407913912411
ZeroRule Variance (all features): 3.432257141571462e-12
ZeroRule MAE (all features): 0.0009179956665761032
ZeroRule Execution Time: 8.051806211471558


In [38]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define OneRule classifier
  one_rule = DummyClassifier(strategy="stratified")

  # Evaluate OneRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(one_rule, X, y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(one_rule, X, y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(one_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(one_rule, X, y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display OneRule results for 10 features
  print("OneRule Precision (10 features):", np.mean(precision_scores_10))
  print("OneRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("OneRule Variance (10 features):", variance_10)
  print("OneRule MAE (10 features):", mae_10)
  print("OneRule Execution Time:", elapsed_time_10)

  # Display OneRule results for all features
  print("OneRule Precision (all features):", np.mean(precision_scores_20))
  print("OneRule F1 Score (all features):", np.mean(f1_scores_20))
  print("OneRule Variance (all features):", variance_20)
  print("OneRule MAE (all features):", mae_20)
  print("OneRule Execution Time:", elapsed_time_20)

  results_10_features.append(["OneRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_all_features.append(["OneRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

OneRule Precision (10 features): 0.9990818963973034
OneRule F1 Score (10 features): 0.9990807807186062
OneRule Variance (10 features): 6.369223916299419e-12
OneRule MAE (10 features): 0.0018303153500530136
OneRule Execution Time: 7.572270393371582
OneRule Precision (all features): 0.9990815319486218
OneRule F1 Score (all features): 0.9990777471249151
OneRule Variance (all features): 3.5620273362099803e-12
OneRule MAE (all features): 0.0018185849849813487
OneRule Execution Time: 7.493360996246338


In [39]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define Naive Bayes classifier
  naive_bayes = GaussianNB()

  # Evaluate Naive Bayes classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(naive_bayes, X, y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(naive_bayes, X, y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(naive_bayes, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(naive_bayes, X, y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Naive Bayes results for 10 features
  print("Naive Bayes Precision (10 features):", np.mean(precision_scores_10))
  print("Naive Bayes F1 Score (10 features):", np.mean(f1_scores_10))
  print("Naive Bayes Variance (10 features):", variance_10)
  print("Naive Bayes MAE (10 features):", mae_10)
  print("Naive Bayes Execution Time:", elapsed_time_10)

  # Display Naive Bayes results for all features
  print("Naive Bayes Precision (all features):", np.mean(precision_scores_20))
  print("Naive Bayes F1 Score (all features):", np.mean(f1_scores_20))
  print("Naive Bayes Variance (all features):", variance_20)
  print("Naive Bayes MAE (all features):", mae_20)
  print("Naive Bayes Execution Time:", elapsed_time_20)

  results_10_features.append(["Naive Bayes", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_all_features.append(["Naive Bayes", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Naive Bayes Precision (10 features): 0.9998023905157524
Naive Bayes F1 Score (10 features): 0.991570605014851
Naive Bayes Variance (10 features): 1.9865065199076738e-08
Naive Bayes MAE (10 features): 0.01663100887950796
Naive Bayes Execution Time: 18.78332757949829
Naive Bayes Precision (all features): 0.9998023905157524
Naive Bayes F1 Score (all features): 0.991570605014851
Naive Bayes Variance (all features): 1.9865065199076738e-08
Naive Bayes MAE (all features): 0.01663100887950796
Naive Bayes Execution Time: 19.702195405960083


In [40]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Create a Random Forest classifier with optimized parameters
  rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)  # Adjust parameters for optimization

  # Evaluate Random Forest classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(rf_classifier, X, y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(rf_classifier, X, y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(rf_classifier, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(rf_classifier, X, y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Random Forest results for 10 features
  print("Random Forest Precision (10 features):", np.mean(precision_scores_10))
  print("Random Forest F1 Score (10 features):", np.mean(f1_scores_10))
  print("Random Forest Variance (10 features):", variance_10)
  print("Random Forest MAE (10 features):", mae_10)
  print("Random Forest Execution Time:", elapsed_time_10)

  # Display Random Forest results for all features
  print("Random Forest Precision (all features):", np.mean(precision_scores_20))
  print("Random Forest F1 Score (all features):", np.mean(f1_scores_20))
  print("Random Forest Variance (all features):", variance_20)
  print("Random Forest MAE (all features):", mae_20)
  print("Random Forest Execution Time:", elapsed_time_20)

  results_10_features.append(["Random Forest", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_all_features.append(["Random Forest", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Random Forest Precision (10 features): 0.9997746924085155
Random Forest F1 Score (10 features): 0.9998458617385355
Random Forest Variance (10 features): 2.5497527179143414e-08
Random Forest MAE (10 features): 0.0003068814862296866
Random Forest Execution Time: 869.8524992465973
Random Forest Precision (all features): 0.9997750715705503
Random Forest F1 Score (all features): 0.9998477556557137
Random Forest Variance (all features): 2.5697755547220984e-08
Random Forest MAE (all features): 0.00030574628960984806
Random Forest Execution Time: 794.2905073165894


In [41]:
import pickle
import os

if not os.path.exists(results_file_name):

  # Save the results lists to a file
  with open(results_file_name, 'wb') as file:
      results_dict = {
          'results_10_features': results_10_features,
          'results_all_features': results_all_features
      }
      pickle.dump(results_dict, file)


In [42]:
# Load the results from the file
with open(results_file_name, 'rb') as file:
    loaded_results = pickle.load(file)

# Access the loaded results lists
results_10_features = loaded_results['results_10_features']
results_all_features = loaded_results['results_all_features']


In [43]:
# Print the results in tabular format
headers_10 = ["Algorithm", "Precision (10 Features)", "F1 Score (10 Features)", "Variance (10 Features)", "MAE (10 Features)", "Execution Time"]
headers_all = ["Precision (All Features)", "F1 Score (All Features)", "Variance (All Features)", "MAE (All Features)", "Execution Time"]

print(tabulate(results_10_features, headers_10, tablefmt="pretty"))
print(tabulate(results_all_features, headers_all, tablefmt="pretty"))

+---------------+-------------------------+------------------------+------------------------+-----------------------+-------------------+
|   Algorithm   | Precision (10 Features) | F1 Score (10 Features) | Variance (10 Features) |   MAE (10 Features)   |  Execution Time   |
+---------------+-------------------------+------------------------+------------------------+-----------------------+-------------------+
|   ZeroRule    |   0.9990820043357116    |   0.9995407913912411   | 3.432257141571462e-12  | 0.0009179956665761032 | 7.76543402671814  |
|    OneRule    |   0.9990818963973034    |   0.9990807807186062   | 6.369223916299419e-12  | 0.0018303153500530136 | 7.572270393371582 |
|  Naive Bayes  |   0.9998023905157524    |   0.991570605014851    | 1.9865065199076738e-08 |  0.01663100887950796  | 18.78332757949829 |
| Random Forest |   0.9997746924085155    |   0.9998458617385355   | 2.5497527179143414e-08 | 0.0003068814862296866 | 869.8524992465973 |
+---------------+-----------------