# Intrusion Detection System (IDS) Public Datasets Benchmarking

In cybersecurity, the design, development, and implementation of effective Intrusion Detection Systems (IDS) are important for safeguarding IT&C infrastructures from unauthorized access, data breaches, and various forms of malicious activities. The selection of an appropriate ML/DL algorithm plays a essential role in ensuring the security and integrity of protected systems.

But before we can dive in the development of a new-edge algorithm, we shoud have the appropriate data, that needs to be studied and analysed in order to undestant the reality and challenges of our ML problem. In accordance with this paradigm, we chosed to study the early created datasets designed for IDS systems in order to derive leasons learn for feature dataset development.

This experiment aims to comprehensively evaluate the performance of different ML and DL algorithms on a variety of datasets, encompassing a wide range of network traffic scenarios. The datasets used for this analysis include well-known benchmark datasets such as KDD, NSL-KDD, CTU-13, ISCXIDS2012, CIC-IDS2017, CSE-CIC-IDS2018, CIDDS-001/CIDDS-002, and Kyoto 2006+. Each dataset represents a distinct set of challenges and characteristics, making this evaluation both diverse and insightful.

The experiment is divided into three main phases:

1. **Data Acquisition and Preprocessing**:
 - In this phase, we acquire the selected datasets from reputable sources, ensuring the integrity and accuracy of the data.
 - Data preprocessing tasks include handling missing values, selecting the most relevant features using feature selection techniques, normalizing the data, and, if necessary, performing feature engineering to enhance the dataset's suitability for machine learning.

2. **Algorithm Evaluation**:
 - We evaluate the performance of a range of ML/DL algorithms on each dataset. The chosen algorithms include baseline methods like ZeroRule and OneRule, traditional machine learning approaches like Naive Bayes and Random Forest, as well as some of the most used anomaly detection deep learning algorithms.
 - Cross-validation is applied to ensure the robustness of our results. Performance metrics such as precision, variance, and Mean Absolute Error (MAE) are calculated for each algorithm and dataset.

3. **Results and Insights**:
 - The results of this evaluation provide valuable insights into the strengths and weaknesses of different IDS algorithms under various conditions.
 - We analyze the performance of algorithms on both the original datasets and balanced datasets to address the challenge of class imbalance in intrusion detection.
 - Observations and additional details regarding the algorithms' performance are documented, providing a comprehensive overview of their behavior.

By conducting this experiment, we aim to contribute to the understanding of cyber domain dataset generation. The findings will assist in making informed decisions when developing a cybersecurity AI application, by deriving necesary steps and procedures in selecting the appropriate learning data.

The following sections of this Jupyter notebook will provide a detailed walkthrough of the experiment, including code snippets, visualizations, and discussions of the results.

In [2]:
# Mount your Google Drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import warnings
from google.colab import files

# Suppress all warning messages
warnings.filterwarnings("ignore")

# Check if the Kaggle API credentials file already exists
kaggle_credentials_path = os.path.expanduser("~/.kaggle/kaggle.json")

if not os.path.exists(kaggle_credentials_path):

    if not os.path.exists(os.path.join("/content/drive/MyDrive/.kaggle/", "kaggle.json")):

      # Upload your Kaggle API credentials file (kaggle.json)
      files.upload()

      !mkdir "/content/drive/MyDrive/.kaggle/"
      !mv kaggle.json "/content/drive/MyDrive/.kaggle/"
      !chmod 600 "/content/drive/MyDrive/.kaggle/kaggle.json"

    # Move the Kaggle API Credentials File
    !mkdir -p ~/.kaggle
    !cp '/content/drive/MyDrive/.kaggle/kaggle.json' ~/.kaggle/

else:

    print("Kaggle API credentials file already exists.")

In [4]:
import tensorflow as tf
print("GPU available:", tf.test.is_gpu_available())
print("GPU device name:", tf.test.gpu_device_name())

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


GPU available: True
GPU device name: /device:GPU:0


In [3]:
import os
from psutil import virtual_memory, cpu_count
from tabulate import tabulate

# Function to get CPU information
def get_cpu_info():
    cpu_info = os.popen('lscpu').read()
    return cpu_info

# Function to get RAM information
def get_ram_info():
    ram = virtual_memory()
    total_ram = f"{ram.total / 1e9:.2f} GB"
    available_ram = f"{ram.available / 1e9:.2f} GB"
    return total_ram, available_ram

# Function to get GPU information
def get_gpu_info():
    # Execute nvidia-smi and get its output
    gpu_info = os.popen('nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader,nounits').read().strip()

    # Split the output to get individual GPU details
    details = gpu_info.split(", ")

    # Return GPU name, total, used, and free memory
    return details[0], f"{details[1]} MB", f"{details[2]} MB", f"{details[3]} MB"

# Collect system information
cpu_info = get_cpu_info()
total_ram, available_ram = get_ram_info()
try:
    gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = get_gpu_info()
except:
    gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = 'null',0,0,0

# Extract relevant CPU information
cpu_type = ""
cpu_architecture = ""

for line in cpu_info.splitlines():
    if "Model name:" in line:
        cpu_type = line.split(":")[1].strip()
    elif "Architecture:" in line:
        cpu_architecture = line.split(":")[1].strip()

# Get the number of CPU cores
num_cpu_cores = cpu_count(logical=False)

# Create a table
table = [
    ["CPU Type", cpu_type],
    ["CPU Architecture", cpu_architecture],
    ["Number of CPU Cores", num_cpu_cores],
    ["Total RAM", total_ram],
    ["Available RAM", available_ram],
    ["GPU Name", gpu_name],
    ["GPU Total Memory", gpu_total_memory],
    ["GPU Used Memory", gpu_used_memory],
    ["GPU Free Memory", gpu_free_memory]
]

# Display the table
print(tabulate(table, headers=["Characteristic", "Value"], tablefmt="pretty"))


+---------------------+--------------------------------+
|   Characteristic    |             Value              |
+---------------------+--------------------------------+
|      CPU Type       | Intel(R) Xeon(R) CPU @ 2.20GHz |
|  CPU Architecture   |             x86_64             |
| Number of CPU Cores |               4                |
|      Total RAM      |            54.76 GB            |
|    Available RAM    |            53.05 GB            |
|      GPU Name       |            Tesla T4            |
|  GPU Total Memory   |            15360 MB            |
|   GPU Used Memory   |              0 MB              |
|   GPU Free Memory   |            15101 MB            |
+---------------------+--------------------------------+


## 1. Data Acquisition and Preprocessing

In this section, we focus on acquiring the above mentioned datasets.

### 1.7. CIDDS-001 dataset
CIDDS-001, the Cybersecurity Intrusion Detection System Dataset 001, is a valuable resource for research and development in the domain of network intrusion detection, hosted and published by HS-Coburg (Germany). This dataset encompasses a diverse and realistic compilation of network traffic data, encompassing both benign and malicious activities. It serves as an academic intrusion detection dataset credited to authors Markus Ring, Sarah Wunderlich, Dominik Grüdl, Dr. Dieter Landes, and Dr. Andreas Hotho research.

### Download and Unzip CIDDS-001 dataset

In [None]:
import os
import pandas as pd
import zipfile

# Specify the dataset name
dataset_name = "dhoogla/cidds001"

# Specify the destination folder in your Google Drive
destination_folder = "/content/drive/MyDrive/CIDDS-001-BM"

# Check if the dataset file already exists in your Google Drive
dataset_file_path = os.path.join(destination_folder, "cidds001.zip")

if not os.path.exists(dataset_file_path):

  # Download the dataset and save it to your Google Drive
  !kaggle datasets download -d $dataset_name -p $destination_folder

  print("Download complete.")

else:

  print("Dataset already exists. Skipping download.")

dest_file = f"{destination_folder}/cidds001.zip"

# Check if the Dataset was downlaoded
if os.path.exists(dest_file) and len(os.listdir(destination_folder))==1:

  # Unzip the downloaded dataset
  with zipfile.ZipFile(dest_file, "r") as zip_ref:
      zip_ref.extractall(destination_folder)

  print("Unzip complete.")

else:

  print("Dataset already exists. Skipping unzip.")

Downloading cidds001.zip to /content/drive/MyDrive/CIDDS-001-BM
 75% 17.0M/22.6M [00:00<00:00, 46.1MB/s]
100% 22.6M/22.6M [00:00<00:00, 54.3MB/s]
Download complete.
Unzip complete.


In [None]:
!ls -ahl '/content/drive/MyDrive/CIDDS-001-BM'

total 48M
-rw------- 1 root root 1.2M Oct 12 11:31 cidds-001-externalserver.parquet
-rw------- 1 root root  25M Oct 12 11:31 cidds-001-openstack.parquet
-rw------- 1 root root  23M Aug 12  2022 cidds001.zip


In [None]:
import pandas as pd
import os

# Specify the destination folder in your Google Drive
destination_folder = "/content/drive/MyDrive/CIDDS-001-BM"

# Check if the Dataset is saved
df_file_path = os.path.join(destination_folder, "cidds001.csv")

if os.path.exists(df_file_path):

  df = pd.read_csv(df_file_path)

else:

  # List to store DataFrames
  dfs = []

  # Walk through the directory and find .parquet files
  for root, dirs, files in os.walk(destination_folder):
      for file in files:
          if file.endswith('.parquet'):
              filepath = os.path.join(root, file)
              dfs.append(pd.read_parquet(filepath))

  # Concatenate the DataFrames
  df = pd.concat(dfs, copy=False, ignore_index=True, sort=False)

In [None]:
if not os.path.exists(df_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df.to_csv(df_file_path, index=False)

In [None]:
# Information about the starting CIDDS-001 DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4366182 entries, 0 to 4366181
Data columns (total 15 columns):
 #   Column       Dtype  
---  ------       -----  
 0   duration     float32
 1   proto        object 
 2   packets      int32  
 3   bytes        float32
 4   flows        int8   
 5   tcp_urg      int8   
 6   tcp_ack      int8   
 7   tcp_psh      int8   
 8   tcp_rst      int8   
 9   tcp_syn      float32
 10  tcp_fin      float32
 11  tos          int16  
 12  label        object 
 13  attack_type  object 
 14  attack_id    int8   
dtypes: float32(4), int16(1), int32(1), int8(6), object(3)
memory usage: 216.5+ MB


In [None]:
# Some basic statistical details like percentile, mean, std, etc. of the starting CIDDS-001 DataFrame
df.describe()

Unnamed: 0,duration,packets,bytes,flows,tcp_urg,tcp_ack,tcp_psh,tcp_rst,tcp_syn,tcp_fin,tos,attack_id
count,4366182.0,4366182.0,4366182.0,4366182.0,4366182.0,4366182.0,4366182.0,4366182.0,4366182.0,4366182.0,4366182.0,4366182.0
mean,16.76576,99.41453,137395.6,1.0,0.0,0.988468,0.9698265,0.02354597,0.642525,0.1925286,12.3661,0.08654701
std,1996.72,2662.677,5510090.0,0.0,0.0,0.1067664,0.1710645,0.1516297,0.4792564,0.3942859,15.81508,1.993916
min,0.0,1.0,28.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.133,5.0,1112.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,0.375,8.0,2188.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0
75%,1.057,14.0,5440.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,32.0,0.0
max,604817.1,208768.0,516200000.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,192.0,70.0


In [None]:
# Shape and columns
df.shape, df.columns

((4366182, 15),
 Index(['duration', 'proto', 'packets', 'bytes', 'flows', 'tcp_urg', 'tcp_ack',
        'tcp_psh', 'tcp_rst', 'tcp_syn', 'tcp_fin', 'tos', 'label',
        'attack_type', 'attack_id'],
       dtype='object'))

In [None]:
label_counts_df = df["label"].value_counts()

# Display the counts with labels for df
print("\nLabel counts for df:")
print(label_counts_df)


Label counts for df:
normal        4158132
suspicious     181406
unknown         14769
attacker         6822
victim           5053
Name: label, dtype: int64


In [None]:
# Check duplicate records

df_copy = df.copy()

# Print the shape of the DataFrame 'df_copy' after removing rows with missing values
print(df_copy.shape)

# Remove duplicate rows from the DataFrame 'df_copy' while resetting the index
df_copy = df_copy.drop_duplicates()
df_copy.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df_copy' after removing duplicates and resetting the index
print(df_copy.shape)

(4366182, 15)
(4366157, 15)


In [None]:
# Find identical feature vectors but with different "label"

# Create a subset DataFrame with only feature columns
feature_columns = [col for col in df.columns if col != "label"]

feature_df = df[feature_columns]

# Find duplicate rows based on feature vectors
duplicate_rows = feature_df.duplicated(keep="first")

# Filter the DataFrame to show only duplicate rows
duplicate_records = df[duplicate_rows]

print("Duplicate Records:")
duplicate_records

Duplicate Records:


Unnamed: 0,duration,proto,packets,bytes,flows,tcp_urg,tcp_ack,tcp_psh,tcp_rst,tcp_syn,tcp_fin,tos,label,attack_type,attack_id
390,31.000000,TCP,1,46.0,1,0,0,0,0,1.0,0.0,0,unknown,benign,0
391,31.000000,TCP,6,264.0,1,0,1,0,0,1.0,0.0,0,unknown,benign,0
590,30.997999,TCP,1,46.0,1,0,0,0,0,1.0,0.0,0,unknown,benign,0
591,30.997999,TCP,6,264.0,1,0,1,0,0,1.0,0.0,0,unknown,benign,0
854,0.000000,TCP,1,46.0,1,0,0,0,0,1.0,0.0,0,unknown,benign,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4286924,0.000000,ICMP,1,42.0,1,0,0,0,0,0.0,0.0,0,victim,pingScan,38
4294189,0.000000,ICMP,1,42.0,1,0,0,0,0,0.0,0.0,0,attacker,portScan,40
4303173,0.593000,TCP,13,1593.0,1,0,1,1,0,1.0,1.0,0,normal,benign,0
4316802,1.319000,TCP,2,120.0,1,0,1,0,0,1.0,0.0,0,normal,benign,0


In [None]:
# Filter the DataFrame to show only duplicate rows with different labels than "benign"
filtered_duplicate_records = df[duplicate_rows]
filtered_duplicate_records = filtered_duplicate_records[filtered_duplicate_records['label'] != 'benign']

print("Duplicate Records with Different Labels than 'benign':")
filtered_duplicate_records

Duplicate Records with Different Labels than 'benign':


Unnamed: 0,duration,proto,packets,bytes,flows,tcp_urg,tcp_ack,tcp_psh,tcp_rst,tcp_syn,tcp_fin,tos,label,attack_type,attack_id
390,31.000000,TCP,1,46.0,1,0,0,0,0,1.0,0.0,0,unknown,benign,0
391,31.000000,TCP,6,264.0,1,0,1,0,0,1.0,0.0,0,unknown,benign,0
590,30.997999,TCP,1,46.0,1,0,0,0,0,1.0,0.0,0,unknown,benign,0
591,30.997999,TCP,6,264.0,1,0,1,0,0,1.0,0.0,0,unknown,benign,0
854,0.000000,TCP,1,46.0,1,0,0,0,0,1.0,0.0,0,unknown,benign,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4286924,0.000000,ICMP,1,42.0,1,0,0,0,0,0.0,0.0,0,victim,pingScan,38
4294189,0.000000,ICMP,1,42.0,1,0,0,0,0,0.0,0.0,0,attacker,portScan,40
4303173,0.593000,TCP,13,1593.0,1,0,1,1,0,1.0,1.0,0,normal,benign,0
4316802,1.319000,TCP,2,120.0,1,0,1,0,0,1.0,0.0,0,normal,benign,0


In [None]:
# Identify identical feature vectors, list the different labels associated with each vector, and provide the indices for each label

# Create a subset DataFrame with only feature columns
feature_columns = [col for col in df.columns if col not in ["label", "attack_type", "attack_id"]]
feature_df = df[feature_columns]

# Find duplicate rows based on feature vectors
duplicate_rows = feature_df.duplicated(keep="first")

# Get the duplicated feature vectors
duplicated_feature_vectors = feature_df[duplicate_rows]

# Initialize a dictionary to store the different labels and their indices
label_indices = {}

# Initialize a list to store the indexes to drop
indexes_to_drop = []

# Iterate through the duplicated feature vectors
for idx, row in duplicated_feature_vectors.iterrows():
    feature_vector = row.tolist()
    label = df.loc[idx, 'label']

    if tuple(feature_vector) in label_indices:
        label_indices[tuple(feature_vector)].append((label, idx))
    else:
        label_indices[tuple(feature_vector)] = [(label, idx)]

# Print feature vectors with different labels in groups
for feature_vector, labels_indices in label_indices.items():
    if len(labels_indices) > 1:
        unique_labels = set(label for label, _ in labels_indices)
        if len(unique_labels) > 1:
            print("Identical Feature Vector:", feature_vector)
            for label, idx in labels_indices:
                print(f"Label: {label}, Index: {idx}")
                indexes_to_drop.append(idx)
            print()

Identical Feature Vector: (0.0, 'TCP  ', 1, 46.0, 1, 0, 0, 0, 0, 1.0, 0.0, 0)
Label: unknown, Index: 854
Label: attacker, Index: 28386
Label: attacker, Index: 45421
Label: attacker, Index: 131797
Label: attacker, Index: 131854

Identical Feature Vector: (0.0, 'TCP  ', 1, 40.0, 1, 0, 1, 0, 1, 0.0, 0.0, 0)
Label: unknown, Index: 855
Label: victim, Index: 25625
Label: victim, Index: 28387
Label: victim, Index: 45422
Label: victim, Index: 131798
Label: victim, Index: 131855

Identical Feature Vector: (0.0010000000474974513, 'TCP  ', 1, 40.0, 1, 0, 1, 0, 1, 0.0, 0.0, 0)
Label: unknown, Index: 15913
Label: victim, Index: 28389
Label: victim, Index: 45426
Label: victim, Index: 131802
Label: victim, Index: 131865

Identical Feature Vector: (0.0, 'TCP  ', 1, 46.0, 1, 0, 1, 0, 0, 0.0, 0.0, 0)
Label: unknown, Index: 16544
Label: attacker, Index: 131795

Identical Feature Vector: (0.0, 'TCP  ', 1, 40.0, 1, 0, 0, 0, 1, 0.0, 0.0, 0)
Label: unknown, Index: 16545
Label: victim, Index: 131796

Identica

In [None]:
# Drop the rows with different labels for the same feature vector from the original DataFrame as inconsistency affects learning
df.drop(indexes_to_drop, inplace=True)

In [None]:
df.shape

(4365454, 15)

### Preprocessing of the CIDDS-001 dataset

In [None]:
# Check if the Dataset was not preprocess do:
  # 1 # Handling Missing Values
  # 2 # Encode Categorical Features and Label
  # 3 # Normalization (Min-Max Scaling)
  # 4 # Removing duplicate records

from sklearn.impute import SimpleImputer

df_encoded_file_path = os.path.join(destination_folder, "cidds001_encoded.csv")
if not os.path.exists(df_encoded_file_path):

  # Step 1: Handling Missing Values

  # Check for missing values, NAN
  check_nan = df.isna().sum().sum()

  # Check if missing values are represented as empty values (",,")
  missing_values_as_empty = df.applymap(lambda x: x == '')

  # Count the number of missing values in each column
  missing_values_count = missing_values_as_empty.sum()

  # Check if all elements in the missing_values_count Series are different from 0
  check_null = (missing_values_count != 0).all()

  # Replace empty values with NaN
  if (check_null):
    df.replace("", np.nan, inplace=True)

  # Impute missing values with the most frequent value for categorical columns and mean for numerical columns
  if (check_null or check_nan !=0):
    imputer = SimpleImputer(strategy='most_frequent', missing_values=pd.NA)
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = imputer.fit_transform(df[[col]])
        else:
            df[col] = df[col].fillna(df[col].mean())

In [None]:
# Check again for missing values, NAN
display(df.isna().sum(axis=0))

duration       0
proto          0
packets        0
bytes          0
flows          0
tcp_urg        0
tcp_ack        0
tcp_psh        0
tcp_rst        0
tcp_syn        0
tcp_fin        0
tos            0
label          0
attack_type    0
attack_id      0
dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4365454 entries, 0 to 4366181
Data columns (total 15 columns):
 #   Column       Dtype  
---  ------       -----  
 0   duration     float32
 1   proto        object 
 2   packets      int32  
 3   bytes        float32
 4   flows        int8   
 5   tcp_urg      int8   
 6   tcp_ack      int8   
 7   tcp_psh      int8   
 8   tcp_rst      int8   
 9   tcp_syn      float32
 10  tcp_fin      float32
 11  tos          int16  
 12  label        object 
 13  attack_type  object 
 14  attack_id    int8   
dtypes: float32(4), int16(1), int32(1), int8(6), object(3)
memory usage: 249.8+ MB


In [None]:
# 2 # Encode Categorical Features and Label

import numpy as np

df['proto'] = df['proto'].astype('category').cat.codes
df['proto'] = df['proto'].astype(np.int32)

In [None]:
# Display the top 10 most frequent values and their counts in the 'label' column of CTU-13
print(df.label.value_counts().head(10))

# Change the data type of the 'label' column to 'object' (string)
df['label'] = df['label'].astype(dtype='object')

# Check if the 'label' column starts with the string 'normal', and assign a Boolean value accordingly
df['label'] = df['label'].str.startswith('normal', na=False)

# Change the data type of the 'label' column to 'float32'
df['label'] = df['label'].astype(dtype='float32', copy=False)

# Display again the top 10 most frequent values and their counts in the 'label' column of CTU-13 after modifications
print(df.label.value_counts().head(10))

normal        4158076
suspicious     181364
unknown         14756
attacker         6523
victim           4735
Name: label, dtype: int64
1.0    4158076
0.0     207378
Name: label, dtype: int64


In [None]:
# Dropping  'attack_type', 'attack_id' 

df = df.drop(columns=['attack_type', 'attack_id'])
df.shape

(4365454, 13)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4365454 entries, 0 to 4366181
Data columns (total 13 columns):
 #   Column    Dtype  
---  ------    -----  
 0   duration  float32
 1   proto     int32  
 2   packets   int32  
 3   bytes     float32
 4   flows     int8   
 5   tcp_urg   int8   
 6   tcp_ack   int8   
 7   tcp_psh   int8   
 8   tcp_rst   int8   
 9   tcp_syn   float32
 10  tcp_fin   float32
 11  tos       int16  
 12  label     float32
dtypes: float32(5), int16(1), int32(2), int8(5)
memory usage: 179.0 MB


In [None]:
  # 3 # Normalization (Min-Max Scaling)

from sklearn.preprocessing import MinMaxScaler

# Check if the Dataset was not preprocessed:
if not os.path.exists(df_encoded_file_path):
    min_max_scaler = MinMaxScaler().fit(df)  # Fit the scaler to the data in 'df'
    df_scaled = pd.DataFrame(data=min_max_scaler.transform(df), columns=df.columns)  # Create a new DataFrame with scaled data

In [None]:
  # 4 # Removing duplicate records

# Print the shape of the DataFrame 'df_scaled' after removing rows with missing values
print(df_scaled.shape)

# Remove duplicate rows from the DataFrame 'df_scaled' while resetting the index
df_scaled = df_scaled.drop_duplicates()
df_scaled.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df_scaled' after removing duplicates and resetting the index
print(df_scaled.shape)

(4365454, 13)
(4359644, 13)


In [None]:
# Print out the DataFrames loaded in the memory
%whos DataFrame

Variable                     Type         Data/Info
---------------------------------------------------
df                           DataFrame             duration  proto <...>365454 rows x 13 columns]
df_copy                      DataFrame             duration  proto <...>366157 rows x 15 columns]
df_scaled                    DataFrame                 duration  pr<...>359644 rows x 13 columns]
duplicate_records            DataFrame              duration  proto<...>\n[928 rows x 15 columns]
duplicated_feature_vectors   DataFrame              duration  proto<...>n[7381 rows x 12 columns]
feature_df                   DataFrame             duration  proto <...>366182 rows x 12 columns]
filtered_duplicate_records   DataFrame              duration  proto<...>\n[928 rows x 15 columns]
missing_values_as_empty      DataFrame             duration  proto <...>365454 rows x 15 columns]


In [None]:
try:
  del df_copy
  del duplicate_records
  del duplicated_feature_vectors
  del feature_df
  del filtered_duplicate_records
  del missing_values_as_empty
except:
  pass

In [None]:
# Check if the Dataset is saved
if not os.path.exists(df_encoded_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df_scaled.to_csv(df_encoded_file_path, index=False)

## 2. Algorithm Evaluation

In this section, we assess the performance of various machine learning algorithms on the upper mentioned datasets.

### 2.7. CIDDS-001 dataset evaluation with baseline and traditional ML algorithms

In this section, we assess the precision and F1 scores as essential metrics for evaluating classification accuracy. We evaluate various machine learning algorithms, including fundamental classifiers like Zero Rule and One Rule, statistical techniques like Naive Bayes, and more advanced models such as Random Forest, using a 10-fold cross-validation methodology. Our evaluation encompasses both 10 and all features from the CIDDS-001 dataset.
These results offer valuable insights into the optimal dataset generation strategy, guiding the selection of effective feature extraction methods from raw data and helping determine the most suitable methodology for the specific dataset.

In [None]:
del df
df = df_scaled.copy()
del df_scaled

In [5]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('/content/drive/MyDrive/CIDDS-001-BM/cidds001_encoded.csv')

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif

# Separate features (X) and labels (y)
X = df.drop('label', axis=1)  # Exclude the label column
y = df['label']

# Create a pipeline for feature selection on the preprocessed data
pipeline_10_features = Pipeline([
    ('selector_10', SelectKBest(score_func=f_classif, k=10))
])

# Fit and transform the data for 10 features
X_selected_10 = pipeline_10_features.fit_transform(X, y)

# Display the selected features
print(X_selected_10.shape)  # Check the shape of the selected 10 features

# Display the selected features
print("Selected 10 features:")
selected_feature_indices_10 = pipeline_10_features.named_steps['selector_10'].get_support(indices=True)
selected_features_10 = X.columns[selected_feature_indices_10]
print(selected_features_10)

(4359644, 10)
Selected 10 features:
Index(['duration', 'proto', 'packets', 'bytes', 'tcp_ack', 'tcp_psh',
       'tcp_rst', 'tcp_syn', 'tcp_fin', 'tos'],
      dtype='object')


  f = msb / msw


In [7]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, mean_absolute_error, f1_score
from sklearn.dummy import DummyClassifier
from tabulate import tabulate
import time
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Define the number of desired folds for Cross-Validation (e.g., 10)
num_folds = 10

# Initialize performance metrics lists for 10 and all features
results_10_features = []
results_all_features = []

In [8]:
import os

# Define a file name for saving the results
results_file_name = os.path.join('/content/drive/MyDrive/CIDDS-001-BM/', "cidds001_results.pkl")

# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define ZeroRule classifier
  zero_rule = DummyClassifier(strategy="most_frequent")

  # Evaluate ZeroRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(zero_rule, X, y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(zero_rule, X, y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(zero_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(zero_rule, X, y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display ZeroRule results for 10 features
  print("ZeroRule Precision (10 features):", np.mean(precision_scores_10))
  print("ZeroRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("ZeroRule Variance (10 features):", variance_10)
  print("ZeroRule MAE (10 features):", mae_10)
  print("ZeroRule Execution Time:", elapsed_time_10)

  # Display ZeroRule results for all features
  print("ZeroRule Precision (all features):", np.mean(precision_scores_20))
  print("ZeroRule F1 Score (all features):", np.mean(f1_scores_20))
  print("ZeroRule Variance (all features):", variance_20)
  print("ZeroRule MAE (all features):", mae_20)
  print("ZeroRule Execution Time:", elapsed_time_20)

  results_10_features.append(["ZeroRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_all_features.append(["ZeroRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

ZeroRule Precision (10 features): 0.9537592977784136
ZeroRule F1 Score (10 features): 0.9763324467478158
ZeroRule Variance (10 features): 1.019999700573365e-12
ZeroRule MAE (10 features): 0.04624070222247505
ZeroRule Execution Time: 12.883915424346924
ZeroRule Precision (all features): 0.9537592977784136
ZeroRule F1 Score (all features): 0.9763324467478158
ZeroRule Variance (all features): 1.019999700573365e-12
ZeroRule MAE (all features): 0.04624070222247505
ZeroRule Execution Time: 13.03194785118103


In [9]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define OneRule classifier
  one_rule = DummyClassifier(strategy="stratified")

  # Evaluate OneRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(one_rule, X, y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(one_rule, X, y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(one_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(one_rule, X, y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display OneRule results for 10 features
  print("OneRule Precision (10 features):", np.mean(precision_scores_10))
  print("OneRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("OneRule Variance (10 features):", variance_10)
  print("OneRule MAE (10 features):", mae_10)
  print("OneRule Execution Time:", elapsed_time_10)

  # Display OneRule results for all features
  print("OneRule Precision (all features):", np.mean(precision_scores_20))
  print("OneRule F1 Score (all features):", np.mean(f1_scores_20))
  print("OneRule Variance (all features):", variance_20)
  print("OneRule MAE (all features):", mae_20)
  print("OneRule Execution Time:", elapsed_time_20)

  results_10_features.append(["OneRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_all_features.append(["OneRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

OneRule Precision (10 features): 0.9537773273918637
OneRule F1 Score (10 features): 0.9538478847474007
OneRule Variance (10 features): 4.413383010722671e-09
OneRule MAE (10 features): 0.0882918421779393
OneRule Execution Time: 13.098495483398438
OneRule Precision (all features): 0.9537679484558573
OneRule F1 Score (all features): 0.9537241920462541
OneRule Variance (all features): 3.1644697947857035e-09
OneRule MAE (all features): 0.08810535906142795
OneRule Execution Time: 13.448907136917114


In [10]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define Naive Bayes classifier
  naive_bayes = GaussianNB()

  # Evaluate Naive Bayes classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(naive_bayes, X, y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(naive_bayes, X, y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(naive_bayes, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(naive_bayes, X, y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Naive Bayes results for 10 features
  print("Naive Bayes Precision (10 features):", np.mean(precision_scores_10))
  print("Naive Bayes F1 Score (10 features):", np.mean(f1_scores_10))
  print("Naive Bayes Variance (10 features):", variance_10)
  print("Naive Bayes MAE (10 features):", mae_10)
  print("Naive Bayes Execution Time:", elapsed_time_10)

  # Display Naive Bayes results for all features
  print("Naive Bayes Precision (all features):", np.mean(precision_scores_20))
  print("Naive Bayes F1 Score (all features):", np.mean(f1_scores_20))
  print("Naive Bayes Variance (all features):", variance_20)
  print("Naive Bayes MAE (all features):", mae_20)
  print("Naive Bayes Execution Time:", elapsed_time_20)

  results_10_features.append(["Naive Bayes", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_all_features.append(["Naive Bayes", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Naive Bayes Precision (10 features): 0.9997120212972236
Naive Bayes F1 Score (10 features): 0.7879164635006519
Naive Bayes Variance (10 features): 2.004666101080834e-07
Naive Bayes MAE (10 features): 0.3312410371122046
Naive Bayes Execution Time: 29.050339698791504
Naive Bayes Precision (all features): 0.9997120212972236
Naive Bayes F1 Score (all features): 0.7879164635006519
Naive Bayes Variance (all features): 2.004666101080834e-07
Naive Bayes MAE (all features): 0.3312410371122046
Naive Bayes Execution Time: 30.547858238220215


In [12]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Create a Random Forest classifier with optimized parameters
  rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)  # Adjust parameters for optimization

  # Evaluate Random Forest classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(rf_classifier, X, y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(rf_classifier, X, y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(rf_classifier, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(rf_classifier, X, y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Random Forest results for 10 features
  print("Random Forest Precision (10 features):", np.mean(precision_scores_10))
  print("Random Forest F1 Score (10 features):", np.mean(f1_scores_10))
  print("Random Forest Variance (10 features):", variance_10)
  print("Random Forest MAE (10 features):", mae_10)
  print("Random Forest Execution Time:", elapsed_time_10)

  # Display Random Forest results for all features
  print("Random Forest Precision (all features):", np.mean(precision_scores_20))
  print("Random Forest F1 Score (all features):", np.mean(f1_scores_20))
  print("Random Forest Variance (all features):", variance_20)
  print("Random Forest MAE (all features):", mae_20)
  print("Random Forest Execution Time:", elapsed_time_20)

  results_10_features.append(["Random Forest", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_all_features.append(["Random Forest", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Random Forest Precision (10 features): 0.9917123946520192
Random Forest F1 Score (10 features): 0.9931495419866885
Random Forest Variance (10 features): 6.16212472092088e-06
Random Forest MAE (10 features): 0.01292789044243062
Random Forest Execution Time: 1359.8828287124634
Random Forest Precision (all features): 0.991550803132497
Random Forest F1 Score (all features): 0.993007494486938
Random Forest Variance (all features): 5.994781715860138e-06
Random Forest MAE (all features): 0.013212317336002664
Random Forest Execution Time: 1249.062584400177


In [17]:
import pickle
import os

if not os.path.exists(results_file_name):

  # Save the results lists to a file
  with open(results_file_name, 'wb') as file:
      results_dict = {
          'results_10_features': results_10_features,
          'results_all_features': results_all_features
      }
      pickle.dump(results_dict, file)


In [18]:
# Load the results from the file
with open(results_file_name, 'rb') as file:
    loaded_results = pickle.load(file)

# Access the loaded results lists
results_10_features = loaded_results['results_10_features']
results_all_features = loaded_results['results_all_features']


In [19]:
# Print the results in tabular format
headers_10 = ["Algorithm", "Precision (10 Features)", "F1 Score (10 Features)", "Variance (10 Features)", "MAE (10 Features)", "Execution Time"]
headers_all = ["Precision (All Features)", "F1 Score (All Features)", "Variance (All Features)", "MAE (All Features)", "Execution Time"]

print(tabulate(results_10_features, headers_10, tablefmt="pretty"))
print(tabulate(results_all_features, headers_all, tablefmt="pretty"))

+---------------+-------------------------+------------------------+------------------------+---------------------+--------------------+
|   Algorithm   | Precision (10 Features) | F1 Score (10 Features) | Variance (10 Features) |  MAE (10 Features)  |   Execution Time   |
+---------------+-------------------------+------------------------+------------------------+---------------------+--------------------+
|   ZeroRule    |   0.9537592977784136    |   0.9763324467478158   | 1.019999700573365e-12  | 0.04624070222247505 | 12.883915424346924 |
|    OneRule    |   0.9537773273918637    |   0.9538478847474007   | 4.413383010722671e-09  | 0.0882918421779393  | 13.098495483398438 |
|  Naive Bayes  |   0.9997120212972236    |   0.7879164635006519   | 2.004666101080834e-07  | 0.3312410371122046  | 29.050339698791504 |
| Random Forest |   0.9917123946520192    |   0.9931495419866885   |  6.16212472092088e-06  | 0.01292789044243062 | 1359.8828287124634 |
+---------------+------------------------