# Intrusion Detection System (IDS) Public Datasets Benchmarking

In cybersecurity, the design, development, and implementation of effective Intrusion Detection Systems (IDS) are important for safeguarding IT&C infrastructures from unauthorized access, data breaches, and various forms of malicious activities. The selection of an appropriate ML/DL algorithm plays a essential role in ensuring the security and integrity of protected systems.

But before we can dive in the development of a new-edge algorithm, we shoud have the appropriate data, that needs to be studied and analysed in order to undestant the reality and challenges of our ML problem. In accordance with this paradigm, we chosed to study the early created datasets designed for IDS systems in order to derive leasons learn for feature dataset development.

This experiment aims to comprehensively evaluate the performance of different ML and DL algorithms on a variety of datasets, encompassing a wide range of network traffic scenarios. The datasets used for this analysis include well-known benchmark datasets such as KDD, NSL-KDD, CTU-13, ISCXIDS2012, CIC-IDS2017, CSE-CIC-IDS2018, and Kyoto 2006+. Each dataset represents a distinct set of challenges and characteristics, making this evaluation both diverse and insightful.

The experiment is divided into three main phases:

1. **Data Acquisition and Preprocessing**:
 - In this phase, we acquire the selected datasets from reputable sources, ensuring the integrity and accuracy of the data.
 - Data preprocessing tasks include handling missing values, selecting the most relevant features using feature selection techniques, normalizing the data, and, if necessary, performing feature engineering to enhance the dataset's suitability for machine learning.

2. **Algorithm Evaluation**:
 - We evaluate the performance of a range of ML/DL algorithms on each dataset. The chosen algorithms include baseline methods like ZeroRule and OneRule, traditional machine learning approaches like Naive Bayes and Random Forest, as well as some of the most used anomaly detection deep learning algorithms.
 - Cross-validation is applied to ensure the robustness of our results. Performance metrics such as precision, variance, and Mean Absolute Error (MAE) are calculated for each algorithm and dataset.

3. **Results and Insights**:
 - The results of this evaluation provide valuable insights into the strengths and weaknesses of different IDS algorithms under various conditions.
 - We analyze the performance of algorithms on both the original datasets and balanced datasets to address the challenge of class imbalance in intrusion detection.
 - Observations and additional details regarding the algorithms' performance are documented, providing a comprehensive overview of their behavior.

By conducting this experiment, we aim to contribute to the understanding of cyber domain dataset generation. The findings will assist in making informed decisions when developing a cybersecurity AI application, by deriving necesary steps and procedures in selecting the appropriate learning data.

The following sections of this Jupyter notebook will provide a detailed walkthrough of the experiment, including code snippets, visualizations, and discussions of the results.

In [1]:
# Mount your Google Drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
import warnings
from google.colab import files

# Suppress all warning messages
warnings.filterwarnings("ignore")

# Check if the Kaggle API credentials file already exists
kaggle_credentials_path = os.path.expanduser("~/.kaggle/kaggle.json")

if not os.path.exists(kaggle_credentials_path):

    if not os.path.exists(os.path.join("/content/drive/MyDrive/.kaggle/", "kaggle.json")):

      # Upload your Kaggle API credentials file (kaggle.json)
      files.upload()

      !mv kaggle.json "/content/drive/MyDrive/.kaggle/"
      !chmod 600 "/content/drive/MyDrive/.kaggle/kaggle.json"

    # Move the Kaggle API Credentials File
    !mkdir -p ~/.kaggle
    !cp '/content/drive/MyDrive/.kaggle/kaggle.json' ~/.kaggle/

else:

    print("Kaggle API credentials file already exists.")

In [3]:
import tensorflow as tf
print("GPU available:", tf.test.is_gpu_available())
print("GPU device name:", tf.test.gpu_device_name())

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


GPU available: True
GPU device name: /device:GPU:0


In [4]:
import os
from psutil import virtual_memory
from tabulate import tabulate

# Function to get CPU information
def get_cpu_info():
    cpu_info = os.popen('lscpu').read()
    return cpu_info

# Function to get RAM information
def get_ram_info():
    ram = virtual_memory()
    total_ram = f"{ram.total / 1e9:.2f} GB"
    available_ram = f"{ram.available / 1e9:.2f} GB"
    return total_ram, available_ram

# Function to get GPU information
def get_gpu_info():
    # Execute nvidia-smi and get its output
    gpu_info = os.popen('nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader,nounits').read().strip()

    # Split the output to get individual GPU details
    details = gpu_info.split(", ")

    # Return GPU name, total, used, and free memory
    return details[0], f"{details[1]} MB", f"{details[2]} MB", f"{details[3]} MB"

# Collect system information
cpu_info = get_cpu_info()
total_ram, available_ram = get_ram_info()
try:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = get_gpu_info()
except:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = 'null',0,0,0

# Extract relevant CPU information
cpu_type = ""
cpu_architecture = ""

for line in cpu_info.splitlines():
    if "Model name:" in line:
        cpu_type = line.split(":")[1].strip()
    elif "Architecture:" in line:
        cpu_architecture = line.split(":")[1].strip()

# Create a table
table = [
    ["CPU Type", cpu_type],
    ["CPU Architecture", cpu_architecture],
    ["Total RAM", total_ram],
    ["Available RAM", available_ram],
    ["GPU Name", gpu_name],
    ["GPU Total Memory", gpu_total_memory],
    ["GPU Used Memory", gpu_used_memory],
    ["GPU Free Memory", gpu_free_memory]
]

# Display the table
print(tabulate(table, headers=["Characteristic", "Value"], tablefmt="pretty"))


+------------------+--------------------------------+
|  Characteristic  |             Value              |
+------------------+--------------------------------+
|     CPU Type     | Intel(R) Xeon(R) CPU @ 2.00GHz |
| CPU Architecture |             x86_64             |
|    Total RAM     |            54.76 GB            |
|  Available RAM   |            52.13 GB            |
|     GPU Name     |            Tesla T4            |
| GPU Total Memory |            15360 MB            |
| GPU Used Memory  |             359 MB             |
| GPU Free Memory  |            14742 MB            |
+------------------+--------------------------------+


## 1. Data Acquisition and Preprocessing

In this section, we focus on acquiring the above mentioned datasets.

### 1.2. NSL-KDD dataset
The NSL-KDD dataset is an extension of the original KDD dataset and serves as a valuable benchmark for assessing Intrusion Detection Systems (IDS). This dataset aims to address the shortcomings of the original KDD dataset by providing a more balanced, less redundant, and better-organized dataset that facilitates fair and accurate evaluations of intrusion detection systems.

### Download and Unzip NSL-KDD dataset

In [5]:
import os
import pandas as pd

# Specify the dataset name
dataset_name = "hassan06/nslkdd"

# Specify the destination folder in your Google Drive
destination_folder = "/content/drive/MyDrive/NSL-KDD-BM"

# Check if the dataset file already exists in your Google Drive
dataset_file_path = os.path.join(destination_folder, "nslkdd.zip")

if not os.path.exists(dataset_file_path):

  # Download the dataset and save it to your Google Drive
  !kaggle datasets download -d $dataset_name -p $destination_folder

  # Unzip the downloaded dataset
  import zipfile
  with zipfile.ZipFile(f"{destination_folder}/nslkdd.zip", "r") as zip_ref:
      zip_ref.extractall(destination_folder)

  print("Download complete.")

else:
  print("Dataset already exists. Skipping download.")

# Remove the downloaded zip file (optional)
#!rm -f {destination_folder}/nslkdd.zip

Downloading nslkdd.zip to /content/drive/MyDrive/NSL-KDD-BM
 94% 13.0M/13.9M [00:01<00:00, 14.5MB/s]
100% 13.9M/13.9M [00:01<00:00, 7.80MB/s]
Download complete.


In [6]:
!ls -ahl '/content/drive/MyDrive/NSL-KDD-BM'

total 68M
-rw------- 1 root root  33K Oct 10 14:27 index.html
-rw------- 1 root root 8.5K Oct 10 14:27 KDDTest1.jpg
-rw------- 1 root root 1.7M Oct 10 14:27 KDDTest-21.arff
-rw------- 1 root root 1.8M Oct 10 14:27 KDDTest-21.txt
-rw------- 1 root root 3.3M Oct 10 14:27 KDDTest+.arff
-rw------- 1 root root 3.3M Oct 10 14:27 KDDTest+.txt
-rw------- 1 root root 8.4K Oct 10 14:27 KDDTrain1.jpg
-rw------- 1 root root 3.6M Oct 10 14:27 KDDTrain+_20Percent.arff
-rw------- 1 root root 3.7M Oct 10 14:27 KDDTrain+_20Percent.txt
-rw------- 1 root root  18M Oct 10 14:27 KDDTrain+.arff
-rw------- 1 root root  19M Oct 10 14:27 KDDTrain+.txt
drwx------ 2 root root 4.0K Oct 10 14:27 nsl-kdd
-rw------- 1 root root  14M Oct 22  2019 nslkdd.zip


In [7]:
# Check if nsl-kdd processed dataframe is in location
nsl_df_file_path = os.path.join(destination_folder, "nsl_kdd.csv")
if os.path.exists(nsl_df_file_path):
  nsl_df = pd.read_csv(nsl_df_file_path)
else:
  nsl_df = pd.read_csv('/content/drive/MyDrive/NSL-KDD-BM/KDDTrain+.txt')

  # add the column labels
  columns = (['duration'
  ,'protocol_type'
  ,'service'
  ,'flag'
  ,'src_bytes'
  ,'dst_bytes'
  ,'land'
  ,'wrong_fragment'
  ,'urgent'
  ,'hot'
  ,'num_failed_logins'
  ,'logged_in'
  ,'num_compromised'
  ,'root_shell'
  ,'su_attempted'
  ,'num_root'
  ,'num_file_creations'
  ,'num_shells'
  ,'num_access_files'
  ,'num_outbound_cmds'
  ,'is_host_login'
  ,'is_guest_login'
  ,'count'
  ,'srv_count'
  ,'serror_rate'
  ,'srv_serror_rate'
  ,'rerror_rate'
  ,'srv_rerror_rate'
  ,'same_srv_rate'
  ,'diff_srv_rate'
  ,'srv_diff_host_rate'
  ,'dst_host_count'
  ,'dst_host_srv_count'
  ,'dst_host_same_srv_rate'
  ,'dst_host_diff_srv_rate'
  ,'dst_host_same_src_port_rate'
  ,'dst_host_srv_diff_host_rate'
  ,'dst_host_serror_rate'
  ,'dst_host_srv_serror_rate'
  ,'dst_host_rerror_rate'
  ,'dst_host_srv_rerror_rate'
  ,'Label'
  ,'level'])

  nsl_df.columns = columns

# Now, 'nsl_df' contains NSL-KDD data in a Pandas DataFrame
nsl_df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,Label,level
0,0,udp,other,SF,146,0,0,0,0,0,...,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal,15
1,0,tcp,private,S0,0,0,0,0,0,0,...,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,neptune,19
2,0,tcp,http,SF,232,8153,0,0,0,0,...,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal,21
3,0,tcp,http,SF,199,420,0,0,0,0,...,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal,21
4,0,tcp,private,REJ,0,0,0,0,0,0,...,0.07,0.07,0.00,0.00,0.00,0.00,1.00,1.00,neptune,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125967,0,tcp,private,S0,0,0,0,0,0,0,...,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,neptune,20
125968,8,udp,private,SF,105,145,0,0,0,0,...,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal,21
125969,0,tcp,smtp,SF,2231,384,0,0,0,0,...,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal,18
125970,0,tcp,klogin,S0,0,0,0,0,0,0,...,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,neptune,20


In [8]:
# Information about the starting NSL-KDD DataFrame
nsl_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125972 entries, 0 to 125971
Data columns (total 43 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   duration                     125972 non-null  int64  
 1   protocol_type                125972 non-null  object 
 2   service                      125972 non-null  object 
 3   flag                         125972 non-null  object 
 4   src_bytes                    125972 non-null  int64  
 5   dst_bytes                    125972 non-null  int64  
 6   land                         125972 non-null  int64  
 7   wrong_fragment               125972 non-null  int64  
 8   urgent                       125972 non-null  int64  
 9   hot                          125972 non-null  int64  
 10  num_failed_logins            125972 non-null  int64  
 11  logged_in                    125972 non-null  int64  
 12  num_compromised              125972 non-null  int64  
 13 

In [9]:
# Some basic statistical details like percentile, mean, std, etc. of the starting NSL-KDD DataFrame
nsl_df.describe()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,level
count,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0,...,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0,125972.0
mean,287.146929,45567.1,19779.27,0.000198,0.022688,0.000111,0.204411,0.001222,0.395739,0.279253,...,115.653725,0.521244,0.082952,0.148379,0.032543,0.284455,0.278487,0.118832,0.120241,19.504056
std,2604.525522,5870354.0,4021285.0,0.014086,0.253531,0.014366,2.149977,0.045239,0.489011,23.942137,...,110.702886,0.44895,0.188922,0.308998,0.112564,0.444785,0.44567,0.306559,0.31946,2.291512
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18.0
50%,0.0,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,63.0,0.51,0.02,0.0,0.0,0.0,0.0,0.0,0.0,20.0
75%,0.0,276.0,516.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,255.0,1.0,0.07,0.06,0.02,1.0,1.0,0.0,0.0,21.0
max,42908.0,1379964000.0,1309937000.0,1.0,3.0,3.0,77.0,5.0,1.0,7479.0,...,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,21.0


### Preprocessing of the NSL-KDD dataset

In [10]:
# Import required libraries
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline

In [11]:
# Check if the Dataset was not preprocess do:
  # 1 # Handling Missing Values
  # 2 # Normalization (Min-Max Scaling)
  # 3 # Encode Categorical Features and Label
  # 4 # Removing duplicate records

nsl_df_encoded_file_path = os.path.join(destination_folder, "nsl_kdd_encoded.csv")
if not os.path.exists(nsl_df_encoded_file_path):

  # Step 1: Handling Missing Values

  # Check for missing values, NAN
  check_nan = nsl_df.isna().sum().sum()

  # Check if missing values are represented as empty values (",,")
  missing_values_as_empty = nsl_df.applymap(lambda x: x == '')

  # Count the number of missing values in each column
  missing_values_count = missing_values_as_empty.sum()

  # Check if all elements in the missing_values_count Series are different from 0
  check_null = (missing_values_count != 0).all()

  # Replace empty values with NaN
  if (check_null):
    nsl_df.replace("", np.nan, inplace=True)

  # Impute missing values with the most frequent value for categorical columns and mean for numerical columns
  if (check_null or check_nan !=0):
    imputer = SimpleImputer(strategy='most_frequent', missing_values=pd.NA)
    for col in nsl_df.columns:
        if nsl_df[col].dtype == 'object':
            nsl_df[col] = imputer.fit_transform(df[[col]])
        else:
            nsl_df[col] = nsl_df[col].fillna(nsl_df[col].mean())

In [12]:
# Check if the Dataset was not preprocess do
if not os.path.exists(nsl_df_encoded_file_path):

  # Step 2: Normalization (Min-Max Scaling)

  columns = [col for col in nsl_df.columns if col not in ['protocol_type','service','flag','Label']]
  min_max_scaler = MinMaxScaler().fit(nsl_df[columns])
  nsl_df[columns] = min_max_scaler.transform(nsl_df[columns])

display(nsl_df.head())

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,Label,level
0,0.0,udp,other,SF,1.057999e-07,0.0,0.0,0.0,0.0,0.0,...,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal,0.714286
1,0.0,tcp,private,S0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune,0.904762
2,0.0,tcp,http,SF,1.681203e-07,6.223962e-06,0.0,0.0,0.0,0.0,...,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal,1.0
3,0.0,tcp,http,SF,1.442067e-07,3.20626e-07,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal,1.0
4,0.0,tcp,private,REJ,0.0,0.0,0.0,0.0,0.0,0.0,...,0.07,0.07,0.0,0.0,0.0,0.0,1.0,1.0,neptune,1.0


In [13]:
# Check if the Dataset was not preprocess do
if not os.path.exists(nsl_df_encoded_file_path):

  # Step 3: Encode Categorical Features and Label

  # Create a copy of the scaled DataFrame
  nsl_df_encoded = nsl_df.copy()

  # Identify categorical columns (non-numeric)
  categorical_columns = nsl_df_encoded.select_dtypes(include=['object']).columns.tolist()
  categorical_columns.remove('Label')

  # Encode the Label with 0 value for normal and 1 for the rest of the attacks
  nsl_df_encoded['Label'] =  nsl_df_encoded['Label'].apply(lambda x: 0 if x == 'normal' else 1)

  # Encode categorical columns using one-hot encoding (get_dummies)
  nsl_df_encoded = pd.get_dummies(nsl_df_encoded, columns=categorical_columns)

  # Now, df_encoded contains the encoded categorical features and label

  unique_labels = nsl_df_encoded['Label'].unique()

  # Print the unique labels
  for label in unique_labels:
      print(label)
else:
  nsl_df_encoded = pd.read_csv(nsl_df_encoded_file_path)

0
1


In [14]:
  # 4 # Removing duplicate records

# Print the shape of the DataFrame 'nsl_df' after removing rows with missing values
print(nsl_df.shape)

# Remove duplicate rows from the DataFrame 'nsl_df' while resetting the index
nsl_df = nsl_df.drop_duplicates()
nsl_df.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'nsl_df' after removing duplicates and resetting the index
print(nsl_df.shape)


# Print the shape of the DataFrame 'nsl_df_encoded' after removing rows with missing values
print(nsl_df_encoded.shape)

# Remove duplicate rows from the DataFrame 'nsl_df_encoded' while resetting the index
nsl_df_encoded = nsl_df_encoded.drop_duplicates()
nsl_df_encoded.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'nsl_df_encoded' after removing duplicates and resetting the index
print(nsl_df_encoded.shape)

(125972, 43)
(125972, 43)
(125972, 124)
(125963, 124)


In [15]:
# Print out the DataFrames loaded in the memory
%whos DataFrame

Variable                  Type         Data/Info
------------------------------------------------
missing_values_as_empty   DataFrame            duration  protoco<...>125972 rows x 43 columns]
nsl_df                    DataFrame            duration protocol<...>125972 rows x 43 columns]
nsl_df_encoded            DataFrame            duration     src_<...>25963 rows x 124 columns]


In [16]:
# Delete missing_values_as_empty DataFrame from memory resulted from step 1 of preprocessing

for var_name in list(globals().keys()):
    if isinstance(globals()[var_name], pd.DataFrame) and var_name.startswith("missing_values_as_empty"):
        del globals()[var_name]

In [17]:
# Step 5: Feature Engineering (if necessary)
# You can perform additional feature engineering as needed based on your analysis

# Display the preprocessed dataset (X) and target (y)
#print(pd.DataFrame(X, columns=column_names[:-1]).head())
#print(y.head())

In [18]:
# Count the occurrences of each label in the original DataFrame 'df'
label_counts_df = nsl_df['Label'].value_counts()

# Print the label counts in 'df'
print("Label counts in df:")
print(label_counts_df)

# Count the occurrences of each label in the encoded DataFrame 'df_encoded'
label_counts_encoded = nsl_df_encoded['Label'].value_counts()

# Print the label counts in 'df_encoded'
print("Label counts in df_encoded:")
print(label_counts_encoded)

Label counts in df:
normal             67342
neptune            41214
satan               3633
ipsweep             3599
portsweep           2931
smurf               2646
nmap                1493
back                 956
teardrop             892
warezclient          890
pod                  201
guess_passwd          53
buffer_overflow       30
warezmaster           20
land                  18
imap                  11
rootkit               10
loadmodule             9
ftp_write              8
multihop               7
phf                    4
perl                   3
spy                    2
Name: Label, dtype: int64
Label counts in df_encoded:
0    67342
1    58621
Name: Label, dtype: int64


In [21]:
# Check if the Dataset is saved
if not os.path.exists(nsl_df_file_path):
  # Convert your Pandas DataFrame to a CSV file
  nsl_df.to_csv('/content/drive/MyDrive/NSL-KDD-BM/nsl_kdd.csv', index=False)
if not os.path.exists(nsl_df_encoded_file_path):
  # Convert your Pandas DataFrame to a CSV file
  nsl_df_encoded.to_csv('/content/drive/MyDrive/NSL-KDD-BM/nsl_kdd_encoded.csv', index=False)

## 2. Algorithm Evaluation

In this section, we assess the performance of various machine learning algorithms on the upper mentioned datasets.

### 2.2. NSL-KDD dataset evaluation with baseline and traditional ML algorithms

In this section we measure the precision and F1 scores, which are indicators of classification accuracy, for algorithms include basic classifiers like Zero Rule and One Rule, statistical methods like Naive Bayes, and more complex models such as Random Forest. The evaluation is done using 10-fold cross-validation on both 10 and 20 selected features from the dataset. In this section, we conduct a comprehensive evaluation of the NSL-KDD dataset using various machine learning algorithms to assess the classification accuracy and effectiveness of these algorithms in the context of the given dataset.

In [22]:
# Separate features (X) and labels (y)
X = nsl_df_encoded.drop('Label', axis=1)  # Exclude the label column
y = nsl_df_encoded['Label']

# Create a pipeline for feature selection on the preprocessed data
pipeline_10_features = Pipeline([
    ('selector_10', SelectKBest(score_func=f_classif, k=10))
])

pipeline_20_features = Pipeline([
    ('selector_20', SelectKBest(score_func=f_classif, k=20))
])

# Fit and transform the data for 10 and 20 features
X_selected_10 = pipeline_10_features.fit_transform(X, y)
X_selected_20 = pipeline_20_features.fit_transform(X, y)

# Display the selected features
print(X_selected_10.shape)  # Check the shape of the selected 10 features
print(X_selected_20.shape)  # Check the shape of the selected 20 features

# Display the selected features
print("Selected 10 features:")
selected_feature_indices_10 = pipeline_10_features.named_steps['selector_10'].get_support(indices=True)
selected_features_10 = X.columns[selected_feature_indices_10]
print(selected_features_10)

print("\nSelected 20 features:")
selected_feature_indices_20 = pipeline_20_features.named_steps['selector_20'].get_support(indices=True)
selected_features_20 = X.columns[selected_feature_indices_20]
print(selected_features_20)

(125963, 10)
(125963, 20)
Selected 10 features:
Index(['logged_in', 'serror_rate', 'srv_serror_rate', 'same_srv_rate',
       'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'flag_S0', 'flag_SF'],
      dtype='object')

Selected 20 features:
Index(['logged_in', 'count', 'serror_rate', 'srv_serror_rate', 'rerror_rate',
       'srv_rerror_rate', 'same_srv_rate', 'dst_host_count',
       'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
       'dst_host_srv_rerror_rate', 'level', 'service_domain_u', 'service_http',
       'service_private', 'flag_S0', 'flag_SF'],
      dtype='object')


In [23]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, mean_absolute_error, f1_score
from sklearn.dummy import DummyClassifier
from tabulate import tabulate
import time
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Define the number of desired folds for Cross-Validation (e.g., 10)
num_folds = 10

# Initialize performance metrics lists for 10 and 20 features
results_10_features = []
results_20_features = []

In [24]:
label_counts = nsl_df_encoded['Label'].value_counts()
label_counts

0    67342
1    58621
Name: Label, dtype: int64

In [25]:
# Define a file name for saving the results
results_file_name = os.path.join(destination_folder, "nsl_kdd_results.pkl")

# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define ZeroRule classifier
  zero_rule = DummyClassifier(strategy="most_frequent")

  # Evaluate ZeroRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(zero_rule, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(zero_rule, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(zero_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(zero_rule, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display ZeroRule results for 10 features
  print("ZeroRule Precision (10 features):", np.mean(precision_scores_10))
  print("ZeroRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("ZeroRule Variance (10 features):", variance_10)
  print("ZeroRule MAE (10 features):", mae_10)
  print("ZeroRule Execution Time:", elapsed_time_10)

  # Display ZeroRule results for 20 features
  print("ZeroRule Precision (20 features):", np.mean(precision_scores_20))
  print("ZeroRule F1 Score (20 features):", np.mean(f1_scores_20))
  print("ZeroRule Variance (20 features):", variance_20)
  print("ZeroRule MAE (20 features):", mae_20)
  print("ZeroRule Execution Time:", elapsed_time_20)

  results_10_features.append(["ZeroRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["ZeroRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

ZeroRule Precision (10 features): 0.0
ZeroRule F1 Score (10 features): 0.0
ZeroRule Variance (10 features): 0.0
ZeroRule MAE (10 features): 0.4653826917428134
ZeroRule Execution Time: 0.33173060417175293
ZeroRule Precision (20 features): 0.0
ZeroRule F1 Score (20 features): 0.0
ZeroRule Variance (20 features): 0.0
ZeroRule MAE (20 features): 0.4653826917428134
ZeroRule Execution Time: 0.4054434299468994


In [26]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define OneRule classifier
  one_rule = DummyClassifier(strategy="stratified")

  # Evaluate OneRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(one_rule, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(one_rule, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(one_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(one_rule, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display OneRule results for 10 features
  print("OneRule Precision (10 features):", np.mean(precision_scores_10))
  print("OneRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("OneRule Variance (10 features):", variance_10)
  print("OneRule MAE (10 features):", mae_10)
  print("OneRule Execution Time:", elapsed_time_10)

  # Display OneRule results for 20 features
  print("OneRule Precision (20 features):", np.mean(precision_scores_20))
  print("OneRule F1 Score (20 features):", np.mean(f1_scores_20))
  print("OneRule Variance (20 features):", variance_20)
  print("OneRule MAE (20 features):", mae_20)
  print("OneRule Execution Time:", elapsed_time_20)

  results_10_features.append(["OneRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["OneRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

OneRule Precision (10 features): 0.4643182608755576
OneRule F1 Score (10 features): 0.4676192412419926
OneRule Variance (10 features): 2.8877747352643335e-05
OneRule MAE (10 features): 0.49847177345728505
OneRule Execution Time: 0.35054945945739746
OneRule Precision (20 features): 0.4671072998119078
OneRule F1 Score (20 features): 0.46345174992136096
OneRule Variance (20 features): 1.0105616426021796e-05
OneRule MAE (20 features): 0.4969713328517104
OneRule Execution Time: 0.41045117378234863


In [27]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define Naive Bayes classifier
  naive_bayes = GaussianNB()

  # Evaluate Naive Bayes classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(naive_bayes, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(naive_bayes, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(naive_bayes, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(naive_bayes, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Naive Bayes results for 10 features
  print("Naive Bayes Precision (10 features):", np.mean(precision_scores_10))
  print("Naive Bayes F1 Score (10 features):", np.mean(f1_scores_10))
  print("Naive Bayes Variance (10 features):", variance_10)
  print("Naive Bayes MAE (10 features):", mae_10)
  print("Naive Bayes Execution Time:", elapsed_time_10)

  # Display Naive Bayes results for 20 features
  print("Naive Bayes Precision (20 features):", np.mean(precision_scores_20))
  print("Naive Bayes F1 Score (20 features):", np.mean(f1_scores_20))
  print("Naive Bayes Variance (20 features):", variance_20)
  print("Naive Bayes MAE (20 features):", mae_20)
  print("Naive Bayes Execution Time:", elapsed_time_20)

  results_10_features.append(["Naive Bayes", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["Naive Bayes", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Naive Bayes Precision (10 features): 0.9483081080571318
Naive Bayes F1 Score (10 features): 0.8513815314023855
Naive Bayes Variance (10 features): 1.0650856542257133e-05
Naive Bayes MAE (10 features): 0.12549716980383127
Naive Bayes Execution Time: 0.871079683303833
Naive Bayes Precision (20 features): 0.9021495902238075
Naive Bayes F1 Score (20 features): 0.8914635876869642
Naive Bayes Variance (20 features): 3.2140929322479728e-06
Naive Bayes MAE (20 features): 0.09983884156458643
Naive Bayes Execution Time: 1.361576795578003


In [28]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Create a Random Forest classifier with optimized parameters
  rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)  # Adjust parameters for optimization

  # Evaluate Random Forest classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(rf_classifier, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(rf_classifier, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(rf_classifier, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(rf_classifier, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Random Forest results for 10 features
  print("Random Forest Precision (10 features):", np.mean(precision_scores_10))
  print("Random Forest F1 Score (10 features):", np.mean(f1_scores_10))
  print("Random Forest Variance (10 features):", variance_10)
  print("Random Forest MAE (10 features):", mae_10)
  print("Random Forest Execution Time:", elapsed_time_10)

  # Display Random Forest results for 20 features
  print("Random Forest Precision (20 features):", np.mean(precision_scores_20))
  print("Random Forest F1 Score (20 features):", np.mean(f1_scores_20))
  print("Random Forest Variance (20 features):", variance_20)
  print("Random Forest MAE (20 features):", mae_20)
  print("Random Forest Execution Time:", elapsed_time_20)

  results_10_features.append(["Random Forest", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["Random Forest", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Random Forest Precision (10 features): 0.9796262301079197
Random Forest F1 Score (10 features): 0.9417613889654819
Random Forest Variance (10 features): 4.37806350207662e-06
Random Forest MAE (10 features): 0.05175329263355112
Random Forest Execution Time: 20.291851043701172
Random Forest Precision (20 features): 0.9960979771716556
Random Forest F1 Score (20 features): 0.9963425200700451
Random Forest Variance (20 features): 1.479463741300512e-06
Random Forest MAE (20 features): 0.0034137008486619085
Random Forest Execution Time: 24.775439739227295


In [29]:
import pickle
import os

if not os.path.exists(results_file_name):

  # Save the results lists to a file
  with open(results_file_name, 'wb') as file:
      results_dict = {
          'results_10_features': results_10_features,
          'results_20_features': results_20_features
      }
      pickle.dump(results_dict, file)


In [30]:
# Load the results from the file
with open(results_file_name, 'rb') as file:
    loaded_results = pickle.load(file)

# Access the loaded results lists
results_10_features = loaded_results['results_10_features']
results_20_features = loaded_results['results_20_features']


In [31]:
# Print the results in tabular format
headers_10 = ["Algorithm", "Precision (10 Features)", "F1 Score (10 Features)", "Variance (10 Features)", "MAE (10 Features)", "Execution Time"]
headers_20 = ["Precision (20 Features)", "F1 Score (20 Features)", "Variance (20 Features)", "MAE (20 Features)", "Execution Time"]

print(tabulate(results_10_features, headers_10, tablefmt="pretty"))
print(tabulate(results_20_features, headers_20, tablefmt="pretty"))

+---------------+-------------------------+------------------------+------------------------+---------------------+---------------------+
|   Algorithm   | Precision (10 Features) | F1 Score (10 Features) | Variance (10 Features) |  MAE (10 Features)  |   Execution Time    |
+---------------+-------------------------+------------------------+------------------------+---------------------+---------------------+
|   ZeroRule    |           0.0           |          0.0           |          0.0           | 0.4653826917428134  | 0.33173060417175293 |
|    OneRule    |   0.4643182608755576    |   0.4676192412419926   | 2.8877747352643335e-05 | 0.49847177345728505 | 0.35054945945739746 |
|  Naive Bayes  |   0.9483081080571318    |   0.8513815314023855   | 1.0650856542257133e-05 | 0.12549716980383127 |  0.871079683303833  |
| Random Forest |   0.9796262301079197    |   0.9417613889654819   |  4.37806350207662e-06  | 0.05175329263355112 | 20.291851043701172  |
+---------------+-----------------