# Intrusion Detection System (IDS) Public Datasets Benchmarking

In cybersecurity, the design, development, and implementation of effective Intrusion Detection Systems (IDS) are important for safeguarding IT&C infrastructures from unauthorized access, data breaches, and various forms of malicious activities. The selection of an appropriate ML/DL algorithm plays a essential role in ensuring the security and integrity of protected systems.

But before we can dive in the development of a new-edge algorithm, we shoud have the appropriate data, that needs to be studied and analysed in order to undestant the reality and challenges of our ML problem. In accordance with this paradigm, we chosed to study the early created datasets designed for IDS systems in order to derive leasons learn for feature dataset development.

This experiment aims to comprehensively evaluate the performance of different ML and DL algorithms on a variety of datasets, encompassing a wide range of network traffic scenarios. The datasets used for this analysis include well-known benchmark datasets such as KDD, NSL-KDD, CTU-13, ISCXIDS2012, CIC-IDS2017, CSE-CIC-IDS2018, and Kyoto 2006+. Each dataset represents a distinct set of challenges and characteristics, making this evaluation both diverse and insightful.

The experiment is divided into three main phases:

1. **Data Acquisition and Preprocessing**:
 - In this phase, we acquire the selected datasets from reputable sources, ensuring the integrity and accuracy of the data.
 - Data preprocessing tasks include handling missing values, selecting the most relevant features using feature selection techniques, normalizing the data, and, if necessary, performing feature engineering to enhance the dataset's suitability for machine learning.

2. **Algorithm Evaluation**:
 - We evaluate the performance of a range of ML/DL algorithms on each dataset. The chosen algorithms include baseline methods like ZeroRule and OneRule, traditional machine learning approaches like Naive Bayes and Random Forest, as well as some of the most used anomaly detection deep learning algorithms.
 - Cross-validation is applied to ensure the robustness of our results. Performance metrics such as precision, variance, and Mean Absolute Error (MAE) are calculated for each algorithm and dataset.

3. **Results and Insights**:
 - The results of this evaluation provide valuable insights into the strengths and weaknesses of different IDS algorithms under various conditions.
 - We analyze the performance of algorithms on both the original datasets and balanced datasets to address the challenge of class imbalance in intrusion detection.
 - Observations and additional details regarding the algorithms' performance are documented, providing a comprehensive overview of their behavior.

By conducting this experiment, we aim to contribute to the understanding of cyber domain dataset generation. The findings will assist in making informed decisions when developing a cybersecurity AI application, by deriving necesary steps and procedures in selecting the appropriate learning data.

The following sections of this Jupyter notebook will provide a detailed walkthrough of the experiment, including code snippets, visualizations, and discussions of the results.

In [16]:
# Mount your Google Drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [17]:
import os
import warnings
from google.colab import files

# Suppress all warning messages
warnings.filterwarnings("ignore")

# Check if the Kaggle API credentials file already exists
kaggle_credentials_path = os.path.expanduser("~/.kaggle/kaggle.json")

if not os.path.exists(kaggle_credentials_path):

    if not os.path.exists(os.path.join("/content/drive/MyDrive/.kaggle/", "kaggle.json")):

      # Upload your Kaggle API credentials file (kaggle.json)
      files.upload()

      !mv kaggle.json "/content/drive/MyDrive/.kaggle/"
      !chmod 600 "/content/drive/MyDrive/.kaggle/kaggle.json"

    # Move the Kaggle API Credentials File
    !mkdir -p ~/.kaggle
    !cp '/content/drive/MyDrive/.kaggle/kaggle.json' ~/.kaggle/

else:

    print("Kaggle API credentials file already exists.")

Kaggle API credentials file already exists.


In [18]:
import tensorflow as tf
print("GPU available:", tf.test.is_gpu_available())
print("GPU device name:", tf.test.gpu_device_name())

GPU available: True
GPU device name: /device:GPU:0


In [19]:
import os
from psutil import virtual_memory
from tabulate import tabulate

# Function to get CPU information
def get_cpu_info():
    cpu_info = os.popen('lscpu').read()
    return cpu_info

# Function to get RAM information
def get_ram_info():
    ram = virtual_memory()
    total_ram = f"{ram.total / 1e9:.2f} GB"
    available_ram = f"{ram.available / 1e9:.2f} GB"
    return total_ram, available_ram

# Function to get GPU information
def get_gpu_info():
    # Execute nvidia-smi and get its output
    gpu_info = os.popen('nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader,nounits').read().strip()

    # Split the output to get individual GPU details
    details = gpu_info.split(", ")

    # Return GPU name, total, used, and free memory
    return details[0], f"{details[1]} MB", f"{details[2]} MB", f"{details[3]} MB"

# Collect system information
cpu_info = get_cpu_info()
total_ram, available_ram = get_ram_info()
try:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = get_gpu_info()
except:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = 'null',0,0,0

# Extract relevant CPU information
cpu_type = ""
cpu_architecture = ""

for line in cpu_info.splitlines():
    if "Model name:" in line:
        cpu_type = line.split(":")[1].strip()
    elif "Architecture:" in line:
        cpu_architecture = line.split(":")[1].strip()

# Create a table
table = [
    ["CPU Type", cpu_type],
    ["CPU Architecture", cpu_architecture],
    ["Total RAM", total_ram],
    ["Available RAM", available_ram],
    ["GPU Name", gpu_name],
    ["GPU Total Memory", gpu_total_memory],
    ["GPU Used Memory", gpu_used_memory],
    ["GPU Free Memory", gpu_free_memory]
]

# Display the table
print(tabulate(table, headers=["Characteristic", "Value"], tablefmt="pretty"))


+------------------+--------------------------------+
|  Characteristic  |             Value              |
+------------------+--------------------------------+
|     CPU Type     | Intel(R) Xeon(R) CPU @ 2.20GHz |
| CPU Architecture |             x86_64             |
|    Total RAM     |            54.76 GB            |
|  Available RAM   |            48.76 GB            |
|     GPU Name     |            Tesla T4            |
| GPU Total Memory |            15360 MB            |
| GPU Used Memory  |             359 MB             |
| GPU Free Memory  |            14742 MB            |
+------------------+--------------------------------+


## 1. Data Acquisition and Preprocessing

In this section, we focus on acquiring the above mentioned datasets.

### 1.1. KDD99 dataset
The KDD dataset was originally created for the DARPA Intrusion Detection Evaluation program and has since become a benchmark for evaluating IDS algorithms. It comprises network traffic data captured from a simulated environment, including various types of attacks and normal activities. The dataset contains a diverse set of features, including duration, protocol type, service, and more.

### Download and Unzip KDD99 dataset

In [20]:
import os
import pandas as pd

# Specify the dataset name
dataset_name = "galaxyh/kdd-cup-1999-data"

# Specify the destination folder in your Google Drive
destination_folder = "/content/drive/MyDrive/KDD99-BM"

# Check if the dataset file already exists in your Google Drive
dataset_file_path = os.path.join(destination_folder, "kdd-cup-1999-data.zip")

if not os.path.exists(dataset_file_path):

  # Download the dataset and save it to your Google Drive
  !kaggle datasets download -d $dataset_name -p $destination_folder

  # Unzip the downloaded dataset
  import zipfile
  with zipfile.ZipFile(f"{destination_folder}/kdd-cup-1999-data.zip", "r") as zip_ref:
      zip_ref.extractall(destination_folder)

  print("Download complete.")

else:
  print("Dataset already exists. Skipping download.")

# Remove the downloaded zip file (optional)
#!rm -f {destination_folder}/kdd-cup-1999-data.zip

Downloading kdd-cup-1999-data.zip to /content/drive/MyDrive/KDD99-BM
100% 87.8M/87.8M [00:05<00:00, 22.5MB/s]
100% 87.8M/87.8M [00:05<00:00, 18.1MB/s]
Download complete.


In [21]:
!ls -ahl '/content/drive/MyDrive/KDD99-BM'

total 902M
drwx------ 2 root root 4.0K Oct 10 13:54 corrected
-rw------- 1 root root 1.4M Oct 10 13:54 corrected.gz
-rw------- 1 root root  88M Nov 21  2019 kdd-cup-1999-data.zip
drwx------ 2 root root 4.0K Oct 10 13:54 kddcup.data
drwx------ 2 root root 4.0K Oct 10 13:54 kddcup.data_10_percent
-rw------- 1 root root  72M Oct 10 13:54 kddcup.data_10_percent_corrected
-rw------- 1 root root 2.1M Oct 10 13:54 kddcup.data_10_percent.gz
-rw------- 1 root root 709M Oct 10 13:54 kddcup.data.corrected
-rw------- 1 root root  18M Oct 10 13:54 kddcup.data.gz
-rw------- 1 root root 1.3K Oct 10 13:54 kddcup.names
drwx------ 2 root root 4.0K Oct 10 13:54 kddcup.newtestdata_10_percent_unlabeled
-rw------- 1 root root 1.4M Oct 10 13:54 kddcup.newtestdata_10_percent_unlabeled.gz
drwx------ 2 root root 4.0K Oct 10 13:54 kddcup.testdata.unlabeled
drwx------ 2 root root 4.0K Oct 10 13:54 kddcup.testdata.unlabeled_10_percent
-rw------- 1 root root 1.4M Oct 10 13:54 kddcup.testdata.unlabeled_10_percent.gz

In [22]:
# Define the file path for kddcup.names
kddcup_names_file_path = '/content/drive/MyDrive/KDD99-BM/kddcup.names'

# Read the content of kddcup.names into a list of lines
with open(kddcup_names_file_path, 'r') as file:
    kddcup_names_content = file.readlines()

# Extract the column names from descriptions
column_names = [line.split(":")[0].strip() for line in kddcup_names_content if ':' in line]

# Add 'label' at the end
column_names.append('Label')

# Print the resulting column names list
print(column_names)

['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'Label']


In [23]:
import pandas as pd

# Check if kdd99 processed dataframe is in location
df_file_path = os.path.join(destination_folder, "kdd99.csv")
if os.path.exists(df_file_path):
  df = pd.read_csv(df_file_path)
else:
  # Load the dataset into a Pandas DataFrame
  # Line 4817100 contains 56 fields instead of 42
  df = pd.read_csv("/content/drive/MyDrive/KDD99-BM/kddcup.data/kddcup.data", header=None, names=column_names, error_bad_lines=False)

In [24]:
# Information about the starting KDD DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898431 entries, 0 to 4898430
Data columns (total 42 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   duration                     int64  
 1   protocol_type                object 
 2   service                      object 
 3   flag                         object 
 4   src_bytes                    int64  
 5   dst_bytes                    int64  
 6   land                         int64  
 7   wrong_fragment               int64  
 8   urgent                       int64  
 9   hot                          int64  
 10  num_failed_logins            int64  
 11  logged_in                    int64  
 12  num_compromised              int64  
 13  root_shell                   int64  
 14  su_attempted                 int64  
 15  num_root                     int64  
 16  num_file_creations           int64  
 17  num_shells                   int64  
 18  num_access_files             int64  
 19  

In [25]:
# Some basic statistical details like percentile, mean, std, etc. of the starting KDD DataFrame
df.describe()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
count,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,...,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0
mean,48.34243,1834.621,1093.623,5.716116e-06,0.0006487792,7.961733e-06,0.01243766,3.205108e-05,0.143529,0.008088304,...,232.9811,189.2142,0.7537132,0.03071111,0.605052,0.006464107,0.1780911,0.1778859,0.0579278,0.05765941
std,723.3298,941431.1,645012.3,0.002390833,0.04285434,0.007215084,0.4689782,0.007299408,0.3506116,3.856481,...,64.02094,105.9128,0.411186,0.1085432,0.4809877,0.04125978,0.3818382,0.3821774,0.2309428,0.2309777
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,49.0,0.41,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,520.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,1032.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,255.0,1.0,0.04,1.0,0.0,0.0,0.0,0.0,0.0
max,58329.0,1379964000.0,1309937000.0,1.0,3.0,14.0,77.0,5.0,1.0,7479.0,...,255.0,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Preprocessing of the KDD dataset

In [26]:
# Import required libraries
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline

In [27]:
# Check if the Dataset was not preprocess do:
  # 1 # Handling Missing Values
  # 2 # Normalization (Min-Max Scaling)
  # 3 # Encode Categorical Features and Label
  # 4 # Removing duplicate records

df_encoded_file_path = os.path.join(destination_folder, "kdd99_encoded.csv")
if not os.path.exists(df_encoded_file_path):

  # Step 1: Handling Missing Values

  # Check for missing values, NAN
  check_nan = df.isna().sum().sum()

  # Check if missing values are represented as empty values (",,")
  missing_values_as_empty = df.applymap(lambda x: x == '')

  # Count the number of missing values in each column
  missing_values_count = missing_values_as_empty.sum()

  # Check if all elements in the missing_values_count Series are different from 0
  check_null = (missing_values_count != 0).all()

  # Replace empty values with NaN
  if (check_null):
    df.replace("", np.nan, inplace=True)

  # Impute missing values with the most frequent value for categorical columns and mean for numerical columns
  if (check_null or check_nan !=0):
    imputer = SimpleImputer(strategy='most_frequent', missing_values=pd.NA)
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = imputer.fit_transform(df[[col]])
        else:
            df[col] = df[col].fillna(df[col].mean())

In [28]:
# Check if the Dataset was not preprocess do
if not os.path.exists(df_encoded_file_path):

  # Step 2: Normalization (Min-Max Scaling)

  columns = [col for col in df.columns if col not in ['protocol_type','service','flag','Label']]
  min_max_scaler = MinMaxScaler().fit(df[columns])
  df[columns] = min_max_scaler.transform(df[columns])

display(df.head())

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,Label
0,0.0,tcp,http,SF,1.558012e-07,3.44108e-05,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0.0,tcp,http,SF,1.173944e-07,3.456654e-06,0.0,0.0,0.0,0.0,...,0.003922,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.
2,0.0,tcp,http,SF,1.71019e-07,9.374494e-07,0.0,0.0,0.0,0.0,...,0.007843,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.
3,0.0,tcp,http,SF,1.68845e-07,1.551219e-06,0.0,0.0,0.0,0.0,...,0.011765,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.
4,0.0,tcp,http,SF,1.731929e-07,3.710101e-07,0.0,0.0,0.0,0.0,...,0.015686,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.


In [29]:
# Check if the Dataset was not preprocess do
if not os.path.exists(df_encoded_file_path):

  # Step 3: Encode Categorical Features and Label

  # Create a copy of the scaled DataFrame
  df_encoded = df.copy()

  # Identify categorical columns (non-numeric)
  categorical_columns = df_encoded.select_dtypes(include=['object']).columns.tolist()
  categorical_columns.remove('Label')

  # Encode the Label with 0 value for normal and 1 for the rest of the attacks
  df_encoded['Label'] = df_encoded['Label'].apply(lambda x: 0 if x == 'normal.' else 1)

  # Encode categorical columns using one-hot encoding (get_dummies)
  df_encoded = pd.get_dummies(df_encoded, columns=categorical_columns)

  # Now, df_encoded contains the encoded categorical features and label

  unique_labels = df_encoded['Label'].unique()

  # Print the unique labels
  for label in unique_labels:
      print(label)
else:
  df_encoded = pd.read_csv(df_encoded_file_path)

0
1


In [30]:
  # 4 # Removing duplicate records

# Print the shape of the DataFrame 'df' after removing rows with missing values
print(df.shape)

# Remove duplicate rows from the DataFrame 'df' while resetting the index
df = df.drop_duplicates()
df.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df' after removing duplicates and resetting the index
print(df.shape)


# Print the shape of the DataFrame 'df_encoded' after removing rows with missing values
print(df_encoded.shape)

# Remove duplicate rows from the DataFrame 'df_encoded' while resetting the index
df_encoded = df_encoded.drop_duplicates()
df_encoded.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df_encoded' after removing duplicates and resetting the index
print(df_encoded.shape)

(4898431, 42)
(1074992, 42)
(4898431, 123)
(1074983, 123)


In [31]:
# Print out the DataFrames loaded in the memory
%whos DataFrame

Variable                  Type         Data/Info
------------------------------------------------
df                        DataFrame             duration protoco<...>074992 rows x 42 columns]
df_encoded                DataFrame             duration     src<...>74983 rows x 123 columns]
missing_values_as_empty   DataFrame             duration  protoc<...>898431 rows x 42 columns]


In [32]:
# Delete missing_values_as_empty DataFrame from memory resulted from step 1 of preprocessing

for var_name in list(globals().keys()):
    if isinstance(globals()[var_name], pd.DataFrame) and var_name.startswith("missing_values_as_empty"):
        del globals()[var_name]

In [34]:
# Step 5: Feature Engineering (if necessary)
# You can perform additional feature engineering as needed based on your analysis

# Display the preprocessed dataset (X) and target (y)
#print(pd.DataFrame(X, columns=column_names[:-1]).head())
#print(y.head())

In [45]:
# Count the occurrences of each label in the original DataFrame 'df'
label_counts_df = df['Label'].value_counts()

# Print the label counts in 'df'
print("Label counts in df:")
print(label_counts_df)

# Count the occurrences of each label in the encoded DataFrame 'df_encoded'
label_counts_encoded = df_encoded['Label'].value_counts()

# Print the label counts in 'df_encoded'
print("Label counts in df_encoded:")
print(label_counts_encoded)

Label counts in df:
normal.             812814
neptune.            242149
satan.                5019
ipsweep.              3723
portsweep.            3564
smurf.                3007
nmap.                 1554
back.                  968
teardrop.              918
warezclient.           893
pod.                   206
guess_passwd.           53
buffer_overflow.        30
warezmaster.            20
land.                   19
imap.                   12
rootkit.                10
loadmodule.              9
ftp_write.               8
multihop.                7
phf.                     4
perl.                    3
spy.                     2
Name: Label, dtype: int64
Label counts in df_encoded:
0    812814
1    262169
Name: Label, dtype: int64


In [35]:
# Check if the Dataset is saved
if not os.path.exists(df_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df.to_csv('/content/drive/MyDrive/KDD99-BM/kdd99.csv', index=False)
if not os.path.exists(df_encoded_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df_encoded.to_csv('/content/drive/MyDrive/KDD99-BM/kdd99_encoded.csv', index=False)

## 2. Algorithm Evaluation

In this section, we assess the performance of various machine learning algorithms on the upper mentioned datasets.

### 2.1. KDD99 dataset evaluation with baseline and traditional ML algorithms

In this section we measure the precision and F1 scores, which are indicators of classification accuracy, for algorithms include basic classifiers like Zero Rule and One Rule, statistical methods like Naive Bayes, and more complex models such as Random Forest. The evaluation is done using 10-fold cross-validation on both 10 and 20 selected features from the dataset. These results provide critical insights into the optimal dataset generation strategy, aiding in the selection of the most effective ways in extracting features from raw data and determine the most suitable algorithm for the given dataset.

In [37]:
# Separate features (X) and labels (y)
X = df_encoded.drop('Label', axis=1)  # Exclude the label column
y = df_encoded['Label']

# Create a pipeline for feature selection on the preprocessed data
pipeline_10_features = Pipeline([
    ('selector_10', SelectKBest(score_func=f_classif, k=10))
])

pipeline_20_features = Pipeline([
    ('selector_20', SelectKBest(score_func=f_classif, k=20))
])

# Fit and transform the data for 10 and 20 features
X_selected_10 = pipeline_10_features.fit_transform(X, y)
X_selected_20 = pipeline_20_features.fit_transform(X, y)

# Display the selected features
print(X_selected_10.shape)  # Check the shape of the selected 10 features
print(X_selected_20.shape)  # Check the shape of the selected 20 features

# Display the selected features
print("Selected 10 features:")
selected_feature_indices_10 = pipeline_10_features.named_steps['selector_10'].get_support(indices=True)
selected_features_10 = X.columns[selected_feature_indices_10]
print(selected_features_10)

print("\nSelected 20 features:")
selected_feature_indices_20 = pipeline_20_features.named_steps['selector_20'].get_support(indices=True)
selected_features_20 = X.columns[selected_feature_indices_20]
print(selected_features_20)

(1074983, 10)
(1074983, 20)
Selected 10 features:
Index(['count', 'serror_rate', 'srv_serror_rate', 'same_srv_rate',
       'dst_host_same_srv_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'service_private', 'flag_S0', 'flag_SF'],
      dtype='object')

Selected 20 features:
Index(['logged_in', 'count', 'serror_rate', 'srv_serror_rate', 'rerror_rate',
       'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',
       'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
       'dst_host_same_srv_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
       'dst_host_srv_rerror_rate', 'service_http', 'service_private',
       'flag_S0', 'flag_SF'],
      dtype='object')


In [38]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, mean_absolute_error, f1_score
from sklearn.dummy import DummyClassifier
from tabulate import tabulate
import time
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Define the number of desired folds for Cross-Validation (e.g., 10)
num_folds = 10

# Initialize performance metrics lists for 10 and 20 features
results_10_features = []
results_20_features = []

In [39]:
# Define a file name for saving the results
results_file_name = os.path.join(destination_folder, "kdd_results.pkl")

# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define ZeroRule classifier
  zero_rule = DummyClassifier(strategy="most_frequent")

  # Evaluate ZeroRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(zero_rule, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(zero_rule, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(zero_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(zero_rule, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display ZeroRule results for 10 features
  print("ZeroRule Precision (10 features):", np.mean(precision_scores_10))
  print("ZeroRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("ZeroRule Variance (10 features):", variance_10)
  print("ZeroRule MAE (10 features):", mae_10)
  print("ZeroRule Execution Time:", elapsed_time_10)

  # Display ZeroRule results for 20 features
  print("ZeroRule Precision (20 features):", np.mean(precision_scores_20))
  print("ZeroRule F1 Score (20 features):", np.mean(f1_scores_20))
  print("ZeroRule Variance (20 features):", variance_20)
  print("ZeroRule MAE (20 features):", mae_20)
  print("ZeroRule Execution Time:", elapsed_time_20)

  results_10_features.append(["ZeroRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["ZeroRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

ZeroRule Precision (10 features): 0.0
ZeroRule F1 Score (10 features): 0.0
ZeroRule Variance (10 features): 0.0
ZeroRule MAE (10 features): 0.24388199627342944
ZeroRule Execution Time: 2.3789751529693604
ZeroRule Precision (20 features): 0.0
ZeroRule F1 Score (20 features): 0.0
ZeroRule Variance (20 features): 0.0
ZeroRule MAE (20 features): 0.24388199627342944
ZeroRule Execution Time: 3.2261602878570557


In [40]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define OneRule classifier
  one_rule = DummyClassifier(strategy="stratified")

  # Evaluate OneRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(one_rule, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(one_rule, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(one_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(one_rule, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display OneRule results for 10 features
  print("OneRule Precision (10 features):", np.mean(precision_scores_10))
  print("OneRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("OneRule Variance (10 features):", variance_10)
  print("OneRule MAE (10 features):", mae_10)
  print("OneRule Execution Time:", elapsed_time_10)

  # Display OneRule results for 20 features
  print("OneRule Precision (20 features):", np.mean(precision_scores_20))
  print("OneRule F1 Score (20 features):", np.mean(f1_scores_20))
  print("OneRule Variance (20 features):", variance_20)
  print("OneRule MAE (20 features):", mae_20)
  print("OneRule Execution Time:", elapsed_time_20)

  results_10_features.append(["OneRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["OneRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

OneRule Precision (10 features): 0.24312432230925718
OneRule F1 Score (10 features): 0.24395697630241897
OneRule Variance (10 features): 4.267845198669343e-06
OneRule MAE (10 features): 0.3688383909326938
OneRule Execution Time: 2.6188907623291016
OneRule Precision (20 features): 0.2441829909978038
OneRule F1 Score (20 features): 0.24404729271675052
OneRule Variance (20 features): 6.391651230681625e-06
OneRule MAE (20 features): 0.36822628822967435
OneRule Execution Time: 3.214223623275757


In [41]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define Naive Bayes classifier
  naive_bayes = GaussianNB()

  # Evaluate Naive Bayes classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(naive_bayes, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(naive_bayes, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(naive_bayes, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(naive_bayes, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Naive Bayes results for 10 features
  print("Naive Bayes Precision (10 features):", np.mean(precision_scores_10))
  print("Naive Bayes F1 Score (10 features):", np.mean(f1_scores_10))
  print("Naive Bayes Variance (10 features):", variance_10)
  print("Naive Bayes MAE (10 features):", mae_10)
  print("Naive Bayes Execution Time:", elapsed_time_10)

  # Display Naive Bayes results for 20 features
  print("Naive Bayes Precision (20 features):", np.mean(precision_scores_20))
  print("Naive Bayes F1 Score (20 features):", np.mean(f1_scores_20))
  print("Naive Bayes Variance (20 features):", variance_20)
  print("Naive Bayes MAE (20 features):", mae_20)
  print("Naive Bayes Execution Time:", elapsed_time_20)

  results_10_features.append(["Naive Bayes", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["Naive Bayes", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Naive Bayes Precision (10 features): 0.9479421362435112
Naive Bayes F1 Score (10 features): 0.9557347237986116
Naive Bayes Variance (10 features): 0.004133212375383223
Naive Bayes MAE (10 features): 0.022359423358322875
Naive Bayes Execution Time: 6.8374340534210205
Naive Bayes Precision (20 features): 0.9471108890560659
Naive Bayes F1 Score (20 features): 0.9555713098731035
Naive Bayes Variance (20 features): 0.007137480410507853
Naive Bayes MAE (20 features): 0.023257111972933526
Naive Bayes Execution Time: 10.366659879684448


In [42]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Create a Random Forest classifier with optimized parameters
  rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)  # Adjust parameters for optimization

  # Evaluate Random Forest classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(rf_classifier, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(rf_classifier, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(rf_classifier, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(rf_classifier, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Random Forest results for 10 features
  print("Random Forest Precision (10 features):", np.mean(precision_scores_10))
  print("Random Forest F1 Score (10 features):", np.mean(f1_scores_10))
  print("Random Forest Variance (10 features):", variance_10)
  print("Random Forest MAE (10 features):", mae_10)
  print("Random Forest Execution Time:", elapsed_time_10)

  # Display Random Forest results for 20 features
  print("Random Forest Precision (20 features):", np.mean(precision_scores_20))
  print("Random Forest F1 Score (20 features):", np.mean(f1_scores_20))
  print("Random Forest Variance (20 features):", variance_20)
  print("Random Forest MAE (20 features):", mae_20)
  print("Random Forest Execution Time:", elapsed_time_20)

  results_10_features.append(["Random Forest", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["Random Forest", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Random Forest Precision (10 features): 0.998262408241796
Random Forest F1 Score (10 features): 0.9799685338553014
Random Forest Variance (10 features): 5.231475371684771e-06
Random Forest MAE (10 features): 0.009168517083525972
Random Forest Execution Time: 169.15548610687256
Random Forest Precision (20 features): 0.9980822403663654
Random Forest F1 Score (20 features): 0.9884509989637585
Random Forest Variance (20 features): 5.116359534341992e-06
Random Forest MAE (20 features): 0.005495900865409034
Random Forest Execution Time: 210.78380870819092


In [43]:
import pickle
import os

if not os.path.exists(results_file_name):

  # Save the results lists to a file
  with open(results_file_name, 'wb') as file:
      results_dict = {
          'results_10_features': results_10_features,
          'results_20_features': results_20_features
      }
      pickle.dump(results_dict, file)


In [46]:
# Load the results from the file
with open(results_file_name, 'rb') as file:
    loaded_results = pickle.load(file)

# Access the loaded results lists
results_10_features = loaded_results['results_10_features']
results_20_features = loaded_results['results_20_features']


In [47]:
# Print the results in tabular format
headers_10 = ["Algorithm", "Precision (10 Features)", "F1 Score (10 Features)", "Variance (10 Features)", "MAE (10 Features)", "Execution Time"]
headers_20 = ["Precision (20 Features)", "F1 Score (20 Features)", "Variance (20 Features)", "MAE (20 Features)", "Execution Time"]

print(tabulate(results_10_features, headers_10, tablefmt="pretty"))
print(tabulate(results_10_features, headers_20, tablefmt="pretty"))

+---------------+-------------------------+------------------------+------------------------+----------------------+--------------------+
|   Algorithm   | Precision (10 Features) | F1 Score (10 Features) | Variance (10 Features) |  MAE (10 Features)   |   Execution Time   |
+---------------+-------------------------+------------------------+------------------------+----------------------+--------------------+
|   ZeroRule    |           0.0           |          0.0           |          0.0           | 0.24388199627342944  | 2.3789751529693604 |
|    OneRule    |   0.24312432230925718   |  0.24395697630241897   | 4.267845198669343e-06  |  0.3688383909326938  | 2.6188907623291016 |
|  Naive Bayes  |   0.9479421362435112    |   0.9557347237986116   |  0.004133212375383223  | 0.022359423358322875 | 6.8374340534210205 |
| Random Forest |    0.998262408241796    |   0.9799685338553014   | 5.231475371684771e-06  | 0.009168517083525972 | 169.15548610687256 |
+---------------+-----------------