# Intrusion Detection System (IDS) Public Datasets Benchmarking

In cybersecurity, the design, development, and implementation of effective Intrusion Detection Systems (IDS) are important for safeguarding IT&C infrastructures from unauthorized access, data breaches, and various forms of malicious activities. The selection of an appropriate ML/DL algorithm plays a essential role in ensuring the security and integrity of protected systems.

But before we can dive in the development of a new-edge algorithm, we shoud have the appropriate data, that needs to be studied and analysed in order to undestant the reality and challenges of our ML problem. In accordance with this paradigm, we chosed to study the early created datasets designed for IDS systems in order to derive leasons learn for feature dataset development.

This experiment aims to comprehensively evaluate the performance of different ML and DL algorithms on a variety of datasets, encompassing a wide range of network traffic scenarios. The datasets used for this analysis include well-known benchmark datasets such as KDD, NSL-KDD, CTU-13, ISCXIDS2012, CIC-IDS2017, CSE-CIC-IDS2018, and Kyoto 2006+. Each dataset represents a distinct set of challenges and characteristics, making this evaluation both diverse and insightful.

The experiment is divided into three main phases:

1. **Data Acquisition and Preprocessing**:
 - In this phase, we acquire the selected datasets from reputable sources, ensuring the integrity and accuracy of the data.
 - Data preprocessing tasks include handling missing values, selecting the most relevant features using feature selection techniques, normalizing the data, and, if necessary, performing feature engineering to enhance the dataset's suitability for machine learning.

2. **Algorithm Evaluation**:
 - We evaluate the performance of a range of ML/DL algorithms on each dataset. The chosen algorithms include baseline methods like ZeroRule and OneRule, traditional machine learning approaches like Naive Bayes and Random Forest, as well as some of the most used anomaly detection deep learning algorithms.
 - Cross-validation is applied to ensure the robustness of our results. Performance metrics such as precision, variance, and Mean Absolute Error (MAE) are calculated for each algorithm and dataset.

3. **Results and Insights**:
 - The results of this evaluation provide valuable insights into the strengths and weaknesses of different IDS algorithms under various conditions.
 - We analyze the performance of algorithms on both the original datasets and balanced datasets to address the challenge of class imbalance in intrusion detection.
 - Observations and additional details regarding the algorithms' performance are documented, providing a comprehensive overview of their behavior.

By conducting this experiment, we aim to contribute to the understanding of cyber domain dataset generation. The findings will assist in making informed decisions when developing a cybersecurity AI application, by deriving necesary steps and procedures in selecting the appropriate learning data.

The following sections of this Jupyter notebook will provide a detailed walkthrough of the experiment, including code snippets, visualizations, and discussions of the results.

In [63]:
# Mount your Google Drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [64]:
import os
from psutil import virtual_memory
from tabulate import tabulate

# Function to get CPU information
def get_cpu_info():
    cpu_info = os.popen('lscpu').read()
    return cpu_info

# Function to get RAM information
def get_ram_info():
    ram = virtual_memory()
    total_ram = f"{ram.total / 1e9:.2f} GB"
    available_ram = f"{ram.available / 1e9:.2f} GB"
    return total_ram, available_ram

# Function to get GPU information
def get_gpu_info():
    # Execute nvidia-smi and get its output
    gpu_info = os.popen('nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader,nounits').read().strip()

    # Split the output to get individual GPU details
    details = gpu_info.split(", ")

    # Return GPU name, total, used, and free memory
    return details[0], f"{details[1]} MB", f"{details[2]} MB", f"{details[3]} MB"

# Collect system information
cpu_info = get_cpu_info()
total_ram, available_ram = get_ram_info()
try:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = get_gpu_info()
except:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = 'null',0,0,0

# Extract relevant CPU information
cpu_type = ""
cpu_architecture = ""

for line in cpu_info.splitlines():
    if "Model name:" in line:
        cpu_type = line.split(":")[1].strip()
    elif "Architecture:" in line:
        cpu_architecture = line.split(":")[1].strip()

# Create a table
table = [
    ["CPU Type", cpu_type],
    ["CPU Architecture", cpu_architecture],
    ["Total RAM", total_ram],
    ["Available RAM", available_ram],
    ["GPU Name", gpu_name],
    ["GPU Total Memory", gpu_total_memory],
    ["GPU Used Memory", gpu_used_memory],
    ["GPU Free Memory", gpu_free_memory]
]

# Display the table
print(tabulate(table, headers=["Characteristic", "Value"], tablefmt="pretty"))


+------------------+--------------------------------+
|  Characteristic  |             Value              |
+------------------+--------------------------------+
|     CPU Type     | Intel(R) Xeon(R) CPU @ 2.20GHz |
| CPU Architecture |             x86_64             |
|    Total RAM     |            54.76 GB            |
|  Available RAM   |            19.26 GB            |
|     GPU Name     |              null              |
| GPU Total Memory |               0                |
| GPU Used Memory  |               0                |
| GPU Free Memory  |               0                |
+------------------+--------------------------------+


## 1. Data Acquisition and Preprocessing

In this section, we focus on acquiring the above mentioned datasets.

### 1.4. ISCXIDS2012 dataset

The ISCXIDS2012 dataset has been meticulously designed to assess the efficacy of intrusion detection systems in identifying and mitigating network-based cyber threats. This dataset is curated from real-world network traffic data recorded in a controlled laboratory setting. Comprising multiple distinct scenarios, ISCXIDS2012 represents various network attacks and intrusion scenarios, providing a comprehensive evaluation platform for cybersecurity research. These scenarios encompass a diverse spectrum of cyber threats, encompassing botnet activities, Distributed Denial of Service (DDoS) attacks, brute-force, and an array of Web attacks, rendering it an invaluable asset for our study.

### Download and Unzip ISCXIDS2012 dataset

In [71]:
import os
import pandas as pd
import zipfile
import requests
import os

# Define the source URL
src_url = "http://205.174.165.80/CICDataset/ISCX-IDS-2012/Dataset/labeled_flows_xml.zip"

# Define the destination folder
destination_folder = "/content/drive/MyDrive/ISCXIDS2012"

# Create the destination folder if it doesn't exist
if not os.path.exists(destination_folder):
    os.makedirs(destination_folder)

# Define the destination file path
dest_file = os.path.join(destination_folder, "labeled_flows_xml.zip")

# Download the file
response = requests.get(src_url, stream=True)
if not os.path.exists(dest_file):
  if response.status_code == 200:
      with open(dest_file, "wb") as f:
          for chunk in response.iter_content(chunk_size=8192):
              f.write(chunk)

      print("Download completed.")
  else:
      print("Failed to download the file.")

# Check if the Dataset was downlaoded
if os.path.exists(dest_file) and len(os.listdir(destination_folder))==1:

  # Unzip the downloaded dataset
  with zipfile.ZipFile(dest_file, "r") as zip_ref:
      zip_ref.extractall(destination_folder)

  print("Unzip complete.")

else:

  print("Dataset already exists. Skipping download.")

Unzip complete.


In [74]:
!ls -ahl '/content/drive/MyDrive/ISCXIDS2012'

total 3.5G
-rw------- 1 root root 394M Oct 10 17:36 labeled_flows_xml.zip
-rw------- 1 root root 1.9K Oct 10 18:47 readme.txt
-rw------- 1 root root 203M Oct 10 18:47 TestbedMonJun14Flows.xml
-rw------- 1 root root  22K Oct 10 18:47 TestbedMonJun14Flows.xsd
-rw------- 1 root root 139M Oct 10 18:47 TestbedSatJun12Flows.xml
-rw------- 1 root root  23K Oct 10 18:47 TestbedSatJun12Flows.xsd
-rw------- 1 root root 283M Oct 10 18:47 TestbedSunJun13Flows.xml
-rw------- 1 root root  24K Oct 10 18:47 TestbedSunJun13Flows.xsd
-rw------- 1 root root 298M Oct 10 18:47 TestbedThuJun17-1Flows.xml
-rw------- 1 root root  23K Oct 10 18:47 TestbedThuJun17-1Flows.xsd
-rw------- 1 root root 236M Oct 10 18:47 TestbedThuJun17-2Flows.xml
-rw------- 1 root root  24K Oct 10 18:47 TestbedThuJun17-2Flows.xsd
-rw------- 1 root root 117M Oct 10 18:47 TestbedThuJun17-3Flows.xml
-rw------- 1 root root  23K Oct 10 18:47 TestbedThuJun17-3Flows.xsd
-rw------- 1 root root 307M Oct 10 18:47 TestbedTueJun15-1Flows.xml
-r

In [75]:
import xml.etree.ElementTree as ET
import csv
import os

# Function to parse a single flow element
def parse_flow(flow):
    flow_data = {}
    flow_data['appName'] = flow.find('appName').text
    flow_data['totalSourceBytes'] = int(flow.find('totalSourceBytes').text)
    flow_data['totalDestinationBytes'] = int(flow.find('totalDestinationBytes').text)
    flow_data['totalDestinationPackets'] = int(flow.find('totalDestinationPackets').text)
    flow_data['totalSourcePackets'] = int(flow.find('totalSourcePackets').text)
    flow_data['direction'] = flow.find('direction').text
    flow_data['source'] = flow.find('source').text
    flow_data['protocolName'] = flow.find('protocolName').text
    flow_data['sourcePort'] = int(flow.find('sourcePort').text)
    flow_data['destination'] = flow.find('destination').text
    flow_data['destinationPort'] = int(flow.find('destinationPort').text)
    flow_data['startDateTime'] = flow.find('startDateTime').text
    flow_data['stopDateTime'] = flow.find('stopDateTime').text
    flow_data['Tag'] = flow.find('Tag').text
    return flow_data

# Use the current directory as the xml_folder
xml_folder = '/content/drive/MyDrive/ISCXIDS2012'

# Get a list of all XML files in the folder
xml_files = [f for f in os.listdir(xml_folder) if f.endswith('.xml')]

# Iterate through each XML file
for xml_file in xml_files:
    # Construct the full path to the XML file
    xml_file_path = os.path.join(xml_folder, xml_file)

    # Change the file extension from .xml to .csv
    csv_file_name = os.path.splitext(xml_file)[0] + '.csv'

    # Define the CSV file path for output
    csv_file_path = os.path.join(xml_folder, csv_file_name)

    # Initialize a list to store the extracted information
    results = []

    try:
        # Parse the XML data from the file
        tree = ET.parse(xml_file_path)
        root = tree.getroot()

        # Iterate through all elements in the XML
        for element in root.iter():
            # Check if the element's name starts with "Testbed"
            if element.tag.startswith("Testbed"):
                flow_data = parse_flow(element)
                if flow_data:
                    results.append(flow_data)

        # Write the extracted data to a CSV file
        with open(csv_file_path, mode='w', newline='') as csv_file:
            fieldnames = [
                'appName', 'totalSourceBytes', 'totalDestinationBytes',
                'totalDestinationPackets', 'totalSourcePackets', 'direction', 'source',
                'protocolName', 'sourcePort', 'destination', 'destinationPort',
                'startDateTime', 'stopDateTime', 'Tag'
            ]
            writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

            # Write the CSV header
            writer.writeheader()

            # Write the data for each flow
            for flow in results:
                try:
                    writer.writerow(flow)
                except Exception as e:
                    print(f"Error writing a row to '{csv_file_name}': {e}. Skipping this row.")

        print(f"Data from '{xml_file}' has been exported to '{csv_file_name}'")

    except ET.ParseError as e:
        print(f"Error parsing '{xml_file}': {e}. Skipping this file.")
        continue


Data from 'TestbedMonJun14Flows.xml' has been exported to 'TestbedMonJun14Flows.csv'
Data from 'TestbedSatJun12Flows.xml' has been exported to 'TestbedSatJun12Flows.csv'
Data from 'TestbedSunJun13Flows.xml' has been exported to 'TestbedSunJun13Flows.csv'
Error parsing 'TestbedThuJun17-1Flows.xml': not well-formed (invalid token): line 3135760, column 209. Skipping this file.
Data from 'TestbedThuJun17-2Flows.xml' has been exported to 'TestbedThuJun17-2Flows.csv'
Data from 'TestbedThuJun17-3Flows.xml' has been exported to 'TestbedThuJun17-3Flows.csv'
Data from 'TestbedTueJun15-1Flows.xml' has been exported to 'TestbedTueJun15-1Flows.csv'
Data from 'TestbedTueJun15-2Flows.xml' has been exported to 'TestbedTueJun15-2Flows.csv'
Data from 'TestbedTueJun15-3Flows.xml' has been exported to 'TestbedTueJun15-3Flows.csv'
Data from 'TestbedWedJun16-1Flows.xml' has been exported to 'TestbedWedJun16-1Flows.csv'
Data from 'TestbedWedJun16-2Flows.xml' has been exported to 'TestbedWedJun16-2Flows.csv'

In [76]:
import os
import pandas as pd

encoding = 'ISO-8859-1'  # Specify the correct encoding


# Get user input with a prompt
csv_folder = '/content/drive/MyDrive/ISCXIDS2012'

# List to store individual DataFrames
dfs = []

# Iterate over the CSV files in the folder
for filename in os.listdir(csv_folder):
    if filename.endswith(".csv") and not filename.startswith("ISCXIDS2012"):

	# Read the CSV file with the specified encoding
        try:
            df = pd.read_csv(os.path.join(csv_folder, filename), encoding=encoding)
        except UnicodeDecodeError:
            print(f'Error: Unable to read {file_path} with encoding {encoding}')
        dfs.append(df)

# Concatenate all DataFrames into one
ISCXIDS2012_df = pd.concat(dfs, ignore_index=True)

In [77]:
# Information about the starting ISCXIDS2012 DataFrame
ISCXIDS2012_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885519 entries, 0 to 1885518
Data columns (total 14 columns):
 #   Column                   Dtype 
---  ------                   ----- 
 0   appName                  object
 1   totalSourceBytes         int64 
 2   totalDestinationBytes    int64 
 3   totalDestinationPackets  int64 
 4   totalSourcePackets       int64 
 5   direction                object
 6   source                   object
 7   protocolName             object
 8   sourcePort               int64 
 9   destination              object
 10  destinationPort          int64 
 11  startDateTime            object
 12  stopDateTime             object
 13  Tag                      object
dtypes: int64(6), object(8)
memory usage: 201.4+ MB


In [78]:
# Some basic statistical details like percentile, mean, std, etc. of the starting ISCXIDS2012 DataFrame
ISCXIDS2012_df.describe()

Unnamed: 0,totalSourceBytes,totalDestinationBytes,totalDestinationPackets,totalSourcePackets,sourcePort,destinationPort
count,1885519.0,1885519.0,1885519.0,1885519.0,1885519.0,1885519.0
mean,2565.597,35116.33,30.93461,20.13171,14610.91,2041.804
std,787915.3,1208891.0,1010.258,691.7955,20532.54,8981.55
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,247.0,360.0,2.0,3.0,2326.0,80.0
50%,444.0,1179.0,5.0,6.0,3784.0,80.0
75%,852.0,7530.0,12.0,10.0,18234.0,80.0
max,763277600.0,1254005000.0,872224.0,514794.0,65535.0,65535.0


In [79]:
ISCXIDS2012_df.shape, ISCXIDS2012_df.columns

((1885519, 14),
 Index(['appName', 'totalSourceBytes', 'totalDestinationBytes',
        'totalDestinationPackets', 'totalSourcePackets', 'direction', 'source',
        'protocolName', 'sourcePort', 'destination', 'destinationPort',
        'startDateTime', 'stopDateTime', 'Tag'],
       dtype='object'))

### Preprocessing of the ISCXIDS2012 dataset

In [80]:
# Check if the Dataset was not preprocess do:
  # 1 # Handling Missing Values
  # 2 # Encode Categorical Features and Label
  # 3 # Normalization (Min-Max Scaling)
  # 4 # Removing duplicate records

df_final_file_path = os.path.join(destination_folder, "ISCXIDS2012.csv")
if not os.path.exists(df_final_file_path):

  # Step 1: Handling Missing Values

  # Check for missing values, NAN
  check_nan = ISCXIDS2012_df.isna().sum().sum()

  # Check if missing values are represented as empty values (",,")
  missing_values_as_empty = ISCXIDS2012_df.applymap(lambda x: x == '')

  # Count the number of missing values in each column
  missing_values_count = missing_values_as_empty.sum()

  # Check if all elements in the missing_values_count Series are different from 0
  check_null = (missing_values_count != 0).all()

  # Replace empty values with NaN
  if (check_null):
    ISCXIDS2012_df.replace("", np.nan, inplace=True)

  # Impute missing values with the most frequent value for categorical columns and mean for numerical columns
  if (check_null or check_nan !=0):
    imputer = SimpleImputer(strategy='most_frequent', missing_values=pd.NA)
    for col in ISCXIDS2012_df.columns:
        if ISCXIDS2012_df[col].dtype == 'object':
            ISCXIDS2012_df[col] = imputer.fit_transform(df[[col]])
        else:
            ISCXIDS2012_df[col] = ISCXIDS2012_df[col].fillna(ISCXIDS2012_df[col].mean())

In [81]:
# Check again for missing values, NAN
ISCXIDS2012_df.isna().sum(axis=0)

appName                    0
totalSourceBytes           0
totalDestinationBytes      0
totalDestinationPackets    0
totalSourcePackets         0
direction                  0
source                     0
protocolName               0
sourcePort                 0
destination                0
destinationPort            0
startDateTime              0
stopDateTime               0
Tag                        0
dtype: int64

In [82]:
  # 2 # Encode Categorical Features and Label

df = ISCXIDS2012_df.copy()

#['appName','direction','source','protocolName','destination','startDateTime','stopDateTime','Tag']

import numpy as np

ISCXIDS2012_df['appName'] = ISCXIDS2012_df['appName'].astype('category').cat.codes
ISCXIDS2012_df['appName'] = ISCXIDS2012_df['appName'].astype(np.int32)

ISCXIDS2012_df['direction'] = ISCXIDS2012_df['direction'].astype('category').cat.codes
ISCXIDS2012_df['direction'] = ISCXIDS2012_df['direction'].astype(np.int32)

ISCXIDS2012_df['source'] = ISCXIDS2012_df['source'].astype('category').cat.codes
ISCXIDS2012_df['source'] = ISCXIDS2012_df['source'].astype(np.int32)

ISCXIDS2012_df['destination'] = ISCXIDS2012_df['destination'].astype('category').cat.codes
ISCXIDS2012_df['destination'] = ISCXIDS2012_df['destination'].astype(np.int32)

ISCXIDS2012_df['protocolName'] = ISCXIDS2012_df['protocolName'].astype('category').cat.codes
ISCXIDS2012_df['protocolName'] = ISCXIDS2012_df['protocolName'].astype(np.int32)

# Drop startDateTime and stopDateTime
ISCXIDS2012_df.drop(['startDateTime', 'stopDateTime'], axis=1, inplace=True)

In [83]:
# Display the top 10 most frequent values and their counts in the 'Tag' column of CTU-13
print(ISCXIDS2012_df.Tag.value_counts().head(10))

# Change the data type of the 'Tag' column to 'object' (string)
ISCXIDS2012_df['Tag'] = ISCXIDS2012_df['Tag'].astype(dtype='object')

# Check if the 'Tag' column starts with the string 'Attack', and assign a Boolean value accordingly
ISCXIDS2012_df['Tag'] = ISCXIDS2012_df['Tag'].str.startswith('Attack', na=False)

# Change the data type of the 'Tag' column to 'float32'
ISCXIDS2012_df['Tag'] = ISCXIDS2012_df['Tag'].astype(dtype='float32', copy=False)

# Display again the top 10 most frequent values and their counts in the 'Tag' column of CTU-13 after modifications
print(ISCXIDS2012_df.Tag.value_counts().head(10))

Normal    1816609
Attack      68910
Name: Tag, dtype: int64
0.0    1816609
1.0      68910
Name: Tag, dtype: int64


In [84]:
from sklearn.preprocessing import MinMaxScaler

# Check if the Dataset was not preprocess do:
if not os.path.exists(df_final_file_path):

  # Step 3: Normalization (Min-Max Scaling)

  #columns = [col for col in ISCXIDS2012_df.columns if col not in ['appName','direction','source','protocolName','destination','startDateTime','stopDateTime','Tag']]
  min_max_scaler = MinMaxScaler().fit(ISCXIDS2012_df)
  ISCXIDS2012_df = pd.DataFrame(data=min_max_scaler.transform(ISCXIDS2012_df), columns=ISCXIDS2012_df.columns)

display(ISCXIDS2012_df.head())

Unnamed: 0,appName,totalSourceBytes,totalDestinationBytes,totalDestinationPackets,totalSourcePackets,direction,source,protocolName,sourcePort,destination,destinationPort,Tag
0,0.830189,2.10618e-05,0.0,0.0,0.000346,0.333333,0.269822,1.0,0.081682,0.438092,0.081682,0.0
1,0.179245,5.030935e-07,0.0,0.0,1.2e-05,0.333333,0.265372,0.8,0.067674,0.313352,0.001221,0.0
2,0.084906,2.240338e-07,5.119595e-07,5e-06,4e-06,0.0,0.268608,1.0,0.067567,0.23459,0.000809,0.0
3,0.179245,5.030935e-07,0.0,0.0,1.2e-05,0.333333,0.268608,0.8,0.055528,0.424007,0.001221,0.0
4,0.179245,2.436859e-07,1.020729e-07,2e-06,4e-06,0.333333,0.268608,0.8,0.055558,0.980615,0.001221,0.0


In [85]:
  # 4 # Removing duplicate records

# Print the shape of the DataFrame 'ISCXIDS2012_df' after removing rows with missing values
print(ISCXIDS2012_df.shape)

# Remove duplicate rows from the DataFrame 'ISCXIDS2012_df' while resetting the index
ISCXIDS2012_df = ISCXIDS2012_df.drop_duplicates()
ISCXIDS2012_df.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'ISCXIDS2012_df' after removing duplicates and resetting the index
print(ISCXIDS2012_df.shape)


# Print the shape of the DataFrame 'df' after removing rows with missing values
print(df.shape)

# Remove duplicate rows from the DataFrame 'df' while resetting the index
df = df.drop_duplicates()
df.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df' after removing duplicates and resetting the index
print(df.shape)

(1885519, 12)
(1622767, 12)
(1885519, 14)
(1746630, 14)


In [86]:
# Print out the DataFrames loaded in the memory
%whos DataFrame

Variable                  Type         Data/Info
------------------------------------------------
ISCXIDS2012_df            DataFrame              appName  totalS<...>622767 rows x 12 columns]
X                         DataFrame              appName  totalS<...>622752 rows x 11 columns]
df                        DataFrame                       appNam<...>746630 rows x 14 columns]
missing_values_as_empty   DataFrame             appName  totalSo<...>885519 rows x 14 columns]


In [87]:
del missing_values_as_empty

In [88]:
# Save the resulting dataframes

df_file_path = '/content/drive/MyDrive/ISCXIDS2012/ISCXIDS2012.csv'
df_encoded_file_path = '/content/drive/MyDrive/ISCXIDS2012/ISCXIDS2012_encoded.csv'

# Check if the Dataset is saved:
if not os.path.exists(df_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df.to_csv(df_file_path, index=False)

if not os.path.exists(df_encoded_file_path):
  # Convert your Pandas DataFrame to a CSV file
  ISCXIDS2012_df.to_csv(df_encoded_file_path, index=False)

## 2. Algorithm Evaluation

In this section, we assess the performance of various machine learning algorithms on the upper mentioned datasets.

### 2.4. ISCXIDS2012 dataset evaluation with baseline and traditional ML algorithms

In this section, we evaluate the performance of various machine learning algorithms on the ISCXIDS2012 dataset. We assess the precision and F1 scores, essential indicators of classification accuracy, for a range of algorithms, including fundamental classifiers like Zero Rule and One Rule, statistical approaches like Naive Bayes, and more advanced models such as Random Forest.

Given that the ISCXIDS2012 dataset have a limited number of features, to ensure a robust assessment, we employ a 10-fold cross-validation methodology, testing the algorithms on all the features from the dataset. These results offer valuable insights into the optimal dataset generation strategy, aiding in the selection of the most effective feature extraction methods for cybersecurity specific dataset.

In [89]:
# Skip selection of best features due to low count of attributes

# Separate features (X) and labels (y)
X = ISCXIDS2012_df.drop('Tag', axis=1)  # Exclude the label column
y = ISCXIDS2012_df['Tag']

In [90]:
import pandas as pd
import numpy as np
import warnings
from sklearn.metrics import precision_score, mean_absolute_error, f1_score
from sklearn.dummy import DummyClassifier
from tabulate import tabulate
import time
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Define the number of desired folds for Cross-Validation (e.g., 10)
num_folds = 10

# Initialize performance metrics lists
results = []

# Suppress all warning messages
warnings.filterwarnings("ignore")

In [91]:
# Define a file name for saving the results
results_file_name = os.path.join(destination_folder, "iscxids2012_results.pkl")

# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define ZeroRule classifier
  zero_rule = DummyClassifier(strategy="most_frequent")

  # Evaluate ZeroRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores = cross_val_score(zero_rule, X, y, cv=num_folds, scoring='precision')
  f1_scores = cross_val_score(zero_rule, X, y, cv=num_folds, scoring='f1')
  elapsed_time = time.time() - start_time  # Calculate execution time

  variance = np.var(precision_scores)

  predictions = cross_val_predict(zero_rule, X, y, cv=num_folds)
  mae = mean_absolute_error(y, predictions)

  # Display ZeroRule results
  print("ZeroRule Precision :", np.mean(precision_scores))
  print("ZeroRule F1 Score :", np.mean(f1_scores))
  print("ZeroRule Variance :", variance)
  print("ZeroRule MAE :", mae)
  print("ZeroRule Execution Time:", elapsed_time)

  results.append(["ZeroRule", np.mean(precision_scores), np.mean(f1_scores), variance, mae, elapsed_time])

ZeroRule Precision : 0.0
ZeroRule F1 Score : 0.0
ZeroRule Variance : 0.0
ZeroRule MAE : 0.03279891691167001
ZeroRule Execution Time: 5.6921117305755615


In [92]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define OneRule classifier
  one_rule = DummyClassifier(strategy="stratified")

  # Evaluate OneRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores = cross_val_score(one_rule, X, y, cv=num_folds, scoring='precision')
  f1_scores = cross_val_score(one_rule, X, y, cv=num_folds, scoring='f1')
  elapsed_time = time.time() - start_time  # Calculate execution time

  variance = np.var(precision_scores)

  predictions = cross_val_predict(one_rule, X, y, cv=num_folds)
  mae = mean_absolute_error(y, predictions)

  # Display OneRule results
  print("OneRule Precision :", np.mean(precision_scores))
  print("OneRule F1 Score :", np.mean(f1_scores))
  print("OneRule Variance :", variance)
  print("OneRule MAE :", mae)
  print("OneRule Execution Time:", elapsed_time)

  results.append(["OneRule", np.mean(precision_scores), np.mean(f1_scores), variance, mae, elapsed_time])

OneRule Precision : 0.0333212537210722
OneRule F1 Score : 0.03162443842440634
OneRule Variance : 4.3234639380297525e-06
OneRule MAE : 0.06329867442460932
OneRule Execution Time: 5.965609788894653


In [93]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define Naive Bayes classifier
  naive_bayes = GaussianNB()

  # Evaluate Naive Bayes classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores = cross_val_score(naive_bayes, X, y, cv=num_folds, scoring='precision')
  f1_scores = cross_val_score(naive_bayes, X, y, cv=num_folds, scoring='f1')
  elapsed_time = time.time() - start_time  # Calculate execution time

  variance = np.var(precision_scores)

  predictions = cross_val_predict(naive_bayes, X, y, cv=num_folds)
  mae = mean_absolute_error(y, predictions)

  # Display Naive Bayes results
  print("Naive Bayes Precision :", np.mean(precision_scores))
  print("Naive Bayes F1 Score :", np.mean(f1_scores))
  print("Naive Bayes Variance :", variance)
  print("Naive Bayes MAE :", mae)
  print("Naive Bayes Execution Time:", elapsed_time)

  results.append(["Naive Bayes", np.mean(precision_scores), np.mean(f1_scores), variance, mae, elapsed_time])

Naive Bayes Precision : 0.2786100231779066
Naive Bayes F1 Score : 0.4317346957207556
Naive Bayes Variance : 0.001716265973914469
Naive Bayes MAE : 0.08639872514045455
Naive Bayes Execution Time: 14.252134561538696


In [94]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Create a Random Forest classifier with optimized parameters
  rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)  # Adjust parameters for optimization

  # Evaluate Random Forest classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores = cross_val_score(rf_classifier, X, y, cv=num_folds, scoring='precision')
  f1_scores = cross_val_score(rf_classifier, X, y, cv=num_folds, scoring='f1')
  elapsed_time = time.time() - start_time  # Calculate execution time

  variance = np.var(precision_scores)

  predictions = cross_val_predict(rf_classifier, X, y, cv=num_folds)
  mae = mean_absolute_error(y, predictions)

  # Display Random Forest results
  print("Random Forest Precision :", np.mean(precision_scores))
  print("Random Forest F1 Score :", np.mean(f1_scores))
  print("Random Forest Variance :", variance)
  print("Random Forest MAE :", mae)
  print("Random Forest Execution Time:", elapsed_time)

  results.append(["Random Forest", np.mean(precision_scores), np.mean(f1_scores), variance, mae, elapsed_time])

Random Forest Precision : 0.9460547155556993
Random Forest F1 Score : 0.8910964816398075
Random Forest Variance : 0.025444219779817073
Random Forest MAE : 0.007293098762792194
Random Forest Execution Time: 1023.1033320426941


In [95]:
import pickle
import os

if not os.path.exists(results_file_name):

  # Save the results lists to a file
  with open(results_file_name, 'wb') as file:
      results_dict = {
          'results': results,
      }
      pickle.dump(results_dict, file)


In [96]:
# Load the results from the file
with open(results_file_name, 'rb') as file:
    loaded_results = pickle.load(file)

# Access the loaded results lists
results = loaded_results['results']


In [97]:
# Print the results in tabular format
headers = ["Algorithm", "Precision", "F1 Score", "Variance", "MAE", "Execution Time"]

print(tabulate(results, headers, tablefmt="pretty"))

+---------------+--------------------+---------------------+------------------------+----------------------+--------------------+
|   Algorithm   |     Precision      |      F1 Score       |        Variance        |         MAE          |   Execution Time   |
+---------------+--------------------+---------------------+------------------------+----------------------+--------------------+
|   ZeroRule    |        0.0         |         0.0         |          0.0           | 0.03279891691167001  | 5.6921117305755615 |
|    OneRule    | 0.0333212537210722 | 0.03162443842440634 | 4.3234639380297525e-06 | 0.06329867442460932  | 5.965609788894653  |
|  Naive Bayes  | 0.2786100231779066 | 0.4317346957207556  |  0.001716265973914469  | 0.08639872514045455  | 14.252134561538696 |
| Random Forest | 0.9460547155556993 | 0.8910964816398075  |  0.025444219779817073  | 0.007293098762792194 | 1023.1033320426941 |
+---------------+--------------------+---------------------+------------------------+-----

In [98]:
label_counts_ISCXIDS2012 = ISCXIDS2012_df['Tag'].value_counts()

# Display the counts with labels for ISCXIDS2012
print("Label counts for ISCXIDS2012:")
print(label_counts_ISCXIDS2012)

# Assuming 'Tag' is the name of the column containing your labels in df
label_counts_df = df['Tag'].value_counts()

# Display the counts with labels for df
print("\nLabel counts for df:")
print(label_counts_df)


Label counts for ISCXIDS2012:
0.0    1569542
1.0      53225
Name: Tag, dtype: int64

Label counts for df:
Normal    1687923
Attack      58707
Name: Tag, dtype: int64
