# Intrusion Detection System (IDS) Public Datasets Benchmarking

In cybersecurity, the design, development, and implementation of effective Intrusion Detection Systems (IDS) are important for safeguarding IT&C infrastructures from unauthorized access, data breaches, and various forms of malicious activities. The selection of an appropriate ML/DL algorithm plays a essential role in ensuring the security and integrity of protected systems.

But before we can dive in the development of a new-edge algorithm, we shoud have the appropriate data, that needs to be studied and analysed in order to undestant the reality and challenges of our ML problem. In accordance with this paradigm, we chosed to study the early created datasets designed for IDS systems in order to derive leasons learn for feature dataset development.

This experiment aims to comprehensively evaluate the performance of different ML and DL algorithms on a variety of datasets, encompassing a wide range of network traffic scenarios. The datasets used for this analysis include well-known benchmark datasets such as KDD, NSL-KDD, CTU-13, ISCXIDS2012, CIC-IDS2017, CSE-CIC-IDS2018, and Kyoto 2006+. Each dataset represents a distinct set of challenges and characteristics, making this evaluation both diverse and insightful.

The experiment is divided into three main phases:

1. **Data Acquisition and Preprocessing**:
 - In this phase, we acquire the selected datasets from reputable sources, ensuring the integrity and accuracy of the data.
 - Data preprocessing tasks include handling missing values, selecting the most relevant features using feature selection techniques, normalizing the data, and, if necessary, performing feature engineering to enhance the dataset's suitability for machine learning.

2. **Algorithm Evaluation**:
 - We evaluate the performance of a range of ML/DL algorithms on each dataset. The chosen algorithms include baseline methods like ZeroRule and OneRule, traditional machine learning approaches like Naive Bayes and Random Forest, as well as some of the most used anomaly detection deep learning algorithms.
 - Cross-validation is applied to ensure the robustness of our results. Performance metrics such as precision, variance, and Mean Absolute Error (MAE) are calculated for each algorithm and dataset.

3. **Results and Insights**:
 - The results of this evaluation provide valuable insights into the strengths and weaknesses of different IDS algorithms under various conditions.
 - We analyze the performance of algorithms on both the original datasets and balanced datasets to address the challenge of class imbalance in intrusion detection.
 - Observations and additional details regarding the algorithms' performance are documented, providing a comprehensive overview of their behavior.

By conducting this experiment, we aim to contribute to the understanding of cyber domain dataset generation. The findings will assist in making informed decisions when developing a cybersecurity AI application, by deriving necesary steps and procedures in selecting the appropriate learning data.

The following sections of this Jupyter notebook will provide a detailed walkthrough of the experiment, including code snippets, visualizations, and discussions of the results.

In [1]:
# Mount your Google Drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
import warnings
from google.colab import files

# Suppress all warning messages
warnings.filterwarnings("ignore")

# Check if the Kaggle API credentials file already exists
kaggle_credentials_path = os.path.expanduser("~/.kaggle/kaggle.json")

if not os.path.exists(kaggle_credentials_path):

    if not os.path.exists(os.path.join("/content/drive/MyDrive/.kaggle/", "kaggle.json")):

      # Upload your Kaggle API credentials file (kaggle.json)
      files.upload()

      !mv kaggle.json "/content/drive/MyDrive/.kaggle/"
      !chmod 600 "/content/drive/MyDrive/.kaggle/kaggle.json"

    # Move the Kaggle API Credentials File
    !mkdir -p ~/.kaggle
    !cp '/content/drive/MyDrive/.kaggle/kaggle.json' ~/.kaggle/

else:

    print("Kaggle API credentials file already exists.")

In [3]:
import tensorflow as tf
print("GPU available:", tf.test.is_gpu_available())
print("GPU device name:", tf.test.gpu_device_name())

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


GPU available: False
GPU device name: 


In [4]:
import os
from psutil import virtual_memory
from tabulate import tabulate

# Function to get CPU information
def get_cpu_info():
    cpu_info = os.popen('lscpu').read()
    return cpu_info

# Function to get RAM information
def get_ram_info():
    ram = virtual_memory()
    total_ram = f"{ram.total / 1e9:.2f} GB"
    available_ram = f"{ram.available / 1e9:.2f} GB"
    return total_ram, available_ram

# Function to get GPU information
def get_gpu_info():
    # Execute nvidia-smi and get its output
    gpu_info = os.popen('nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader,nounits').read().strip()

    # Split the output to get individual GPU details
    details = gpu_info.split(", ")

    # Return GPU name, total, used, and free memory
    return details[0], f"{details[1]} MB", f"{details[2]} MB", f"{details[3]} MB"

# Collect system information
cpu_info = get_cpu_info()
total_ram, available_ram = get_ram_info()
try:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = get_gpu_info()
except:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = 'null',0,0,0

# Extract relevant CPU information
cpu_type = ""
cpu_architecture = ""

for line in cpu_info.splitlines():
    if "Model name:" in line:
        cpu_type = line.split(":")[1].strip()
    elif "Architecture:" in line:
        cpu_architecture = line.split(":")[1].strip()

# Create a table
table = [
    ["CPU Type", cpu_type],
    ["CPU Architecture", cpu_architecture],
    ["Total RAM", total_ram],
    ["Available RAM", available_ram],
    ["GPU Name", gpu_name],
    ["GPU Total Memory", gpu_total_memory],
    ["GPU Used Memory", gpu_used_memory],
    ["GPU Free Memory", gpu_free_memory]
]

# Display the table
print(tabulate(table, headers=["Characteristic", "Value"], tablefmt="pretty"))


+------------------+--------------------------------+
|  Characteristic  |             Value              |
+------------------+--------------------------------+
|     CPU Type     | Intel(R) Xeon(R) CPU @ 2.30GHz |
| CPU Architecture |             x86_64             |
|    Total RAM     |            37.84 GB            |
|  Available RAM   |            35.71 GB            |
|     GPU Name     |              null              |
| GPU Total Memory |               0                |
| GPU Used Memory  |               0                |
| GPU Free Memory  |               0                |
+------------------+--------------------------------+


## 1. Data Acquisition and Preprocessing

In this section, we focus on acquiring the above mentioned datasets.

### 1.5. CIC-IDS2017 dataset
CIC-IDS2017, or the Canadian Institute for Cybersecurity Intrusion Detection System 2017 dataset, is a comprehensive cybersecurity dataset designed for research and development in the field of network intrusion detection. It consists of a diverse and realistic collection of network traffic data, including both benign and malicious activities.

### Download and Unzip CIC-IDS2017 dataset

In [7]:
import os
import pandas as pd
import zipfile

# Specify the dataset name
dataset_name = "cicdataset/cicids2017"

# Specify the destination folder in your Google Drive
destination_folder = "/content/drive/MyDrive/CIC-IDS2017-BM"

# Check if the dataset file already exists in your Google Drive
dataset_file_path = os.path.join(destination_folder, "cicids2017.zip")

if not os.path.exists(dataset_file_path):

  # Download the dataset and save it to your Google Drive
  !kaggle datasets download -d $dataset_name -p $destination_folder

  print("Download complete.")

else:

  print("Dataset already exists. Skipping download.")

dest_file = f"{destination_folder}/cicids2017.zip"

# Check if the Dataset was downlaoded
if os.path.exists(dest_file) and len(os.listdir(destination_folder))==1:

  # Unzip the downloaded dataset
  with zipfile.ZipFile(dest_file, "r") as zip_ref:
      zip_ref.extractall(destination_folder)

  print("Unzip complete.")

else:

  print("Dataset already exists. Skipping unzip.")

Downloading cicids2017.zip to /content/drive/MyDrive/CIC-IDS2017-BM
 91% 209M/230M [00:01<00:00, 152MB/s]
100% 230M/230M [00:01<00:00, 132MB/s]
Download complete.
Unzip complete.


In [8]:
!ls -ahl '/content/drive/MyDrive/CIC-IDS2017-BM'

total 230M
-rw------- 1 root root 230M Jan  3  2020 cicids2017.zip
drwx------ 3 root root 4.0K Oct 11 06:05 MachineLearningCSV
-rw------- 1 root root   57 Oct 11 06:05 MachineLearningCSV.md5


In [10]:
!ls -ahl '/content/drive/MyDrive/CIC-IDS2017-BM/MachineLearningCSV/MachineLearningCVE'

total 844M
-rw------- 1 root root  74M Oct 11 06:05 Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
-rw------- 1 root root  74M Oct 11 06:05 Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
-rw------- 1 root root  56M Oct 11 06:05 Friday-WorkingHours-Morning.pcap_ISCX.csv
-rw------- 1 root root 169M Oct 11 06:05 Monday-WorkingHours.pcap_ISCX.csv
-rw------- 1 root root  80M Oct 11 06:05 Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
-rw------- 1 root root  50M Oct 11 06:05 Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
-rw------- 1 root root 129M Oct 11 06:05 Tuesday-WorkingHours.pcap_ISCX.csv
-rw------- 1 root root 215M Oct 11 06:05 Wednesday-workingHours.pcap_ISCX.csv


In [50]:
import os
import pandas as pd

# Check if the Dataset is saved
df_file_path = os.path.join(destination_folder, "cicids2017.csv")

if os.path.exists(df_file_path):

  df = pd.read_csv(df_file_path)

else:

  encoding = 'ISO-8859-1'  # Specify the correct encoding


  # Get user input with a prompt
  csv_folder = '/content/drive/MyDrive/CIC-IDS2017-BM/MachineLearningCSV/MachineLearningCVE'

  # List to store individual DataFrames
  dfs = []

  # Iterate over the CSV files in the folder
  for filename in os.listdir(csv_folder):
      if filename.endswith(".csv"):

    # Read the CSV file with the specified encoding
          try:
              df_ = pd.read_csv(os.path.join(csv_folder, filename), encoding=encoding)
          except UnicodeDecodeError:
              print(f'Error: Unable to read {filename} with encoding {encoding}')
          dfs.append(df_)

  # Concatenate all DataFrames into one
  df = pd.concat(dfs, ignore_index=True)

In [12]:
if not os.path.exists(df_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df.to_csv(df_file_path, index=False)

In [51]:
# Information about the starting CIC-IDS2017 DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2830743 entries, 0 to 2830742
Data columns (total 79 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0    Destination Port             int64  
 1    Flow Duration                int64  
 2    Total Fwd Packets            int64  
 3    Total Backward Packets       int64  
 4   Total Length of Fwd Packets   int64  
 5    Total Length of Bwd Packets  int64  
 6    Fwd Packet Length Max        int64  
 7    Fwd Packet Length Min        int64  
 8    Fwd Packet Length Mean       float64
 9    Fwd Packet Length Std        float64
 10  Bwd Packet Length Max         int64  
 11   Bwd Packet Length Min        int64  
 12   Bwd Packet Length Mean       float64
 13   Bwd Packet Length Std        float64
 14  Flow Bytes/s                  float64
 15   Flow Packets/s               float64
 16   Flow IAT Mean                float64
 17   Flow IAT Std                 float64
 18   Flow IAT Max         

In [52]:
# Some basic statistical details like percentile, mean, std, etc. of the starting CIC-IDS2017 DataFrame
df.describe()

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min
count,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,...,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0,2830743.0
mean,8071.483,14785660.0,9.36116,10.39377,549.3024,16162.64,207.5999,18.71366,58.20194,68.91013,...,5.418218,-2741.688,81551.32,41134.12,153182.5,58295.82,8316037.0,503843.9,8695752.0,7920031.0
std,18283.63,33653740.0,749.6728,997.3883,9993.589,2263088.0,717.1848,60.33935,186.0912,281.1871,...,636.4257,1084989.0,648599.9,393381.5,1025825.0,577092.3,23630080.0,4602984.0,24366890.0,23363420.0
min,0.0,-13.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-536870700.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,53.0,155.0,2.0,1.0,12.0,0.0,6.0,0.0,6.0,0.0,...,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,80.0,31316.0,2.0,2.0,62.0,123.0,37.0,2.0,34.0,0.0,...,1.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,443.0,3204828.0,5.0,4.0,187.0,482.0,81.0,36.0,50.0,26.16295,...,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,65535.0,120000000.0,219759.0,291922.0,12900000.0,655453000.0,24820.0,2325.0,5940.857,7125.597,...,213557.0,138.0,110000000.0,74200000.0,110000000.0,110000000.0,120000000.0,76900000.0,120000000.0,120000000.0


In [53]:
# Shape and columns
df.shape, df.columns

((2830743, 79),
 Index([' Destination Port', ' Flow Duration', ' Total Fwd Packets',
        ' Total Backward Packets', 'Total Length of Fwd Packets',
        ' Total Length of Bwd Packets', ' Fwd Packet Length Max',
        ' Fwd Packet Length Min', ' Fwd Packet Length Mean',
        ' Fwd Packet Length Std', 'Bwd Packet Length Max',
        ' Bwd Packet Length Min', ' Bwd Packet Length Mean',
        ' Bwd Packet Length Std', 'Flow Bytes/s', ' Flow Packets/s',
        ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max', ' Flow IAT Min',
        'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std', ' Fwd IAT Max',
        ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean', ' Bwd IAT Std',
        ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags', ' Bwd PSH Flags',
        ' Fwd URG Flags', ' Bwd URG Flags', ' Fwd Header Length',
        ' Bwd Header Length', 'Fwd Packets/s', ' Bwd Packets/s',
        ' Min Packet Length', ' Max Packet Length', ' Packet Length Mean',
        ' Packet Length Std

In [54]:
label_counts_df = df[" Label"].value_counts()

# Display the counts with labels for df
print("\nLabel counts for df:")
print(label_counts_df)


Label counts for df:
BENIGN                          2273097
DoS Hulk                         231073
PortScan                         158930
DDoS                             128027
DoS GoldenEye                     10293
FTP-Patator                        7938
SSH-Patator                        5897
DoS slowloris                      5796
DoS Slowhttptest                   5499
Bot                                1966
Web Attack ï¿½ Brute Force         1507
Web Attack ï¿½ XSS                  652
Infiltration                         36
Web Attack ï¿½ Sql Injection         21
Heartbleed                           11
Name:  Label, dtype: int64


In [55]:
# Check duplicate records

df_copy = df.copy()

# Print the shape of the DataFrame 'df_copy' after removing rows with missing values
print(df_copy.shape)

# Remove duplicate rows from the DataFrame 'df_copy' while resetting the index
df_copy = df_copy.drop_duplicates()
df_copy.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df_copy' after removing duplicates and resetting the index
print(df_copy.shape)

(2830743, 79)
(2522362, 79)


In [None]:
# Find identical feature vectors but with different " Label"

# Create a subset DataFrame with only feature columns
feature_columns = [col for col in df.columns if col != " Label"]
feature_df = df[feature_columns]

# Find duplicate rows based on feature vectors
duplicate_rows = feature_df.duplicated(keep="first")

# Filter the DataFrame to show only duplicate rows
duplicate_records = df[duplicate_rows]

print("Duplicate Records:")
duplicate_records

In [58]:
# Filter the DataFrame to show only duplicate rows with different labels than "BENIGN"
filtered_duplicate_records = df[duplicate_rows]
filtered_duplicate_records = filtered_duplicate_records[filtered_duplicate_records[' Label'] != 'BENIGN']

print("Duplicate Records with Different Labels than 'BENIGN':")
filtered_duplicate_records

Duplicate Records with Different Labels than 'BENIGN':


Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
43645,80,4,1,1,6,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,DDoS
89342,80,4,1,1,6,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,DDoS
99514,80,5,2,1,12,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,DDoS
132402,80,31,1,1,6,1375,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,DDoS
158001,80,4,1,1,6,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,DDoS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2680409,80,95625815,2,0,0,0,0,0,0.0,0.0,...,32,0.0,0.0,0,0,95600000.0,0.0,95600000,95600000,DoS GoldenEye
2707696,80,56011620,2,0,0,0,0,0,0.0,0.0,...,32,0.0,0.0,0,0,56000000.0,0.0,56000000,56000000,DoS GoldenEye
2722630,80,34918594,2,0,0,0,0,0,0.0,0.0,...,32,0.0,0.0,0,0,34900000.0,0.0,34900000,34900000,DoS GoldenEye
2723783,80,34918440,2,0,0,0,0,0,0.0,0.0,...,32,0.0,0.0,0,0,34900000.0,0.0,34900000,34900000,DoS GoldenEye


In [None]:
"""# Identify identical feature vectors, list the different labels associated with each vector, and provide the indices for each label

# Create a subset DataFrame with only feature columns
feature_columns = [col for col in df.columns if col != " Label"]
feature_df = df[feature_columns]

# Find duplicate rows based on feature vectors
duplicate_rows = feature_df.duplicated(keep="first")

# Get the duplicated feature vectors
duplicated_feature_vectors = feature_df[duplicate_rows]

# Initialize a dictionary to store the different labels and their indices
label_indices = {}

# Iterate through the duplicated feature vectors
for idx, row in duplicated_feature_vectors.iterrows():
    feature_vector = row.tolist()
    label = df.loc[idx, ' Label']

    if tuple(feature_vector) in label_indices:
        label_indices[tuple(feature_vector)].append((label, idx))
    else:
        label_indices[tuple(feature_vector)] = [(label, idx)]

# Print the identical feature vectors, different labels, and their indices
for feature_vector, labels_indices in label_indices.items():
    if len(labels_indices) > 1:
        print("Identical Feature Vector:", feature_vector)
        for label, idx in labels_indices:
            print(f" Label: {label}, Index: {idx}")
        print()"""

In [64]:
# Identify identical feature vectors, list the different labels associated with each vector, and provide the indices for each label

# Create a subset DataFrame with only feature columns
feature_columns = [col for col in df.columns if col != " Label"]
feature_df = df[feature_columns]

# Find duplicate rows based on feature vectors
duplicate_rows = feature_df.duplicated(keep="first")

# Get the duplicated feature vectors
duplicated_feature_vectors = feature_df[duplicate_rows]

# Initialize a dictionary to store the different labels and their indices
label_indices = {}

# Initialize a list to store the indexes to drop
indexes_to_drop = []

# Iterate through the duplicated feature vectors
for idx, row in duplicated_feature_vectors.iterrows():
    feature_vector = row.tolist()
    label = df.loc[idx, ' Label']

    if tuple(feature_vector) in label_indices:
        label_indices[tuple(feature_vector)].append((label, idx))
    else:
        label_indices[tuple(feature_vector)] = [(label, idx)]

# Print feature vectors with different labels in groups
for feature_vector, labels_indices in label_indices.items():
    if len(labels_indices) > 1:
        unique_labels = set(label for label, _ in labels_indices)
        if len(unique_labels) > 1:
            print("Identical Feature Vector:", feature_vector)
            for label, idx in labels_indices:
                print(f"Label: {label}, Index: {idx}")
                indexes_to_drop.append(idx)
            print()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Label: PortScan, Index: 468300
Label: BENIGN, Index: 1284222

Identical Feature Vector: (7496.0, 52.0, 1.0, 1.0, 2.0, 6.0, 2.0, 2.0, 2.0, 0.0, 6.0, 6.0, 6.0, 0.0, 153846.1538, 38461.53846, 52.0, 0.0, 52.0, 52.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.0, 20.0, 19230.76923, 19230.76923, 2.0, 6.0, 3.333333333, 2.309401077, 5.333333333, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 5.0, 2.0, 6.0, 24.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 1.0, 6.0, 1024.0, 0.0, 0.0, 24.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
Label: PortScan, Index: 346954
Label: BENIGN, Index: 1284355

Identical Feature Vector: (2522.0, 54.0, 1.0, 1.0, 2.0, 6.0, 2.0, 2.0, 2.0, 0.0, 6.0, 6.0, 6.0, 0.0, 148148.1481, 37037.03704, 54.0, 0.0, 54.0, 54.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.0, 20.0, 18518.51852, 18518.51852, 2.0, 6.0, 3.333333333, 2.309401077, 5.333333333, 0.0, 0.0, 0.0, 1.

In [65]:
# Drop the rows with different labels for the same feature vector from the original DataFrame as inconsistency affects learning
df.drop(indexes_to_drop, inplace=True)

In [66]:
df.shape

(2825887, 79)

### Preprocessing of the CIC-IDS2017 dataset

In [67]:
# Check if the Dataset was not preprocess do:
  # 1 # Handling Missing Values
  # 2 # Normalization (Min-Max Scaling)
  # 3 # Encode Categorical Label
  # 4 # Removing duplicate records

from sklearn.impute import SimpleImputer

df_encoded_file_path = os.path.join(destination_folder, "cicids2017_encoded.csv")
if not os.path.exists(df_encoded_file_path):

  # Step 1: Handling Missing Values

  # Check for missing values, NAN
  check_nan = df.isna().sum().sum()

  # Check if missing values are represented as empty values (",,")
  missing_values_as_empty = df.applymap(lambda x: x == '')

  # Count the number of missing values in each column
  missing_values_count = missing_values_as_empty.sum()

  # Check if all elements in the missing_values_count Series are different from 0
  check_null = (missing_values_count != 0).all()

  # Replace empty values with NaN
  if (check_null):
    df.replace("", np.nan, inplace=True)

  # Impute missing values with the most frequent value for categorical columns and mean for numerical columns
  if (check_null or check_nan !=0):
    imputer = SimpleImputer(strategy='most_frequent', missing_values=pd.NA)
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = imputer.fit_transform(df[[col]])
        else:
            df[col] = df[col].fillna(df[col].mean())

In [68]:
# Check again for missing values, NAN
print(df.isna().sum(axis=0))

 Destination Port              0
 Flow Duration                 0
 Total Fwd Packets             0
 Total Backward Packets        0
Total Length of Fwd Packets    0
                              ..
Idle Mean                      0
 Idle Std                      0
 Idle Max                      0
 Idle Min                      0
 Label                         0
Length: 79, dtype: int64


In [69]:
# Step 2: Normalization (Scaling)

from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Check if the dataset was not preprocessed
if not os.path.exists(df_encoded_file_path):

  # Specify the columns to fix
  int64_columns = [col for col in df.columns if df[col].dtype == 'int64']
  float64_columns = [col for col in df.columns if df[col].dtype == 'float64']

  # Clean the data by replacing infinite or extremely large values
  max_allowed_value = 1e15  # Define a threshold for allowed values
  for col in float64_columns:
      df[col] = df[col].clip(lower=None, upper=max_allowed_value)

  # Calculate the number of rows to scale (10% of the total rows)
  #num_rows_to_scale = int(0.10 * len(df))
  # Randomly select 10% of rows
  #random_indices = np.random.choice(len(df), num_rows_to_scale, replace=False)
  #subset_df = df.iloc[random_indices]

  # Create a Min-Max scaler
  scaler = MinMaxScaler()

  # Specify the columns to scale
  columns = [col for col in df.columns if col not in [' Label']]

  # Fit and transform the selected columns in the subset
  df[columns] = scaler.fit_transform(df[columns])

# Display the DataFrame with the scaled subset
df

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,0.837186,1.333333e-07,0.000005,0.000000,9.302326e-07,0.000000e+00,0.000242,0.002581,0.001010,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
1,0.840070,1.016667e-06,0.000000,0.000003,4.651163e-07,9.153974e-09,0.000242,0.002581,0.001010,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
2,0.840085,5.416666e-07,0.000000,0.000003,4.651163e-07,9.153974e-09,0.000242,0.002581,0.001010,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
3,0.705516,3.916666e-07,0.000000,0.000003,4.651163e-07,9.153974e-09,0.000242,0.002581,0.001010,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
4,0.837156,1.333333e-07,0.000005,0.000000,9.302326e-07,0.000000e+00,0.000242,0.002581,0.001010,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2830738,0.000809,2.685666e-04,0.000014,0.000007,8.682171e-06,2.319007e-07,0.001128,0.012043,0.004713,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
2830739,0.000809,2.808333e-06,0.000005,0.000007,6.511628e-06,5.522898e-07,0.001692,0.018065,0.007070,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
2830740,0.885481,7.916666e-07,0.000005,0.000003,2.403101e-06,9.153974e-09,0.001249,0.000000,0.002609,0.003076,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN
2830741,0.000809,8.738733e-03,0.000023,0.000007,1.488372e-05,3.905696e-07,0.001289,0.013763,0.005386,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN


In [70]:
# Check if the Dataset was not preprocess do
if not os.path.exists(df_encoded_file_path):

  # Step 3: Encode Label

  # Identify categorical columns (non-numeric)
  categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
  categorical_columns.remove(' Label')

  # Encode the Label with 0 value for normal and 1 for the rest of the attacks
  df[' Label'] = df[' Label'].apply(lambda x: 0 if x == 'BENIGN' else 1)

  # Encode categorical columns using one-hot encoding (get_dummies)
  df = pd.get_dummies(df, columns=categorical_columns)

  # Now, df contains the encoded categorical features and label

  unique_labels = df[' Label'].unique()

  # Print the unique labels
  for label in unique_labels:
      print(label)
else:
  df_encoded = pd.read_csv(df_encoded_file_path)

0
1


In [71]:
label_counts_df = df[' Label'].value_counts()

# Display the counts with labels for df
print("\nLabel counts for df:")
print(label_counts_df)


Label counts for df:
0    2272778
1     553109
Name:  Label, dtype: int64


In [72]:
  # 4 # Removing duplicate records

# Print the shape of the DataFrame 'df' after removing rows with missing values
print(df.shape)

# Remove duplicate rows from the DataFrame 'df' while resetting the index
df = df.drop_duplicates()
df.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df' after removing duplicates and resetting the index
print(df.shape)

(2825887, 79)
(2522050, 79)


In [77]:
# Print out the DataFrames loaded in the memory
%whos DataFrame

Variable   Type         Data/Info
---------------------------------
df         DataFrame              Destination Por<...>522050 rows x 79 columns]


In [78]:
try:
  del df_
  del df_copy
  del duplicate_records
  del duplicated_feature_vectors
  del feature_df
  del filtered_duplicate_records
  del missing_values_as_empty
except:
  pass

In [79]:
# Step 5: Feature Engineering (if necessary)
# You can perform additional feature engineering as needed based on your analysis

# Display the preprocessed dataset (X) and target (y)
#print(pd.DataFrame(X, columns=column_names[:-1]).head())
#print(y.head())

In [80]:
# Check if the Dataset is saved
if not os.path.exists(df_encoded_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df.to_csv(df_encoded_file_path, index=False)

## 2. Algorithm Evaluation

In this section, we assess the performance of various machine learning algorithms on the upper mentioned datasets.

### 2.5. CIC-IDS2017 dataset evaluation with baseline and traditional ML algorithms

In this section, we assess the precision and F1 scores as essential metrics for evaluating classification accuracy. We evaluate various machine learning algorithms, including fundamental classifiers like Zero Rule and One Rule, statistical techniques like Naive Bayes, and more advanced models such as Random Forest, using a 10-fold cross-validation methodology. Our evaluation encompasses both 10 and 20 best selected features from the CIC-IDS2017 dataset.
These results offer valuable insights into the optimal dataset generation strategy, guiding the selection of effective feature extraction methods from raw data and helping determine the most suitable methodology for the specific dataset.

In [84]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif

# Separate features (X) and labels (y)
X = df.drop(' Label', axis=1)  # Exclude the label column
y = df[' Label']

# Create a pipeline for feature selection on the preprocessed data
pipeline_10_features = Pipeline([
    ('selector_10', SelectKBest(score_func=f_classif, k=10))
])

pipeline_20_features = Pipeline([
    ('selector_20', SelectKBest(score_func=f_classif, k=20))
])

# Fit and transform the data for 10 and 20 features
X_selected_10 = pipeline_10_features.fit_transform(X, y)
X_selected_20 = pipeline_20_features.fit_transform(X, y)

# Display the selected features
print(X_selected_10.shape)  # Check the shape of the selected 10 features
print(X_selected_20.shape)  # Check the shape of the selected 20 features

# Display the selected features
print("Selected 10 features:")
selected_feature_indices_10 = pipeline_10_features.named_steps['selector_10'].get_support(indices=True)
selected_features_10 = X.columns[selected_feature_indices_10]
print(selected_features_10)

print("\nSelected 20 features:")
selected_feature_indices_20 = pipeline_20_features.named_steps['selector_20'].get_support(indices=True)
selected_features_20 = X.columns[selected_feature_indices_20]
print(selected_features_20)

(2522050, 10)
(2522050, 20)
Selected 10 features:
Index(['Bwd Packet Length Max', ' Bwd Packet Length Mean',
       ' Bwd Packet Length Std', ' Fwd IAT Std', ' Max Packet Length',
       ' Packet Length Mean', ' Packet Length Std', ' Packet Length Variance',
       ' Average Packet Size', ' Avg Bwd Segment Size'],
      dtype='object')

Selected 20 features:
Index([' Flow Duration', 'Bwd Packet Length Max', ' Bwd Packet Length Mean',
       ' Bwd Packet Length Std', ' Flow IAT Std', ' Flow IAT Max',
       'Fwd IAT Total', ' Fwd IAT Std', ' Fwd IAT Max', ' Min Packet Length',
       ' Max Packet Length', ' Packet Length Mean', ' Packet Length Std',
       ' Packet Length Variance', 'FIN Flag Count', ' Average Packet Size',
       ' Avg Bwd Segment Size', 'Idle Mean', ' Idle Max', ' Idle Min'],
      dtype='object')


In [85]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, mean_absolute_error, f1_score
from sklearn.dummy import DummyClassifier
from tabulate import tabulate
import time
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Define the number of desired folds for Cross-Validation (e.g., 10)
num_folds = 10

# Initialize performance metrics lists for 10 and 20 features
results_10_features = []
results_20_features = []

In [86]:
# Define a file name for saving the results
results_file_name = os.path.join(destination_folder, "cicids2017_results.pkl")

# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define ZeroRule classifier
  zero_rule = DummyClassifier(strategy="most_frequent")

  # Evaluate ZeroRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(zero_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(zero_rule, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(zero_rule, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(zero_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(zero_rule, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display ZeroRule results for 10 features
  print("ZeroRule Precision (10 features):", np.mean(precision_scores_10))
  print("ZeroRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("ZeroRule Variance (10 features):", variance_10)
  print("ZeroRule MAE (10 features):", mae_10)
  print("ZeroRule Execution Time:", elapsed_time_10)

  # Display ZeroRule results for 20 features
  print("ZeroRule Precision (20 features):", np.mean(precision_scores_20))
  print("ZeroRule F1 Score (20 features):", np.mean(f1_scores_20))
  print("ZeroRule Variance (20 features):", variance_20)
  print("ZeroRule MAE (20 features):", mae_20)
  print("ZeroRule Execution Time:", elapsed_time_20)

  results_10_features.append(["ZeroRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["ZeroRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

ZeroRule Precision (10 features): 0.0
ZeroRule F1 Score (10 features): 0.0
ZeroRule Variance (10 features): 0.0
ZeroRule MAE (10 features): 0.16885152950972424
ZeroRule Execution Time: 5.741321802139282
ZeroRule Precision (20 features): 0.0
ZeroRule F1 Score (20 features): 0.0
ZeroRule Variance (20 features): 0.0
ZeroRule MAE (20 features): 0.16885152950972424
ZeroRule Execution Time: 8.154945850372314


In [87]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define OneRule classifier
  one_rule = DummyClassifier(strategy="stratified")

  # Evaluate OneRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(one_rule, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(one_rule, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(one_rule, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(one_rule, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(one_rule, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display OneRule results for 10 features
  print("OneRule Precision (10 features):", np.mean(precision_scores_10))
  print("OneRule F1 Score (10 features):", np.mean(f1_scores_10))
  print("OneRule Variance (10 features):", variance_10)
  print("OneRule MAE (10 features):", mae_10)
  print("OneRule Execution Time:", elapsed_time_10)

  # Display OneRule results for 20 features
  print("OneRule Precision (20 features):", np.mean(precision_scores_20))
  print("OneRule F1 Score (20 features):", np.mean(f1_scores_20))
  print("OneRule Variance (20 features):", variance_20)
  print("OneRule MAE (20 features):", mae_20)
  print("OneRule Execution Time:", elapsed_time_20)

  results_10_features.append(["OneRule", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["OneRule", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

OneRule Precision (10 features): 0.1688648087046783
OneRule F1 Score (10 features): 0.16824311673945386
OneRule Variance (10 features): 1.5092077832251044e-06
OneRule MAE (10 features): 0.28086754822465854
OneRule Execution Time: 6.198862314224243
OneRule Precision (20 features): 0.1690042791642603
OneRule F1 Score (20 features): 0.1690999895153959
OneRule Variance (20 features): 1.9157845758286896e-06
OneRule MAE (20 features): 0.2807485973711861
OneRule Execution Time: 8.574870109558105


In [88]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define Naive Bayes classifier
  naive_bayes = GaussianNB()

  # Evaluate Naive Bayes classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(naive_bayes, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(naive_bayes, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(naive_bayes, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(naive_bayes, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(naive_bayes, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Naive Bayes results for 10 features
  print("Naive Bayes Precision (10 features):", np.mean(precision_scores_10))
  print("Naive Bayes F1 Score (10 features):", np.mean(f1_scores_10))
  print("Naive Bayes Variance (10 features):", variance_10)
  print("Naive Bayes MAE (10 features):", mae_10)
  print("Naive Bayes Execution Time:", elapsed_time_10)

  # Display Naive Bayes results for 20 features
  print("Naive Bayes Precision (20 features):", np.mean(precision_scores_20))
  print("Naive Bayes F1 Score (20 features):", np.mean(f1_scores_20))
  print("Naive Bayes Variance (20 features):", variance_20)
  print("Naive Bayes MAE (20 features):", mae_20)
  print("Naive Bayes Execution Time:", elapsed_time_20)

  results_10_features.append(["Naive Bayes", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["Naive Bayes", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Naive Bayes Precision (10 features): 0.5401359055040736
Naive Bayes F1 Score (10 features): 0.5455170076570558
Naive Bayes Variance (10 features): 0.07693411707006614
Naive Bayes MAE (10 features): 0.1255934656331159
Naive Bayes Execution Time: 15.011318445205688
Naive Bayes Precision (20 features): 0.49268026231647066
Naive Bayes F1 Score (20 features): 0.5347929048673781
Naive Bayes Variance (20 features): 0.0633147173244346
Naive Bayes MAE (20 features): 0.14097301798140402
Naive Bayes Execution Time: 25.839256286621094


In [89]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Create a Random Forest classifier with optimized parameters
  rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)  # Adjust parameters for optimization

  # Evaluate Random Forest classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='precision')
  f1_scores_10 = cross_val_score(rf_classifier, X[selected_features_10], y, cv=num_folds, scoring='f1')
  elapsed_time_10 = time.time() - start_time  # Calculate execution time

  start_time = time.time()  # Start measuring execution time
  precision_scores_20 = cross_val_score(rf_classifier, X[selected_features_20], y, cv=num_folds, scoring='precision')
  f1_scores_20 = cross_val_score(rf_classifier, X[selected_features_20], y, cv=num_folds, scoring='f1')
  elapsed_time_20 = time.time() - start_time  # Calculate execution time

  variance_10 = np.var(precision_scores_10)
  variance_20 = np.var(precision_scores_20)

  predictions_10 = cross_val_predict(rf_classifier, X[selected_features_10], y, cv=num_folds)
  mae_10 = mean_absolute_error(y, predictions_10)

  predictions_20 = cross_val_predict(rf_classifier, X[selected_features_20], y, cv=num_folds)
  mae_20 = mean_absolute_error(y, predictions_20)

  # Display Random Forest results for 10 features
  print("Random Forest Precision (10 features):", np.mean(precision_scores_10))
  print("Random Forest F1 Score (10 features):", np.mean(f1_scores_10))
  print("Random Forest Variance (10 features):", variance_10)
  print("Random Forest MAE (10 features):", mae_10)
  print("Random Forest Execution Time:", elapsed_time_10)

  # Display Random Forest results for 20 features
  print("Random Forest Precision (20 features):", np.mean(precision_scores_20))
  print("Random Forest F1 Score (20 features):", np.mean(f1_scores_20))
  print("Random Forest Variance (20 features):", variance_20)
  print("Random Forest MAE (20 features):", mae_20)
  print("Random Forest Execution Time:", elapsed_time_20)

  results_10_features.append(["Random Forest", np.mean(precision_scores_10), np.mean(f1_scores_10), variance_10, mae_10, elapsed_time_10])
  results_20_features.append(["Random Forest", np.mean(precision_scores_20), np.mean(f1_scores_20), variance_20, mae_20, elapsed_time_20])

Random Forest Precision (10 features): 0.9893039075931785
Random Forest F1 Score (10 features): 0.9277228280954881
Random Forest Variance (10 features): 4.655091528474845e-05
Random Forest MAE (10 features): 0.0188723459090819
Random Forest Execution Time: 521.3329920768738
Random Forest Precision (20 features): 0.9900622819449969
Random Forest F1 Score (20 features): 0.9311988646306373
Random Forest Variance (20 features): 0.00016571417330496935
Random Forest MAE (20 features): 0.018215340695069488
Random Forest Execution Time: 598.3411316871643


In [90]:
import pickle
import os

if not os.path.exists(results_file_name):

  # Save the results lists to a file
  with open(results_file_name, 'wb') as file:
      results_dict = {
          'results_10_features': results_10_features,
          'results_20_features': results_20_features
      }
      pickle.dump(results_dict, file)


In [91]:
# Load the results from the file
with open(results_file_name, 'rb') as file:
    loaded_results = pickle.load(file)

# Access the loaded results lists
results_10_features = loaded_results['results_10_features']
results_20_features = loaded_results['results_20_features']


In [92]:
# Print the results in tabular format
headers_10 = ["Algorithm", "Precision (10 Features)", "F1 Score (10 Features)", "Variance (10 Features)", "MAE (10 Features)", "Execution Time"]
headers_20 = ["Precision (20 Features)", "F1 Score (20 Features)", "Variance (20 Features)", "MAE (20 Features)", "Execution Time"]

print(tabulate(results_10_features, headers_10, tablefmt="pretty"))
print(tabulate(results_10_features, headers_20, tablefmt="pretty"))

+---------------+-------------------------+------------------------+------------------------+---------------------+--------------------+
|   Algorithm   | Precision (10 Features) | F1 Score (10 Features) | Variance (10 Features) |  MAE (10 Features)  |   Execution Time   |
+---------------+-------------------------+------------------------+------------------------+---------------------+--------------------+
|   ZeroRule    |           0.0           |          0.0           |          0.0           | 0.16885152950972424 | 5.741321802139282  |
|    OneRule    |   0.1688648087046783    |  0.16824311673945386   | 1.5092077832251044e-06 | 0.28086754822465854 | 6.198862314224243  |
|  Naive Bayes  |   0.5401359055040736    |   0.5455170076570558   |  0.07693411707006614   | 0.1255934656331159  | 15.011318445205688 |
| Random Forest |   0.9893039075931785    |   0.9277228280954881   | 4.655091528474845e-05  | 0.0188723459090819  | 521.3329920768738  |
+---------------+------------------------