# Intrusion Detection System (IDS) Public Datasets Benchmarking

In cybersecurity, the design, development, and implementation of effective Intrusion Detection Systems (IDS) are important for safeguarding IT&C infrastructures from unauthorized access, data breaches, and various forms of malicious activities. The selection of an appropriate ML/DL algorithm plays a essential role in ensuring the security and integrity of protected systems.

But before we can dive in the development of a new-edge algorithm, we shoud have the appropriate data, that needs to be studied and analysed in order to undestant the reality and challenges of our ML problem. In accordance with this paradigm, we chosed to study the early created datasets designed for IDS systems in order to derive leasons learn for feature dataset development.

This experiment aims to comprehensively evaluate the performance of different ML and DL algorithms on a variety of datasets, encompassing a wide range of network traffic scenarios. The datasets used for this analysis include well-known benchmark datasets such as KDD, NSL-KDD, CTU-13, ISCXIDS2012, CIC-IDS2017, CSE-CIC-IDS2018, CIDDS-001/CIDDS-002, and Kyoto 2006+. Each dataset represents a distinct set of challenges and characteristics, making this evaluation both diverse and insightful.

The experiment is divided into three main phases:

1. **Data Acquisition and Preprocessing**:
 - In this phase, we acquire the selected datasets from reputable sources, ensuring the integrity and accuracy of the data.
 - Data preprocessing tasks include handling missing values, selecting the most relevant features using feature selection techniques, normalizing the data, and, if necessary, performing feature engineering to enhance the dataset's suitability for machine learning.

2. **Algorithm Evaluation**:
 - We evaluate the performance of a range of ML/DL algorithms on each dataset. The chosen algorithms include baseline methods like ZeroRule and OneRule, traditional machine learning approaches like Naive Bayes and Random Forest, as well as some of the most used anomaly detection deep learning algorithms.
 - Cross-validation is applied to ensure the robustness of our results. Performance metrics such as precision, variance, and Mean Absolute Error (MAE) are calculated for each algorithm and dataset.

3. **Results and Insights**:
 - The results of this evaluation provide valuable insights into the strengths and weaknesses of different IDS algorithms under various conditions.
 - We analyze the performance of algorithms on both the original datasets and balanced datasets to address the challenge of class imbalance in intrusion detection.
 - Observations and additional details regarding the algorithms' performance are documented, providing a comprehensive overview of their behavior.

By conducting this experiment, we aim to contribute to the understanding of cyber domain dataset generation. The findings will assist in making informed decisions when developing a cybersecurity AI application, by deriving necesary steps and procedures in selecting the appropriate learning data.

The following sections of this Jupyter notebook will provide a detailed walkthrough of the experiment, including code snippets, visualizations, and discussions of the results.

In [1]:
# Mount your Google Drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
import warnings
from google.colab import files

# Suppress all warning messages
warnings.filterwarnings("ignore")

# Check if the Kaggle API credentials file already exists
kaggle_credentials_path = os.path.expanduser("~/.kaggle/kaggle.json")

if not os.path.exists(kaggle_credentials_path):

    if not os.path.exists(os.path.join("/content/drive/MyDrive/.kaggle/", "kaggle.json")):

      # Upload your Kaggle API credentials file (kaggle.json)
      files.upload()

      !mv kaggle.json "/content/drive/MyDrive/.kaggle/"
      !chmod 600 "/content/drive/MyDrive/.kaggle/kaggle.json"

    # Move the Kaggle API Credentials File
    !mkdir -p ~/.kaggle
    !cp '/content/drive/MyDrive/.kaggle/kaggle.json' ~/.kaggle/

else:

    print("Kaggle API credentials file already exists.")

In [3]:
import tensorflow as tf
print("GPU available:", tf.test.is_gpu_available())
print("GPU device name:", tf.test.gpu_device_name())

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


GPU available: True
GPU device name: /device:GPU:0


In [4]:
import os
from psutil import virtual_memory
from tabulate import tabulate

# Function to get CPU information
def get_cpu_info():
    cpu_info = os.popen('lscpu').read()
    return cpu_info

# Function to get RAM information
def get_ram_info():
    ram = virtual_memory()
    total_ram = f"{ram.total / 1e9:.2f} GB"
    available_ram = f"{ram.available / 1e9:.2f} GB"
    return total_ram, available_ram

# Function to get GPU information
def get_gpu_info():
    # Execute nvidia-smi and get its output
    gpu_info = os.popen('nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader,nounits').read().strip()

    # Split the output to get individual GPU details
    details = gpu_info.split(", ")

    # Return GPU name, total, used, and free memory
    return details[0], f"{details[1]} MB", f"{details[2]} MB", f"{details[3]} MB"

# Collect system information
cpu_info = get_cpu_info()
total_ram, available_ram = get_ram_info()
try:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = get_gpu_info()
except:
  gpu_name, gpu_total_memory, gpu_used_memory, gpu_free_memory = 'null',0,0,0

# Extract relevant CPU information
cpu_type = ""
cpu_architecture = ""

for line in cpu_info.splitlines():
    if "Model name:" in line:
        cpu_type = line.split(":")[1].strip()
    elif "Architecture:" in line:
        cpu_architecture = line.split(":")[1].strip()

# Create a table
table = [
    ["CPU Type", cpu_type],
    ["CPU Architecture", cpu_architecture],
    ["Total RAM", total_ram],
    ["Available RAM", available_ram],
    ["GPU Name", gpu_name],
    ["GPU Total Memory", gpu_total_memory],
    ["GPU Used Memory", gpu_used_memory],
    ["GPU Free Memory", gpu_free_memory]
]

# Display the table
print(tabulate(table, headers=["Characteristic", "Value"], tablefmt="pretty"))


+------------------+--------------------------------+
|  Characteristic  |             Value              |
+------------------+--------------------------------+
|     CPU Type     | Intel(R) Xeon(R) CPU @ 2.20GHz |
| CPU Architecture |             x86_64             |
|    Total RAM     |            54.76 GB            |
|  Available RAM   |            52.17 GB            |
|     GPU Name     |            Tesla T4            |
| GPU Total Memory |            15360 MB            |
| GPU Used Memory  |             359 MB             |
| GPU Free Memory  |            14742 MB            |
+------------------+--------------------------------+


## 1. Data Acquisition and Preprocessing

In this section, we focus on acquiring the above mentioned datasets.

### 1.3. CTU-13 dataset

The CTU-13 dataset was specifically tailored for evaluating the detection of botnet activities within network traffic data. The dataset was derived from real-world network traffic captured in a controlled laboratory environment. It consists of thirteen different scenarios, each representing a specific network attack or intrusion scenario. These scenarios cover a wide range of cyber threats, including botnet activity, Distributed Denial of Service (DDoS) attacks, and various intrusion attempts, making it indispensable for our study.

### Download and Unzip CTU-13 dataset

In [5]:
import os
import pandas as pd

# Specify the dataset name
dataset_name = "dhoogla/ctu13"

# Specify the destination folder in your Google Drive
destination_folder = "/content/drive/MyDrive/CTU13-BM"

# Check if the dataset file already exists in your Google Drive
dataset_file_path = os.path.join(destination_folder, "ctu13.zip")

if not os.path.exists(dataset_file_path):

  # Download the dataset and save it to your Google Drive
  !kaggle datasets download -d $dataset_name -p $destination_folder

  # Unzip the downloaded dataset
  import zipfile
  with zipfile.ZipFile(f"{destination_folder}/ctu13.zip", "r") as zip_ref:
      zip_ref.extractall(destination_folder)

  print("Download complete.")

else:
  print("Dataset already exists. Skipping download.")

Downloading ctu13.zip to /content/drive/MyDrive/CTU13-BM
 98% 100M/102M [00:06<00:00, 21.2MB/s] 
100% 102M/102M [00:06<00:00, 16.4MB/s]
Download complete.


In [6]:
!ls -ahl '/content/drive/MyDrive/CTU13-BM'

total 219M
-rw------- 1 root root 9.0M Oct 10 16:43 10-Rbot-20110818.binetflow.parquet
-rw------- 1 root root 830K Oct 10 16:43 11-Rbot-20110818-2.binetflow.parquet
-rw------- 1 root root 2.6M Oct 10 16:43 12-NsisAy-20110819.binetflow.parquet
-rw------- 1 root root  11M Oct 10 16:43 13-Virut-20110815-3.binetflow.parquet
-rw------- 1 root root  18M Oct 10 16:43 1-Neris-20110810.binetflow.parquet
-rw------- 1 root root  12M Oct 10 16:43 2-Neris-20110811.binetflow.parquet
-rw------- 1 root root  21M Oct 10 16:43 3-Rbot-20110812.binetflow.parquet
-rw------- 1 root root 8.0M Oct 10 16:43 4-Rbot-20110815.binetflow.parquet
-rw------- 1 root root 1.1M Oct 10 16:43 5-Virut-20110815-2.binetflow.parquet
-rw------- 1 root root 4.2M Oct 10 16:43 6-Menti-20110816.binetflow.parquet
-rw------- 1 root root 993K Oct 10 16:43 7-Sogou-20110816-2.binetflow.parquet
-rw------- 1 root root  17M Oct 10 16:43 8-Murlo-20110816-3.binetflow.parquet
-rw------- 1 root root  15M Oct 10 16:43 9-Neris-20110817.binetflo

In [7]:
import pandas as pd
import os

# Specify the destination folder in your Google Drive
destination_folder = "/content/drive/MyDrive/CTU13-BM"

# List to store DataFrames
dfs = []

# Walk through the directory and find .parquet files
for root, dirs, files in os.walk(destination_folder):
    for file in files:
        if file.endswith('.parquet'):
            filepath = os.path.join(root, file)
            dfs.append(pd.read_parquet(filepath))

# Concatenate the DataFrames
df = pd.concat(dfs, copy=False, ignore_index=True, sort=False)

In [8]:
# Information about the starting CTU-13 DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10598771 entries, 0 to 10598770
Data columns (total 11 columns):
 #   Column     Dtype  
---  ------     -----  
 0   dur        float32
 1   proto      object 
 2   dir        object 
 3   state      object 
 4   stos       float32
 5   dtos       float32
 6   tot_pkts   int32  
 7   tot_bytes  int64  
 8   src_bytes  int64  
 9   label      object 
 10  Family     object 
dtypes: float32(3), int32(1), int64(2), object(5)
memory usage: 727.8+ MB


In [9]:
# Some basic statistical details like percentile, mean, std, etc. of the starting CTU-13 DataFrame
df.describe()

Unnamed: 0,dur,stos,dtos,tot_pkts,tot_bytes,src_bytes
count,10598770.0,10512980.0,9690363.0,10598770.0,10598770.0,10598770.0
mean,540.132,0.008291562,0.0007676699,75.89311,60445.35,12029.09
std,1077.755,1.00699,0.04455093,7613.475,5468097.0,2289821.0
min,0.0,0.0,0.0,1.0,60.0,0.0
25%,0.005451,0.0,0.0,2.0,266.0,81.0
50%,0.716635,0.0,0.0,4.0,550.0,222.0
75%,133.9011,0.0,0.0,11.0,2179.0,952.0
max,3657.061,192.0,3.0,16580640.0,4376239000.0,3423408000.0


In [10]:
df.shape, df.columns

((10598771, 11),
 Index(['dur', 'proto', 'dir', 'state', 'stos', 'dtos', 'tot_pkts', 'tot_bytes',
        'src_bytes', 'label', 'Family'],
       dtype='object'))

In [11]:
# Dropping  'Family' as it is an irrelevant feature

df = df.drop(columns=['Family'])
df.shape

(10598771, 10)

### Preprocessing of the CTU-13 dataset

In [12]:
# Check if the Dataset was not preprocess do:
  # 1 # Handling Missing Values
  # 2 # Encode Categorical Features and Label
  # 3 # Normalization (Min-Max Scaling)
  # 4 # Removing duplicate records

from sklearn.impute import SimpleImputer

df_encoded_file_path = os.path.join(destination_folder, "ctu13_scaled.csv")
if not os.path.exists(df_encoded_file_path):

  # Step 1: Handling Missing Values

  # Check for missing values, NAN
  check_nan = df.isna().sum(axis=0)
  print(check_nan)

  # Impute missing values with the most frequent value for categorical columns and mean for numerical columns
  if (df.isna().sum().sum() !=0):
    imputer = SimpleImputer(strategy='most_frequent', missing_values=pd.NA)
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = imputer.fit_transform(df[[col]])
        else:
            df[col] = df[col].fillna(df[col].mean())

# If you want to simply remove the nan rows you could do the following

# Print the shape (number of rows and columns) of the CTU-13 DataFrame 'df' before any modifications
#print(df.shape)

# Remove rows with missing values (NaN) from the DataFrame 'df'
#df = df.dropna()

# Reprint the shape
#print(df.shape)

dur               0
proto             0
dir               0
state            76
stos          85794
dtos         908408
tot_pkts          0
tot_bytes         0
src_bytes         0
label             0
dtype: int64


In [13]:
# Check again for missing values, NAN
df.isna().sum(axis=0)

dur          0
proto        0
dir          0
state        0
stos         0
dtos         0
tot_pkts     0
tot_bytes    0
src_bytes    0
label        0
dtype: int64

In [14]:
  # 2 # Encode Categorical Features and Label

# Rransforming specific columns in the CTU-13 DataFrame to categorical data types, encoding them as integer codes, and then reducing their memory usage by changing the data type to 32-bit integers.
# This can be useful for reducing memory consumption when dealing with large datasets with categorical features.

import numpy as np

df['proto'] = df['proto'].astype('category').cat.codes
df['proto'] = df['proto'].astype(np.int32)
df['dir'] = df['dir'].astype('category').cat.codes
df['dir'] = df['dir'].astype(np.int32)
df['state'] = df['state'].astype('category').cat.codes
df['state'] = df['state'].astype(np.int32)

In [15]:
# Display the top 10 most frequent values and their counts in the 'label' column of CTU-13
print(df.label.value_counts().head(10))

# Change the data type of the 'label' column to 'object' (string)
df['label'] = df['label'].astype(dtype='object')

# Check if the 'label' column starts with the string 'flow=From-Botnet', and assign a Boolean value accordingly
df['label'] = df['label'].str.startswith('flow=From-Botnet', na=False)

# Change the data type of the 'label' column to 'float32'
df['label'] = df['label'].astype(dtype='float32', copy=False)

# Display again the top 10 most frequent values and their counts in the 'label' column of CTU-13 after modifications
print(df.label.value_counts().head(10))

flow=Background-UDP-Established           4151793
flow=Background-TCP-Established           2054183
flow=To-Background-UDP-CVUT-DNS-Server    1659760
flow=Background-Established-cmpgw-CVUT     855498
flow=Background-UDP-Attempt                492881
flow=Background-TCP-Attempt                355420
flow=Background                            178154
flow=To-Background-CVUT-Proxy              149570
flow=Background-Attempt-cmpgw-CVUT          54956
flow=To-Background-CVUT-WebServer           45227
Name: label, dtype: int64
0.0    10336198
1.0      262573
Name: label, dtype: int64


In [16]:
  # 3 # Normalization (Min-Max Scaling)

from sklearn.preprocessing import MinMaxScaler

# Check if the Dataset was not preprocessed:
if not os.path.exists(df_encoded_file_path):
    min_max_scaler = MinMaxScaler().fit(df)  # Fit the scaler to the data in 'df'
    df_scaled = pd.DataFrame(data=min_max_scaler.transform(df), columns=df.columns)  # Create a new DataFrame with scaled data

In [17]:
  # 4 # Removing duplicate records

# Print the shape of the DataFrame 'df' after removing rows with missing values
print(df.shape)

# Remove duplicate rows from the DataFrame 'df' while resetting the index
df = df.drop_duplicates()
df.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df' after removing duplicates and resetting the index
print(df.shape)


# Print the shape of the DataFrame 'df_scaled' after removing rows with missing values
print(df_scaled.shape)

# Remove duplicate rows from the DataFrame 'df_scaled' while resetting the index
df_scaled = df_scaled.drop_duplicates()
df_scaled.reset_index(inplace=True, drop=True)

# Print the shape of the DataFrame 'df_scaled' after removing duplicates and resetting the index
print(df_scaled.shape)

(10598771, 10)
(9404048, 10)
(10598771, 10)
(9404048, 10)


In [18]:
# Print out the DataFrames loaded in the memory
%whos DataFrame

Variable    Type         Data/Info
----------------------------------
df          DataFrame                  dur  proto <...>404048 rows x 10 columns]
df_scaled   DataFrame                  dur     pro<...>404048 rows x 10 columns]


In [19]:
# Save the resulting dataframes

df_file_path = '/content/drive/MyDrive/CTU13-BM/ctu13.csv'
df_encoded_file_path = '/content/drive/MyDrive/CTU13-BM/ctu13_scaled.csv'

# Check if the Dataset is saved:
if not os.path.exists(df_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df.to_csv(df_file_path, index=False)
if not os.path.exists(df_encoded_file_path):
  # Convert your Pandas DataFrame to a CSV file
  df_scaled.to_csv(df_encoded_file_path, index=False)

## 2. Algorithm Evaluation

In this section, we assess the performance of various machine learning algorithms on the upper mentioned datasets.

### 2.3. CTU-13 dataset evaluation with baseline and traditional ML algorithms

In this section, we evaluate the performance of various machine learning algorithms on the CTU-13 botnet dataset. We assess the precision and F1 scores, essential indicators of classification accuracy, for a range of algorithms, including fundamental classifiers like Zero Rule and One Rule, statistical approaches like Naive Bayes, and more advanced models such as Random Forest.

Given the CTU-13 dataset's unique characteristics, which include a limited number of features (namely 'due,' 'proto,' 'dir,' 'state,' 'stos,' 'dtos,' 'tot_pkts,' 'tot_bytes,' 'src_bytes,' and 'label'), our evaluation focuses on understanding the effectiveness of these algorithms in detecting and classifying network traffic patterns associated with botnet activities.

To ensure a robust assessment, we employ a 10-fold cross-validation methodology, testing the algorithms on all the features from the dataset. These results offer valuable insights into the optimal dataset generation strategy, aiding in the selection of the most effective feature extraction methods for cybersecurity specific dataset.

In [20]:
# Skip selection of best features due to low count of attributes

# Separate features (X) and labels (y)
X = df_scaled.drop('label', axis=1)  # Exclude the label column
y = df_scaled['label']

In [21]:
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, mean_absolute_error, f1_score
from sklearn.dummy import DummyClassifier
from tabulate import tabulate
import time
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Define the number of desired folds for Cross-Validation (e.g., 10)
num_folds = 10

# Initialize performance metrics lists for 9 features due to limited attributes count
results_9_features = []

In [22]:
# Define a file name for saving the results
results_file_name = os.path.join(destination_folder, "ctu13_results.pkl")

# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define ZeroRule classifier
  zero_rule = DummyClassifier(strategy="most_frequent")

  # Evaluate ZeroRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_9 = cross_val_score(zero_rule, X, y, cv=num_folds, scoring='precision')
  f1_scores_9 = cross_val_score(zero_rule, X, y, cv=num_folds, scoring='f1')
  elapsed_time_9 = time.time() - start_time  # Calculate execution time

  variance_9 = np.var(precision_scores_9)

  predictions_9 = cross_val_predict(zero_rule, X, y, cv=num_folds)
  mae_9 = mean_absolute_error(y, predictions_9)

  # Display ZeroRule results for 9 features
  print("ZeroRule Precision (9 features):", np.mean(precision_scores_9))
  print("ZeroRule F1 Score (9 features):", np.mean(f1_scores_9))
  print("ZeroRule Variance (9 features):", variance_9)
  print("ZeroRule MAE (9 features):", mae_9)
  print("ZeroRule Execution Time:", elapsed_time_9)

  results_9_features.append(["ZeroRule", np.mean(precision_scores_9), np.mean(f1_scores_9), variance_9, mae_9, elapsed_time_9])

ZeroRule Precision (9 features): 0.0
ZeroRule F1 Score (9 features): 0.0
ZeroRule Variance (9 features): 0.0
ZeroRule MAE (9 features): 0.024866100215566744
ZeroRule Execution Time: 30.797607898712158


In [23]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define OneRule classifier
  one_rule = DummyClassifier(strategy="stratified")

  # Evaluate OneRule classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_9 = cross_val_score(one_rule, X, y, cv=num_folds, scoring='precision')
  f1_scores_9 = cross_val_score(one_rule, X, y, cv=num_folds, scoring='f1')
  elapsed_time_9 = time.time() - start_time  # Calculate execution time

  variance_9 = np.var(precision_scores_9)

  predictions_9 = cross_val_predict(one_rule, X, y, cv=num_folds)
  mae_9 = mean_absolute_error(y, predictions_9)

  # Display OneRule results for 9 features
  print("OneRule Precision (9 features):", np.mean(precision_scores_9))
  print("OneRule F1 Score (9 features):", np.mean(f1_scores_9))
  print("OneRule Variance (9 features):", variance_9)
  print("OneRule MAE (9 features):", mae_9)
  print("OneRule Execution Time:", elapsed_time_9)

  results_9_features.append(["OneRule", np.mean(precision_scores_9), np.mean(f1_scores_9), variance_9, mae_9, elapsed_time_9])

OneRule Precision (9 features): 0.024784734503328354
OneRule F1 Score (9 features): 0.02493287194553043
OneRule Variance (9 features): 8.980152971872171e-07
OneRule MAE (9 features): 0.04843010159029388
OneRule Execution Time: 31.0071804523468


In [24]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Define Naive Bayes classifier
  naive_bayes = GaussianNB()

  # Evaluate Naive Bayes classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_9 = cross_val_score(naive_bayes, X, y, cv=num_folds, scoring='precision')
  f1_scores_9 = cross_val_score(naive_bayes, X, y, cv=num_folds, scoring='f1')
  elapsed_time_9 = time.time() - start_time  # Calculate execution time

  variance_9 = np.var(precision_scores_9)

  predictions_9 = cross_val_predict(naive_bayes, X, y, cv=num_folds)
  mae_9 = mean_absolute_error(y, predictions_9)

  # Display Naive Bayes results for 9 features
  print("Naive Bayes Precision (9 features):", np.mean(precision_scores_9))
  print("Naive Bayes F1 Score (9 features):", np.mean(f1_scores_9))
  print("Naive Bayes Variance (9 features):", variance_9)
  print("Naive Bayes MAE (9 features):", mae_9)
  print("Naive Bayes Execution Time:", elapsed_time_9)

  results_9_features.append(["Naive Bayes", np.mean(precision_scores_9), np.mean(f1_scores_9), variance_9, mae_9, elapsed_time_9])

Naive Bayes Precision (9 features): 0.02939298880167533
Naive Bayes F1 Score (9 features): 0.0570083052333196
Naive Bayes Variance (9 features): 2.0994859580674664e-05
Naive Bayes MAE (9 features): 0.7805218561198326
Naive Bayes Execution Time: 68.04527616500854


In [25]:
# Check for results before rerunning the code snippet
if not os.path.exists(results_file_name):

  # Create a Random Forest classifier with optimized parameters
  rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1)  # Adjust parameters for optimization

  # Evaluate Random Forest classifier
  start_time = time.time()  # Start measuring execution time
  precision_scores_9 = cross_val_score(rf_classifier, X, y, cv=num_folds, scoring='precision')
  f1_scores_9 = cross_val_score(rf_classifier, X, y, cv=num_folds, scoring='f1')
  elapsed_time_9 = time.time() - start_time  # Calculate execution time

  variance_9 = np.var(precision_scores_9)

  predictions_9 = cross_val_predict(rf_classifier, X, y, cv=num_folds)
  mae_9 = mean_absolute_error(y, predictions_9)

  # Display Random Forest results for 9 features
  print("Random Forest Precision (9 features):", np.mean(precision_scores_9))
  print("Random Forest F1 Score (9 features):", np.mean(f1_scores_9))
  print("Random Forest Variance (9 features):", variance_9)
  print("Random Forest MAE (9 features):", mae_9)
  print("Random Forest Execution Time:", elapsed_time_9)

  results_9_features.append(["Random Forest", np.mean(precision_scores_9), np.mean(f1_scores_9), variance_9, mae_9, elapsed_time_9])

Random Forest Precision (9 features): 0.6757534659433394
Random Forest F1 Score (9 features): 0.039739922209714736
Random Forest Variance (9 features): 0.18238869891010404
Random Forest MAE (9 features): 0.024287838598867212
Random Forest Execution Time: 4184.681151866913


In [26]:
import pickle
import os

if not os.path.exists(results_file_name):

  # Save the results lists to a file
  with open(results_file_name, 'wb') as file:
      results_dict = {
          'results_9_features': results_9_features,
      }
      pickle.dump(results_dict, file)


In [27]:
# Load the results from the file
with open(results_file_name, 'rb') as file:
    loaded_results = pickle.load(file)

# Access the loaded results lists
results_9_features = loaded_results['results_9_features']


In [28]:
# Print the results in tabular format
headers_9 = ["Algorithm", "Precision", "F1 Score", "Variance", "MAE", "Execution Time"]

print(tabulate(results_9_features, headers_9, tablefmt="pretty"))

+---------------+----------------------+----------------------+------------------------+----------------------+--------------------+
|   Algorithm   |      Precision       |       F1 Score       |        Variance        |         MAE          |   Execution Time   |
+---------------+----------------------+----------------------+------------------------+----------------------+--------------------+
|   ZeroRule    |         0.0          |         0.0          |          0.0           | 0.024866100215566744 | 30.797607898712158 |
|    OneRule    | 0.024784734503328354 | 0.02493287194553043  | 8.980152971872171e-07  | 0.04843010159029388  |  31.0071804523468  |
|  Naive Bayes  | 0.02939298880167533  |  0.0570083052333196  | 2.0994859580674664e-05 |  0.7805218561198326  | 68.04527616500854  |
| Random Forest |  0.6757534659433394  | 0.039739922209714736 |  0.18238869891010404   | 0.024287838598867212 | 4184.681151866913  |
+---------------+----------------------+----------------------+------