## Data Preprocessing CIC-IoT2023

This section covers the preprocessing of the CIC-IoT2023 dataset after extracting its flow-level and packet-level features using the `Feature_extractor_flow_packet_combined.py` script (extraction shown in `GNN4ID.ipynb`) and generation of additional features. Since CIC-IoT2023 dataset is very huge and is very imbalance with having some classes with very low instances, therefore to maintain uniformity we have under and over-sampled data instances. 

In [None]:
from Utility.Functions import *
import pandas as pd
import glob
import os
from tqdm import tqdm

For our preprocessing, we divided the data into an 80:20 split. Specifically, 80% of the data is used for training, while the remaining 20% is utilized for testing.

To achieve this division, we first identified the class with the least number of samples, which in our scenario was the BruteForce Attack class, with 2,336 samples. Using this as a reference point, we determined the undersampling rate for the other classes based on the number of samples for the minority class.

We applied an oversampling factor of 10x to the minority class for the training data, meaning we increased the number of training samples for the BruteForce Attack class (80% of 2,336 samples) to 20,000 samples. Consequently, we limited the number of training samples for each class to 20,000. Depending on the class, we either undersampled or oversampled to achieve this target.

The following steps outline the process of dividing the dataset into an 80:20 split and subsequently performing the necessary over/undersampling:

In [None]:
## Directory for Extracted flow features+additional features in csv files
directory = 'F:/CIC_IOT/Extracted_Flow_Features/'
List_of_CSV_File =glob.glob(directory+'*csv') 

### Dataset Division & Filtering

For the attack class samples, we are specifically filtering the data by retaining only those flow instances where the attacker's MAC address appears as either the source or destination. This targeted approach ensures that non-attack flows are excluded from the analysis, focusing solely on the relevant attack data.

Similarly, we are removing the samples with attacker's MAC address for the Benign Samples so that our data is more clean and filtered.

Furthermore, we are capping the number of samples to enhance the efficiency and manageability of the dataset. The test dataset is limited to a maximum of 4,000 samples, while the training dataset is restricted to 20,000 samples. This strategic sampling allows for effective model training and evaluation without compromising on performance or computational resources.

The provided code offers full flexibility for personalization, allowing you to process a specific number of files or focus on particular attack classes based on your requirements. Whether you need to analyze just a subset of the data or target specific attack types, the code can be tailored to meet your needs.


### Test & Train Split

In [None]:
# Filtering and Spliting Files into Train/Test
for files in tqdm(List_of_CSV_File):
    split_csv(files)

##### Combining Sub-Classes into Broad Classes

In [None]:
Attack_Classes = ['Benign', 'BruteForce','DDos','Dos','Mirai','Recon','Spoofing','WebBased']
label_dict = {'Benign': 0,'WebBased': 1,'Spoofing': 2,'Recon': 3,'Mirai': 4,'Dos': 5,'DDos': 6,'BruteForce': 7}

## Combining Same Class files into one file.
Combining_classes(directory,Attack_Classes,label_dict=label_dict)

#### Transforming Data into Single Train and Test File

In [None]:
## Train_Set
directory_combined = directory+'/train/'
List_of_CSV_File = glob.glob(directory_combined+"*") 
df_list = []
for location in List_of_CSV_File:
    df = pd.read_csv(location)
    os.remove(location) 
    df_list.append(df)
final_df = pd.concat(df_list, ignore_index=True)
final_df.to_csv(directory_combined+'df_class_8_train.csv', index=False)

In [None]:
## Some Columns to drop as they contain some biased or highly correlated data. 
final_df.drop(['src_ip','src_port','dst_ip','dst_port','ip_version','bidirectional_bytes','bidirectional_first_seen_ms','bidirectional_last_seen_ms','bidirectional_duration_ms',
         'bidirectional_packets','src2dst_first_seen_ms','src2dst_last_seen_ms','dst2src_first_seen_ms','dst2src_last_seen_ms',
         'id','src_mac','src_oui','dst_mac','dst_oui','vlan_id','tunnel_id','bidirectional_syn_packets','bidirectional_cwr_packets',
         'bidirectional_ece_packets','bidirectional_urg_packets','bidirectional_ack_packets','bidirectional_psh_packets',
         'bidirectional_rst_packets','bidirectional_fin_packets'], axis=1, inplace=True)
final_df.to_csv(directory_combined+'df_class_8_train.csv', index=False)

In [None]:
## Test_Set
directory_combined = directory+'/test/'
List_of_CSV_File = glob.glob(directory_combined+"*") 
df_list = []
for location in List_of_CSV_File:
    df = pd.read_csv(location)
    os.remove(location) 
    df_list.append(df)
final_df = pd.concat(df_list, ignore_index=True)

In [None]:
## Some Columns to drop as they contain some biased or highly correlated data. 
final_df.drop(['src_ip','src_port','dst_ip','dst_port','ip_version','bidirectional_bytes','bidirectional_first_seen_ms','bidirectional_last_seen_ms','bidirectional_duration_ms',
         'bidirectional_packets','src2dst_first_seen_ms','src2dst_last_seen_ms','dst2src_first_seen_ms','dst2src_last_seen_ms',
         'id','src_mac','src_oui','dst_mac','dst_oui','vlan_id','tunnel_id','bidirectional_syn_packets','bidirectional_cwr_packets',
         'bidirectional_ece_packets','bidirectional_urg_packets','bidirectional_ack_packets','bidirectional_psh_packets',
         'bidirectional_rst_packets','bidirectional_fin_packets'], axis=1, inplace=True)
final_df.to_csv(directory_combined+'df_class_8_test.csv', index=False)