## GNN<span style="font-size: 1.3em; font-family: Comic Sans MS, Aptos (Body);">4</span>ID

This notebook demonstrates a comprehensive pipeline to convert raw PCAP files into graph data objects suitable for Graph Neural Network (GNN) models. The pipeline extracts flow-level information alongside packet-level details, ultimately producing two primary outputs:

1. <u>Extracted Flow-based Features with Packet-Level Information</u>: Detailed flow-based features are extracted, including comprehensive packet-level information.
2. <u>Graph Data Objects for GNN Models</u>: Flow-based features and packet-level details are transformed into individual graph data objects, suitable for graph-level predictions using GNN models.

This transformation enables the application of GNNs for advanced network traffic analysis and intrusion detection, leveraging the rich information from both flow and packet levels.

In [8]:
from utility.functions import *
from utility.flow_features import *
import tarfile
import glob
import shutil
import subprocess

### Extraction of Compressed PCAP Files from the CIC-IoT2023 Dataset

For demonstration purposes, we will utilize the CIC-IoT2023 dataset, one of the latest and most comprehensive datasets available for IoT network traffic analysis. You can access and download the dataset from the following link: [CIC-IoT2023 Dataset](https://www.unb.ca/cic/datasets/iotdataset-2023.html).

While this example uses the CIC-IoT2023 dataset, any dataset with labeled raw packets can be used. If your PCAP files are not labeled, you can use the tool available at [Payload Byte](https://github.com/Yasir-ali-farrukh/Payload-Byte) for labeling the PCAP files.


In [134]:
# Provide path to the directory where Raw Pcap Files are downloaded
# The CIC-IoT2023 Dataset is availble in compressed .tar format
Directory = "F:\\CIC IoT Dataset 2023\\*.tar.gz"
# Path where you want the extracted PCAP files to be
Out_Directory = 'F:\\CIC_IOT\\Packet_Level_Data'

Compressed_files = glob.glob(Directory)

In [137]:
for files in Compressed_files:
    file = tarfile.open(files) 
    file.extractall(Out_Directory)

### Renaming the PCAP Files

To facilitate easier differentiation between attack classes during the transformation into graph data objects, it is essential to rename the PCAP files appropriately.

**Alternatively:** You can generate a single file containing all data instances along with a "Label" column to categorize the instances.

In [138]:
name_mapping = {'Benign': 'Benign-Benign' , 
          'DDoS-ACK_Fragmentation':'DDos-AckFrg', 
          'DDoS-UDP_Flood':'DDos-UDPFlood',
         'DDos-SlowLoris':'DDos-SlowLoris',
         'DDoS-ICMP_Flood':'DDos-ICMPFlood',
         'DDoS-RSTFINFlood' :'DDos-RSTFIN',
         'DDoS-PSHACK_Flood':'DDos-PSHACK',
         'DDoS-HTTP_Flood':'DDos-HTTPFlood',
         'DDoS-UDP_Fragmentation':'DDos-UDPFrg' ,
         'DDoS-ICMP_Fragmentation':'DDos-ICMPFrg',
         'DDoS-TCP_Flood':'DDos-TCPFlood',
         'DDoS-SYN_Flood':'DDos-SYNFlood',
         'DDoS-SynonymousIP_Flood':'DDos-SynonymousIPFlood' ,
          'DoS-TCP_Flood':'Dos-TCPFlood',
          'DoS-HTTP_Flood':'Dos-HTTPFlood',
          'DoS-SYN_Flood':'Dos-SYNFlood',
          'DoS-UDP_Flood':'Dos-UDPFlood',
          'Recon-PingSweep':'Recon-PingSweep',
          'Recon-OSScan':'Recon-OSScan',
          'VulnerabilityScan':'Recon-VulScan',
          'Recon-PortScan':'Recon-PortScan',
          'Recon-HostDiscovery':'Recon-HostDisc',
          'SqlInjection':'WebBased-SqlInject',
          'CommandInjection':'WebBased-CmmdInject',
          'Backdoor_Malware':'WebBased-BckdoorMalware',
          'Uploading_Attack':'WebBased-UploadAttack',
          'XSS':'WebBased-XSS',
          'BrowserHijacking':'Webbased-BrwserHijack',
          'DictionaryBruteForce':'BruteForce-Dictionary',
          'MITM-ArpSpoofing':'Spoofing-ARP',
          'DNS_Spoofing':'Spoofing-DNS',
          'Mirai-greip_flood':'Mirai-GREIP',
          'Mirai-greeth_flood':'Mirai-Greeth',
          'Mirai-udpplain':'Mirai-UDPPlain'
         }

In [None]:
## Function that rename the files
rename_files(Out_Directory, name_mapping)

### Extracting Features from PCAP Files

Extraction of flow-level features along with their respective packet-level features from PCAP files.

The features are extracted using the `Feature_extractor_flow_packet_combined.py` script. These features can be utilized for various purposes beyond creating graph objects, as they offer complete information about each flow along with its associated packet details.


In [None]:
directory = Out_Directory+"\**\*pcap"
List_of_PCAP = glob.glob(directory)

Out_path= 'F:/CIC_IOT/' # Directory path where you want the csv files to be saved.
feature_Extractor ='/Utility/Feature_extractor_flow_packet_combined.py' # Script for extracting the features from PCAP files

for single_pcap_file in List_of_PCAP:
    print("Reading File: ",os.path.basename(single_pcap_file))
    # Running the feature extractor on the command line as sometime it has some issues while running in the notebook due to multi threading. 
    completed_process =subprocess.run(['python', feature_Extractor, single_pcap_file, Out_path], capture_output=True)
    os.remove(single_pcap_file) # Removing the pcap files as processed to save the disk storage
    List_of_PCAP.remove(single_pcap_file)
    print("**Extraction Completed For: ",os.path.basename(single_pcap_file),"**")

### Transformation into Graph Data Objects
Utilizing the extracted flow-level features along with their respective packets, the data object created is a heterogeneous graph consisting of two different types of nodes and two different types of edges. The nodes are:

1. Flow Node: Contains all flow-level statistical features.
2. Packet Node: Contains payload information transformed into byte-wise values.

The two different edges are:

1. Contain Edge: Links Flow Nodes and Packet Nodes along with some features.
2. Link Edge: Links Packet Nodes together with t-delta as its attribute.

In [None]:
## Dictionary for classifying Classes and Assigning them Class number for reference
Dict_x = {'Benign': 0 , 
          'WebBased': 1, 
          'Spoofing': 2,
          'Recon' : 3,
          'Mirai' : 4,
          'Dos' : 5,
          'DDos' : 6,
          'BruteForce': 7
         }

## Directory where graph data will be stored
dir = "F:/GNN_Project/data/"
## Directory where CSV files(Extracted Flow-level and packet-level inforamtion) is stored
Files =glob.glob(Out_path+"/*.csv") ## This will list all the files from which graph data objects will be created.


Since the CIC-IoT2023 dataset is large and has imbalanced instances of classes, we have performed data processing (over/under sampling) to achieve a balanced dataset for ease of training and to address the imbalance problem. To follow the pre-processing steps, please refer to the notebook: `Data_preprocessing_CIC-IoT2023.ipynb`.



In [None]:
## Generation of graph data obejcts.
data_Hetero = NIDSDataset(root=dir, label_dict=Dict_x, filename=Files, skip_processing=False, test=True, single_file=True)

PARAMETERS 

- **root** (`str`): Root directory where the graph objects should be saved.

- **label_dict** (`Dict`): Dictionary for assigning labels to each attack class.

- **filename** (`List[str]`): List of CSV file paths to be used for the development of graph objects.

- **skip_processing** (`bool`): If set to `True`, skips the generation of graph objects and utilizes the ones present in the root directory. (default: `False`)

- **test** (`bool`): If set to `True`, generates data objects for testing by creating data objects with a test suffix. (default: `False`)

- **single_file** (`bool`): If set to True, the provided CSV files is a single file with Label column within CSV.  (default: `False`)   
