## Data Preprocessing CIC-IoT2023

This section covers the preprocessing of the CIC-IoT2023 dataset after extracting its flow-level and packet-level features using the `Feature_extractor_flow_packet_combined.py` script (extraction shown in `GNN4ID.ipynb`). Since CIC-IoT2023 dataset is very huge and is very imbalance with having some classes with very low instances, therefore to maintain uniformity we have under and over-sampled data instances. 

In [None]:
## Iterate for each individual Class
List_of_CSV_File =glob.glob("F:/CIC_IOT/Recon*") 

For our preprocessing, we divided the data into an 80:20 split. Specifically, 80% of the data is used for training, while the remaining 20% is utilized for testing.

To achieve this division, we first identified the class with the least number of samples, which in our scenario was the BruteForce Attack class, with 2,336 samples. Using this as a reference point, we determined the undersampling rate for the other classes based on the number of samples for the minority class.

We applied an oversampling factor of 10x to the minority class for the training data, meaning we increased the number of training samples for the BruteForce Attack class (80% of 2,336 samples) to 20,000 samples. Consequently, we limited the number of training samples for each class to 20,000. Depending on the class, we either undersampled or oversampled to achieve this target.

The following steps outline the process of dividing the dataset into an 80:20 split and subsequently performing the necessary over/undersampling:

### Dataset Division & Filtering

For Attack Data Only, We are filtering the attack data instances by removing all other flow isntances that does not have attacker mac address as either source or destination.

Additionally, we are limiting the test dataset samples to maximum 4000 and training samples to 20,000.

In [None]:
## Iterate for each file in List_of_CSV_File
df = pd.read_csv(List_of_CSV_File[0])
print("Size Before Filtering:", df.shape)

## Comment this line if Using for Benign Data.
df=df[(df['src_mac']=='dc:a6:32:dc:27:d5') | (df['src_mac']=='e4:5f:01:55:90:c4') | (df['src_mac']=='dc:a6:32:c9:e4:ab') | (df['src_mac']=='ac:17:02:05:34:27') | (df['src_mac']=='dc:a6:32:c9:e5:a4') | (df['src_mac']=='dc:a6:32:c9:e4:d5') | (df['src_mac']=='dc:a6:32:c9:e5:ef') | (df['src_mac']=='dc:a6:32:c9:e4:90') | (df['src_mac']=='b0:09:da:3e:82:6c') | (df['dst_mac']=='dc:a6:32:dc:27:d5') | (df['dst_mac']=='e4:5f:01:55:90:c4') | (df['dst_mac']=='dc:a6:32:c9:e4:ab') | (df['dst_mac']=='ac:17:02:05:34:27') | (df['dst_mac']=='dc:a6:32:c9:e5:a4') | (df['dst_mac']=='dc:a6:32:c9:e4:d5') | (df['dst_mac']=='dc:a6:32:c9:e5:ef') | (df['dst_mac']=='dc:a6:32:c9:e4:90') | (df['dst_mac']=='b0:09:da:3e:82:6c') ]

## Comment this line if Using for Attack Data
# df=df[(df['src_mac']!='dc:a6:32:dc:27:d5') & (df['src_mac']!='e4:5f:01:55:90:c4') & (df['src_mac']!='dc:a6:32:c9:e4:ab') & (df['src_mac']!='ac:17:02:05:34:27') & (df['src_mac']!='dc:a6:32:c9:e5:a4') & (df['src_mac']!='dc:a6:32:c9:e4:d5') & (df['src_mac']!='dc:a6:32:c9:e5:ef') & (df['src_mac']!='dc:a6:32:c9:e4:90') & (df['src_mac']!='b0:09:da:3e:82:6c') & (df['dst_mac']!='dc:a6:32:dc:27:d5') & (df['dst_mac']!='e4:5f:01:55:90:c4') & (df['dst_mac']!='dc:a6:32:c9:e4:ab') & (df['dst_mac']!='ac:17:02:05:34:27') & (df['dst_mac']!='dc:a6:32:c9:e5:a4') & (df['dst_mac']!='dc:a6:32:c9:e4:d5') & (df['dst_mac']!='dc:a6:32:c9:e5:ef') & (df['dst_mac']!='dc:a6:32:c9:e4:90') & (df['dst_mac']!='b0:09:da:3e:82:6c') ]

## Dropping Extra Columns that we are not utilizing in our graph data object
df.drop(['src_ip','src_port','dst_ip','dst_port','ip_version'], axis=1, inplace=True)
df.drop(['bidirectional_bytes','bidirectional_first_seen_ms','bidirectional_last_seen_ms','bidirectional_duration_ms',
         'bidirectional_packets','src2dst_first_seen_ms','src2dst_last_seen_ms','dst2src_first_seen_ms','dst2src_last_seen_ms',
         'id','src_mac','src_oui','dst_mac','dst_oui','vlan_id','tunnel_id','bidirectional_syn_packets','bidirectional_cwr_packets',
         'bidirectional_ece_packets','bidirectional_urg_packets','bidirectional_ack_packets','bidirectional_psh_packets',
         'bidirectional_rst_packets','bidirectional_fin_packets'], axis=1, inplace=True)

## Making Sure the Feature Space reamain constant, therefore creating dummy variable with categories provided beforehand.
df['expiration_id']=pd.Categorical(df['expiration_id'], categories=[0,-1])
df['protocol']=pd.Categorical(df['protocol'], categories=[1,2,6,17,58])
df=pd.get_dummies(df, prefix=['Exp','proto'], columns=['expiration_id', 'protocol'],dtype=int)
print("Size After Filtering:", df.shape)



In [None]:
fraction = 10000/df.shape[0]
df = df.sample(frac=fraction)
df.to_csv(List_of_CSV_File[0], index=False)

### Under/Over Sampling

In [1064]:
List_of_CSV_File = glob.glob("F:/GNN_Project/data/raw/test/Benign*") 
List_of_CSV_File

['F:/GNN_Project/data/raw/test\\Benign_0_test.csv',
 'F:/GNN_Project/data/raw/test\\Benign_1_test.csv',
 'F:/GNN_Project/data/raw/test\\Benign_2_test.csv',
 'F:/GNN_Project/data/raw/test\\Benign_3_test.csv']

In [None]:
# 4000 for test samples & 20000 for training samples
Number_instances_in_each_Overall_Class = 4000
Number_in_individaul_class =int(Number_instances_in_each_Overall_Class/len(List_of_CSV_File))

In [None]:
for files in List_of_CSV_File:
    df = pd.read_csv(files)
    name_file = files.split('\\')[-1]
    name_file = name_file.split('.')[-2]
    ## Remove _test if performing for training samples
    name_file = os.path.dirname(files)+'\\'+name_file+'_test.csv'
    if df.shape[0]<Number_in_individaul_class:
        df = duplicate_rows(df,Number_in_individaul_class)
    else:
        fraction = Number_in_individaul_class/df.shape[0]
        df = df.sample(frac=fraction)
    df.to_csv(files, index=False)

#### Combining Each Broad Class Samples

In [None]:
List_of_CSV_File = glob.glob("F:/GNN_Project/data/raw/test/Benign*") 
List_of_CSV_File

In [None]:
dataframes = [pd.read_csv(location) for location in List_of_CSV_File]
concatenated_df = pd.concat(dataframes, ignore_index=True)
## Getting Name of the Broad_Class
old_name_search = List_of_CSV_File[0].split('\\')[-1]
old_name_search = old_name_search.split('.')[-2]
old_name_search = re.sub(r'\d+', '', old_name_search)
old_name_search = old_name_search[:-1] if old_name_search.endswith("_") else old_name_search
try:
    old_name_search = old_name_search.split('-')[-2]
except:
    old_name_search = old_name_search
## Saving the File with Broad CLass and combined suffix
concatenated_df.to_csv(os.path.dirname(List_of_CSV_File[0])+'\\'+old_name_search+'-Combined_test.csv',index=False)

#### Combining Each Broad Class Samples

In [None]:
## Combined Data for Training
List_of_CSV_File = glob.glob("F:/GNN_Project/data/raw/test/*.csv") 

In [None]:
# Iterate Over each file with new name
df = pd.read_csv(List_of_CSV_File[0])
# Iterate Label as per the Dictionary
df['Label'] = 0

In [None]:
df_complete_test=pd.concat([df,df1,df2,df3,df4,df5,df6,df7])
df_complete.to_csv('/scratch/user/yasir.ali/GNN_Project/data/df_8_class_test.csv', index=False)