# Pipeline For Computing Complete Payload Data

This pipeline is created for the ease of users willing to generate the complete data on their own. There are few things that should be kept in mind before executing this. 

>1. You should have enough space in your hard drive before executing this notebook. Approximately you should have atleast 400GB of space for storing and saving results of PCAP files.
>2. This notebook is compatible with python version 3.7.13. 
>3. Developed parser is based on Scapy module. Make sure it is installed. 
>4. Code processing might requrie high RAM space, therefore if you are on low resources try other method. 


In [None]:
import os
import pandas as pd
import numpy as np
from datetime import datetime
from imblearn.under_sampling import RandomUnderSampler 
from sklearn.preprocessing import LabelEncoder
from functions.Pipeline import *

#### There are three inputs for the pipeline:

>1. In_directory (in_dir) = The directory where PCAP files are stored. For UNSW there are two folders wheras for CICIDS there are five individual files.
>2. Out_directory (out_dir) = The directory where you want the outcome of the tool to be stored.
>3. Dataset Name= `UNSW` or `CICIDS`.
>4. Processed CSV File = The directory for combined and processed CSV file. For processing the CSV files navigate to `CSV_data_preprocessing` folder

In [None]:
in_dir='D:/UNSW'
out_dir="D:/UNSW_results"
Dataset_name='UNSW'
processed_csv_file="E:/UNSW-NB15 Dataset/UNSW-NB15-CSV-Files/Preprocessed-CSV/UNSW-NB15_processed.csv"

In [None]:
df=pipeline(in_dir,out_dir,Dataset_name,processed_csv_file)

In [None]:
df.attack_cat.value_counts()

## Undersampling Normal Data Instances

Since number of normal data instances are extensively higher than the attacks, normal instances are undersampled as mentioned in the paper. If you dont want to reduce the data instances ignore this step.

Or if you want to reduce it according to your approach change the data instances provided in `dict`.



In [None]:
## For UNSW
dict={ 'generic':17580,
'exploits':13992,
'fuzzers'  : 12722,
'reconnaissance': 7562,
'dos'  : 3397,
'backdoor' :   1239,
'analysis' :  1208,
'shellcode': 1088,
'normal': 21000,
'worms':  93
}

In [None]:
## For CICIDS
dict={ 'BENIGN': 362108,
'DoS Hulk':          250000,
'DDoS'  :         241405,
'DoS GoldenEye':     128122,
'DoS slowloris':    121097,
'Infiltration'        :115007,
'DoS Slowhttptest'         :  80542,
'SSH-Patator':          48165,
'FTP-Patator'            :   31843,
'Heartbleed'              :  13486,
'Web Attack – Brute Force'            :   11754,
'Web Attack – XSS'              :  3341,
'Bot'            :   2543,
'PortScan'              :  830,
'Web Attack – Sql Injection': 12
}

In [None]:
X_res=df.iloc[:,:-1]
y_res=df.iloc[:,-1]

In [None]:
rus = RandomUnderSampler(random_state=42,sampling_strategy=dict)
X_res, y_res = rus.fit_resample(df.iloc[:,:-1], df.iloc[:,-1])
X_res['label']=y_res
df=0
df=X_res

## Transformation of Hex Valued Payload into Byte-Wise Integers

Transform data into 1504 features, following the employed feature vector as explained in the paper.
Each feature is in integer form and can be utilized for training of Machine Learning models.

In [None]:
# Out_dir => directory for saving the transformed data
out_dir="D:/UNSW_results/"

df_t=transform(df,out_dir)