# Pipeline For Computing Complete Payload Data

This pipeline is created for the ease of users willing to generate the complete data on their own. There are few things that should be kept in mind before executing this. 

>1. You should have enough space in your hard drive before executing this notebook. Approximately you should have atleast 400GB of space for storing and saving results of PCAP files.
>2. This notebook is compatible with python version 3.7.13. 
>3. Developed parser is based on Scapy module. Make sure it is installed. 
>4. Code processing might requrie high RAM space, therefore if you are on low resources try other method. 


In [2]:
from imblearn.under_sampling import RandomUnderSampler
from Functions.Pipeline import pipeline
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

  cipher=algorithms.Blowfish,
  cipher=algorithms.CAST5,


#### There are three inputs for the pipeline:

>1. In_directory (in_dir) = The directory where PCAP files are stored. For UNSW there are two folders wheras for CICIDS there are five individual files.
>2. Out_directory (out_dir) = The directory where you want the outcome of the tool to be stored.
>3. Dataset Name= `UNSW` or `CICIDS`.
>4. Processed CSV File = The directory for combined and processed CSV file. For processing the CSV files navigate to `CSV_data_preprocessing` folder. MUST BE RUN BEFORE PIPELINE.

In [5]:
in_dir = "/home/jovyan/wire/DataSets/UNSW-NB15/UNSW-NB15 - pcap files"
out_dir = "./UNSW_results"
Dataset_name = "UNSW"
processed_csv_file = "./UNSW-NB15_preprocessed.csv"

In [None]:
df = pipeline(in_dir, out_dir, Dataset_name, processed_csv_file)

In [10]:
df.attack_cat.value_counts()

normal            90159797
exploits           1760669
dos                 460985
generic             256395
fuzzers             171920
reconnaissance       51240
backdoor             11710
worms                10400
analysis              8107
shellcode             4080
Name: attack_cat, dtype: int64

In [8]:
df.protocol_m.value_counts()

tcp            89918983
udp             2879507
ospf              60499
arp               12673
sctp              10065
others             8410
icmp                923
pim                 372
any                 276
sep                 162
mobile              127
sun-nd              127
swipe               127
gre                  98
pup                  92
nvp                  92
unas                 92
ib                   92
sps                  92
pipe                 92
iplt                 92
crudp                92
sccopmce             92
crtp                 92
fire                 92
snp                  92
pnni                 92
gmtp                 92
encap                92
etherip              92
aes-sp3-d            92
micp                 92
ipip                 92
ax.25                92
larp                 92
dgp                  92
secure-vmtp          92
ip                   92
rvd                  92
egp                  92
rsvp                 92
ipv6            

## Undersampling Normal Data Instances

Since number of normal data instances are extensively higher than the attacks, normal instances are undersampled as mentioned in the paper. If you dont want to reduce the data instances ignore this step.

Or if you want to reduce it according to your approach change the data instances provided in `dict`.



In [11]:
## For UNSW
dict = {
    "normal": 2641003,
    "exploits": 1760669,
    "dos": 460985,
    "generic": 256395,
    "fuzzers": 171920,
    "reconnaissance": 51240,
    "backdoor": 11710,
    "worms": 10400,
    "analysis": 8107,
    "shellcode": 4080,
}

In [7]:
## For CICIDS
dict = {
    "BENIGN": 3328591,
    "DoS Hulk": 2219061,
    "DDoS": 618544,
    "SSH-Patator": 181147,
    "FTP-Patator": 110636,
    "Infiltration": 41725,
    "Heartbleed": 41283,
    "DoS GoldenEye": 34293,
    "Web Attack – Brute Force": 28920,
    "DoS slowloris": 20877,
    "DoS Slowhttptest": 9778,
    "Web Attack – XSS": 6767,
    "Bot": 5143,
    "PortScan": 946,
    "Web Attack – Sql Injection": 45,
}

In [12]:
rus = RandomUnderSampler(random_state=42, sampling_strategy=dict)
X_res, y_res = rus.fit_resample(df.loc[:, df.columns != "attack_cat"], df.loc[:, "attack_cat"])
X_res["attack_cat"] = y_res
df_reduced = X_res

In [None]:
df_reduced.attack_cat.value_counts()

## Transformation of Hex Valued Payload into Byte-Wise Integers

Transform data into 1504 features, following the employed feature vector as explained in the paper.
Each feature is in integer form and can be utilized for training of Machine Learning models.

In [14]:
X = [np.array(bytearray.fromhex(y)) for y in df["payload"].to_numpy()]

In [15]:
for x in X:
    x.resize(1500, refcheck=False)

In [16]:
X = np.row_stack(X)

In [17]:
y = np.array(df["label"]).reshape((-1, 1))

In [18]:
le = LabelEncoder()
name = []
for x in range(1, 1501):
    name.append("payload_byte_" + str(x))
final = pd.DataFrame(X, columns=name)
final[["sttl", "total_len", "t_delta"]] = df.loc[:, ["sttl", "total_len", "t_delta"]]
final["protocol_m"] = le.fit_transform(df["protocol_m"])

# Re-order the columns to move payload to the end
cols = final.columns.tolist()
cols = cols[-4:] + cols[:-4]
final = final[cols]
final["label"] = y

In [19]:
final

Unnamed: 0,sttl,total_len,t_delta,protocol_m,payload_byte_1,payload_byte_2,payload_byte_3,payload_byte_4,payload_byte_5,payload_byte_6,...,payload_byte_1492,payload_byte_1493,payload_byte_1494,payload_byte_1495,payload_byte_1496,payload_byte_1497,payload_byte_1498,payload_byte_1499,payload_byte_1500,label
0,1,64,0.0,26,2,1,0,44,192,168,...,0,0,0,0,0,0,0,0,0,0.0
1,1,64,0.0,26,2,1,0,44,192,168,...,0,0,0,0,0,0,0,0,0,0.0
2,1,64,6.0,26,2,1,0,44,192,168,...,0,0,0,0,0,0,0,0,0,0.0
3,1,64,0.0,26,2,1,0,44,192,168,...,0,0,0,0,0,0,0,0,0,0.0
4,1,64,6.0,26,2,1,0,44,192,168,...,0,0,0,0,0,0,0,0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92895298,30,106,0.0,43,49,50,53,32,68,97,...,0,0,0,0,0,0,0,0,0,0.0
92895299,29,106,0.0,43,49,50,53,32,68,97,...,0,0,0,0,0,0,0,0,0,0.0
92895300,30,1500,0.0,43,114,104,198,131,58,255,...,0,0,0,0,0,0,0,0,0,0.0
92895301,30,1500,0.0,43,152,220,163,131,47,58,...,0,0,0,0,0,0,0,0,0,0.0


In [None]:
final.to_csv(out_dir + f"/{Dataset_name}_converted_data.csv", index=False)