<a href="https://colab.research.google.com/github/VivianeSouza923/Wproject_Iot/blob/main/copy_Data_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

  
***
***
# Written by Cooper Coldwell, June 23 2022
This code's purpose is to read in '.pcapng' files from 3 sources--Normal-1UE, Normal-2UE, and Attacks--and parse the data to use for machine learning model training.  
## Dataset Explanation
### Normal-1UE
The Normal-1UE sets represent normal 5G network traffic data collected on a simulated 5G Core connected to another computer simulating a Radio-Area-Network connected to a single User Equipment (UE, basically a 5G-capable device like a cellphone). Within the Normal-1UE directory are log files--containing the terminal logs for each Network Function (NF, the components of the 5G network)--and '.pcapng' files containing the captured 5G network packets.  
The network traffic consisted of YouTube streaming, HTTP requests to popular websites, and data transfers to and from  FTP and SAMBA servers.
### Normal-2UE
The Normal-2UE captured data is very similar to the Normal-1UE data except with two simulated UEs. The network traffic was of the same type but divided between the two UEs. The goal here was to introduce more 'network regulation'-type data that was very weakly represented in the 1UE. Consider the following scenario:
> A physical 5G network: a user with a 5G cellphone is moving, so the connection strength between the user and cell tower A weakens while connection strength to tower B is increasing. The network would detect this and make decisions whether to end the user's session with A and begin another with B.  

With two UEs, we hope to see more of these types of intra-network communication packets.
### Attacks
The Attacks captured data were captured by executing 5G-specific attacks against the 5G Core from the 5G Core, i.e. a Bad Actor has gained access to the Core and is mucking around. There is very little internet traffic in this set because the attacks were run while the simulated UEs were idle. There might be some incidental traffic, but not much.
## Data Handling
The data is saved across many files. For the normal data, we are pulling the data from the 'allcap\*.pcapng' files, which contains the combined data from all the network interfaces we recorded on; the allcap files represent the sum total of all the traffic inside the 5G Core as well as the data between the RAN and Core.
When examining the captured packets with Wireshark and Scapy, we discovered that the packet layers containing the attacks were labelled as 'Raw' by Scapy, so we decided to discard the other layers. To convert the packets to a format usable for training ML models, this notebook performs the following:
1. Read in the files with Scapy
2. Convert the raw bytes for each to a string
3. Add each successive packet to an array containing the other packets of the same classification (Normal-1UE, Normal-2UE, Attack)
4. Combine subsets of the processed sets together to create a set containing normal data of both varieties and another set that is 50% attack, 50% normal. The packets in the mixed normal-and-attacks set are labelled according to whether they are normal or attack.
    - These labels are not important for our training, because we use unsupervised learning to train a variational autoencoder on the normal data, but the labelled data is useful for comparing how well the VAE can differentiate between attacks and normal traffic.
5. Shuffle each set, then normalize the length of each string of bytes
6. Convert the strings of bytes to an array of bytes
7. Save the datasets

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# import cupy as cp
import numpy as np
import pandas as pd
# import cudf as cd

import os, sys
import glob as glob
import binascii
import csv
import pickle
from scapy.all import *
from pathlib import Path
from tqdm.auto import tqdm

### Set directory paths pointing towards the datasets
The *processedPath* variable points to where the output files will be written. The *path\**  variables point to the data sources.

In [None]:
pathToNormal = 'Normal-1UE/'
pathToNormal2UE = 'Normal-2UE/'
pathToAttack = 'Attacks/'
!mkdir NEW-PREPPED-DATA_jupyter
processedPath = 'NEW-PREPPED-DATA/'

# Loading data from the .pcapng files

## Let's first look at the structure of packets:

In [None]:
example = rdpcap(pathToAttack+'AMFLookingForUDM/allcap_AMFLookingForUDM_00001_20220609151247.pcapng')

In [None]:
example[6].show()

###[ Ethernet ]### 
  dst       = 00:00:00:00:00:00
  src       = 00:00:00:00:00:00
  type      = IPv4
###[ IP ]### 
     version   = 4
     ihl       = 5
     tos       = 0x0
     len       = 197
     id        = 22163
     flags     = DF
     frag      = 0
     ttl       = 64
     proto     = tcp
     chksum    = 0xe594
     src       = 127.0.0.1
     dst       = 127.0.0.10
     \options   \
###[ TCP ]### 
        sport     = 37364
        dport     = irdmi
        seq       = 3683835274
        ack       = 4293697932
        dataofs   = 8
        reserved  = 0
        flags     = PA
        window    = 512
        chksum    = 0xfec2
        urgptr    = 0
        options   = [('NOP', None), ('NOP', None), ('Timestamp', (3708803956, 3529532140))]
###[ Raw ]### 
           load      = 'GET /nnrf-disc/v1/nf-instances?requester-nf-type=AMF&target-nf-type=UDM HTTP/1.1\r\nHost: 127.0.0.10:8000\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\n\r\n'



What ScaPy shows as 'Raw' for this packet is everything after the IP and TCP headers, which turns out to be HTTP.  

This specific packet is an attack packet that pretends to be the AMF network function asking for information about the UDM network function. The attack itself is contained in the HTTP data. All of our attacks occur in HTTP or PFCP data; luckily for us, Scapy labels those portions as 'Raw'. The IP and TCP headers aren't part of the attack, but they might tip off the model based on commonalities between the attacks, so we will strip off those layers and only keep the 'Raw' portion.

## Open the Normal-1UE data and append it all together

### Close any running tqdm instances
The `tqdm` library provides a handy progress bar. The below section of code is only useful if you're rerunning cells in Jupyter because Jupyter maintains variables in memory, so rerunning a cell can open new instances of `tqdm`, causing the progress bar to not update in-line.

In [None]:
while len(tqdm._instances) > 0:
    tqdm._instances.pop().close()
print("Made it past clearing instances")

Made it past clearing instances


The Normal-1UE data is spread across several 'allcap*' files, so we need to iterate through the files, process them with Scapy, and combine the data into one array.
- we gather a list of .pcapng files (in the Normal-1UE directory) starting with 'allcap' using the `glob` function
- the `sniff` function is a Scapy method for reading capture files. Another possible method to use is `rdpcap`, but I found sniff to be faster for large sets.
- The Raw data output by Scapy is ugly, and not especially useful in its initial form. It will look like individual bytes represented in hexadecimal and separated by '\'
    - To remedy this, we use `binascii.hexlify`, which converts converts each byte of the binary output of sniff() to its 2-digit hex representation, which is output as a string.
    
**NOTE: Reading in these pcapng files is not a quick process, so expect this section to take 10+ minutes with a decently fast CPU**

In [None]:
datasets = glob(pathToNormal+'allcap*.pcapng')
print(datasets)
payloads = []
for file in tqdm(datasets):
    pcap = sniff(offline=str(file))
    for packet in pcap:
        if not Raw in packet:
            continue
        payload = binascii.hexlify(packet[Raw].original)
        payloads.append(payload)
print(len(payloads))

['Normal-1UE/allcap_00006_20220607091008.pcapng', 'Normal-1UE/allcap_00003_20220606211007.pcapng', 'Normal-1UE/allcap_00001_20220606131007.pcapng', 'Normal-1UE/allcap_00002_20220606171007.pcapng', 'Normal-1UE/allcap_00005_20220607051008.pcapng', 'Normal-1UE/allcap_00001_20220606102554.pcapng', 'Normal-1UE/allcap_00004_20220607011007.pcapng']


  0%|          | 0/7 [00:00<?, ?it/s]

9339618


### Add labels to the data and save it as a CSV
We take the payloads pulled from the pcap files and put them into a `pandas` DataFrame. The DataFrame is convenient for both shuffling the data (done with `.sample(frac=1)`) and writing it to a CSV. Before we write the payloads to a CSV, we add a "label" column filled with 'normal' to simplify creating a mixed set later. The CSV makes the payloads human-readable in a way that a pickled or numpy-saved file would not be.

In [None]:
data = {'raw':payloads}
df = pd.DataFrame(data=data).sample(frac=1).reset_index(drop=True)
df.loc[:,'label'] = 'normal'
df.to_csv(f"{processedPath}normal_data.csv", index=False)
print(df.head(5))

                                                 raw   label
0  b'5497e16da1b9130133e5e732e67c0047910ac5fcee6b...  normal
1  b'5445177f2e2747f98c8e1bbf79528d030a242a515814...  normal
2  b'4eccbad439a0903d62084b152dc51a3e9c21bdc05ef5...  normal
3  b'34ff00c0000000010000008501000900456000b80000...  normal
4  b'0a061f1bbb205a0a14a33dc41faa033103ab6cf44fc6...  normal


## Open the 2UE normal data and append it together
The process used to handle the Normal-1UE data applies here as well, with a notable exception: speed.  
**Reading in the 2UE files is MUCH slower than the 1UE files because 2UE has 23M packets vs. 1UE's 9M.**

In [None]:
# Close tqdm instances:
while len(tqdm._instances) > 0:
    tqdm._instances.pop().close()
print("Made it past clearing instances")

datasets = glob(pathToNormal2UE+'allcap*.pcapng')
payloads = []
for pcap in datasets:
    pcap = sniff(offline=str(file))
    for packet in pcap:
        if not Raw in packet:
            continue
        payload = binascii.hexlify(packet[Raw].original)
        payloads.append(payload)
print(len(payloads))

Made it past clearing instances
24851445


The 2UE data is ***massive***, so it's important to save at this point to avoid accidental loss. We experienced memory overloads, which crashed the program while trying to save to either a .npy or CSV (though, when the crashes occured, we were running the notebook cells out of order. YMMV). I discovered that saving as a pickle file used up less memory and helped to avoid crashes. *You don't want to crash before saving and have to rerun the 2 hour processing time.*

In [None]:
with open('2ue.p','wb') as file:
    pickle.dump(payloads,file)

In [None]:
with open('2ue.p','rb') as file:
    payloads = pickle.load(file)

In [None]:
data = {'raw':payloads,'label':['normal']*len(payloads)}
# print(data['label'][0])
df = pd.DataFrame(data=data).sample(frac=1).reset_index(drop=True)
df.to_csv(f"{processedPath}normal_data_2ue.csv", index=False)

## Open the malicious data and append it all together
The total data collected while running the attacks is much smaller than the collected normal datas. The size of the isolated attack data is even smaller because we used Wireshark to filter out and export the packets performing the attacks. The filtered pcap files are labelled beginning with "Attacks_".  
Also of note is that each attack is within its own subdirectory of the Attacks directory. The folders are named for the attack type, and each pcap file is also named for the attack type.

<!-- one packet attack, contents of packet trigger attack
run multiple times in capture
rest of packet is normal traffic -->

In [None]:
## Remove previously-used variables from memory if they exist. This helps to reduce memory usage, and perhaps equally as important, prevent variables remaining in memory from causing unintended behavior.
## This step isn't important if the notebook is run sequentially, but in our workflow, we would re-run certain sections as needed.
try:
    del dataset, payload, payloads, data, df
except:
    pass

sets = []
# print(os.listdir(pathToAttack))
for i in os.listdir(pathToAttack):
    dataset = glob(pathToAttack+i+'/Attacks*.pcapng')
    try:
        # print(dataset[0])
        sets.append(str(dataset[0]))
    except:
        print("Failed to find 'Attacks*.pcapng' file in folder: ", str(pathToAttack+i))

# print(sets)
payloads = []
for file in sets:
    pcap = sniff(offline=str(file))

    for packet in pcap[Raw]:
        if not Raw in packet:
            continue
        payload = binascii.hexlify(packet[Raw].original)
        payloads.append(payload)
    # print(file,len(payloads)
print(len(payloads))

Failed to find 'Attacks*.pcapng' file in folder:  Attacks/.ipynb_checkpoints
24174


In [None]:
data = {'raw':payloads}
df = pd.DataFrame(data=data)
df.loc[:,'label'] = 'attack'
df.to_csv(f"{processedPath}malicious_data.csv", index=False)

try:
    del dataset, payload, payloads, data, df
except:
    pass

## Import the data from the CSVs
Using cuDF and cuPy should increase the processing speed (by orders of magnitude) over using pandas and numpy because these new libraries use Nvidia CUDA cores for the processing. The documentation says cuDF and cuPy should implement most methods from pandas and numpy, but I had difficulty using the CUDA accelerated libraries by importing them under the same alias as pandas and numpy.

The issue I encounter was cuDF and cuPY expecting *very specific* data-types as function parameters, which I unsuccessfully tried to provide. You, the reader, may be able to figure it out if it piques your interest.

Back to pandas and numpy...  
### Importing CSVs...

In [None]:
# import cudf as pd
# import cupy as np

In [None]:
normal = pd.read_csv(f"{processedPath}normal_data.csv")
normal2UE = pd.read_csv(f"{processedPath}normal_data_2ue.csv")
malicious = pd.read_csv(f"{processedPath}malicious_data.csv")

In [None]:
print('Normal: ')
normal.head(4)

Normal: 


Unnamed: 0,raw,label
0,b'5497e16da1b9130133e5e732e67c0047910ac5fcee6b...,normal
1,b'5445177f2e2747f98c8e1bbf79528d030a242a515814...,normal
2,b'4eccbad439a0903d62084b152dc51a3e9c21bdc05ef5...,normal
3,b'34ff00c0000000010000008501000900456000b80000...,normal


In [None]:
print('Normal-2UE: ')
normal2UE.head()

Normal-2UE: 


Unnamed: 0,raw,label
0,b'34ff027d000000010000008501100900450002750000...,normal
1,b'591d7435daee582a77fab1fbf19331c573956854543c...,normal
2,b'34ff00440000000100000085010009004580003c50dc...,normal
3,b'34ff009100000001000000850110090045000089125e...,normal
4,b'34ff0030000000010000008501100900452000280000...,normal


In [None]:
print('Malicious: ')
malicious.head(4)

Malicious: 


Unnamed: 0,raw,label
0,b'474554202f6e6e72662d646973632f76312f6e662d69...,attack
1,b'474554202f6e6e72662d646973632f76312f6e662d69...,attack
2,b'474554202f6e6e72662d646973632f76312f6e662d69...,attack
3,b'474554202f6e6e72662d646973632f76312f6e662d69...,attack


## Create new sets from the old for training models
We want to have a set that is 50% attacks, 50% normal and a set of the two types of normal traffic. Let's look at the size of the sets so we can determine how best to make the 25-25-50 (1UE-2UE-Attack) dataset.

In [None]:
print(f'Normal size: {normal.shape}')
print(f'Normal2UE size: {normal2UE.shape}')
print(f'Malicious size: {malicious.shape}')

Normal size: (9339618, 2)
Normal2UE size: (24851445, 2)
Malicious size: (24174, 2)


### Create a mixed set of both attack and normal
We want a 50/50 split of normal/attack data, and the malicious set is significantly smaller than either of the normal sets. Therefore, we take **all** of malicious and then half as many samples each for Normal-1IU and normal2UE. To avoid some kind of data bias, normal and normal2UE are shuffled before sampling.

Also, delete variables from memory as we go to avoid crashes.

In [None]:
mixed = malicious.sample(frac=1,random_state=100) #take all the malicious
mixed = pd.concat([mixed, normal.sample(frac=1,random_state=100)[0:len(malicious)//2]]) #append the first {half the length of malicious} packets from normal-1ue
mixed = pd.concat([mixed, normal2UE.sample(frac=1,random_state=100)[0:len(malicious)//2]]) #append the first {half the length of malicious} packets from normal-2ue
mixed = mixed.sample(frac=1,random_state=1) #shuffle the data before processing
## Separate the labels (important for using the mixed data to evaluate an autoencoder)
mixed_labels = mixed.pop('label')
np.save(f'{processedPath}mixed_labels.npy',mixed_labels)
del mixed_labels
print('Packets in malicious: ',len(malicious))
print('Packets in mixed: ',len(mixed))
print('Mixed set is of the expected size: ',len(malicious)*2==len(mixed))

Packets in malicious:  24174
Packets in mixed:  48349
Mixed set is of the expected size:  False


## Normalize the packet lengths and reshape each packet's string of bytes to an array of bytes
- The length of the payloads can vary widely, from a few bytes to several thousand bytes. I checked a few dozen attack packets, and those usually weren't much longer (+/- 20%) than 1000 bytes. We have to use a square number for the length because our FPGAs don't like performing convolutions unless the inputs are square, i.e. 10x10, 25x25, 32x32, etc. If this is not desired, set the `reshape` argument to `False`
    - to normalize the payload length, append zeros to the ends of packets shorter than the desired size and truncate longer packets to the desired size
    - to convert from byte string to byte array, we use the numpy function `frombuffer`

#### Declare the desired, normalized size for the packets:

In [None]:
max_packet_length = 1024

In [None]:
def ReshapePackets(dataFrame, saveToFilename, max_packet_length, reshape=True):
    '''Converts from byte strings in a DataFrame to a numpy array of bytes'''
    array = np.array(dataFrame['raw'])
    array = np.ascontiguousarray(array)
    payloads = []
    array.shape
    for i in range(array.shape[0]):
#         print(array[i])
        # Standardize the length of the strings:
        payloadStr = array[i].split('\'')[1]
        payloadStr = payloadStr.ljust(max_packet_length+2, u'0')
        payloadStr = payloadStr[0:max_packet_length]
        array[i] = payloadStr.encode('utf8')
        # Convert to array:
        array[i] = np.frombuffer(array[i],dtype=np.uint8,count=max_packet_length)
        if(reshape=True):
            payloads.append(np.reshape(array[i],(array[i].shape[0],1,1)))
        else:
            payloads.append(array[i])
    payloads = np.array(payloads)
    print('New data shape: ',payloads.shape)
    np.save(saveToFilename,payloads)

### Normalize and reshape the mixed data
Also delete it to free memory

In [None]:
ReshapePackets(mixed,f'{processedPath}mixed.npy',max_packet_length)
del mixed

New data shape:  (48349, 1024, 1, 1)


### Create a 50/50 split of the two types of normal data:
As before, delete the variables after we're done with them

In [None]:
totalNormal = pd.concat([normal.sample(frac=1,random_state=2022),
                         normal2UE.sample(frac=1,random_state=100)[0:len(normal)]
                         ])
totalNormal = totalNormal.sample(frac=1,random_state=2022)
ReshapePackets(normal,f'{processedPath}normal.npy',max_packet_length)
del normal
ReshapePackets(normal2UE,f'{processedPath}normal2UE.npy',max_packet_length)
del normal2UE
ReshapePackets(totalNormal,f'{processedPath}total_normal.npy',max_packet_length)
del totalNormal

New data shape:  (9339618, 1024, 1, 1)
New data shape:  (24851445, 1024, 1, 1)
New data shape:  (18679236, 1024, 1, 1)


In [None]:
mixed = np.load(f'{processedPath}mixed.npy',allow_pickle=True)
labels = np.load(f'{processedPath}mixed_labels.npy',allow_pickle=True)
print(mixed[0:5][1])
print(labels[0:5])

[[[52]]

 [[97]]

 [[97]]

 ...

 [[48]]

 [[48]]

 [[48]]]
['normal' 'normal' 'normal' 'attack' 'normal']
