# PCAP Clustering
## Data Generation
### Introduction
We used a (3GPP) 5G Lab to generate sample test traffic to run machine learning experiments. The 3rd Generation Partnership Project (3GPP) is a virtual organization that collects the view of seven telecommunications standard development organizations around the world. Since the development of the third generation (3G) mobile network standards, they have been spearheading the standards development effort in this arena. 5G is the technology standard defined by 3GPP from Release 15, fully specified by September 2019. Since mid 2019 numerous 5G networks have been deployed around the world and as of 4Q2023 over 1.1B subscribers are using it.

One of the characteristics of 5G is the distribution of network functionality between control plane (signaling) and user plane (packet forwarding). In addition 5G uses HTTP2 protocol extensively where multiple network functions communicate with each other a common service bus. 5G adopts the recent improvement in system architecture and dictates network functions to be implemented in a containerized format. This allowed the implementation of these network functions as open source containers distributed for experimentation, testing. Some example open-source libraries include [open5gs](https://github.com/open5gs) and [UERANSIM](https://github.com/aligungr/UERANSIM).

### Topology
For testing we used the following lab topology where various network functions as well as User Equipment (UE) were implemented as containers.

<img src="https://raw.githubusercontent.com/b-yond-infinite-network/sharkfest-europe-2023/main/assets/LAAS-Network-Slicing.png"> 



### Test Traffic
Test traffic was generated from a traffic generator, [fortio](fortio)  that allows generation of numerous types of synthetic traffic, including dns, http, tcp, udp and grpc. Fortio was deployed a traffic server and also included in the container image of the UE where fortio client is used to generate test traffic towards the fortio server.

In order to make the test-bed traffic diverse, impairment tool [pumba](https://github.com/alexei-led/pumba) was used. Pumba is used to generate process, network and performance impairments. Process impairments are related to pausing, stopping, killing and removing containers. In this set-up, we didn’t use these capabilities. Instead we relied on the network emulation capabilities of pumba which provides the following network impairments:

* delay

* loss

* duplicate

* corrupt

* rate-limit

In addition to fortio we used icmp_ping from the UE towards the fortio server to create test traffic. All traffic was captured from four Linux bridges to ensure all traffic is captured in a single pcap file corresponding to any control plane and user plane traffic.

```
$ sudo docker network list
NETWORK ID     NAME                   DRIVER    SCOPE
a3bb3f0b4bdc   bridge                 bridge    local
1dbfe33d5a4f   host                   host      local
6a5d4c99167f   laas-5gsa-docker_cp    bridge    local
cc7d4568c741   laas-5gsa-docker_oam   bridge    local
4dec7fb20e13   laas-5gsa-docker_sbi   bridge    local
30dc0e1ba2f9   laas-5gsa-docker_up    bridge    local
b84e483fa615   none                   null      local
tshark -l -i br-6a5d4c99167f -i br-30dc0e1ba2f9 -i br-cc7d4568c741 -i br-4dec7fb20e13 -w <filename.pcap>    
```
Traffic generation was initiated with the following comments (following example is for tcp echo where 100 requests were sent at the rate of 1 req per sec.):
```
ip route add 100.0.0.2 via 10.46.0.2 dev uesimtun0
fortio load -qps -1 -n 100 tcp://100.0.0.2:8078
```

In order to add impurity the following were applied to the user plane function (UPF) container on its eth0 interface connecting it to the fortio server.
```
pumba netem --duration 5m --interface eth0 delay --time 300 --jitter 30 --correlation 50 --distribution normal core_upf
pumba netem --duration 5m --interface eth0 loss --percent 50 --correlation 50 core_upf
pumba netem --duration 5m --interface eth0 rate --rate 10kbit core_upf
pumba netem --duration 5m --interface eth0 duplicate --percent 50 --correlation 50 core_upf
pumba netem --duration 5m --interface eth0 corrupt --percent 50 --correlation 50 core_upf
```

<img src="https://raw.githubusercontent.com/b-yond-infinite-network/sharkfest-europe-2023/main/assets/5g-laas-setup.png" width="70%">



To extract pcap into a csv file:


```
tshark -r [file.pcap]  -T fields -e frame.number -e frame.interface_id -e frame.len -e frame.protocols -e frame.time_delta -e ip.hdr_len -e ip.len -e ip.proto -e ip.ttl -e ip.version -E aggregator="$" -E separator=";" -E header=y > data.csv
```


## Objectives
In the following procedure we will show how to apply an unsupervised approach to cluster pcaps.


### Verify runtime environment

In [None]:
try:
    import google.colab
    IN_COLAB = True
    # Load the autoreload extension for IPython
    %load_ext autoreload
    # Set the autoreload extension to reload modules every time they are imported, so that changes made to code in the src folder are reflected in the running code
    %autoreload 2
    %pip install scikit-learn==1.3.1
except:
    IN_COLAB = False

### Basic installations and imports

In [None]:
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
import os

### Function to preprocess the data
For each paquet:
* use onehot encoding for protocols
* create an index with filenames
* clean nested data EX: `34$45` -> `34`
* fill missing value with a default value (`-1`)

In [None]:
def encode_protocols(df, colname):
    protocols_df = df[colname].str.get_dummies(sep=':')

    data_with_protocols = pd.concat([df, protocols_df], axis=1)

    return data_with_protocols.drop(colname, axis=1)


def create_index(df):
    df.index = df.apply(lambda x: f"{x['file']}", axis=1)
    df.drop(['file', 'frame.number'], axis=1, inplace=True)
    return df

def clean_nested(df):
    non_numeric_cols = ['ip.hdr_len', 'ip.len', 'ip.proto', 'ip.ttl', 'ip.version']
    for col in non_numeric_cols:
        df[col] = df[col].apply(lambda x: str(x).split('$')[0])
    df[non_numeric_cols] = df[non_numeric_cols].apply(pd.to_numeric, errors='coerce')
    return df

def fill_missing_values(df):
    df.fillna(-1, inplace=True)
    return df

def preprocess(df):
    res = encode_protocols(df, 'frame.protocols')
    res = create_index(res)
    res = clean_nested(res)
    res = fill_missing_values(res)
    return res

### Function to create features
The objective is to create a dataframe where each row is a single file. To do so, we need to aggregate the data per file. We are using `mean` to aggregate the data.

In [None]:
def create_features(df):
    df = df.groupby(level=0).mean()

    return df

### Reading data

In [None]:
%%time
data_path = "https://raw.githubusercontent.com/b-yond-infinite-network/sharkfest-europe-2023-data/main/network-traces-clustering/data.csv"
df = pd.read_csv(data_path,index_col=0)


In [None]:
df

### Apply preprocessing function

In [None]:
%%time
df = preprocess(df)

In [None]:
df

### Create the features

In [None]:
df = create_features(df)

In [None]:
df

### Standardize the data to avoid the scale effect when computing the distance

In [None]:
# Initializing the scaler
scaler = MinMaxScaler()
# Fitting and transforming the data
scaled_data = scaler.fit_transform(df)

In [None]:
scaled_data

### Create a hierarchical clustering

For more details on the different parameter of `linkage`, check [docs here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)

In [None]:
Z = linkage(scaled_data, method='average', metric='cityblock')



### Ploting the results

In [None]:
plt.figure(figsize=(20, 10))
dendrogram(Z, labels= df.index.astype(str).tolist(), leaf_rotation=90, leaf_font_size=15)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("PCAP Files")
plt.ylabel("Manhatten Distance")
plt.tight_layout()
plt.show()

### Extract clusters

For more details on the different parameters for `fcluster` check [docs here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster)

In [None]:
from scipy.cluster.hierarchy import fcluster
groups = fcluster(Z, t= 6, criterion='maxclust')


In [None]:
groups

In [None]:
df['groups'] = groups

In [None]:
df[df['groups'] == 1].index

In [None]:
df[df['groups'] == 2].index

In [None]:
df[df['groups'] == 3].index