# Packet Anomaly detection

We used the data set available (sometimes) at the Canadian institute of cybersecurity: https://www.unb.ca/cic/datasets/iotdataset-2022.html

This dataset is simulated data from the folowing setup:

<img src="https://raw.githubusercontent.com/b-yond-infinite-network/sharkfest-europe-2023/main/assets/iot-setup.jpg">

We focused on Idle and Active use cases in this experiment as described below:

- **Idle**: In this experiment, we captured the whole network traffic from late in the evening to early in the morning, which we call idle time. In this period, the whole lab was completely evacuated and there were no human interactions involved.

- **Active**: In addition to the idle time, the whole network communications were also captured throughout the day. All fellow researchers during this period were allowed to enter the lab whenever they wanted. They might interact with devices and generate network traffic either passively or actively.




To create chunks from a large PCAP file:


```
editcap -c [chunk size] [inputfile.pcap]  [outputdir]/[prefix].pcap
```

**In the following, we fix the chunk size to 10000 Frames.**

To extract pcap into a csv file:


```
tshark -r [file.pcap]  -T fields -e frame.number -e frame.interface_id -e frame.len -e frame.protocols -e frame.time_delta -e ip.hdr_len -e ip.len -e ip.proto -e ip.ttl -e ip.version -E aggregator="$" -E separator=";" -E header=y
```



## 1 - Verify runtime environment

In [None]:
import pandas as pd
import numpy as np
try:
    import google.colab
    IN_COLAB = True
    # Load the autoreload extension for IPython
    %load_ext autoreload
    # Set the autoreload extension to reload modules every time they are imported, so that changes made to code in the src folder are reflected in the running code
    %autoreload 2
    %pip install scikit-learn==1.3.1
except:
    IN_COLAB = False


## 2 - Helper functions

### 2.1 - Function to preprocess the data
For each packet:
* drop columns that are completely empty
* use onehot encoding for protocols
* create an index with filenames
* clean nested data EX: `34$45` -> `34`
* fill missing value with a default value (`-1`)

In [None]:
def keep_columns_with_data(df):
    return df.loc[:, df.apply(lambda x: x.isnull().sum() != df.shape[0], axis=0)]

def encode_protocols(df, colname):
    protocols_df = df[colname].str.get_dummies(sep=':')

    data_with_protocols = pd.concat([df, protocols_df], axis=1)

    return data_with_protocols.drop(colname, axis=1)

def create_index(df):
    df.index = df.apply(lambda x: f"{x['file']}", axis=1)
    df.drop(['file', 'frame.number'], axis=1, inplace=True)
    return df

def clean_nested(df):
    non_numeric_cols = ['ip.hdr_len', 'ip.len', 'ip.proto', 'ip.ttl', 'ip.version']
    for col in non_numeric_cols:
        df[col] = df[col].apply(lambda x: str(x).split('$')[0])
    df[non_numeric_cols] = df[non_numeric_cols].apply(pd.to_numeric, errors='coerce')
    return df

def fill_missing_values(df):
    df.fillna(-1, inplace=True)
    return df

def preprocess(df):
    res = encode_protocols(df, 'frame.protocols')
    res = create_index(res)
    res = clean_nested(res)
    res = keep_columns_with_data(res)
    res = fill_missing_values(res)
    return res

### 2.2 - Function to create features
The objective is to create a dataframe where each row is a single file. To do so, we need to aggregate the data per file. We are using `sum` to aggregate the data.

In [None]:
def create_features(df):
    df = df.groupby(level=0).sum()

    return df

## 3 - Load Data
First part is to read the data from the disk. The data has been extracted using `tshark` and stored in a CSV files `extract_active.csv` and `extract_idle.csv`.

In [None]:
%%time
df_active = pd.read_csv("https://github.com/b-yond-infinite-network/sharkfest-europe-2023-data/raw/main/packet-anomaly-detection/packet-anomaly-detection-data-active.zip", compression='zip' ,index_col=0)
df_idle = pd.read_csv("https://github.com/b-yond-infinite-network/sharkfest-europe-2023-data/raw/main/packet-anomaly-detection/packet-anomaly-detection-data-idle.zip", compression='zip' ,index_col=0)


In [None]:
df_active.info()
df_idle.info()

## 4 - Preprocess and Create Features
Apply `preprocessing` and `create_features` functions to both data frames

In [None]:
%%time
df_active = create_features(preprocess(df_active))

In [None]:
%%time
df_idle = create_features(preprocess(df_idle))

### 4.1 - drop columns primarily composed of zeros

In [None]:
# drop columns that are primarily composed of zeros 
df_idle = df_idle.loc[:,df_idle.mean().sort_values()>10]

### 4.2 - make sure both datasets have the same columns

In [None]:
# use the columns that remain for df_idle to Find common columns in both DFs 
columns = list(set(df_active.columns) & set(df_idle.columns))

#only keep columns that are present in both
df_active = df_active[columns]
df_idle = df_idle[columns]

In [None]:
df_active.head(10)

In [None]:
df_idle.head(10)

## 5 - Model Training & results

### 5.1 - Create a train test split 

In [None]:
%%time
from sklearn.model_selection import train_test_split
pos_train, pos_test = train_test_split(df_idle,test_size=0.01)
neg_train, neg_test = train_test_split(df_active,test_size=0.99)

#### 5.1.1 - Create a training set 
with only idle pcap summaries (idle PCAPs will be our normal)

In [None]:
train = pd.concat([pos_train]) 

#### 5.1.2 - Test set 
with 1% idle PCAPs and 99% active PCAPs (mostly representing abnormal cases)

In [None]:
test = pd.concat([pos_test,neg_test])

### 5.2 PCA dimentionality reduction 
to 2 dimensions (used as a pre processor for the isolation forest in the next step)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_train = pd.DataFrame(pca.fit_transform(train), columns=['PC1','PC2'])
pca_test =  pd.DataFrame(pca.transform(test), columns=['PC1','PC2'])
pca_train['label'] = len(train)*[1]
pca_test['label'] = len(pos_test)*[1]+len(neg_test)*[-1]

In [None]:
from matplotlib import pyplot as plt
plt.style.use('ggplot')
color_map = {
    1:'blue',
    -1:'red'
}

ax = pca_train.plot.scatter(x='PC1', y='PC2',c='blue', label='Idle',alpha=0.2,figsize=(20, 10))
pca_test[pca_test['label'] == 1].plot.scatter(x='PC1', y='PC2',c='blue', ax=ax,alpha=0.2)
pca_test[pca_test['label'] == -1].plot.scatter(x='PC1', y='PC2',c='red', label='Active',ax=ax,alpha=0.2)

plt.legend()
plt.style.use('default')




### 5.3 - training an Isolation Forest model
on the data using just the idle PCAP summaries

In [None]:
%%time
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(n_estimators = 2, contamination = 0.4,random_state=42)
iforest.fit(pca_train[['PC1','PC2']]) #fit the model to normal
preds_test = iforest.predict(pca_test[['PC1','PC2']]) #predict on external data
pca_test['preds'] = preds_test

#### 5.3.1 - Results from the Isolation Forest model

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

print(classification_report(pca_test['label'],pca_test['preds']))


In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

cm = confusion_matrix(pca_test['label'],pca_test['preds'])
disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=['Active','Idle'])
disp.plot()

#### 5.3.2 - Plot results of isolation Forest

In [None]:
%%time
from sklearn.inspection import DecisionBoundaryDisplay
from matplotlib.pyplot import figure


plt.figure(figsize=(10,10))
ax = plt.gca()
disp = DecisionBoundaryDisplay.from_estimator(
    estimator=iforest,
    X=pca_test[['PC1','PC2']],
    response_method="decision_function",
    ax=ax
)
disp.ax_.set_title("Path length decision boundary \nof IsolationForest")



In [None]:
labels = pca_test['label'].to_list()
plt.figure(figsize=(10,10))
ax = plt.gca()
disp = DecisionBoundaryDisplay.from_estimator(
    estimator=iforest,
    X=pca_test[['PC1','PC2']],
    response_method="decision_function",
    ax=ax
)
scatter = disp.ax_.scatter(pca_test['PC1'], pca_test['PC2'],s=20, edgecolor="w")

disp.ax_.set_title("Path length decision boundary \nof IsolationForest")




In [None]:
labels = pca_test['label'].to_list()
plt.figure(figsize=(10,10))
ax = plt.gca()
disp = DecisionBoundaryDisplay.from_estimator(
    estimator=iforest,
    X=pca_train[['PC1','PC2']],
    response_method="decision_function",
    ax=ax
)
scatter = disp.ax_.scatter(pca_train['PC1'], pca_train['PC2'],s=20, edgecolor="w")

disp.ax_.set_title("Path length decision boundary \nof IsolationForest")


display(figure())