# Train and evaluate graphs

We can use the pytorch geometric library to make a graph model that can perform classification on different nodes in the graph. [link](https://pytorch-geometric.readthedocs.io/en/latest/)

The graph created is the same as the one described in the paper of David P. et al [link](https://arxiv.org/pdf/2107.14756.pdf). We work with the pytorch geometric library to have acess to more recent algorithms that run on graphs.

The graph contains two different node types. One node for each ip corresponding to a device and a node for each connection between ip's. Graphs with multiple node types are called heterogenous graphs. 

We use a transformer based graph neural network designed for heterogenous graphs and implemented in the pytorch geometric library, more details can be found in this paper
[link](https://arxiv.org/abs/2003.01332)

In [2]:
import pickle

with open('data/train/week1_prep_train.pkl','rb') as f:
    df_train = pickle.load(f)

with open('data/eval/week1_prep_val.pkl','rb') as f:
    df_test = pickle.load(f)

We define a class to convert the tabular data to graphs. For each node of type 'ip' corresponding to a device we add a feature based on the ip-adress. For each subnet in the network we create a column and convert ip adresses to a category, this can be seen in the function `encode_ip`.

The nodes of type 'connection' contain a number of connection type features defined in the constructor of the class, see `self.conn_feat`. Each connection node also contains a label see `self.labels_cols_oh`.

This way we can train an algorithm to classify the connection nodes that also use information from the structure of the graph and ip nodes. Such a graph is created in pieces for example of 200 rows of our tabular dataset, each row contains information about a connection between two devices. In this way we create a snapshot of the network. This is done in the function `process`

In [3]:
#zelf grafen aanmaken per 200 connecties
from torch_geometric.data import HeteroData
from torch_geometric.loader import DataLoader
import numpy as np
import torch
from tqdm import tqdm
import pickle

class CICdata():
    def __init__(self, path_data):
        f = open(path_data,'rb')
        self.df = pickle.load(f)
        self.conn_feat = ['Duration', 'Packets', 'Bytes', 'Proto_ICMP ', 'Proto_IGMP ','Proto_TCP  ', 'Proto_UDP  ','flag_A', 'flag_P', 'flag_R', 'flag_S','flag_F', 'Tos_0', 'Tos_16', 'Tos_32', 'Tos_192']
        self.label_cols_oh = ['attack_benign','attack_bruteForce', 'attack_dos', 'attack_pingScan', 'attack_portScan']
        
        
    def make_ip_map(self, data):
        unique_ip = np.unique(np.append(data['Src IP Addr'].to_numpy(), 
                                        data['Dst IP Addr'].to_numpy()))
        return {ip:idx for idx, ip in enumerate(unique_ip)}
    
    def encode_ip(self, value):
        temp = [0]*10
        if value == '192.168.100.6': #internal web server
            temp[0] = 1.0
        elif value == '192.168.100.5': #internal file server
            temp[1] = 1.0
        elif value == '192.168.100.4': #internal mail server
            temp[2] = 1.0
        elif value == '192.168.100.3': #internal backup server
            temp[3] = 1.0
        elif value[:11] == '192.168.100': #server subnet
            temp[4] = 1.0
        elif value[:11] == '192.168.200': #management subnet
            temp[5] = 1.0
        elif value[:11] == '192.168.210': #office subnet
            temp[6] = 1.0
        elif value[:11] == '192.168.220': #developer subnet
            temp[7] = 1.0
        elif value[5:6]=='_': #public ip
            temp[8] = 1.0
        elif value in ['0.0.0.0', '255.255.255.255']: #local ip
            temp[9] = 1.0

        return temp
    
    def get_ip_feat(self, ip_map):
        ip_data = []
        for ip, idx in ip_map.items():
            ip_data.append(self.encode_ip(ip))
        
        return torch.tensor(ip_data).float()
                
    def make_edges(self, data, ip_map):
        src = []
        dst = []
        count = 0
        for _, row in data.iterrows():
            #source ip to connection
            src.append(ip_map[row['Src IP Addr']])
            dst.append(count)

            #destination ip to connection
            src.append(ip_map[row['Dst IP Addr']])
            dst.append(count)
            count +=1

        return torch.tensor([src, dst]), torch.tensor([dst, src])

    def get_info_conn(self, data, cols):
        return torch.tensor(data[cols].values)
                
    def process(self, n_rows=200):
        x_conn = self.get_info_conn(self.df, self.conn_feat)
        y = self.get_info_conn(self.df, self.label_cols_oh)
        data_list = []
        for i in tqdm(range(1, (len(self.df)//n_rows)+1), desc='processing'):
            start_idx = (i-1)*n_rows
            end_idx = i*n_rows
            sample = self.df[start_idx:end_idx]
            ip_map = self.make_ip_map(sample)
            ip_to_conn, conn_to_ip = self.make_edges(sample, ip_map)
            data = HeteroData()
            data['ip'].x = self.get_ip_feat(ip_map) #encode ip's from the map
            data['connection'].x = x_conn[start_idx:end_idx].float()
            data['connection'].y = y[start_idx:end_idx]
            data['ip','connection'].edge_index = ip_to_conn
            data['connection','ip'].edge_index = conn_to_ip
            data_list.append(data)
        
        return data_list

In [4]:
cic_data = CICdata('data/train/week1_prep_train.pkl')

In [5]:
data = cic_data.process()

processing: 100%|██████████| 10559/10559 [01:06<00:00, 159.85it/s]


We can display the types of nodes in our dataset along with the edges.
* two node types: 'ip' and 'connection'.
* two edge types: from ip to connection node or from connection to ip node.

In [6]:
print(data[0].node_types)
print(data[0].metadata())

['ip', 'connection']
(['ip', 'connection'], [('ip', 'to', 'connection'), ('connection', 'to', 'ip')])


A heterogenous graph transformer model can be easily made by using built-in classes. Some hyperparameters can be chosen such as the number of heads, hidden channels and number of layers.

The number of out channels is always the number of unique label values (5 in the example dataset).

In [7]:
import torch_geometric.transforms as T
from torch_geometric.nn import HGTConv, Linear

class HGT(torch.nn.Module):
    def __init__(self, data_graph, hidden_channels, out_channels, num_heads, num_layers):
        super().__init__()

        self.lin_dict = torch.nn.ModuleDict()
        for node_type in data_graph.node_types:
            self.lin_dict[node_type] = Linear(-1, hidden_channels)

        print(self.lin_dict)
        self.convs = torch.nn.ModuleList()
        for _ in range(num_layers):
            conv = HGTConv(hidden_channels, hidden_channels, data_graph.metadata(),
                           num_heads, group='sum')
            self.convs.append(conv)

        self.lin = Linear(hidden_channels, out_channels)

    def forward(self, x_dict, edge_index_dict):        
        for node_type, x in x_dict.items():
            x_dict[node_type] = self.lin_dict[node_type](x).relu_()

        for conv in self.convs:
            x_dict = conv(x_dict, edge_index_dict)

        return self.lin(x_dict['connection'])

In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


model = HGT(data_graph= data[0], hidden_channels=64, out_channels=5,
            num_heads=4, num_layers=2)
# Initialize lazy module, still on cpu
with torch.no_grad():
    out = model(data[0].x_dict, data[0].edge_index_dict)

ModuleDict(
  (ip): Linear(-1, 64, bias=True)
  (connection): Linear(-1, 64, bias=True)
)


The algorithm is trained with boilerplate pytorch code. The number of epochs and batch size can be adapted.

In [9]:
from torch_geometric.loader import DataLoader, DataListLoader
from torch.optim import Adam
from torch.nn import functional as F
from torch import nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#hyperparams
EPOCHS = 20
batch_size = 64

optimizer = Adam(model.parameters())
train_loader = DataLoader(data, batch_size=batch_size)


def train():
    model.to(device)
    model.train()
    for epoch in range(EPOCHS):
        total_examples = total_loss = 0

        for batch in train_loader:
            optimizer.zero_grad()
            batch.to(device)
            out = model(batch.x_dict, batch.edge_index_dict)
            #print(batch['connection'].y, batch['connection'].y.size())
            # print(out, out.size())
            loss = F.cross_entropy(out, batch['connection'].y.float())
            #loss = focal_loss(out, batch['connection'].y.float())
            loss.backward()
            optimizer.step()

            total_examples += 64
            total_loss += float(loss) * 64
            
        tqdm.write('EPOCH '+str(epoch)+' loss: '+ str(total_loss/total_examples))

In [10]:
train()

EPOCH 0 loss: 1.0481664816086942
EPOCH 1 loss: 0.16742647788205156
EPOCH 2 loss: 0.016370172334030608
EPOCH 3 loss: 0.01377640049393063
EPOCH 4 loss: 0.012513390156549797
EPOCH 5 loss: 0.011639993952915326
EPOCH 6 loss: 0.010873084813912018
EPOCH 7 loss: 0.0097299537721979
EPOCH 8 loss: 0.008813250395916948
EPOCH 9 loss: 0.007899739290008834
EPOCH 10 loss: 0.0070540991671584905
EPOCH 11 loss: 0.006732296295415312
EPOCH 12 loss: 0.00582879560929679
EPOCH 13 loss: 0.004940631523398528
EPOCH 14 loss: 0.005108925253438374
EPOCH 15 loss: 0.0056151988995032185
EPOCH 16 loss: 0.004877394851813954
EPOCH 17 loss: 0.003961015903835439
EPOCH 18 loss: 0.003629152100824506
EPOCH 19 loss: 0.0037679633526116075


The tabular data for evaluation is converted into graphs and predictions are made with the now trained algorithm. Notice labels are extracted from the `connection` nodes.

In [11]:
cic_val = CICdata('data/eval/week1_prep_val.pkl')
data_val = cic_val.process()

processing: 100%|██████████| 2639/2639 [00:16<00:00, 161.34it/s]


In [12]:
model.eval()
count=0
preds = []
labels = []
for graph in tqdm(data_val):
    graph.to(device)
    preds.append(model(graph.x_dict, graph.edge_index_dict).argmax(dim=1))
    labels.append(graph['connection'].y.argmax(dim=1))

preds = torch.cat(preds)
labels = torch.cat(labels)

100%|██████████| 2639/2639 [00:07<00:00, 332.47it/s]


In [13]:
#precise calculation of accuracy
correct = (preds == labels).sum()
acc = correct / len(preds)
acc

tensor(0.9970, device='cuda:0')

The label values were one-hot encoded. The numbers [0, 1, 2, 3, 4] correspond to the different classes. We can convert these back

In [14]:
from sklearn.metrics import classification_report
print(classification_report(preds.cpu(), labels.cpu()))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    323295
           1       1.00      0.52      0.68       702
           2       1.00      1.00      1.00    155977
           3       0.55      0.52      0.53      1505
           4       0.98      0.99      0.99     46321

    accuracy                           1.00    527800
   macro avg       0.91      0.80      0.84    527800
weighted avg       1.00      1.00      1.00    527800



In [15]:
import pandas as pd
map_class = {'0':'benign', '1':'bruteforce', '2':'dos', '3':'pingscan', '4':'portscan'}
cr_df = pd.DataFrame(classification_report(preds.cpu(), labels.cpu(), output_dict=True)).transpose()
temp = cr_df.index[:5].map(map_class).append(cr_df.index[5:])
cr_df.index = temp
cr_df

Unnamed: 0,precision,recall,f1-score,support
benign,0.999422,0.999991,0.999706,323295.0
bruteforce,1.0,0.518519,0.682927,702.0
dos,0.999981,0.99991,0.999946,155977.0
pingscan,0.5531,0.515615,0.5337,1505.0
portscan,0.983494,0.989163,0.98632,46321.0
accuracy,0.996995,0.996995,0.996995,0.996995
macro avg,0.907199,0.804639,0.84052,527800.0
weighted avg,0.996917,0.996995,0.996852,527800.0


The results clearly show the `bruteforce` and `pingscan` attacks are much harder to detect. They also occur much less in the dataset according to the support column.

### Visualization
We can visualize some of the attacks being detected

In [None]:
# (todo) use network x for this