# Train and Evaluate your own GNN

To make this straightforwad and efficent, I have created a few helper classes that do a lot of the heavy lifting for you! 

There is a `GNNTrainer` class which provides a nice interface to train your first Graph Neural Network (GNN) and then there is a `GNNEvaultor` class which provides a nice interface to evaluate your model! 


**Note**: Running this on a laptop *without* a GPU is not the most performant. Depending on your hardware, the loading of the data may taken 5 or so minutes and the training (once you execute the cell `trainer.train()` will take about 10 minutes to run!

In [None]:
from bin2mlpy.training.train import GNNTrainer
from bin2mlpy.eval.eval import SearchPoolGNNEvaluator, NDayGNNEvaluator
from bin2mlpy.training.gnn import GCN, GraphConvNet

In [None]:
TRAINING_DATA_PKL = "../train.pkl" # The path to the processed train data
EVAL_DATA_PKL = "../test.pkl" # The path to the processed test data

## Train your neural network


In [None]:
trainer = GNNTrainer(model_cls=GCN, 
                     training_dataset_pickle=TRAINING_DATA_PKL,
                     evaluation_dataset_pickle=EVAL_DATA_PKL,
                     num_samples_per_epoch=100000,
                     num_training_epochs=50,
                     hidden_dimension_size=128,
                     learning_rate=0.0001,
                     batch_size=512)

In [None]:
trainer.train()

# Evaluate your GNN

We are going to use the `GNNEvaluator` class to get an idea of our performant our GNN is. This class takes a collection of input data and then creates *search pools*. A search pool is created by first selecting a function you want to search (the query function) and then creating a single positive pair whereby the query function is paired with another example of the same function and then creating 100 negative pairs whereby the query function is paired with a random other function. 

Each pair within each search pool is then embedded using our model and then the cosine distance between each is calculated. These similiarity scores are then ranked with the highest first. The higher the rank, the more similiar our model thinks the pairs are! We then use the metrics MRR@10 and R@1 to give us an idea of how good the model is performing. For both of these metrics, 1 is perfect performance!

**Note:** Due to the difficulty of the cross-architecture function search task and the limited hardware available (no GPU) restricting how long we can train these models for, do not expect amazing performance!

In [None]:
evaluator = SearchPoolGNNEvaluator(model=trainer.model, 
                         eval_data=trainer.eval_dataset, 
                         num_search_pools=50, 
                         search_pool_size=100)

In [None]:
evaluator.evaluate()

MRR @ 10: 0.42270001769065857
Recall @ 1: 0.2857142984867096

# What about N-day detection performance?!

There is another evaluator included called `NDayGNNEvaluator`. This evaluator works by searching two `openssl` samples taken from real devices which a collection of known vulnerable sample versions.

In [None]:
evaluator = NDayGNNEvaluator(model=trainer.model, eval_data_dir="../../data/vuln-eval/graphs")

In [None]:
evaluator.tplink_evaluate()

In [None]:
evaluator.netgear_evaluate()

# Stretch Experimentation/After Hours

There are several ways we could improve this performance:

1) Increase the training time - This can be done in two different ways in our case. The first is increasing the `num_training_epochs` parameters in the `GNNTrainer`. This will increase the number of times the model processes a single pass through the data. The second is increasing the `num_samples_per_epoch` parameters within the `GNNTrainer`. This determines how many samples are sampled from the dataset as a whole to make the data for a given epoch. I would suggest doing *both* of these. I would increase `num_samples_per_epoch` to `250000` and `num_training_epochs` to 250.
2) Increase the batch size - When formulating the problem in the way we have, contrastive learning - learning by comparing two things together and using the output as the loss, a very easy way to boost performance is by increasing the batch size. The batch size determines how many samples are put into the model at once. What this really means in our case though is when we increase the batch size, we are increasing the number of negative samples that can be associated with each positive. This gives the model a better signal of what is good and what is bad!
3) Change the GNN layer used within the model - The model is currently using the `GCN` layer described on the online website. We could chagne this to the `GraphConv` layer which has been proven to improve performance usually. I have created a equalivant model using the `GraphConv` layer called `GraphConvNet`. Try this out and see how it effects performance.
4) Change the learning rate - The learning rate determines how small or large the adjustments the optimiser can make given the loss. Experiment with lowering the loss to a smaller value and observe the loss values printed by the trainer class. It is likely you will see smaller but consistently lower loss values. This is because the model is making smaller but more precise adjustments. That being said, a very small learning rate will make the training process take a very long time. Experiment with this!
5) Change the size of the GNN - The base `GCN` GNN has a hidden dimension of 64 and an output dimension of 64. Both of these can be adjusted to provide the model with more *power*. The word *power* basically means the size of the models brain or ability to learn. Increasing both of these will however increase the computational cost and subsequently make training a bit slower. A rule of thumb too is to have the output dimension equal to or less than the hidden dimension. If you read any literature after this, if you see something like *project down* or *projection layer*, the authors are typically referring to an output dimension that is smaller than the hidden dimension. The reason behind this are varied but the usually reason is computational efficiency. Very large models have hidden dimensions that make working with the output representations very computationally expensive. They train *projection* layers to make the representations smaller and more useful to work with. 