S2Vul: Vulnerability Analysis Based on Self-supervised Information Integration

Abstract

The analysis of static vulnerabilities, which consists of detection, classification, and localization, is a perpetually significant concern in software security. The advancement of neural networks has led to a greater emphasis on vulnerability detection research. However, most research faced obstacles in attempting to perform satisfactorily on real-world datasets. Furthermore, an additional obstacle is the substantial reliance of the studies on labels, which requires considerable effort for labeling, restricts the model's scalability, and potentially results in adverse effects due to inaccurate labels.

This study introduces an approach for efficiently training the model to adjust to various vulnerability analysis tasks. By synthesizing a wide range of information extracted from the source code, the model allows the network to collect substantial data to identify vulnerabilities. In addition, unsupervised representation learning is employed to mitigate the influence of labels in the method. The model could be applied to a wide range of vulnerability-related tasks by training task-specific classifiers at a minimal cost. The model was evaluated using two public vulnerability datasets. The test results validate the efficacy of our model, which achieves an accuracy of more than 70% in localization, an accuracy of more than 98% in detection, and greater accuracy of greater than 70% in classification when evaluating the real-world dataset, surpassing the performance of peer approaches.

Overview

Dataset & Model

Link: zenodo

Dataset is in version 1, while Model weight is in version 2.

Download them and extract to ./data and ./model folder separately.

Source Code

Step 1: setup requirements

python -m venv -r ./requirements.txt

Step 2: Preparation of data
- step 2-1: data extraction
  - Compile using LLVM-11, gclang would be helpful.
  - Analysis using Joern, llvm2cpg would be helpful. The output should be CPG with semantics (.bin format).
  - Firstly obtain .dot function-level graphs from Joern CPG, then convert CPGs to network digraph using networkx, extract semantics with pre-order traversal, extract CPG flow information and convert to adjacency matrix.
- step 2-2: data processing
  - deduplication: using simhash algorithm for deduplication of samples as well as sentences, the k-threshold we set is 3.
  - parsing (for transformer pretraining): since the raw data is too big, the function-level semantics could be parsed into the sentences for next steps.
Step 3: Pretrain of model
- step 3-1: employing sentencepiece to train models for the tokenization of source and IR code separately, save the models.
- step 3-2: tokenizing the pretrained data and train the transformer
```
python ./script/0_pretrain/pretrain_transformer.py
```
- step 3-3: obtain the integrated semantic embedding with pretrained transformer, along with the flow adjacency matrix data to train GGNN using contrastive learning
```
python ./script/0_pretrain/pretrain_ggnn_dino.py
```
Step 4: Evaluation
- step 4-1: get the embeddings using the pretrained models.
```
python ./script/1_data_process/embedding.py
```
- step 4-2: preparation of evaluation data
  - label encoding:
```
python ./script/1_data_process/label_encode.py
```
  - dataset generation:
```
python ./script/1_data_process/gen_data.py
```
- step 4-3: evaluation
  - the configuration file for evaluation is ./script/2_evaluation/config.yaml
  - after the configuration, using the script for concrete downstream evaluation:
```
python ./script/2_evaluation/evaluation.py ./script/2_evaluation/config.yaml
```

Publication

@INPROCEEDINGS{10771195,
author={Yang, Shaojie and Xu, Haoran and Xu, Fangliang and Wang, Yongjun},
booktitle={2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)}, 
title={S2Vul: Vulnerability Analysis Based on Self-supervised Information Integration}, 
year={2024},
volume={},
number={},
pages={84-95},
keywords={Representation learning;Location awareness;Training;Analytical models;Accuracy;Scalability;Source coding;Semantics;Software;Software reliability;Vulnerability Detection;Information Integration;Contrastive Learning;Transformer Model;Graph Neural Network},
doi={10.1109/ISSRE62328.2024.00019}}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
figs		figs
model		model
script		script
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S2Vul: Vulnerability Analysis Based on Self-supervised Information Integration

Abstract

Overview

Dataset & Model

Source Code

Publication

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

S2Vul: Vulnerability Analysis Based on Self-supervised Information Integration

Abstract

Overview

Dataset & Model

Source Code

Publication

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages