Skip to content

Wir2Go/S2Vul

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

S2Vul: Vulnerability Analysis Based on Self-supervised Information Integration

Abstract

The analysis of static vulnerabilities, which consists of detection, classification, and localization, is a perpetually significant concern in software security. The advancement of neural networks has led to a greater emphasis on vulnerability detection research. However, most research faced obstacles in attempting to perform satisfactorily on real-world datasets. Furthermore, an additional obstacle is the substantial reliance of the studies on labels, which requires considerable effort for labeling, restricts the model's scalability, and potentially results in adverse effects due to inaccurate labels.

This study introduces an approach for efficiently training the model to adjust to various vulnerability analysis tasks. By synthesizing a wide range of information extracted from the source code, the model allows the network to collect substantial data to identify vulnerabilities. In addition, unsupervised representation learning is employed to mitigate the influence of labels in the method. The model could be applied to a wide range of vulnerability-related tasks by training task-specific classifiers at a minimal cost. The model was evaluated using two public vulnerability datasets. The test results validate the efficacy of our model, which achieves an accuracy of more than 70% in localization, an accuracy of more than 98% in detection, and greater accuracy of greater than 70% in classification when evaluating the real-world dataset, surpassing the performance of peer approaches.

Overview

Process of U2Vul

Dataset & Model

Link: zenodo

Dataset is in version 1, while Model weight is in version 2.

Download them and extract to ./data and ./model folder separately.

Source Code

  • Step 1: setup requirements
python -m venv -r ./requirements.txt
  • Step 2: Preparation of data

    • step 2-1: data extraction
      • Compile using LLVM-11, gclang would be helpful.
      • Analysis using Joern, llvm2cpg would be helpful. The output should be CPG with semantics (.bin format).
      • Firstly obtain .dot function-level graphs from Joern CPG, then convert CPGs to network digraph using networkx, extract semantics with pre-order traversal, extract CPG flow information and convert to adjacency matrix.
    • step 2-2: data processing
      • deduplication: using simhash algorithm for deduplication of samples as well as sentences, the k-threshold we set is 3.
      • parsing (for transformer pretraining): since the raw data is too big, the function-level semantics could be parsed into the sentences for next steps.
  • Step 3: Pretrain of model

    • step 3-1: employing sentencepiece to train models for the tokenization of source and IR code separately, save the models.
    • step 3-2: tokenizing the pretrained data and train the transformer
    python ./script/0_pretrain/pretrain_transformer.py
    
    • step 3-3: obtain the integrated semantic embedding with pretrained transformer, along with the flow adjacency matrix data to train GGNN using contrastive learning
    python ./script/0_pretrain/pretrain_ggnn_dino.py
    
  • Step 4: Evaluation

    • step 4-1: get the embeddings using the pretrained models.
    python ./script/1_data_process/embedding.py
    
    • step 4-2: preparation of evaluation data
      • label encoding:
      python ./script/1_data_process/label_encode.py
      
      • dataset generation:
      python ./script/1_data_process/gen_data.py
      
    • step 4-3: evaluation
      • the configuration file for evaluation is ./script/2_evaluation/config.yaml
      • after the configuration, using the script for concrete downstream evaluation:
      python ./script/2_evaluation/evaluation.py ./script/2_evaluation/config.yaml
      

Publication

@INPROCEEDINGS{10771195,
author={Yang, Shaojie and Xu, Haoran and Xu, Fangliang and Wang, Yongjun},
booktitle={2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)}, 
title={S2Vul: Vulnerability Analysis Based on Self-supervised Information Integration}, 
year={2024},
volume={},
number={},
pages={84-95},
keywords={Representation learning;Location awareness;Training;Analytical models;Accuracy;Scalability;Source coding;Semantics;Software;Software reliability;Vulnerability Detection;Information Integration;Contrastive Learning;Transformer Model;Graph Neural Network},
doi={10.1109/ISSRE62328.2024.00019}}

About

Implementation of "S2Vul: Vulnerability Analysis Based on Self-supervised Information Integration" in ISSRE 2024

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages