Skip to content

anonymouslySEResearch/VPFinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VPFinder

1 Project Description

Datasets and codes for submitted papers.

2 Environments

  1. OS: Ubuntu

    GPU: NVIDIA A100-SXM

  2. Language: Python (v3.8)

  3. CUDA: 11.3

  4. Python packages:

    Please refer the official docs for the use of these packages (especially AllenNLP).

  5. Setup:

    We use the approach proposed by Pan et, al. (Automated Unearthing of Dangerous Issue Reports, FSE 2022), Zhou et, al. (Finding a needle in a haystack: Automated mining of silent vulnerability fixes, ASE 2021) and Sun et, al. (Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation, arXiv) as our baselines. Pan et, al.'s work is archived at link.

    We use bert-base-uncased and mrm8488/codebert-base-finetuned-detect-insecure-code from HuggingFace Transformer Libarary.

3 Contents of the Folder

Let's explain some of the files that need to be in folders.

3.1 data

The data folder needs the following files:

  • 1000.csv
  • all_samples.csv
  • CVE_dict.json
  • cwe_1_8_classes.json
  • cwe_1_8_classes_old.json
  • cwe_2_4_classes.json
  • cwe_2_4_clasees_old.json
  • dataset.csv
  • dataset_project_relation_layer.csv
  • embedded_cwe.json
  • embedded_from_bottom_cwe.json
  • test_project.csv
  • test_project.json
  • train_project.csv
  • train_project.json

Files all_samples.csv, CVE_dict.json, dataset.csv and dataset_project_relation_layer.csv are shared anonymously at here.

3.2 model_best

The model_best folder needs the following files:

  • VPFinder_multi_f1.txt
  • VPFinder_multi_model.pth
  • VPFinder-1+3n+4_multi_f1.txt
  • VPFinder-1+3n+4_multi_model.pth
  • VPFinder-h2+h3_multi_f1.txt
  • VPFinder-h2+h3_multi_model.pth

All files are shared anonymously at here.

3.3 mybert

Download the Huggingface model files bert-base-uncased and save them to mybert folder.

Click here to jump to the model file page.

Or you can directly modify the bert model name in the training and testing python file to download files online.

3.4 mycodebert

Download the Huggingface model files mrm8488/codebert-base-finetuned-detect-insecure-code and save them to mycodebert folder.

Click here to jump to the model file page.

Or you can directly modify the bert model name in the training and testing python file to download files online.

3.5 build

The build folder needs the following file:

  • clang_ast

The file is shared anonymously at here.

4 Dataset

VPFinder uses the dataset data/dataset_project_relation_layer.csv, which is also used by Zhou et, al.'work and Sun et, al.'s work. Pan et, al.'s work uses the dataset data/train_project.json, data/validation.json and data/test_project.json.

If you want to create a new dataset from scratch, the initial dataset is dataset.csv. First, execute python utils.py to obtain the dataset applicable for MemVul. Then, execute python make_dataset.py to obtain datasets for the remaining models.

5 Train & Test

Run the files starting with for_train, train, or test. For example:

python train_VPFinder_binary.py

For running baseline MemVul, limited by the size of uploaded files, we are unable to provide relevant files, see more details here.

Because three models have slow convergence rates, we use alternative models for training. The parameters obtained from the training are then used for testing original models. The model parameters have been saved in the model_best folder, and the code files for alternative models begin with for_train.

python for_train_VPFinder_multi.py
python test_VPFinder_multi.py

6 Production of the Dataset

We use the Github Rest API to get data. Please make sure you have a token and then change the fields Your Token Here in the python files.

Get the url of each issue from the original dataset data/all_samples.csv.

python get_url.py

Look for the commits by given urls.

python get_SHA.py

Download patched, and find parent commits and download files.

python get_and_download_parent_code.py

Extract snippets of java and python code and prepare for the extraction of C language snippets.(Modify the path in tackle_C.py as needed)

python slice.py

Move the slice.sh file to the build folder and execute for extract snippets of c language:

./slice.sh

Store the extracted code snippets in uniformly named files.

python merge_code.py

Add the code snippet to the original dataset data/all_samples.csv. The results are saved in data/all_samples_processed.csv.

python preprocess_dataset.py

Filter out samples with no code.

python filter_dataset.py

Get discussions of issue reports.

python get_comments.py

Add discussion to dataset.

python preprocess_dataset_with_comments.py

Get commit mesaages and patches.

python get_dataset_commits.py

Classify the patches by deletion and addition.

python commit_to_patch.py

Add commit messages and patches to dataset.

python preprocess_dataset_patch_and_message.py

We have complete dataset data/dataset.csv.

7 CWE Embedding

We have our cwe information ready, here is our work flow.

Concatenate CWE information and built CWE tree based on the relationships between nodes.

python CWE_relationship.py

Embed CWE information and then bottom-up aggregate the vectors.

python emb_tree.py

Find the nodes according to the requirements, and the order corresponds to the labels in the dataset data/dataset_project_relation_layer.csv, which serves as another part of the model input.

python multi_class_embedding.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages