Datasets and codes for submitted papers.
-
OS: Ubuntu
GPU: NVIDIA A100-SXM
-
Language: Python (v3.8)
-
CUDA: 11.3
-
Python packages:
Please refer the official docs for the use of these packages (especially AllenNLP).
-
Setup:
We use the approach proposed by Pan et, al. (Automated Unearthing of Dangerous Issue Reports, FSE 2022), Zhou et, al. (Finding a needle in a haystack: Automated mining of silent vulnerability fixes, ASE 2021) and Sun et, al. (Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation, arXiv) as our baselines. Pan et, al.'s work is archived at link.
We use bert-base-uncased and mrm8488/codebert-base-finetuned-detect-insecure-code from HuggingFace Transformer Libarary.
Let's explain some of the files that need to be in folders.
The data
folder needs the following files:
- 1000.csv
- all_samples.csv
- CVE_dict.json
- cwe_1_8_classes.json
- cwe_1_8_classes_old.json
- cwe_2_4_classes.json
- cwe_2_4_clasees_old.json
- dataset.csv
- dataset_project_relation_layer.csv
- embedded_cwe.json
- embedded_from_bottom_cwe.json
- test_project.csv
- test_project.json
- train_project.csv
- train_project.json
Files all_samples.csv
, CVE_dict.json
, dataset.csv
and dataset_project_relation_layer.csv
are shared anonymously at here.
The model_best
folder needs the following files:
- VPFinder_multi_f1.txt
- VPFinder_multi_model.pth
- VPFinder-1+3n+4_multi_f1.txt
- VPFinder-1+3n+4_multi_model.pth
- VPFinder-h2+h3_multi_f1.txt
- VPFinder-h2+h3_multi_model.pth
All files are shared anonymously at here.
Download the Huggingface model files bert-base-uncased
and save them to mybert
folder.
Click here to jump to the model file page.
Or you can directly modify the bert model name in the training and testing python file to download files online.
Download the Huggingface model files mrm8488/codebert-base-finetuned-detect-insecure-code
and save them to mycodebert
folder.
Click here to jump to the model file page.
Or you can directly modify the bert model name in the training and testing python file to download files online.
The build
folder needs the following file:
- clang_ast
The file is shared anonymously at here.
VPFinder uses the dataset data/dataset_project_relation_layer.csv
, which is also used by Zhou et, al.'work and Sun et, al.'s work.
Pan et, al.'s work uses the dataset data/train_project.json
, data/validation.json
and data/test_project.json
.
If you want to create a new dataset from scratch, the initial dataset is dataset.csv
.
First, execute python utils.py
to obtain the dataset applicable for MemVul.
Then, execute python make_dataset.py
to obtain datasets for the remaining models.
Run the files starting with for_train
, train
, or test
. For example:
python train_VPFinder_binary.py
For running baseline MemVul, limited by the size of uploaded files, we are unable to provide relevant files, see more details here.
Because three models have slow convergence rates, we use alternative models for training.
The parameters obtained from the training are then used for testing original models.
The model parameters have been saved in the model_best
folder, and the code files for alternative models begin with for_train
.
python for_train_VPFinder_multi.py
python test_VPFinder_multi.py
We use the Github Rest API to get data.
Please make sure you have a token and then change the fields Your Token Here
in the python files.
Get the url of each issue from the original dataset data/all_samples.csv
.
python get_url.py
Look for the commits by given urls.
python get_SHA.py
Download patched, and find parent commits and download files.
python get_and_download_parent_code.py
Extract snippets of java and python code and prepare for the extraction of C language snippets.(Modify the path in tackle_C.py
as needed)
python slice.py
Move the slice.sh
file to the build
folder and execute for extract snippets of c language:
./slice.sh
Store the extracted code snippets in uniformly named files.
python merge_code.py
Add the code snippet to the original dataset data/all_samples.csv
. The results are saved in data/all_samples_processed.csv
.
python preprocess_dataset.py
Filter out samples with no code.
python filter_dataset.py
Get discussions of issue reports.
python get_comments.py
Add discussion to dataset.
python preprocess_dataset_with_comments.py
Get commit mesaages and patches.
python get_dataset_commits.py
Classify the patches by deletion and addition.
python commit_to_patch.py
Add commit messages and patches to dataset.
python preprocess_dataset_patch_and_message.py
We have complete dataset data/dataset.csv
.
We have our cwe information ready, here is our work flow.
Concatenate CWE information and built CWE tree based on the relationships between nodes.
python CWE_relationship.py
Embed CWE information and then bottom-up aggregate the vectors.
python emb_tree.py
Find the nodes according to the requirements, and the order corresponds to the labels in the dataset data/dataset_project_relation_layer.csv
, which serves as another part of the model input.
python multi_class_embedding.py