This repository shows the labels and codes of paper Where is it? Tracing the Vulnerability-Relevant Files from Vulnerability Reports.
| File | Description |
|---|---|
| code/run_model.py | Codes for classifier of pre-trained model, the experiments of RoBERTa, CodeBERT, GPT2 and GPTBIGCODE are all done with this code. |
| data/cve_relevant_file | Cleaned CVE-vulnerability relevant file pairs |
| data/training_set | One of the labels from the 5-fold cross-validation process for training classifier, we only show part of testing data here as the full data is too big, please contact authors to get the full data. |
| data/experiments/RQ1 | Manual labels for RQ1 |
| data/experiments/RQ2 | Manual labels for RQ2 |
The code is based on HuggingFace project. The path of all files need to be changed.
For the raw-transformer, please see this repository.
According to paper Section 3.1, the accuracy of our collected CVE-relevant file pairs is as high as 99.7%.
According to paper Section 3.2, our file candidate matching can achieve 99.5% and 76.3% accuracy for the file pairs with only different repositories, and the file pairs with different paths and repositories, respectively. Our classifier dataset collection approach can attain true vulnerability-relevant file candidates in 89.9% CVEs.
The BERT-based models have better results than the GPT-based models, and CodeBERT gets the best outcomes. The pre-trained models are far better than the raw Transformer.
Participants using our tool show faster completion times and higher correctness rates compared to the control group. The tool’s impact on correctness is especially evident as the tasks become more challenging. Interestingly, false negatives don’t significantly influence participants’ judgments, but they do require additional time for validation.



