This repository provides the code and data of the submitted paper: "Attacking Pre-Trained Language Models of Code with Multi-level Linguistic Representations"
Our approach contains two parts: (1) probing tasks; (2) MindAC.
python 3.8.13
numpy 1.21.2
pandas 1.3.4
torch 2.0.0+cu118
tqdm 4.63.0
scikit-learn 1.0.1
transformers 4.20.1
TXL v10.8 (7.5.20)
We experiment on two open source C/C++ datasets: (1) Devign dataset; (2) POJ-104 dataset. Devign dataset contains FFmpeg and QEMU datasets, which are available at https://drive.google.com/file/d/1x6hoF7G-tSYxg8AFybggypLZgMGDNHfF/view?pli=1.
POJ-104 dataset is available at https://drive.google.com/file/d/0B2i-vWnOu7MxVlJwQXN6eVNONUU/view?resourcekey=0-Po3kiAifLfCCYnanCBDMHw
In this paper, we set three research questions:
(1) RQ1: Which layers of the victim models are significant for linguistic features learning?
(2) RQ2: How effective is the linguistic representations-based attack compared with the state-of-the-art baselines?
(3) RQ3: Can we improve the robustness of victim models with adversarial examples?
Before the probing tasks, we need to fine-tune the CodeBERT model.
python codebert.py --train_eval train --layer 12
We perform the probing tasks (i.e., 2 surface probing tasks, 3 syntax probing tasks and 3 probing tasks) on Devign dataset.
cd ./probing
Running the code in the "code_length" and "code_content" folders. For example, to perform "CodeLength" task, we need to run the following scripts:
cd ./code_length
python code_length.py # labeling each code snippet according to its length
python tokenization.py
python codebert.py --train_eval prob --layer 1 # evaluating the ability of the first layer
We need to parser the ASTs from the code snippets by Joern.
cd ../preprocess
run code_preprocessing.py # filtering the comments in the code snippets
run graph_generation.py # generating the ".dot" files of ASTs and CFGs
cd ../probing
Deriving the related information from the ASTs by running "identifier_num.py", "ctrstatement_num.py", and "tree_width.py". Then run the following script:
codebert.py --train_eval prob --layer n # assess the ability of the n-th layer
We need first to perform the semantic-preserving transformations, i.e. "WhileToFor" (Transformation2), "SwitchTrans" (Transformation7), and "WhileToFor" (Transformation3).
cd ./Transformation1
run code_trans.py # performing the transformation
cd ..
Run the code in "switch_trans", "for_to_while", and "while_to_for". For example:
cd ./switch_trans
python switch_trans.py # labeling a code snippet as "1" if it is transformed or "0" if it is not
python tokenization.py
python codebert.py --train_eval prob --layer 1 # asessing the ability of the 1 layer
cd ./attack
cd ./vulnerability prediction
cd ./CodeBERT
Downloading the Devign dataset and putting the "function.json" file into the "dataset" folder.
cd ./dataset
python proprecess.py # spliting the dataset into training set, validation set, and testing set
cd ../
python tokenization.py
python codebert.py # fine-tuning CodeBERT on Devign dataset
sh mkdir.sh
python original_cases.py # obtaining the code snippets which are correctly predicted by the fine-tuned CodeBERT
sh mutation.sh # running the script first
sh mutation1.sh
sh mutation2.sh # running the sh mutation1.sh and sh mutation2.sh in parallel
python fuzzing.py
cd ../GraphCodeBERT
cd ./dataset
python proprecess.py # spliting the dataset into training set, validation set, and testing set
cd ../
sh run.sh # fine-tuning GraphCodeBERT on Devign dataset
sh mkdir.sh
python original_cases.py # obtaining the code snippets which are correctly predicted by the fine-tuned CodeBERT
sh mutation.sh # running the script first
sh mutation1.sh
sh mutation2.sh # running the sh mutation1.sh and sh mutation2.sh in parallel
python fuzzing.py
Script running refers to attacks against vulnerability prediction.
Running the corresponding scrips on the training sets, augmenting the training sets with the generated adversarial examples to obtain the "adversarial training sets". Fine-tuning the victim models on the "adversarial training sets". Referring to the attacks in RQ2.