SCAE: Can Seq2Seq Code Transformation Evade Code Authorship Attribution?

Overview

Code authorship attribution [1] is the problem of identifying authors of programming language codes through the stylistic features in their codes, a topic that recently witnessed significant interest with outstanding performance. To defeat attribution, the state-of-the-art approach uses the Monte-Carlo Tree Search (MCTS) [2] for code transformation to obfuscate codes. Although effective in misleading code authorship attribution, MCTS is disadvantaged by exhaustive resources for identifying optimal code transformations. Can we efficiently evade code authorship attribution without MCTS?

We present SCAE, a code authorship obfuscation technique that leverages a Seq2Seq code transformer called StructCoder. Unlike MCTS, SCAE saves processing time while maintaining the performance of the transformation. SCAE customizes StructCoder [3], a system originally designed for function-level code translation from one language to another (e.g., Java to C#), using transfer learning. To alleviate the need for manually transformed training data, we leverage the outputs from the MCTS method to construct a source-target code pair dataset and use it to train/fine-tune StructCoder for C++ to C++ code transformation that maintains the stylistic patterns of the target code. Our evaluation shows that SCAE improved the efficiency at a slight degradation of accuracy compared to MCTS. We were able to reduce the processing time by approximately 68% while maintaining an 85% transformation success rate and up to 95.77% evasion success rate in the untargeted setting. We further show the limitations of Seq2Seq models in the targeted domain.

Setup the conda enviorment:

conda env scoder create -f scoder.yml
conda activate scoder

Data prepraration:

Untargeted Trasnformation.

Move training, validation, and testing files from Untargeted_transformation to data in src folder.
In data, there should be train.cpp.text_1&2.cpp, valid.cpp.text_1&2.cpp, and test.cpp.text_1&2.cpp.

Targeted Trasnformation.

Move training, validation, and testing files from Targeted_transformation to data in src folder.
In data, there should be train.cpp.text_1&2.cpp, valid.cpp.text_1&2.cpp, and test.cpp.text_1&2.cpp.

The names of data for Untargeted and Targeted transformation are the same. Therefore, we need to modify it to avoid conflict.

Pre-trained StructCoder:

Download The pre-trained checkpoint of Structcoder [3] from GoogleDrive.

You can also download it from the original work.

Place it under saved_models/pretrain folder.

Finetune on translation command:

python3 run_translation.py --do_train --do_eval --do_test --source_lang cpp --target_lang cpp --alpha1_clip -4 --alpha2_clip -4

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
src		src
README.md		README.md
S2CAE.png		S2CAE.png
proposedmodel.pdf		proposedmodel.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCAE: Can Seq2Seq Code Transformation Evade Code Authorship Attribution?

Overview

Setup the conda enviorment:

Data prepraration:

Pre-trained StructCoder:

Finetune on translation command:

MAA's Targeted Attack Analysis is presented in here

About

Uh oh!

Releases

Packages

Languages

codeAuthorship/SCAE

Folders and files

Latest commit

History

Repository files navigation

SCAE: Can Seq2Seq Code Transformation Evade Code Authorship Attribution?

Overview

Setup the conda enviorment:

Data prepraration:

Pre-trained StructCoder:

Finetune on translation command:

MAA's Targeted Attack Analysis is presented in here

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages