Skip to content

codeAuthorship/SCAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCAE: Can Seq2Seq Code Transformation Evade Code Authorship Attribution?

Overview

Code authorship attribution [1] is the problem of identifying authors of programming language codes through the stylistic features in their codes, a topic that recently witnessed significant interest with outstanding performance. To defeat attribution, the state-of-the-art approach uses the Monte-Carlo Tree Search (MCTS) [2] for code transformation to obfuscate codes. Although effective in misleading code authorship attribution, MCTS is disadvantaged by exhaustive resources for identifying optimal code transformations. Can we efficiently evade code authorship attribution without MCTS?

We present SCAE, a code authorship obfuscation technique that leverages a Seq2Seq code transformer called StructCoder. Unlike MCTS, SCAE saves processing time while maintaining the performance of the transformation. SCAE customizes StructCoder [3], a system originally designed for function-level code translation from one language to another (e.g., Java to C#), using transfer learning. To alleviate the need for manually transformed training data, we leverage the outputs from the MCTS method to construct a source-target code pair dataset and use it to train/fine-tune StructCoder for C++ to C++ code transformation that maintains the stylistic patterns of the target code. Our evaluation shows that SCAE improved the efficiency at a slight degradation of accuracy compared to MCTS. We were able to reduce the processing time by approximately 68% while maintaining an 85% transformation success rate and up to 95.77% evasion success rate in the untargeted setting. We further show the limitations of Seq2Seq models in the targeted domain.

SCAE

Setup the conda enviorment:

conda env scoder create -f scoder.yml
conda activate scoder    

Data prepraration:

  1. Untargeted Trasnformation.
  • Move training, validation, and testing files from Untargeted_transformation to data in src folder.
  • In data, there should be train.cpp.text_1&2.cpp, valid.cpp.text_1&2.cpp, and test.cpp.text_1&2.cpp.
  1. Targeted Trasnformation.
  • Move training, validation, and testing files from Targeted_transformation to data in src folder.
  • In data, there should be train.cpp.text_1&2.cpp, valid.cpp.text_1&2.cpp, and test.cpp.text_1&2.cpp.

The names of data for Untargeted and Targeted transformation are the same. Therefore, we need to modify it to avoid conflict.

Pre-trained StructCoder:

  1. Download The pre-trained checkpoint of Structcoder [3] from GoogleDrive.
  • You can also download it from the original work.
  1. Place it under saved_models/pretrain folder.

Finetune on translation command:

python3 run_translation.py --do_train --do_eval --do_test --source_lang cpp --target_lang cpp --alpha1_clip -4 --alpha2_clip -4

MAA's Targeted Attack Analysis is presented in here

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages