Initial documentation for de-anonymizing programmers from executable binaries.
For details see the paper: https://www.princeton.edu/~aylinc/papers/caliskan_when.pdf
Please cite: (bibtex entry) @inproceedings{caliskan2018coding, title={When coding style survives compilation: De-anonymizing programmers from executable binaries}, author={Caliskan, Aylin and Yamaguchi, Fabian and Dauber, Edwin and Harang, Richard and Rieck, Konrad and Greenstadt, Rachel and Narayanan, Arvind}, booktitle={Network and Distributed System Security Symposium (NDSS) 2018}, year={2018}, organization={Internet Society} }
Requirements:
-
IDA pro and hexrays https://www.hex-rays.com
-
Llvm* for obfuscation - ObfuscateBinaries.java https://github.com/obfuscator-llvm/obfuscator/wiki
∗ optional
Take binaries or if you have source code compile them (CompileCode.java).
-
Preprocess the binary:
- Disassemble,
- BinaryDisassemble.java
- bjoernDisassemble.java
- Decompile to obtain decompiled source code,
- DecompileBinaries.java
- Generate abstract syntax trees,
- FeatureCalculators.java
- Generate control flow graphs.
- bjoernGenerateGraphmlCFG.java
- Disassemble,
-
Extract features from four data sources (This produces about 700,000 features for 100 programmers each with 9 files.)
- assembly code,
- decompiled source code,
- abstract syntax trees, and
- control flow graphs.
♣ FeatureExtractorAllFeatures.java - remove the feature types that you do not want in your feature set.
-
Apply information gain criterion to use to highly effective features.
- Extract features high in information gain.
- AuthorClassificationBasic.java
- Extract features high in information gain.