Obfuscation_Tool

Introduction

Obfuscation_Tool is a Python-based framework for analyzing and quantifying obfuscation in Ethereum smart contract bytecode. This repository implements the methods described in the paper “Obfuscation Unmasked: Revealing Hidden Logic in Ethereum Scam Contracts via Bytecode-Level Transfer Analysis.” The tool extracts seven bytecode-level obfuscation features (F1–F7) from smart contracts, computes a Z-score for each contract, and outputs detailed per-contract metrics.

Key capabilities include:

Single-contract analysis: Parse a raw bytecode string and extract F1–F7 as defined in the paper:
1. F1. Number of steps in address generation
  Backward dataflow analysis on the address variable, counting distinct arithmetic, hash, bitwise, and external-call steps.
2. F2. Number of string operations
  Count of all string-manipulation and hash instructions involved in address generation.
3. F3. Presence of external call
  Binary flag indicating whether any CALL, DELEGATECALL, or STATICCALL appears in the addr/value dataflow.
4. F4. Height of branch tree
  Maximum nesting depth of conditional branches (JUMPI) along the transfer’s control-flow path.
5. F5. Transfer-related instruction ratio (TIR)
  Ratio of effective transfer- and state-update instructions to total instructions in the transfer-residing function.
6. F6. Transfer operation similarity
  Cosine similarity between R-GCN–embedded PDG representations of transfer-containing functions.
7. F7. Relevance of log events
  Binary flag indicating whether logs emitted within two CFG hops of a transfer are semantically relevant.
Batch analysis (multi-threaded): Use Dask to process a CSV of bytecode strings in parallel and store all per-contract results (F1–F7 and Z-score) in output.csv.
Reproducibility: Includes scripts for training Word2Vec embeddings (word2vec_Train/) and for running on a Dask cluster.
Note: The current code outputs a total of 8 parameters. The 5th parameter is used to indicate whether the parameters of the Transfer instruction are controlled externally, and the 8th parameter is used to indicate whether the conditions for executing the Transfer instruction are controlled externally. These two together constitute the F3 feature.

rattle-cli.py: The single‐contract CLI entrypoint. Reads a raw bytecode string from STDIN (no extension), invokes the analysis pipeline, and prints a human‐readable report plus a final numeric summary (as a list of values).
rundask.py: The batch‐mode script. It expects a CSV file with columns:


bytecode,address
\<hex string without 0x>,<contract address>

Modify the csv_path variable near the top of rundask.py to point to your CSV (e.g., test.csv). Then run with:

python rundask.py

This will:

Spin up a local Dask cluster.
Partition the dataset (by default, from 1 to 10 partitions).
Dispatch each partition to a worker, run the same analysis pipeline as rattle-cli.py, collect results.
Save all per‐contract metrics into output.csv in the repository root.

Print progress messages like:

Dashboard: http://127.0.0.1:8787/status
Initial partitions: 1
Final partitions: 10
✅ Process 0 completed: ...
…
✅ All tasks finished! Saved to output.csv

Installation

Clone the repository:

git clone https://github.com/dcszhang/Obfuscation_Tool.git
cd Obfuscation_Tool

Set up a Python 3.10+ environment (recommended: use venv or conda):

python3 -m venv venv
source venv/bin/activate    # macOS/Linux
# OR
# venv\Scripts\activate     # Windows PowerShell

Install required packages. Run:
```
pip install \
    dask[complete] \
    pandas \
    numpy \
    networkx \
    scikit‐learn \
    gensim \
    pyparsing \
    tqdm
```
Adjust versions as needed. For GPU‐accelerated experiments (RGCN, t‐SNE with CUDA), install the appropriate CUDA‐enabled libraries.

Usage

1. Single‐Contract Analysis

This mode is useful when you have a single raw bytecode string (no file extension). For example:

# Suppose you have a file `example_bytecode.txt` containing:
# 608060405234801561001057600080fd5b506040516101003803806101008339818101604052
# You can run:
python rattle-cli.py < example_bytecode.txt

Prints a formatted report, for example:

--------------------------------------------------------------------------------------
                         Smart contract analysis process
--------------------------------------------------------------------------------------
This is the 1 transfer
    (1) Found Transfer Address instruction:
            %145 = AND(%144, #4c36d2919e407f0cc2ee3c993ccf8ac26d9ce64e)
    (2) trace_step: 3
    (3) Tree height: 4
    (4) String Operation times: 0
-----------------------------------------------------------
This is the 2 transfer
    …

--------------------------------------------------------------------------------------
                                  END
--------------------------------------------------------------------------------------
[3, 4, 0, 91.01775288581848, False, 0.5882352941176471, False, False]

2. Batch Analysis (Multi‐Threaded via Dask)

When you have many contracts to analyze, you can process them in parallel:

Prepare a CSV (test.csv or your own) with two columns and a header row:
```
bytecode,address
6080604052348015610010...,0xAbCdEf123...
60806040526004361...,0xDeFaCe456...
…
```
- bytecode: Hex string (no 0x prefix, no file extension) per row.
- address: Contract address or identifier (used for labeling output).
Edit rundask.py:
- Open rundask.py in a text editor.
- Modify the csv_path variable (near the top) to point to your CSV file. For example:
```
csv_path = "test.csv"
```
Run the batch script:
```
python rundask.py
```
You will see output like:
```
Dashboard: http://127.0.0.1:8787/status
Initial partitions: 1
Final partitions: 10
✅ Process 0 completed: <contract_address_1>
✅ Process 1 completed: <contract_address_2>
✅ Process 2 completed: <contract_address_3>
…
✅ All tasks finished! Saved to output.csv
```
- A Dask dashboard will be available at http://127.0.0.1:8787/status (open in browser to monitor real‐time progress).
- The script automatically repartitions the dataset (default: from 1 to 10 partitions) and distributes tasks across available CPU cores.
- Once all partitions are processed, a single file output.csv will be created in the repository root.

Contributing

Contributions are welcome! If you find a bug or want to add a new feature (e.g., additional obfuscation metrics, support for alternative EVM versions), please:

Fork the repository.
Create a new feature branch (git checkout -b feature/YourFeature).
Make your changes, ensuring all existing tests pass.
Submit a Pull Request with a clear description of your changes.

Please follow PEP 8 style guidelines and add appropriate documentation or unit tests for new modules.

License

This project is released under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Obfuscation_Tool

Introduction

Installation

Usage

1. Single‐Contract Analysis

2. Batch Analysis (Multi‐Threaded via Dask)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Obfuscation		Obfuscation
word2vec_Train		word2vec_Train
.gitignore		.gitignore
README.md		README.md
bytecode		bytecode
rattle-cli.py		rattle-cli.py
rundask.py		rundask.py
single_output.txt		single_output.txt
test.csv		test.csv

Folders and files

Latest commit

History

Repository files navigation

Obfuscation_Tool

Introduction

Installation

Usage

1. Single‐Contract Analysis

2. Batch Analysis (Multi‐Threaded via Dask)

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages