CovBooster: Coverage Booster for Binary Code Clone Detection by Reduced Signatures.
This work has been accepted for presentation at The 41st ACM/SIGAPP Symposium On Applied Computing (SAC 2026).
This repository contains the implementation and evaluation code for the CovBooster approach, which uses dominating set algorithms to improve binary function detection coverage.
CovBooster is a novel approach for binary function detection that leverages dominating set algorithms to select optimal binary sets for function matching.
- Python 3.7 or higher
- Required Python packages (see
requirements.txt)
# Clone the repository
git clone <repository-url>
cd CovBooster-public
# Install dependencies
pip install -r requirements.txtThe code expects TLSH hash files organized in the following structure:
<db_root>/
├── <binary_group_1>/
│ ├── <binary_1>/
│ │ ├── <function_1>.tlsh
│ │ ├── <function_2>.tlsh
│ │ └── ...
│ └── <binary_2>/
│ └── ...
└── <binary_group_2>/
└── ...
Each .tlsh file should contain:
- Line 1: TLSH hash value
- Line 2: Function size (strand size)
Sample Data: This repository includes a sample_data/ directory containing TLSH hash files for testing. The sample data includes:
- 5 binary groups:
bool,direvent,gmp,libcrypto,libssl - Multiple compiler versions (clang 4.0-7.0, gcc 4.9.4-8.2.0)
- Multiple architectures (arm_32, arm_64, x86_32, x86_64)
- Multiple optimization levels (O0, O1, O2, O3)
- TLSH hash files across multiple binary variants
To use the sample data:
python3 evaluation_dominating.py sample_data 30 test_resultsGithub/
├── README.md # This file
├── requirements.txt # Python dependencies
├── ds_algo.py # Dominating set algorithm implementation
├── dominating_set.py # Dominating set construction and evaluation
├── evaluation_dominating.py # Main evaluation script with dominating set approach
├── threshold_sensitivity_analysis.py # Threshold sensitivity analysis
├── THRESHOLD_ANALYSIS_README.md # Detailed threshold analysis documentation
└── sample_data/ # Sample TLSH hash files for testing
├── bool/ # bool binary group
├── direvent/ # direvent binary group
├── gmp/ # gmp binary group
├── libcrypto/ # libcrypto binary group
└── libssl/ # libssl binary group
Run the main evaluation script with dominating set approach:
python3 evaluation_dominating.py <db_root> <base_result_directory>This will:
- Test multiple threshold values (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40)
- Generate results for each threshold automatically
- Save results in timestamped directories under
<base_result_directory>/exp_<timestamp>/threshold_<value>/
Generate ROC/PR curves and detailed analysis from evaluation results:
python3 threshold_sensitivity_analysis.py <exp_dir>This generates:
- ROC and PR curves
- Detailed performance analysis
- Threshold sensitivity results CSV
Generate ROC/PR curves and detailed analysis:
python3 threshold_sensitivity_analysis.py <exp_dir>For each binary group, the following files are generated:
dominating_set_metrics.csv: Main metrics (Precision, Recall, F1-score, etc.)false_positives.csv: False positive casestopk_matches.csv: Top-K matching resultsgrid_search_results.csv: Grid search parameter optimization results
threshold_sensitivity_results.csv: Threshold sensitivity analysisthreshold_roc_pr_curves.png: ROC and PR curvesthreshold_detailed_analysis.png: Detailed performance analysis
If you find CovBooster useful in your research, please cite:
CovBooster: Coverage Booster for Binary Code Clone Detection by Reduced Signatures.
To appear in The 41st ACM/SIGAPP Symposium on Applied Computing (SAC 2026).
Key parameters that can be adjusted:
TLSH_THRESHOLD: TLSH similarity threshold (default: 0-40)SIZE_DIFF_THRESHOLD: Maximum size difference ratio (default: 0.3)
For questions or issues, please open an issue on the repository or contact me by email (jeongwoo@korea.ac.kr).