A machine learning-based detector for identifying UPX-packed ELF malware using n-gram feature extraction and Support Vector Machine (SVM) classification.
upxelfdet is a Python tool designed for malware analysis and research. It extracts features from ELF binary sections, vectorizes them using n-gram methods, and applies machine learning models to classify whether binaries are packed with UPX or identify malware families.
Key Features:
- ELF Binary Analysis: Extracts features from specific sections of ELF files
- N-gram Vectorization: Converts binary features into numeric vectors using configurable n-gram sizes
- SVM Classification: Trains and evaluates Support Vector Machine models
- Flexible Configuration: JSON-based configuration for easy experimentation
- CLI Interface: Command-line tools for training, evaluation, and prediction
- Structured Logging: Comprehensive logging with both human-readable and JSON formats
- Python >= 3.12
- pip or uv (recommended)
# Clone the repository
git clone https://github.com/bolin8017/upxelfdet.git
cd upxelfdet
# Install dependencies (using uv - recommended)
uv pip install -e .
# Or using pip
pip install -e .pip install upxelfdet-
Prepare your dataset: Organize ELF binaries in
input/dataset/and create CSV files with labels. -
Configure the detector: Edit
config.jsonto set paths and parameters. -
Train the model:
upxelfdet train --config config.json
-
Evaluate performance:
upxelfdet evaluate --config config.json
-
Make predictions:
upxelfdet predict --config config.json
Create or modify config.json:
{
"data": {
"train": "./input/train.csv",
"test": "./input/test.csv",
"predict": "./input/test.csv",
"dataset": "./data/samples"
},
"output": {
"feature": "./output/features",
"model": "./output/model",
"prediction": "./output/predictions/predictions.csv",
"log": "./output/logs"
},
"feature": {
"section_name": ".block_1"
},
"vectorize": {
"method": "ngram_numeric",
"size_features": 256,
"offset": 0,
"ngram_size": 2,
"encoding": "TF"
},
"model": {
"type": "SVM",
"params": {
"C": 100,
"gamma": 0.001,
"kernel": "rbf"
}
},
"classify": true,
"seed": 8017
}Configuration Options:
data.train: Path to training CSV filedata.test: Path to test CSV filedata.dataset: Directory containing ELF binary filesfeature.section_name: ELF section to extract features from (e.g.,.block_1)vectorize.method: Vectorization method (ngram_numericorraw_bytes)vectorize.ngram_size: Size of n-grams (typically 2-4)vectorize.encoding: Encoding method (TFfor term frequency)model.type: Model type (currentlySVM)classify: Iftrue, performs multi-class classification; iffalse, binary classification
Train a new model using your dataset:
upxelfdet train --config config.jsonWhat happens during training:
- Loads training data from CSV
- Extracts features from ELF binaries in the dataset directory
- Vectorizes features using the specified method
- Trains an SVM model with configured parameters
- Saves the trained model to
output/model/
Output:
- Trained model files in
output/model/ - Feature extraction results in
output/features/ - Vectorization results in
output/vectorize/ - Training logs in
output/logs/
Evaluate model performance on test data:
upxelfdet evaluate --config config.jsonMetrics reported:
- Accuracy
- Precision
- Recall
- F1 Score
- Confusion Matrix
- Classification Report (for multi-class)
Make predictions on new samples:
upxelfdet predict --config config.jsonPredictions are saved to the path specified in config.output.prediction.
You can also use the detector programmatically:
from upxelfdet import UpxElfDetector
from upxelfdet.config import UpxElfDetectorConfig
# Load configuration
config = UpxElfDetectorConfig.from_file("config.json")
# Initialize detector
detector = UpxElfDetector(config)
# Train model
model_path = detector.train()
# Evaluate model
metrics = detector.evaluate()
print(f"Accuracy: {metrics['accuracy']:.4f}")
# Make predictions
predictions_path = detector.predict()See examples/basic_usage.py for a complete example.
upxelfdet/
├── src/
│ └── upxelfdet/
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration management
│ ├── detector.py # Main detector class
│ ├── constants.py # Constants and defaults
│ ├── exceptions.py # Custom exceptions
│ ├── logging.py # Logging configuration
│ ├── feature/ # Feature extraction
│ │ ├── __init__.py
│ │ └── extractor.py
│ ├── vectorizer/ # Vectorization methods
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── ngram_numeric.py
│ │ ├── raw_bytes.py
│ │ └── factory.py
│ ├── model/ # ML models
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── svm.py
│ │ └── factory.py
│ └── predictor/ # Prediction logic
│ ├── __init__.py
│ └── predictor.py
├── tests/ # Unit tests
│ ├── __init__.py
│ ├── conftest.py
│ ├── test_config.py
│ └── test_detector.py
├── examples/ # Usage examples
│ └── basic_usage.py
├── data/ # Example data (see data/README.md)
│ ├── samples/
│ └── README.md
├── input/ # Input data (not in repo)
│ ├── dataset/ # ELF binaries (excluded)
│ ├── train.csv # Training labels (excluded)
│ └── test.csv # Test labels (excluded)
├── output/ # Output directories
│ ├── features/ # Extracted features
│ ├── vectorize/ # Vectorized features
│ ├── model/ # Trained models
│ ├── predictions/ # Prediction results
│ └── logs/ # Log files
├── config.json # Configuration file
├── pyproject.toml # Project metadata and dependencies
├── LICENSE # MIT License
├── README.md # This file
└── .gitignore # Git ignore rules
- Input: ELF binary files + CSV with labels
- Feature Extraction: Extract specified section (e.g.,
.block_1) from ELF - Vectorization: Convert binary data to numeric vectors using n-grams
- Model Training: Train SVM classifier on vectorized features
- Evaluation/Prediction: Apply trained model to new samples
- FeatureExtractor: Extracts binary sections from ELF files using
upx-elf-parser - Vectorizer: Implements different vectorization strategies (n-gram, raw bytes)
- Model: Wraps scikit-learn models with consistent interface
- Predictor: Handles the complete prediction pipeline
- UpxElfDetector: Main orchestrator class that coordinates all components
from upxelfdet import UpxElfDetector
from upxelfdet.config import UpxElfDetectorConfig
config = UpxElfDetectorConfig.from_file("config.json")
detector = UpxElfDetector(config)
# Train and evaluate
detector.train()
metrics = detector.evaluate()from upxelfdet.config import (
UpxElfDetectorConfig,
DataConfig,
VectorizeConfig,
ModelConfig,
)
config = UpxElfDetectorConfig(
data=DataConfig(
train="./my_train.csv",
test="./my_test.csv",
dataset="./my_dataset",
),
vectorize=VectorizeConfig(
method="ngram_numeric",
ngram_size=3,
size_features=512,
),
model=ModelConfig(
type="SVM",
params={"C": 10, "kernel": "linear"},
),
)
detector = UpxElfDetector(config)
detector.train()See examples/basic_usage.py for a complete working example.
# Clone repository
git clone https://github.com/bolin8017/upxelfdet.git
cd upxelfdet
# Install with development dependencies
uv pip install -e ".[dev]"pytest tests/This project uses:
- ruff: For linting and formatting
- mypy: For type checking
- pytest: For testing
# Lint code
ruff check src/ tests/
# Format code
ruff format src/ tests/
# Type check
mypy src/This project is licensed under the MIT License. See LICENSE for details.
If you use this tool in your research, please cite:
@software{upxelfdet,
author = {bolin8017},
title = {upxelfdet: Machine Learning-Based Detection for UPX-Packed ELF Malware},
year = {2025},
url = {https://github.com/bolin8017/upxelfdet}
}This project builds upon:
- islab-malware-detector: Base malware detection framework
- upx-elf-parser: ELF parsing utilities
- scikit-learn: Machine learning library
- Do not use this tool for malicious activities
- Handle malware samples with extreme caution
- Use isolated environments when analyzing malicious binaries
- Comply with all applicable laws and regulations
For questions, issues, or contributions:
- Issues: GitHub Issues
- Repository: GitHub
Note: This project is under active development. APIs and features may change.