Estos resultados han sido (parcialmente) financiados por la Cátedra Internacional UMA 2023, la cual forma parte del Programa Global de Innovación en Seguridad para la promoción de Cátedras de Ciberseguridad en España financiado por la Unión Europea Fondos NextGeneration-EU, a través del Instituto Nacional de Ciberseguridad (INCIBE).
This repository provides an end-to-end pipeline for malware family classification using multimodal data:
- Static features
- Dynamic behavior features
- Visual features from binary-to-image representations
The project compares fusion strategies at scale:
- Early fusion
- Expert selection
- Late fusion
It includes data preparation, feature extraction, model training, explainability scripts, statistical comparison, and visualization utilities.
0_organize_dataset.py: Organizes raw samples into the expected dataset structure.1_split_dataset.py: Creates train/test splits.2_train_cnn.py: Trains the CNN model for visual features.3_build_features.py: Builds feature datasets from extractors.4_train_model.py: Trains single models.5_train_multiclassifier.py: Trains multi-classifier pipelines.predict.py: Runs inference on new samples.config.py: Central configuration.
src/: Core implementation.feature_extraction/: Static, dynamic, image, and multi-feature extractors.models/: ML models (SVM, Random Forest, LightGBM, XGBoost, Voting Ensemble).experiment.py: Experiment orchestration.
dataset/: Processed CSVs, notebook, and split sample directories.tests/: Evaluation scripts for fusion strategies, XAI, and McNemar testing.graphics/: Plotting scripts and generated figures.weights/: Saved model weights/checkpoints.
dataset/dataset_processed.csv: Full processed dataset.dataset/dataset_processed_train.csv: Training split metadata.dataset/dataset_processed_test.csv: Test split metadata.dataset/samples_split/train/: Files organized by malware family for training.dataset/samples_split/test/: Files organized by malware family for testing.
Current families include:
- Adware
- Downloaders
- Goodware
- Infectors
- Malware_Generic
- Obfuscated
- Other_Malware
- Ransomware
- Spying_Stealing
- Trojan
- Install dependencies:
pip install -r requirements.txt
- Prepare and split data:
python 0_organize_dataset.py python 1_split_dataset.py
- Train visual expert (CNN):
python 2_train_cnn.py
- Build multimodal features:
python 3_build_features.py
- Train baseline or fusion models:
python 4_train_model.py python 5_train_multiclassifier.py
The tests/ folder contains experiment scripts for:
- Early fusion evaluation
- Expert selection evaluation
- Late fusion evaluation
- XAI-based model interpretation (global and per-file)
- Statistical significance testing
The graphics/ folder contains scripts to generate publication-ready plots for the three fusion strategies.
Typical generated artifacts:
- Trained model files and checkpoints in
weights/ - Intermediate features and processed metadata in
dataset/ - Comparative plots in
graphics/early_fusion/,graphics/expert_selection/, andgraphics/late_fusion/
- Update paths and parameters in
config.pybefore running long experiments. - Some pipelines are computationally expensive and may require GPU acceleration.
- Keep dataset organization consistent with
samples_split/trainandsamples_split/test.