Skip to content

antoniol00/ai-malware-classification

Repository files navigation

Multimodal Malware Family Classification

Estos resultados han sido (parcialmente) financiados por la Cátedra Internacional UMA 2023, la cual forma parte del Programa Global de Innovación en Seguridad para la promoción de Cátedras de Ciberseguridad en España financiado por la Unión Europea Fondos NextGeneration-EU, a través del Instituto Nacional de Ciberseguridad (INCIBE).

This repository provides an end-to-end pipeline for malware family classification using multimodal data:

  • Static features
  • Dynamic behavior features
  • Visual features from binary-to-image representations

The project compares fusion strategies at scale:

  • Early fusion
  • Expert selection
  • Late fusion

It includes data preparation, feature extraction, model training, explainability scripts, statistical comparison, and visualization utilities.

Repository Contents

Root Scripts

  • 0_organize_dataset.py: Organizes raw samples into the expected dataset structure.
  • 1_split_dataset.py: Creates train/test splits.
  • 2_train_cnn.py: Trains the CNN model for visual features.
  • 3_build_features.py: Builds feature datasets from extractors.
  • 4_train_model.py: Trains single models.
  • 5_train_multiclassifier.py: Trains multi-classifier pipelines.
  • predict.py: Runs inference on new samples.
  • config.py: Central configuration.

Main Folders

  • src/: Core implementation.
    • feature_extraction/: Static, dynamic, image, and multi-feature extractors.
    • models/: ML models (SVM, Random Forest, LightGBM, XGBoost, Voting Ensemble).
    • experiment.py: Experiment orchestration.
  • dataset/: Processed CSVs, notebook, and split sample directories.
  • tests/: Evaluation scripts for fusion strategies, XAI, and McNemar testing.
  • graphics/: Plotting scripts and generated figures.
  • weights/: Saved model weights/checkpoints.

Dataset Layout

  • dataset/dataset_processed.csv: Full processed dataset.
  • dataset/dataset_processed_train.csv: Training split metadata.
  • dataset/dataset_processed_test.csv: Test split metadata.
  • dataset/samples_split/train/: Files organized by malware family for training.
  • dataset/samples_split/test/: Files organized by malware family for testing.

Current families include:

  • Adware
  • Downloaders
  • Goodware
  • Infectors
  • Malware_Generic
  • Obfuscated
  • Other_Malware
  • Ransomware
  • Spying_Stealing
  • Trojan

Quick Start

  1. Install dependencies:
    pip install -r requirements.txt
  2. Prepare and split data:
    python 0_organize_dataset.py
    python 1_split_dataset.py
  3. Train visual expert (CNN):
    python 2_train_cnn.py
  4. Build multimodal features:
    python 3_build_features.py
  5. Train baseline or fusion models:
    python 4_train_model.py
    python 5_train_multiclassifier.py

Evaluation and Analysis

The tests/ folder contains experiment scripts for:

  • Early fusion evaluation
  • Expert selection evaluation
  • Late fusion evaluation
  • XAI-based model interpretation (global and per-file)
  • Statistical significance testing

The graphics/ folder contains scripts to generate publication-ready plots for the three fusion strategies.

Output Artifacts

Typical generated artifacts:

  • Trained model files and checkpoints in weights/
  • Intermediate features and processed metadata in dataset/
  • Comparative plots in graphics/early_fusion/, graphics/expert_selection/, and graphics/late_fusion/

Notes

  • Update paths and parameters in config.py before running long experiments.
  • Some pipelines are computationally expensive and may require GPU acceleration.
  • Keep dataset organization consistent with samples_split/train and samples_split/test.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors