Skip to content
A handful of statistical metrics to better understand and qualify malware datasets
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
LICENSE.txt
README.md
ouroboros.py
output.json
requirements.txt
sample.csv.gz
stase.py

README.md

What is STASE ?

STASE provides a set of metrics to describe a dataset of malware labels.

Goal:

  • evaluate the properties of malware datasets
  • identify potential bias in experimental studies
  • analyze the decision and classification of antivirus products

Usage

Input: a dataset of labels formatted as a CSV or CSV.GZ file

  • columns: antivirus products
  • rows: malware files

Output: metrics introduce in this research paper (soon to be released)

Example:

python3 stase.py sample.csv.gz output.json

{
    "equiponderance": 0.2422919148,
    "equiponderance_idx":8.0,
    "exclusivity":0.2626262626,
    "recognition":0.1051423324,
    "synchronicity":0.1677210336,
    "genericity":0.5233236152,
    "uniformity":0.2926562999,
    "uniformity_idx":48.0,
    "divergence":0.7568027211,
    "consensuality":0.2227891156,
    "resemblance":0.6406466991,
    "labels":328.0,
    "apps":99.0,
    "avs":66.0,
}

Technical details:

  • implemented in Python 3 (dependencies in requirements.txt)
  • use multiprocessing for performance
  • shipped with Ouroboros

TODO

  • Handle more input formats and options

Pull request accepted !

You can’t perform that action at this time.