Skip to content

Dataset

denishkhetan edited this page Feb 11, 2019 · 2 revisions

The data was taken from the Microsoft Malware Classification Challenge. The uncompressed data is nearly about half terabyte.

The data files are in two formats:

  1. hexadecimal format
  2. asm format

Data can be found from following Google Storage Path: gs://uga-dsp/project1/files

Each of the BinaryHex file looks something like as follows:

The first hexadecimal token of each line, looking something like 00401000, is just the line pointer and can be safely ignored (as these will appear throughout many of the files, not unlike articles in natural language). The other hexadecimal pairs, however, are the code of the malware instance itself and should be used to predict the malware’s family.

Each binary file is independently identified by its hash, such as 01SuzwMJEIXsK7A8dQbl. The files are named with their hashes (and have “.bytes” file extensions, after the hash; so a full filename would look like 01SuzwMJEIXsK7A8dQbl.bytes); in a different directory are text files that contain these hashes, one per line, to indicate which files (or documents) are part of which dataset. The dataset definitions are located here: gs://uga-dsp/project1/files

Each of the asm file looks something like as follows:

Here, in the asm file we have been given the opcode instructions such as mov, jmp.

Specifically, here are the available files in the files directory:

X_train_small.txt, y_train_small.txt
X_test_small.txt, y_test_small.txt
X_train.txt, y_train.txt
X_test.txt

Each X* contains a list of hashes, one per line. Each corresponding y* file is a list of integers, one per line, indicating the malware family to which the binary file with the corresponding hash belongs.

There are “small” and “large” versions of the data available. The “small” is meant to help in initial algorithm development, as these data are small enough to fit on one machine. Or else, you can you large dataset.