-
Notifications
You must be signed in to change notification settings - Fork 2
Dataset
The data was taken from the Microsoft Malware Classification Challenge. The uncompressed data is nearly about half terabyte.
- hexadecimal format
- asm format
Data can be found from following Google Storage Path: gs://uga-dsp/project1/files
The first hexadecimal token of each line, looking something like 00401000, is just the line pointer and can be safely ignored (as these will appear throughout many of the files, not unlike articles in natural language). The other hexadecimal pairs, however, are the code of the malware instance itself and should be used to predict the malware’s family.
Each binary file is independently identified by its hash, such as 01SuzwMJEIXsK7A8dQbl. The files are named with their hashes (and have “.bytes” file extensions, after the hash; so a full filename would look like 01SuzwMJEIXsK7A8dQbl.bytes); in a different directory are text files that contain these hashes, one per line, to indicate which files (or documents) are part of which dataset. The dataset definitions are located here: gs://uga-dsp/project1/files
Here, in the asm file we have been given the opcode instructions such as mov, jmp.
Specifically, here are the available files in the files directory:
X_train_small.txt, y_train_small.txt
X_test_small.txt, y_test_small.txt
X_train.txt, y_train.txt
X_test.txt
Each X* contains a list of hashes, one per line. Each corresponding y* file is a list of integers, one per line, indicating the malware family to which the binary file with the corresponding hash belongs.
There are “small” and “large” versions of the data available. The “small” is meant to help in initial algorithm development, as these data are small enough to fit on one machine. Or else, you can you large dataset.