Project Overview

In this project, data was taken from the Microsoft Malware Classification Challenge. The uncompressed data is nearly about half terabyte. The data is in hexadecimal format, can be found from following Google Storage Path. gs://uga-dsp/project1/files Here, We are trying to classify this data into one of the several possible malware families. There are 9 classes of malware, and each instance of malware has only of the following categories.

Ramnit
Lollipop
Kelihos_ver3
Vundo
Simda
Tracur
Kelih9.ver1
Obfuscator.ACY
Gatak

Each of the file looks something like as follows:
...
00401060 53 8F 48 00 A9 88 40 00 04 4E 00 00 F9 31 4F 00
00401070 1D 99 02 47 D5 4F 00 00 03 05 B5 42 CE 88 65 43
00401080 6F 3D 4D 00 77 73 CD 47 21 A5 F0 48 87 8E 4A 00
...

The first hexadecimal token of each line, looking something like 00401060, is just the line pointer and can be safely ignored (as these will appear throughout many of the files, not unlike articles in natural language). The other hexadecimal pairs, however, are the code of the malware instance itself and should be used to predict the malware’s family.

Each binary file is independently identified by its hash, such as 01SuzwMJEIXsK7A8dQbl. The files are named with their hashes (and have “.bytes” file extensions, after the hash; so a full filename would look like 01SuzwMJEIXsK7A8dQbl.bytes); in a different directory are text files that contain these hashes, one per line, to indicate which files (or documents) are part of which dataset. The dataset definitions are located here:

gs://uga-dsp/project1/files

Specifically, here are the available files in the files directory:

X_train_small.txt, y_train_small.txt
X_test_small.txt, y_test_small.txt
X_train.txt, y_train.txt
X_test.txt

Each X* contains a list of hashes, one per line. Each corresponding y* file is a list of integers, one per line, indicating the malware family to which the binary file with the corresponding hash belongs.

There are “small” and “large” versions of the data available. The “small” is meant to help in initial algorithm development, as these data are small enough to fit on one machine. Or else, you can you large dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Overview

Clone this wiki locally