Skip to content

Project Overview

denishkhetan edited this page Feb 10, 2019 · 2 revisions

In this project, data was taken from the Microsoft Malware Classification Challenge. The uncompressed data is nearly about half terabyte. The data files are in two formats: hexadecimal format and asm format which can be found from following Google Storage Path. gs://uga-dsp/project1/files Here, We are trying to classify this data into one of the several possible malware families. There are 9 classes of malware, and each instance of malware has only of the following categories.

  1. Ramnit
  2. Lollipop
  3. Kelihos_ver3
  4. Vundo
  5. Simda
  6. Tracur
  7. Kelih9.ver1
  8. Obfuscator.ACY
  9. Gatak

Each of the binaryHex file looks something like as follows:

The first hexadecimal token of each line, looking something like 00401000, is just the line pointer and can be safely ignored (as these will appear throughout many of the files, not unlike articles in natural language). The other hexadecimal pairs, however, are the code of the malware instance itself and should be used to predict the malware’s family.

Each of the asm file looks something like as follows:

Each binary file is independently identified by its hash, such as 01SuzwMJEIXsK7A8dQbl. The files are named with their hashes (and have “.bytes” file extensions, after the hash; so a full filename would look like 01SuzwMJEIXsK7A8dQbl.bytes); in a different directory are text files that contain these hashes, one per line, to indicate which files (or documents) are part of which dataset. The dataset definitions are located here:

gs://uga-dsp/project1/files

Specifically, here are the available files in the files directory:

  • X_train_small.txt, y_train_small.txt
  • X_test_small.txt, y_test_small.txt
  • X_train.txt, y_train.txt
  • X_test.txt

Each X* contains a list of hashes, one per line. Each corresponding y* file is a list of integers, one per line, indicating the malware family to which the binary file with the corresponding hash belongs.

There are “small” and “large” versions of the data available. The “small” is meant to help in initial algorithm development, as these data are small enough to fit on one machine. Or else, you can you large dataset.

Clone this wiki locally