1st Place Solution in Malware Classification Challenge

Problem Statement

The detection of malicious software (malware) is an important problem in cyber security, especially as more of society becomes dependent on computing systems.

In the past few years, the malware industry has grown very rapidly that, the syndicates invest heavily in technologies to evade traditional protection, forcing the anti-malware groups/communities to build more robust softwares to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to identify whether a given piece of file/software is a malware.

Source

Microsoft has been very active in building anti-malware products over the years and it runs it’s anti-malware utilities over 150 million computers around the world. This generates tens of millions of daily data points to be analyzed as potential malware. In order to be effective in analyzing and classifying such large amounts of data, we need to be able to group them into groups and identify their respective families.

This dataset provided by Microsoft contains about 10 classes of malware.

How to get the train/test data

Run the download.py file
python download.py

How to get the train/test image data

Run the src/bi_2_img.py file
python bi_2_img.py --mode train or python bi_2_img.py --mode test

How to train the model and make prediction？

Take TextCNN as an example

Training

Run src/textcnn_train.py with the argument parameters specified --dataset (input data folder path) --label_file (label data loader path) --mode train

Predicting

Run src/textcnn_eval.py with the argument parameters specified --dataset (input data folder path) --label_file (label data loader path) --mode eval
The prediction would be saved as tcnn_submission.csv in current working directory

Model/Algorithm

This repository contains the source code for detecting different types of malwares using Deep learning based Feature Extraction and Wraper based Feature Selection Technique. A research paper describing how it works is availible at "to be updated"

Three major approaches we used for malware classification:

1-ML-Base

Referring to the top1 and top2 solution in Microsoft Malware Prediction, we experiment on different feature engineering ways to extract sequencial information from the raw bytes file, such as n-gram, tf-idf, and calculation on the blocksize. We generate the train and test data with around 60500 features.
For features selection, we apply Randomforest get around 3860 features out of 60500. LightGBM is the main SOTA model we used to make the submission prediction.

Our final prediction is mixed with the blending method which includes the prediction we have experimented on and the stacking method(Stage1 Model: RandomForest,2 lightgbm, Stage2 Model: LR).
2- CNN-Base
Image representation of .bytes file shows pattern for different malware classes. We referred to the paper Malware Classification using Deep Convolutional Neural Networks and experimented on pretrained VGG16 and self-defined CNN-Base model to extrat features from .bytes image for classifying the malware files.

VGG16 Structure

3- TextCNN-Base
Sequence representation of .bytes file contains significant information of malware. We implement the MalConv proposed in Malware Detection by Eating a Whole EXE, but since the limitation of the time and computation resource we adjust the neuralnet structure of the original paper to take shorter sequence length as input (from 2000000 to 4096) and window size of dilated CNN.

(Graph Reference:haimgil1/Deep_Learning_Malware_Classification)

Dependency

Please make sure each of them is installed with the latest version

tqdm
numpy
scipy
pandas
torch 0.4.0
torchvision
Pillow
sklearn 0.22.0
lightgbm

We used python 3.7 for this project.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
src		src
.gitignore		.gitignore
README.md		README.md
download_data.py		download_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1st Place Solution in Malware Classification Challenge

Problem Statement

Source

How to get the train/test data

How to get the train/test image data

How to train the model and make prediction？

Model/Algorithm

Dependency

About

Releases

Packages

Languages

areomoon/malware_detection

Folders and files

Latest commit

History

Repository files navigation

1st Place Solution in Malware Classification Challenge

Problem Statement

Source

How to get the train/test data

How to get the train/test image data

How to train the model and make prediction？

Model/Algorithm

Dependency

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages