Skip to content

areomoon/malware_detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 

Repository files navigation

1st Place Solution in Malware Classification Challenge

Problem Statement

The detection of malicious software (malware) is an important problem in cyber security, especially as more of society becomes dependent on computing systems.

In the past few years, the malware industry has grown very rapidly that, the syndicates invest heavily in technologies to evade traditional protection, forcing the anti-malware groups/communities to build more robust softwares to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to identify whether a given piece of file/software is a malware.

Source

Microsoft has been very active in building anti-malware products over the years and it runs it’s anti-malware utilities over 150 million computers around the world. This generates tens of millions of daily data points to be analyzed as potential malware. In order to be effective in analyzing and classifying such large amounts of data, we need to be able to group them into groups and identify their respective families.

This dataset provided by Microsoft contains about 10 classes of malware.

How to get the train/test data

Run the download.py file
python download.py

How to get the train/test image data

Run the src/bi_2_img.py file
python bi_2_img.py --mode train or python bi_2_img.py --mode test

How to train the model and make prediction?

Take TextCNN as an example

  • Training

Run src/textcnn_train.py with the argument parameters specified --dataset (input data folder path) --label_file (label data loader path) --mode train

  • Predicting

Run src/textcnn_eval.py with the argument parameters specified --dataset (input data folder path) --label_file (label data loader path) --mode eval
The prediction would be saved as tcnn_submission.csv in current working directory

Model/Algorithm

This repository contains the source code for detecting different types of malwares using Deep learning based Feature Extraction and Wraper based Feature Selection Technique. A research paper describing how it works is availible at "to be updated"

Three major approaches we used for malware classification:

  • 1-ML-Base

    Referring to the top1 and top2 solution in Microsoft Malware Prediction, we experiment on different feature engineering ways to extract sequencial information from the raw bytes file, such as n-gram, tf-idf, and calculation on the blocksize. We generate the train and test data with around 60500 features.
    For features selection, we apply Randomforest get around 3860 features out of 60500. LightGBM is the main SOTA model we used to make the submission prediction.

    Our final prediction is mixed with the blending method which includes the prediction we have experimented on and the stacking method(Stage1 Model: RandomForest,2 lightgbm, Stage2 Model: LR).

  • 2- CNN-Base
    Image representation of .bytes file shows pattern for different malware classes. We referred to the paper Malware Classification using Deep Convolutional Neural Networks and experimented on pretrained VGG16 and self-defined CNN-Base model to extrat features from .bytes image for classifying the malware files.

VGG16 Structure

cnn

  • 3- TextCNN-Base
    Sequence representation of .bytes file contains significant information of malware. We implement the MalConv proposed in Malware Detection by Eating a Whole EXE, but since the limitation of the time and computation resource we adjust the neuralnet structure of the original paper to take shorter sequence length as input (from 2000000 to 4096) and window size of dilated CNN.

text_cnn (1)

(Graph Reference:haimgil1/Deep_Learning_Malware_Classification)

Dependency

Please make sure each of them is installed with the latest version

  • tqdm
  • numpy
  • scipy
  • pandas
  • torch 0.4.0
  • torchvision
  • Pillow
  • sklearn 0.22.0
  • lightgbm

We used python 3.7 for this project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages