The detection of malicious software (malware) is an important problem in cyber security, especially as more of society becomes dependent on computing systems.
In the past few years, the malware industry has grown very rapidly that, the syndicates invest heavily in technologies to evade traditional protection, forcing the anti-malware groups/communities to build more robust softwares to detect and terminate these attacks. The major part of protecting a computer system from a malware attack is to identify whether a given piece of file/software is a malware.
Microsoft has been very active in building anti-malware products over the years and it runs it’s anti-malware utilities over 150 million computers around the world. This generates tens of millions of daily data points to be analyzed as potential malware. In order to be effective in analyzing and classifying such large amounts of data, we need to be able to group them into groups and identify their respective families.
This dataset provided by Microsoft contains about 10 classes of malware.
Run the download.py
file
python download.py
Run the src/bi_2_img.py
file
python bi_2_img.py --mode train
or python bi_2_img.py --mode test
Take TextCNN as an example
- Training
Run src/textcnn_train.py
with the argument parameters specified --dataset (input data folder path) --label_file (label data loader path) --mode train
- Predicting
Run src/textcnn_eval.py
with the argument parameters specified --dataset (input data folder path) --label_file (label data loader path) --mode eval
The prediction would be saved as tcnn_submission.csv
in current working directory
This repository contains the source code for detecting different types of malwares using Deep learning based Feature Extraction and Wraper based Feature Selection Technique. A research paper describing how it works is availible at "to be updated"
Three major approaches we used for malware classification:
-
1-ML-Base
Referring to the top1 and top2 solution in Microsoft Malware Prediction, we experiment on different feature engineering ways to extract sequencial information from the raw bytes file, such as n-gram, tf-idf, and calculation on the blocksize. We generate the train and test data with around 60500 features.
For features selection, we apply Randomforest get around 3860 features out of 60500. LightGBM is the main SOTA model we used to make the submission prediction.Our final prediction is mixed with the blending method which includes the prediction we have experimented on and the stacking method(Stage1 Model: RandomForest,2 lightgbm, Stage2 Model: LR).
-
2- CNN-Base
Image representation of .bytes file shows pattern for different malware classes. We referred to the paper Malware Classification using Deep Convolutional Neural Networks and experimented on pretrained VGG16 and self-defined CNN-Base model to extrat features from .bytes image for classifying the malware files.
VGG16 Structure
- 3- TextCNN-Base
Sequence representation of .bytes file contains significant information of malware. We implement the MalConv proposed in Malware Detection by Eating a Whole EXE, but since the limitation of the time and computation resource we adjust the neuralnet structure of the original paper to take shorter sequence length as input (from 2000000 to 4096) and window size of dilated CNN.
(Graph Reference:haimgil1/Deep_Learning_Malware_Classification)
Please make sure each of them is installed with the latest version
- tqdm
- numpy
- scipy
- pandas
- torch 0.4.0
- torchvision
- Pillow
- sklearn 0.22.0
- lightgbm
We used python 3.7 for this project.