Skip to content

Approach

chatrathabhishek edited this page Feb 13, 2019 · 2 revisions

We have taken 2 approaches.

  1. Engram analysis of the binaryHex files with Random Forest.
  2. Opcode Engram analysis.

Approach 1

For our first approach, we used the unigram analysis with random forest on binaryHex files available in the training data. For preprocessing, we removed the line numbers and other characters like "?" using Regex Tokenizer available in PySpark.ml.feature. Next, we extracted the hexadecimal bytes and performed the unigram analysis. Using count vectorizer we extracted the feature vectors. Then, we used Random Forest to train our model on small dataset (number of trees =20 and depth=8) with the accuracy of 89.49.

Then we trained our model on large dataset, this time we changed the number of trees to 50 and depth to 25. We got the accuracy of 98.01.

Approach 2

For the second approach, we used the opcode n-gram analysis. For preprocessing, all the spaces and unrequired characters were removed using regex expressions. Next, we filtered out opcodes to get a list of opcode instruction set from each file. CountVectorizer was applied to extract features. These features were passed through a Random Forest model with number of trees 20 and depth 8. The accuracy obtained for the small dataset was 92.97.

On the large dataset the model was changed to have number of trees as 50 and depth 25. We got accuracy of 98.45

Clone this wiki locally