Skip to content

aapplebaum/kipple-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

kipple-data

This repository houses the data associated with the kipple project. It has two primary folders:

  • data, which contains files of zipped memmap'd feature arrays for adversarial malware, and
  • records, which contains a list of the associated md5/sha256 value (more below) for each dat file.

Note that for each dat file in data there is an associated txt file in records with the latter listing the md5/sha256 values encoded in the array.

In total, there are 13 data stores, matching the following table:

Name Description Count
msf_normal Randomly generated implants from msfvenom, no added-code parameter 5884
msf_sorel Randomly generated implants from msfvenom, added-code from the SoReL dataset 33633
msf_vs Randomly generated implants from msfvenom, added-code from VirusShare 7614
sorel_malware_rl Adversarial malware generated using Malware RL over the SoReL dataset 37553
sorel_sml_gamma Adversarial malware generated using the GAMMA attack from SecML Malware on the SoReL dataset 5167
sorel_small_pad Adversarial malware generated using the padding attack with a small pad from SecML Malware on the SoReL dataset 225
sorel_large_pad Adversarial malware generated using the padding attack with a large pad from SecML Malware on the SoReL dataset 277
sorel_header_ev Adversarial malware generated using the DOS Header attack from SecML Malware on the SoReL dataset 2590
vs_malware_rl Adversarial malware generated using Malware RL over malware from VirusShare 24581
vs_sml_gamma Adversarial malware generated using the GAMMA attack from SecML Malware on malware from VirusShare 5629
vs_small_pad Adversarial malware generated using the padding attack with a small pad from SecML Malware on malware from VirusShare 2347
vs_large_pad Adversarial malware generated using the padding attack with a large pad from SecML Malware on malware from VirusShare 2815
vs_header_ev Adversarial malware generated using the DOS Header attack from SecML Malware on malware from VirusShare 2814

Pre-requisites

This data is zipped. The main kipple repo assumes you will unzip it -- we strongly recommend unzipping once you download the repo. The zip is only to make sure we're in line with file size requirements.

File Hashes

The records directory contains files listing the file hashes associated with each data array. Due to the different data sources, and some small code hiccups, there are some nuances in the naming convention:

  • All hashes under the "msf" category are the MD5 file hashes of the implant generated by msfvenom.
  • All hashes under the "vs" category are the MD5 file hashes of the original malware downloaded from VirusShare.
    • In some cases, multiple variants of the same original sample were created; in these cases, after the original sample is created, the subsequent ones have a "-ABC-.exe" after them, where is the variant number.
    • In some cases, a sha256 value may be used in place of an MD5.
  • All hashes under the "sorel" category of file hashes are the hashes of the original malware.
    • SoReL modifies the malware binaries to be non-executable, giving them a different hash than the "active"/original malware.
    • The sha256 values correspond to the original version.
  • There may be some names solely consisting of "-".

Why this format?

The memmap'd format for storage probably isn't ideal -- it would be better to have stored + shared the malware as feature sets similar to how EMBER stores the data. However, to save time during testing we would effectively add all newly generated malware samples to the existing memmap'd set, letting us run quicker tests. Hopefully at some point in the future I'll go through and revise the format storage.

Usage

Assuming you've already unzipped, the following code would be an example of running a classifier over the kipple data:

import ember
import os
from ember.features import PEFeatureExtractor
import lightgbm as lgb
import gzip
import numpy as np

# Load EMBER feature extractor + number of dimensions
extractor=PEFeatureExtractor(feature_version=2, print_feature_warning=False)
ndim = extractor.dim

# Load the data in the array we want to use
target_data="msf_normal"
num_entries=sum(1 for line in open("records/" + target_data + ".txt"))
malware_data = np.memmap("data/" + target_data + ".dat", dtype=np.float32, mode="r", shape=(num_entries, ndim))

# Load a local model
model_location="/exes/kipple_repo/kipple/models/initial.txt.gz"
with gzip.open(model_location,"rb") as f:
    md=f.read().decode('ascii')
mdl=lgb.Booster(model_str=md)

num_correct=0
for i in range (0, num_entries):
    if mdl.predict([malware_data[i]])[0] > .85:
        num_correct=num_correct+1
print(num_correct/num_entries)

There are more examples in the primary kipple directory.

Useful References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published