Using sklearn #3

mozark24 · 2018-04-30T20:24:11Z

First off, thank you for this awesome dataset! Completely agree that this level of control on both benign and malware sets of this size has been a shortfall based on my researching. As a relatively new ML'r I would like to use the dataset with more traditional sklearn modules instead of the provided Ember ones. Apologize if this isn't a great place to ask, but what steps can I take to prep the dataset to then take over and apply say a basic LogisticRegression model to the Import calls? Thanks!

mozark24 · 2018-05-01T12:13:56Z

As I look through the code with fresh eyes I think I can work it out. The thing that was throwing me yesterday was the feature hashing/importing steps.

mrphilroth · 2018-05-01T14:40:41Z

You're probably on the right track, but I'll just stress that you can use this code to vectorize the features only. At that point, you'll have a large feature space that you can read in and hand to whatever model you're interested in, including all those in scikit-learn.

mozark24 · 2018-05-01T16:16:40Z

To confirm, if i just want to vectorize and go from there, I should follow the "Import usage" steps up to:
metadata_dataframe = ember.read_metadata("/data/ember/")?

I might even want to take over earlier, can't wait to dig in!

mrphilroth · 2018-05-01T17:05:36Z

That's right. If you complete those steps, then the X_train, y_train, X_test, y_test variables in your environment will be immediately ready to be handed to scikit-learn models.

mozark24 · 2018-05-11T16:31:53Z

(Temp re-open) I imagine there is some data pre-processing required to select only certain features or to remove the "-1" labeled rows from the datasets for a purely supervised approach. I can remove them from the dataframe step but the X_train, y_train, X_test, y_test = ember.read_vectorized_features(data_dir) operation goes straight to the already written vectorized data files, instead of the metadata dataframe. The only way I can see making this adjustment would be to strip them from the JSONL files prior to vectorization. Does that seem accurate?

mrphilroth · 2018-05-11T16:52:57Z

"""remove the "-1" labeled rows"""

Check out the train_ember function for how I filter out the -1 labeled rows from the ember benchmark model training: https://github.com/endgameinc/ember/blob/master/ember/__init__.py#L146-L160

"""select certain features"""

If you want to select only certain columns from X_train or X_test, you can use numpy indexing to achieve this:
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html
https://stackoverflow.com/questions/8386675/extracting-specific-columns-in-numpy-array

If you want to select rows from the feature matrix based on the metadata dataframe, I would suggest doing something like:

selected_rows = metadata["appeared"] < "2017-08"
X_train_filtered = X_train[selected_rows]

Good luck!

mozark24 · 2018-05-12T14:48:55Z

What is not clear is the mapping from the header columns to the numpy array needed to slice a particular feature. For instance, what if I only want to look at Imports info, how can I determine which array indices that corresponds to in X_train? After all, there are 2,351 feature vectors to choose from in X_train/X_test. The FeatureHasher also makes it very difficult to characterize feature importance.

drhyrum · 2018-05-12T19:07:20Z

Since the hashing trick is used to convert, e.g., a ragged count of imports to a fixed-length vector, you'd only be able to back out "these columns are imports", but have a many-to-one problem of many imports mapping to any one column. (One column corresponds to many imported names.)

Hashing trick: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html

For any one section, you can figure out where the feature type begins by noting the order of features:
https://github.com/endgameinc/ember/blob/master/ember/features.py#L441-L443

features = [
    ByteHistogram(), ByteEntropyHistogram(), StringExtractor(), GeneralFileInfo(), 
    HeaderFileInfo(), SectionInfo(), ImportsInfo(), ExportsInfo()
]

and noting that every FeatureType has a dim that specified the number of columns spanned in the feature matrix. So, ImportsInfo features begins at

imports_offset = sum([fe.dim for fe in features[:6])

and is features[6].dim columns long.

mozark24 · 2018-05-12T19:10:03Z

I have it working now. Dug into the code and figured out where each of those fixed length feature sizes are so I can compare the various categories against each-other. Its working well!

add script

mozark24 closed this as completed May 1, 2018

mozark24 reopened this May 11, 2018

mozark24 closed this as completed May 11, 2018

choisungwook referenced this issue in choisungwook/ember Dec 2, 2018

Merge pull request #3 from choisungwook/master

add417c

add script

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using sklearn #3

Using sklearn #3

mozark24 commented Apr 30, 2018 •

edited

mozark24 commented May 1, 2018

mrphilroth commented May 1, 2018

mozark24 commented May 1, 2018

mrphilroth commented May 1, 2018

mozark24 commented May 11, 2018 •

edited

mrphilroth commented May 11, 2018

mozark24 commented May 12, 2018 •

edited

drhyrum commented May 12, 2018

mozark24 commented May 12, 2018

Using sklearn #3

Using sklearn #3

Comments

mozark24 commented Apr 30, 2018 • edited

mozark24 commented May 1, 2018

mrphilroth commented May 1, 2018

mozark24 commented May 1, 2018

mrphilroth commented May 1, 2018

mozark24 commented May 11, 2018 • edited

mrphilroth commented May 11, 2018

mozark24 commented May 12, 2018 • edited

drhyrum commented May 12, 2018

mozark24 commented May 12, 2018

mozark24 commented Apr 30, 2018 •

edited

mozark24 commented May 11, 2018 •

edited

mozark24 commented May 12, 2018 •

edited