# Task 1 - Training: 
In this task, you will be creating and training a deep neural network based on the
MalConv architecture to classify PE files as malware or benign. As for the dataset, you will be using the
EMBER-2017 v2 ( https://github.com/endgameinc/ember ). Besides the references provided in this
repository, the following two talks at BSides San Francisco 2018 and the CAMLIS 2019 conferences
present detailed overviews of this dataset, as well as hints on how to use EMBER to train malware
classifiers:


In [2]:
!pip install numpy
!pip install pandas
!pip install keras
!pip install tensorflow
!pip install sklearn
!pip install altair
!pip install altair vega_datasets


!pip install tensorflow==2.0





In [3]:
!pip install js.ember



Creating the Vectorized data

In [4]:
import ember
ember.create_vectorized_features("C:\\Users\\gouru\\Downloads\\ember_2017_2")
ember.create_metadata("C:\\Users\\gouru\\Downloads\\ember_2017_2")



Vectorizing training set


100%|████████████████████████████████████████████████████████████████████████| 900000/900000 [11:31<00:00, 1301.28it/s]


Vectorizing test set


100%|████████████████████████████████████████████████████████████████████████| 200000/200000 [02:34<00:00, 1291.92it/s]


Unnamed: 0,sha256,appeared,subset,label
0,0abb4fda7d5b13801d63bee53e5e256be43e141faa077a...,2006-12,train,0
1,d4206650743b3d519106dea10a38a55c30467c3d9f7875...,2006-12,train,0
2,c9cafff8a596ba8a80bafb4ba8ae6f2ef3329d95b85f15...,2007-01,train,0
3,7f513818bcc276c531af2e641c597744da807e21cc1160...,2007-02,train,0
4,ca65e1c387a4cc9e7d8a8ce12bf1bcf9f534c9032b9d95...,2007-02,train,0
...,...,...,...,...
1099995,fffe314f23cee3a68ccab272934877d3bc18ec3bd905df...,2017-12,test,0
1099996,fffe7a1b23e04facc9ca91a93ac4a34e8b3040e023dbde...,2017-12,test,1
1099997,fffe801f51e7ec931515aa49a3d157a9c0fbcdca8c9d80...,2017-12,test,0
1099998,fffe92f9593649c4a7050302368189de45e2c1c06b04ea...,2017-12,test,1


In [5]:
X_train, y_train, X_test, y_test = ember.read_vectorized_features("C:\\Users\\gouru\\Downloads\\ember_2017_2")
metadata_dataframe = ember.read_metadata("C:\\Users\\gouru\\Downloads\\ember_2017_2")

  mask |= (ar1 == a)


In [6]:
metadata_dataframe.tail()


Unnamed: 0,sha256,appeared,subset,label
1099995,fffe314f23cee3a68ccab272934877d3bc18ec3bd905df...,2017-12,test,0
1099996,fffe7a1b23e04facc9ca91a93ac4a34e8b3040e023dbde...,2017-12,test,1
1099997,fffe801f51e7ec931515aa49a3d157a9c0fbcdca8c9d80...,2017-12,test,0
1099998,fffe92f9593649c4a7050302368189de45e2c1c06b04ea...,2017-12,test,1
1099999,ffffb259a4c5e25ae1437af59caafb718cf8879187cc8c...,2017-12,test,1


Normalizing the data and taking relevant samples:
I tried doing without normalizing and the accuracy of the model is about 41%

In [7]:
labelrows = (y_train != -1)

In [8]:
X_train = X_train[labelrows]
y_train = y_train[labelrows]

In [10]:
len(X_train)

600000

In [11]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
for x in range(0,600000,100000):
    ss.partial_fit(X_train[x:x+100000])

In [12]:
X_train = ss.transform(X_train)

In [13]:
len(X_train)

600000

Training Model

In [14]:
from keras import optimizers, Input, Model
from keras.layers import Dense, Conv1D, Activation, GlobalMaxPooling1D, Input, Embedding, Multiply
from keras.models import Model, load_model
from keras import backend as K
from keras import metrics
from keras.optimizers import SGD
from keras.callbacks import LearningRateScheduler

Using TensorFlow backend.


In [15]:
import keras
from keras import optimizers
maxLen = 200000
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [16]:
def Model():
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    feature_size=2381
    tf.compat.v1.disable_eager_execution()

    keras.backend.clear_session()
  
    #Model architecture
    from tensorflow.keras import layers
  
    model = tf.keras.Sequential()
    model.add(layers.InputLayer(input_shape=(1,feature_size)))
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(1500, activation='relu',activity_regularizer=tf.keras.regularizers.l1(l=0.01)))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(1, activation='sigmoid'))

    model.compile(tf.keras.optimizers.Adam(learning_rate=0.001),
            loss='binary_crossentropy',
            metrics=['accuracy',tf.keras.metrics.AUC(),tf.keras.metrics.Precision()])
    print(model.summary())
    return model

In [17]:
model = Model()

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 1, 2381)           0         
_________________________________________________________________
dense (Dense)                (None, 1, 1500)           3573000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1, 1500)           0         
_________________________________________________________________
dense_1 (Dense)              (None, 1, 1)              1501      
Total params: 3,574,501
Trainable params: 3,574,501
Non-trainable params: 0
_________________________________________________________________
None


In [18]:
import numpy as np
X_train = np.reshape(X_train,(-1,1,2381))
y_train = np.reshape(y_train,(-1,1,1))

In [19]:
import tensorflow as tf
tf.compat.v1.disable_eager_execution()

model.compile(tf.keras.optimizers.Adam(learning_rate=0.01),
          loss='binary_crossentropy',
          metrics=['accuracy',tf.keras.metrics.AUC(),tf.keras.metrics.Precision()])

history = model.fit(X_train, y_train,
                batch_size=128,
                epochs=5,
                  validation_split=.2,
                  callbacks=None )
                  


Train on 480000 samples, validate on 120000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Testing the Model

In [20]:
X_test = ss.transform(X_test)

In [21]:
X_test = np.reshape(X_test,(-1,1,2381))
y_test = np.reshape(y_test,(-1,1,1))

Accuracy on test set

In [22]:
results =model.evaluate(X_test,y_test)
print("loss: %gl,acc: %gl"%(results[0],results[1]))

loss: 616.169l,acc: 0.944365l


In [23]:
y_pred = model.predict(X_test)

In [25]:
y_test = np.reshape(y_test,(-1))
y_pred = np.reshape(y_pred,(-1))

False positive rate

In [26]:
from sklearn.metrics import roc_auc_score, roc_curve
def get_fpr(y_test, y_pred):
    nbenign = (y_test == 0).sum()
    nfalse = (y_pred[y_test == 0] == 1).sum()
    return nfalse / float(nbenign)




In [27]:
get_fpr(y_test, y_pred)

0.00613

F1 Score

In [28]:
y_test_int = np.asarray(y_test,dtype=int)
y_pred_int=np.asarray(y_pred,dtype=int)


In [29]:
from sklearn.metrics import f1_score

fscore = f1_score(y_test_int, y_pred_int) 
fscore

0.7965743880486332

Precision

In [30]:
import sklearn
precision = sklearn.metrics.precision_score(y_test_int,y_pred_int)
precision

0.9908794691345166

Confusion Matrix

In [31]:
from sklearn.metrics import confusion_matrix
confmat = confusion_matrix(y_test_int,y_pred_int,labels=[0,1])

In [32]:
confmat

array([[99387,   613],
       [33402, 66598]], dtype=int64)

Saving the Model 

In [33]:


!pip install h5py



In [35]:
import os
save_path = "C:\\Users\\gouru\\Downloads\\ember_2017_2"

model.save_weights(os.path.join(save_path,"my_weights.h5"))

# save neural network structure to JSON (no weights)
model_json = model.to_json()
with open(os.path.join(save_path,"my_model.json"), "w") as json_file:
    json_file.write(model_json)

Set up and load the Keras model using the json and weights file

In [1]:
import boto3, re
from sagemaker import get_execution_role

role = get_execution_role()

In [3]:
import keras
from keras.models import model_from_json

Using TensorFlow backend.





In [4]:
!mkdir keras_model

In [5]:
!ls keras_model

my_model.json  my_weights.h5


In [6]:
import tensorflow as tf

json_file = open('/home/ec2-user/SageMaker/keras_model/'+'my_model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json,custom_objects={"GlorotUniform": tf.keras.initializers.glorot_uniform})




Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [7]:
loaded_model.load_weights('/home/ec2-user/SageMaker/keras_model/my_weights.h5')
print("Loaded model from disk")







Loaded model from disk


Export the Keras model to the TensorFlow ProtoBuf format

In [8]:
from tensorflow.python.saved_model import builder
from tensorflow.python.saved_model.signature_def_utils import predict_signature_def
from tensorflow.python.saved_model import tag_constants

# Note: This directory structure will need to be followed - see notes for the next section
model_version = '1'
export_dir = 'export/Servo/' + model_version

In [9]:

import shutil
shutil.rmtree(export_dir)

In [10]:
# Build the Protocol Buffer SavedModel at 'export_dir'
build = builder.SavedModelBuilder(export_dir)

In [11]:

# Create prediction signature to be used by TensorFlow Serving Predict API
signature = predict_signature_def(
    inputs={"inputs": loaded_model.input}, outputs={"score": loaded_model.output})

Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


In [12]:
from keras import backend as K

with K.get_session() as sess:
    # Save the meta graph and variables
    build.add_meta_graph_and_variables(
        sess=sess, tags=[tag_constants.SERVING], signature_def_map={"serving_default": signature})
    build.save()

INFO:tensorflow:No assets to save.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: export/Servo/1/saved_model.pb


Convert TensorFlow model to a SageMaker readable format 

In [13]:
!ls export

Servo


In [14]:

!ls export/Servo

1


In [15]:

!ls export/Servo/1/variables

variables.data-00000-of-00001  variables.index


In [16]:
import tarfile
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
    archive.add('export', recursive=True)

In [17]:
import sagemaker

sagemaker_session = sagemaker.Session()
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')

Deploy the trained model (must use AWS SageMaker Notebook)

In [18]:
!touch train.py

In [19]:
from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
                                  role = role,
                                  framework_version = '1.12',
                                  entry_point = 'train.py')

2.1.0 is the latest version of tensorflow that supports Python 2. Newer versions of tensorflow will only be available for Python 3.Please set the argument "py_version='py3'" to use the Python 3 tensorflow image.


In [20]:
%%time
predictor = sagemaker_model.deploy(initial_instance_count=1,
                                   instance_type='ml.m4.xlarge')

-------------!CPU times: user 462 ms, sys: 21.8 ms, total: 483 ms
Wall time: 6min 34s


In [22]:
predictor.endpoint

'sagemaker-tensorflow-2020-04-28-00-12-37-103'

In [23]:
endpoint_name = 'sagemaker-tensorflow-2020-04-28-00-12-37-103'

In [28]:
import sagemaker
from sagemaker.tensorflow.model import TensorFlowModel
predictor=sagemaker.tensorflow.model.TensorFlowPredictor(endpoint_name, sagemaker_session)

In [35]:
import json
import boto3
import numpy as np
import io

endpoint_name = 'sagemaker-tensorflow-2020-04-28-00-12-37-103'