<a href="https://colab.research.google.com/github/cmikke97/Automatic-Malware-Signature-Generation/blob/main/src/DetectionBase/DetectionBase_Github.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Train and Evaluate ML detection model**

(code available at https://github.com/cmikke97/Automatic-Malware-Signature-Generation)

# **Needed packages**

In [None]:
!pip install boto3
!pip install baker
!pip install -U logzero
!pip install lmdb

Collecting boto3
[?25l  Downloading https://files.pythonhosted.org/packages/ea/10/a997a266165e2df1976c4fc973f71bcd2e65a255f92d0ff7ab59b2f81989/boto3-1.17.44-py2.py3-none-any.whl (131kB)
[K     |██▌                             | 10kB 17.5MB/s eta 0:00:01[K     |█████                           | 20kB 23.8MB/s eta 0:00:01[K     |███████▌                        | 30kB 22.5MB/s eta 0:00:01[K     |██████████                      | 40kB 17.5MB/s eta 0:00:01[K     |████████████▌                   | 51kB 14.3MB/s eta 0:00:01[K     |███████████████                 | 61kB 13.7MB/s eta 0:00:01[K     |█████████████████▌              | 71kB 13.4MB/s eta 0:00:01[K     |████████████████████            | 81kB 14.2MB/s eta 0:00:01[K     |██████████████████████▌         | 92kB 15.5MB/s eta 0:00:01[K     |█████████████████████████       | 102kB 15.0MB/s eta 0:00:01[K     |███████████████████████████▍    | 112kB 15.0MB/s eta 0:00:01[K     |██████████████████████████████  | 122kB 15

# **Set up Drive**

In [None]:
from google.colab import drive

# set path where to mount drive
drive_path = "/content/drive"

# mount drive
drive.mount(drive_path)

Mounted at /content/drive


# **Set base path**

In [None]:
import os

# set base path (if using google Drive)
base_path = os.path.join(drive_path, "MyDrive/thesis")

# **Clone git repository**

In [None]:
!rm -r /content/Automatic-Malware-Signature-Generation

rm: cannot remove '/content/Automatic-Malware-Signature-Generation': No such file or directory


In [None]:
!git clone https://github.com/cmikke97/Automatic-Malware-Signature-Generation.git

Cloning into 'Automatic-Malware-Signature-Generation'...
remote: Enumerating objects: 456, done.[K
remote: Counting objects: 100% (456/456), done.[K
remote: Compressing objects: 100% (433/433), done.[K
Receiving objects: 100% (456/456), 127.15 KiB | 6.05 MiB/s, done.
remote: Total 456 (delta 143), reused 0 (delta 0), pack-reused 0[K
Resolving deltas: 100% (143/143), done.


## **Download SOREL20M dataset**

In [None]:
# set destination dir
dataset_dir = "/content/Dataset"

In [None]:
# execute downloader
!python Automatic-Malware-Signature-Generation/src/DatasetDownloader/sorel20mDownloader.py sorel20m_download $dataset_dir

Now downloading 09-DEC-2020/processed-data/meta.db from s3 bucket..
100% 3788979200/3788979200 [00:33<00:00, 113085201.69it/s]
1/3 done.
Now downloading 09-DEC-2020/processed-data/ember_features/lock.mdb from s3 bucket..
100% 65664/65664 [00:00<00:00, 181365.99it/s]
2/3 done.
Now downloading 09-DEC-2020/processed-data/ember_features/data.mdb from s3 bucket..
100% 76865335296/76865335296 [15:58<00:00, 80169334.77it/s]
3/3 done.


# **Configuration**

To change configuration change values in local copy of "config.py" located at "/content/Automatic-Malware-Signature-Generation/src/DetectionBase/config.py".

# **Train Network**

In [None]:
os.chdir("/content/Automatic-Malware-Signature-Generation/src/DetectionBase")

import config

checkpoint_base_dir = config.checkpoint_dir

# for the number of configured runs
for i in range(config.runs):
    checkpoint_dir = os.path.join(checkpoint_base_dir, str(i))
    remove_missing_features = os.path.join(base_path, "Dataset/09-DEC-2020/processed-data/shas_missing_ember_features.json")

    # execute train.py script
    !python train.py train_network --checkpoint_dir $checkpoint_dir --remove_missing_features $remove_missing_features

[32m[I 210404 10:27:39 train:151][39m ...instantiating network
[32m[I 210404 10:27:46 dataset:151][39m Opening Dataset at /content/Dataset/09-DEC-2020/processed-data/meta.db in train mode.
[32m[I 210404 10:27:47 dataset:161][39m 400000 samples loaded.
[32m[I 210404 10:27:47 dataset:209][39m Trying to load shas to ignore from /content/drive/MyDrive/thesis/Dataset/09-DEC-2020/processed-data/shas_missing_ember_features.json...
[32m[I 210404 10:27:51 dataset:220][39m Dataset now has 393310 samples.
[32m[I 210404 10:27:52 dataset:151][39m Opening Dataset at /content/Dataset/09-DEC-2020/processed-data/meta.db in validation mode.
[32m[I 210404 10:28:25 dataset:161][39m 76923 samples loaded.
[32m[I 210404 10:28:25 dataset:209][39m Trying to load shas to ignore from /content/drive/MyDrive/thesis/Dataset/09-DEC-2020/processed-data/shas_missing_ember_features.json...
[32m[I 210404 10:28:26 dataset:220][39m Dataset now has 75118 samples.
[32m[I 210404 10:28:26 train:188][39m St

# **Evaluate Network**

In [None]:
import json

results_base_dir = config.results_dir

#instantiate results_files dictionary
results_files = {}

# for the number of configured runs
for i in range(config.runs):
    # add file path to results_files dictionary (used for plotting results)
    results_files["run_id_" + str(i)] = os.path.join(results_base_dir, str(i), "results.csv");

    results_dir = os.path.join(results_base_dir, str(i))
    checkpoint_file = os.path.join(config.checkpoint_dir, str(i), "epoch_10.pt")
    remove_missing_features = os.path.join(base_path, "Dataset/09-DEC-2020/processed-data/shas_missing_ember_features.json")

    # execute evaluate.py script
    !python evaluate.py evaluate_network --results_dir $results_dir --checkpoint_file $checkpoint_file --remove_missing_features $remove_missing_features
    
# create and open the results.json file in write mode
with open(os.path.join(results_base_dir, "results.json"), "w") as output_file:
    # save results_files dictionary as a json file
    json.dump(results_files, output_file)

[32m[I 210405 01:15:02 dataset:151][39m Opening Dataset at /content/Dataset/09-DEC-2020/processed-data/meta.db in test mode.
[32m[I 210405 01:15:44 dataset:161][39m 123077 samples loaded.
[32m[I 210405 01:15:44 dataset:209][39m Trying to load shas to ignore from /content/drive/MyDrive/thesis/Dataset/09-DEC-2020/processed-data/shas_missing_ember_features.json...
[32m[I 210405 01:15:44 dataset:220][39m Dataset now has 120760 samples.
[32m[I 210405 01:15:45 evaluate:127][39m ...running network evaluation
100% 15/15 [13:37<00:00, 54.51s/it]
...done
[32m[I 210405 01:29:35 dataset:151][39m Opening Dataset at /content/Dataset/09-DEC-2020/processed-data/meta.db in test mode.
[32m[I 210405 01:30:16 dataset:161][39m 123077 samples loaded.
[32m[I 210405 01:30:16 dataset:209][39m Trying to load shas to ignore from /content/drive/MyDrive/thesis/Dataset/09-DEC-2020/processed-data/shas_missing_ember_features.json...
[32m[I 210405 01:30:17 dataset:220][39m Dataset now has 120760 samp

# **Plot Results**

In [None]:
# for the number of configured runs
for i in range(config.runs):
    results_file = os.path.join(results_base_dir, str(i), "results.csv")
    output_filename = os.path.join(results_base_dir, str(i), "results.png")

    # execute plot.py to plot per-tag results for the single run
    !python plots.py plot_tag_result --results_file $results_file --output_filename $output_filename


run_to_filename_json = os.path.join(results_base_dir, "results.json")
output_filename = os.path.join(results_base_dir, "results.png")
tag_to_plot = 'malware'

# execute plot.py to plot the model results mean and confidence (at least 2 runs are needed)
!python plots.py plot_roc_distribution_for_tag --run_to_filename_json $run_to_filename_json --output_filename $output_filename --tag_to_plot $tag_to_plot