# Black Hat USA Training (Early draft)

## Lab 3: Dynamic Malware Detection using Machine Learning with PyTorch

We will follow a "Top-Down" teaching methodology: We will start with higher level concepts familiar to our students in the cybersecurity domain, for instance, by introducing a specific library and demonstrating its use. Then, we delve deeper into the methods and parameters of these applications. Finally, we explore the underlying fundamentals, such as the specific PE format properties or mathematical concepts at the core of these ideas.

**NOTE: This is a raw draft that will be populated with more material (especially visual) and explanations, especially, facilitating more gradual AI/ML concept introduction.**

Contents:

- Download PlugX Sample
- Machine Learning with Dynamic Malware Analysis 
- Speakeasy Emulator
- PyTorch Introduction

In [1]:
# force reimport of lab_helpers
import sys
if 'lab_helpers' in sys.modules:
    del sys.modules['lab_helpers']

from lab_helpers import *

## Download PlugX Sample

In [13]:
# NOTE: for some reason download from vx-underground is denied by the server 
# works from browser, but not if using requests.get, user-agent browser mimic does not help
vx_link = "https://samples.vx-underground.org/Samples/Families/PlugX/0D219AA54B1D417DA61BD4AED5EEB53D6CBA91B3287D53186B21FED450248215.7z"

# NOTE: go to https://vx-underground.org/Samples/Families/PlugX and download the sample manually
local_path = "./0D219AA54B1D417DA61BD4AED5EEB53D6CBA91B3287D53186B21FED450248215.7z"
plugx_rat_bytez = get_encrypted_archive(local_path, password="infected")
print(plugx_rat_bytez[0:20])

b'MZP\x00\x02\x00\x00\x00\x04\x00\x0f\x00\xff\xff\x00\x00\xb8\x00\x00\x00'


## Machine Learning Model with Dynamic Malware Analysis 

In [None]:
%pip install git+https://github.com/dtrizna/nebula

In [11]:
import torch
torch.manual_seed(0)

import nebula

nebula_transformer = nebula.Nebula(tokenizer='bpe')
plugx_rat_report = nebula_transformer.dynamic_analysis_pe_file(plugx_rat_bytez)

print("First 5 API calls invoked by PlugX RAT:\n")
_ = [print(val) for val in plugx_rat_report['apis'][0:5]]

First 5 API calls invoked by PlugX RAT:

{'api_name': 'kernel32.GetModuleHandleA', 'args': ['0x0'], 'ret_val': '0x400000'}
{'api_name': 'user32.GetKeyboardType', 'args': ['0x0'], 'ret_val': '0x4'}
{'api_name': 'kernel32.GetCommandLineA', 'args': [], 'ret_val': '0x45f0'}
{'api_name': 'kernel32.GetStartupInfoA', 'args': ['0x1211f30'], 'ret_val': None}
{'api_name': 'kernel32.GetVersion', 'args': [], 'ret_val': '0x1db10106'}


In [12]:
plugx_dynamic_features = nebula_transformer.preprocess(plugx_rat_report)

print(f"Shape of dynamic features: {plugx_dynamic_features.shape}\n")

print(f"First 5 dynamic features:\n\n{plugx_dynamic_features[0, 0:20]}")

Shape of dynamic features: (1, 512)

First 5 dynamic features:

[ 2235  1036   530  1203   778  1078   125  1103 15492 49966   932   530
  1203   778   560  1176  1103 15492 49966   932]


In [19]:
torch.manual_seed(0)
prob = nebula_transformer.predict_proba(plugx_dynamic_features)

hhash = vx_link.split("/")[-1].split(".")[0].lower()
print(f"Probability of '{hhash}' being malware: {prob*100:.2f}%")

if os.path.exists(r"C:\windows\system32\calc.exe"):
    with open (r"C:\windows\system32\calc.exe", "rb") as f:
        calc_bytez = f.read()
    report = nebula_transformer.dynamic_analysis_pe_file(calc_bytez)
    dynamic_features = nebula_transformer.preprocess(report)
    
    torch.manual_seed(0)
    prob = nebula_transformer.predict_proba(dynamic_features)
    print(f"Probability of 'calc.exe' being malware: {prob*100:.2f}%")

Probability of '0d219aa54b1d417da61bd4aed5eeb53d6cba91b3287d53186b21fed450248215' being malware: 93.69%
Probability of 'calc.exe' being malware: 0.06%


## Speakeasy Emulator

How this is achieved under the hood?

Speakeasy is a Python-based emulator build and actively maintained by Mandiant.

It is built on top of the Unicorn emulator framework, and emulated x86 architecture solely with a focus on malware analysis.


In [24]:
import speakeasy
emulator = speakeasy.Speakeasy()

module = emulator.load_module(data=plugx_rat_bytez)
emulator.run_module(module)
report = emulator.get_report()

# drop report to disk
import json
with open("plugx_report.json", "w") as f:
    json.dump(report, f, indent=4)

In [26]:
api_nr = len(report['entry_points'][0]['apis'])
print(f"Number of API calls invoked by PlugX RAT: {api_nr}")

Number of API calls invoked by PlugX RAT: 197


In [27]:
_ = [print(val) for val in report['entry_points'][0]['apis'][0:5]]

{'pc': '0x4069ad', 'api_name': 'kernel32.GetModuleHandleA', 'args': ['0x0'], 'ret_val': '0x400000'}
{'pc': '0x40398a', 'api_name': 'user32.GetKeyboardType', 'args': ['0x0'], 'ret_val': '0x4'}
{'pc': '0x406870', 'api_name': 'kernel32.GetCommandLineA', 'args': [], 'ret_val': '0x45f0'}
{'pc': '0x401333', 'api_name': 'kernel32.GetStartupInfoA', 'args': ['0x1211f30'], 'ret_val': None}
{'pc': '0x406884', 'api_name': 'kernel32.GetVersion', 'args': [], 'ret_val': '0x1db10106'}


We got the same values as the ones we got from the `nebula` analysis above. In reality, Speakeasy has extra fields that are a potential source of information for our model.

In [28]:
report['entry_points'][0]['registry_access']

[{'event': 'open_key',
  'path': 'HKEY_CURRENT_USER\\Software\\Borland\\Locales'},
 {'event': 'open_key',
  'path': 'HKEY_LOCAL_MACHINE\\Software\\Borland\\Locales'},
 {'event': 'open_key',
  'path': 'HKEY_CURRENT_USER\\Software\\Borland\\Delphi\\Locales'}]

Speakeasy is adjustable tool and supports a variety of configurations, for instance, it is possible to modify environment variables, user and domain information, simulate network state, and more.

In [31]:
import json

speakeasy_config = os.path.join(os.path.dirname(nebula.__file__), "objects", "speakeasy_config.json")

with open(speakeasy_config, "r") as f:
    speakeasy_config = json.load(f)

speakeasy_config['env']

{'comspec': 'C:\\Windows\\system32\\cmd.exe',
 'systemroot': 'C:\\Windows',
 'windir': 'C:\\Windows',
 'temp': 'C:\\Windows\\temp\\',
 'userprofile': 'C:\\Users\\dtrizna',
 'systemdrive': 'C:',
 'allusersprofile': 'C:\\ProgramData',
 'programfiles': 'C:\\Program Files'}

In [32]:
print(speakeasy_config['domain'])
print(speakeasy_config['user'])

foo.bar
{'name': 'dtrizna', 'is_admin': True}


In [33]:
speakeasy_config['network']

{'dns': {'names': {'work.foo.bar': '127.0.0.1',
   'foo.bar': '10.1.2.3',
   'default': '10.1.2.3',
   'google.com': '8.8.8.8',
   'localhost': '127.0.0.1'},
  'txt': [{'name': 'default', 'path': '$ROOT$/resources/web/default.bin'}]},
 'http': {'responses': [{'verb': 'GET',
    'files': [{'mode': 'default', 'path': '$ROOT$/resources/web/default.bin'},
     {'mode': 'by_ext',
      'ext': 'gif',
      'path': '$ROOT$/resources/web/decoy.gif'},
     {'mode': 'by_ext',
      'ext': 'jpg',
      'path': '$ROOT$/resources/web/decoy.jpg'}]}]},
 'winsock': {'responses': [{'mode': 'default',
    'path': '$ROOT$/resources/web/stager.bin'}]}}

## Transformer Model

Nebula uses Transformer model to classify malware, the same architecture used in GPT. Transformer is a deep learning model that is based on the attention mechanism.

In [12]:
nebula_transformer.model

TransformerEncoderChunks(
  (encoder): Embedding(50001, 64)
  (pos_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.3, inplace=False)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
        )
        (linear1): Linear(in_features=64, out_features=256, bias=True)
        (dropout): Dropout(p=0.3, inplace=False)
        (linear2): Linear(in_features=256, out_features=64, bias=True)
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.3, inplace=False)
        (dropout2): Dropout(p=0.3, inplace=False)
      )
    )
  )
  (ffnn): Sequential(
    (0): Sequential(
      (0): Linear(in_features=32768, out_features=64, bias=True)
      (1): ReLU()
      (2): Dropou

### Layers

In [44]:
from torch import nn

layer_a = nn.Linear(3, 1)
layer_a(tensor_b)

tensor([[0.2902],
        [0.8687],
        [1.4472]], grad_fn=<AddmmBackward0>)