# Black Hat USA Training (Early draft)

## Lab 1: Static Malware Detection using Machine Learning with Gradient Boosted Decision Trees

We will follow a "Top-Down" teaching methodology: We will start with higher level concepts familiar to our students in the cybersecurity domain, for instance, by introducing a specific library and demonstrating its use. Then, we delve deeper into the methods and parameters of these applications. Finally, we explore the underlying fundamentals, such as the specific PE format properties or mathematical concepts at the core of these ideas.

**NOTE: This is a raw draft that will be populated with more material (especially visual) and explanations, especially, facilitating AI/ML intuition and more gradual familiriaztion with concepts.**

Contents:
- Downloading AsyncRAT Sample
- Machine Learning in Commercial EDRs
- Why XGBoost? EMBER
- Feature Extraction
- Explainability (???)

First, install pre-requisites and import necessary libraries:

In [None]:
%pip install --upgrade pip
%pip install git+https://github.com/dtrizna/ember.git py7zr torch requests numpy lightgbm

In [1]:
# force reimport of lab_helpers
import sys
if 'lab_helpers' in sys.modules:
    del sys.modules['lab_helpers']

from lab_helpers import *

## Downloading AsyncRAT Sample

AsyncRAT seems to be on the rise according to [Recorded Future Adversary Report](https://www.recordedfuture.com/2023-adversary-infrastructure-report):


<img src="./img/recorded_future_malware_bargraph.png" width="600">


It is a remote access trojan (RAT) that is written in C# and has been around since 2014, emerging from the QuasaRAT malware strain and was used as a starting point for RevengeRAT and BoratRAT. It is a simple RAT that is easy to use and [has a lot of features](https://www.blackberry.com/us/en/solutions/endpoint-security/ransomware-protection/asyncrat), such as:

- Remotely record a target’s screen;
- Keylogger;
- Import and exec DLLs;
- File exfiltration;
- Persistence;
- Launch botnet-enabled DOS attacks.

Let's get one from [vx-underground](https://twitter.com/vxunderground):


In [6]:
# NOTE: for some reason download from vx-underground is denied by the server 
# works from browser, but not if using requests.get, user-agent browser mimic does not help
vx_link = "https://samples.vx-underground.org/Samples/Families/AsyncRAT/5e3588e8ddebd61c2bd6dab4b87f601bd6a4857b33eb281cb5059c29cfe62b80.7z"
# using a private hosted copy
async_rat_path = "http://malware-training.us.to/5e3588e8ddebd61c2bd6dab4b87f601bd6a4857b33eb281cb5059c29cfe62b80.7z"
async_rat_bytez = get_encrypted_archive(async_rat_path, password="infected")
async_rat_bytez[0:20]

b'MZ\x90\x00\x03\x00\x00\x00\x04\x00\x00\x00\xff\xff\x00\x00\xb8\x00\x00\x00'

## Machine Learning in Commercial EDRs

ML/AI components in commercial malware detection products are usually just a part of a multi-modular heuristic.

Consider this discussion initiated by malware detection vendor on [Twitter](https://twitter.com/joshua_saxe/status/1550545466072264704) that depicts the complexity of the problem:

<img src="./img/sophos_concerns.png" width="800">

We will discuss holistic vision in future Labs, but for now let's focus on ML/AI components. 

How they work?
What commercial vendors are using?

### Job Description Reconnaissance

Well known Red Team methodology can be used to answer these questions, for instance, job/career description recoinessance may yield some insights on what target vendor is using as ML/AI component.

Simple Google dork like `<vendor> careers job "malware" "machine learning"` will yield interesting results:

- vendor 1:

<img src="./img/job1_name.png" width="400"></br>
<img src="./img/job1_reqs.png" width="400">

- vendor 2:

<img src="./img/job2_name.png" width="400"></br>
<img src="./img/job2_reqs.png" width="400">

- vendor 3:

<img src="./img/job3_name.png" width="400"></br>
<img src="./img/job3_reqs.png" width="400">


### Why XGBoost? EMBER. <a name="why-xgboost"></a>

XGBoost implements a gradient boosting decision tree (GBDT) algorithm. As we see malware detectors in production heavily rely on GBDT algorithm, influenced by [EMBER research](https://arxiv.org/abs/1804.04637): they use a gradient boosted decision tree model (LightGBM, similar to XGBoost) with a set of static features extracted from the PE header and byte code. Let's explore what modeling strategies are used in EMBER.


In [8]:
ember_weights_link = "https://github.com/dtrizna/quo.vadis/raw/main/modules/sota/ember/parameters/ember_model.txt.7z"
ember_pretrained_weights = get_encrypted_archive(ember_weights_link)

In [9]:
import lightgbm as lgb
lgbm_model = lgb.Booster(model_str=ember_pretrained_weights.decode("utf-8"))

In [11]:
import ember
prob = ember.predict_sample(lgbm_model, async_rat_bytez, feature_version=2)

hhash = vx_link.split("/")[-1].split(".")[0]
print(f"[!] Probability malware: {prob*100:>5.2f}% | File: {hhash}")

if os.path.exists(r"C:\windows\system32\calc.exe"):
    with open (r"C:\windows\system32\calc.exe", "rb") as f:
        calc_bytez = f.read()
    prob = ember.predict_sample(lgbm_model, calc_bytez, feature_version=2)
    print(f"[!] Probability malware: {prob*100:>5.2f}% | File: calc.exe")

[!] Probability malware: 88.82% | File: 5e3588e8ddebd61c2bd6dab4b87f601bd6a4857b33eb281cb5059c29cfe62b80
[!] Probability malware:  0.01% | File: calc.exe


## Static Feature Engineering <a name="feature"></a>

GBDT is a tabular model, meaning, it requires fixed number of features. Malware samples are not tabular, but an open ended files with variable length, so we need to extract a fixed set of features from malware samples. This process is called as **Feature Engineering**.

There are two types of features used for ML modeling of malware: 

- static,
- dynamic.

In Section we will focus on static features, and we will discuss dynamic features in a later lab.

Static features are extracted from the file itself, without running it. PE file has well defined format:

<img src="img/PE_Structure.jpg" width="600">

[[Image Source]](https://en.wikipedia.org/wiki/Portable_Executable)

It is possible to define features based on PE structure, like header, imports, exports, etc. One of such feature extraction methodologies is EMBER, which is [open sourced and freely available](https://github.com/elastic/ember/blob/master/ember/features.py). It extracts:

- PE format specific features:
  - Imported and exported functions
  - Section information
  - Header information
- Format agnostic features:
  - Byte and entropy Histograms
  - String information


In [12]:
from ember.features import *

extractor = PEFeatureExtractor()

features_async_rat = extractor.feature_vector(async_rat_bytez)
features_calc = extractor.feature_vector(calc_bytez)

print(f"Shape of Async RAT feature vector: {features_async_rat.shape}\n")
print(f"Shape of calc.exe feature vector: {features_calc.shape}\n")

print("First 10 feature values of Async RAT:\n")
print(features_async_rat[0:10])

Shape of Async RAT feature vector: (2381,)

Shape of calc.exe feature vector: (2381,)

First 10 feature values of Async RAT:

[0.10524162 0.0175666  0.01379335 0.00907949 0.01335176 0.00698284
 0.01272708 0.00688591 0.00507288 0.00523444]


This vector what actually model expects as an input, providing a probability of a sample being malicious:

In [13]:
lgbm_model.predict(features_async_rat.reshape(1, -1))

array([0.88821438])

### Detailed Analysis of EMBER Features <a name="ember-features"></a>

What exactly happens under the hood?

`PEFeatureExtractor()` loads following features:

In [14]:
features = {
    'ByteHistogram': ByteHistogram(),
    'ByteEntropyHistogram': ByteEntropyHistogram(),
    'StringExtractor': StringExtractor(),
    'GeneralFileInfo': GeneralFileInfo(),
    'HeaderFileInfo': HeaderFileInfo(),
    'SectionInfo': SectionInfo(),
    'ImportsInfo': ImportsInfo(),
    'ExportsInfo': ExportsInfo()
}

Let's take a look what some of them represent:

In [39]:
import lief

lief_binary = lief.PE.parse(list(async_rat_bytez))

HeaderFileInfo().raw_features(async_rat_bytez, lief_binary)

{'coff': {'timestamp': 1589088291,
  'machine': 'I386',
  'characteristics': ['CHARA_32BIT_MACHINE', 'EXECUTABLE_IMAGE']},
 'optional': {'subsystem': 'WINDOWS_GUI',
  'dll_characteristics': ['DYNAMIC_BASE',
   'NX_COMPAT',
   'TERMINAL_SERVER_AWARE',
   'NO_SEH'],
  'magic': 'PE32',
  'major_image_version': 0,
  'minor_image_version': 0,
  'major_linker_version': 8,
  'minor_linker_version': 0,
  'major_operating_system_version': 4,
  'minor_operating_system_version': 0,
  'major_subsystem_version': 4,
  'minor_subsystem_version': 0,
  'sizeof_code': 43008,
  'sizeof_headers': 512,
  'sizeof_heap_commit': 4096}}

In [40]:
ImportsInfo().raw_features(async_rat_bytez, lief_binary)

{'mscoree.dll': ['_CorExeMain']}

Let's open specimen in PE-bear and observe how this correlates with sample in static malware analysis tools:

- Imports:

<img src="./img/async_rat_imports.png" width="500">

In [41]:
SectionInfo().raw_features(async_rat_bytez, lief_binary)

{'entry': '.text',
 'sections': [{'name': '.text',
   'size': 43008,
   'entropy': 5.533967866846488,
   'vsize': 42916,
   'props': ['CNT_CODE', 'MEM_EXECUTE', 'MEM_READ']},
  {'name': '.rsrc',
   'size': 2048,
   'entropy': 4.88653168864938,
   'vsize': 2047,
   'props': ['CNT_INITIALIZED_DATA', 'MEM_READ']},
  {'name': '.reloc',
   'size': 512,
   'entropy': 1.584962500721156,
   'vsize': 12,
   'props': ['CNT_INITIALIZED_DATA', 'MEM_DISCARDABLE', 'MEM_READ']}]}

I like the Section information more form PEStudio, so let's open the same sample there and observe section information:

  - `.text` with size of 43008 and entropy ~`5.53`;
  - `.rsrc` with size of 2048 and entropy ~`4.49`;
  - `.reloc` with size of 512 and low entropy in both cases.

<img src="./img/async_rat_sections.png" width="600">


In [42]:
string_info = StringExtractor().raw_features(async_rat_bytez, lief_binary)
del string_info['printabledist'] # removing verbose component
string_info

{'numstrings': 553,
 'avlength': 14.985533453887884,
 'printables': 8287,
 'entropy': 5.218674659729004,
 'paths': 0,
 'urls': 1,
 'registry': 0,
 'MZ': 1}

This way it is possible to see what information EMBER keeps from the PE structure, and which is irrelevant for the model.

These allows to infer what adversary might be interested to modify to create an adversarial example.

# Explainability

In [173]:
import shap

explainer = shap.Explainer(lgbm_model)
shap_values = explainer(features_async_rat.reshape(1, -1))

  def _pt_shuffle_rec(i, indexes, index_mask, partition_tree, M, pos):
  def delta_minimization_order(all_masks, max_swap_size=100, num_passes=2):
  def _reverse_window(order, start, length):
  def _reverse_window_score_gain(masks, order, start, length):
  def _mask_delta_score(m1, m2):
  def identity(x):
  def _identity_inverse(x):
  def logit(x):
  def _logit_inverse(x):
  def _build_fixed_single_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _build_fixed_multi_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
  def _init_masks(cluster_matrix, M, indices_row_pos, indptr):
  def _rec_fill_masks(cluster_matrix, indices_row_pos, indptr, indices, M, ind):
  def _single_delta_mask(dind, masked_inputs, last_mask, data, x, noop_code):
  def _delta_masking(masks, x, curr_delta_inds, varying_rows_out,
  def _jit_build_partition_tree(xmin, xmax, ymi