# Flash Evaluation on the DARPA OpTC Dataset

This notebook is designed for evaluating Flash on the DARPA OpTC dataset. The OpTC dataset is a node-level dataset, crucial for our analysis. Flash is configured to operate in a node-level setting to effectively assess this dataset. The OpTC dataset is enriched with node attributes, making it suitable for running Flash in a decoupled manner. This includes using offline GNN embeddings and a downstream classifier. Our approach tests Flash on this dataset, where Flash generates word2vec embeddings as feature vectors for GNN embeddings. These embeddings are stored in a datastore and used in conjunction with a downstream model for improved detection results.

## Accessing the Dataset:
- The OpTC dataset can be accessed via this link: [OpTC Dataset](https://drive.google.com/drive/u/0/folders/1n3kkS3KR31KUegn42yk3-e6JkZvf0Caa).
- Dataset files for evaluation will be downloaded automatically by the script.
- While we provide pre-trained weights, you also have the option to download benign data files for training the models from the ground up.

## Data Parsing and Execution:
- The script is adept at autonomously parsing the downloaded data files.
- For evaluation results, execute all cells in this notebook.

## Model Training and Execution Options:
- By default, the notebook utilizes pre-trained model weights.
- It also offer settings to independently train Graph Neural Networks (GNNs), word2vec, and Xgboost models.
- These independently trained models can then be deployed for an evaluation of the system.

Following these guidelines will ensure a thorough and effective analysis of the OpTC dataset using Flash.

In [35]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import torch
from torch_geometric.data import Data
import os
import torch.nn.functional as F
import pickle
import json
import warnings
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
warnings.filterwarnings('ignore')
from torch_geometric.loader import NeighborLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
%matplotlib inline

In [36]:
gnn_weights = "trained_weights/optc/gnn_temp.pth"
xgboost_weights = "trained_weights/optc/xgb.pkl"
word2vec_weights = 'w2v_optc.model'
create_store = True
gnnTrain = True
xgbTrain = True

In [37]:
from pprint import pprint
import gzip
from sklearn.manifold import TSNE
import json
import copy
import os
import xgboost as xgb

import gensim
from gensim.models import Word2Vec
from multiprocessing import Pool
from itertools import compress
from tqdm import tqdm
import time

import multiprocessing

In [38]:
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from collections import Counter
from gensim.models import Word2Vec
from multiprocessing import Pool
from itertools import compress
from tqdm import tqdm
import time

In [39]:
import gzip
import io

def extract_logs(filepath, hostid):
    search_pattern = f'SysClient{hostid}'
    output_filename = f'SysClient{hostid}.systemia.com.txt'
    
    with gzip.open(filepath, 'rt', encoding='utf-8') as fin:
        with open(output_filename, 'ab') as f:
            out = io.BufferedWriter(f)
            for line in fin:
                if search_pattern in line:
                    out.write(line.encode('utf-8'))
            out.flush()

In [40]:
import gdown
from tqdm import tqdm
    
def prepare_test_set():
    urls = [
        "https://drive.google.com/file/d/1HFSyvmgH0jvdnnnTdKfWRjZYOrLWoIkv/view?usp=drive_link",
        "https://drive.google.com/file/d/1pJLxJsDV8sngiedbfVajMetczIgM3PQd/view?usp=drive_link",
        "https://drive.google.com/file/d/1fRQqc68r8-z5BL7H_eAKIDOeHp7okDuM/view?usp=drive_link",
        "https://drive.google.com/file/d/1VfyGr8wfSe8LBIHBWuYBlU8c2CyEgO5C/view?usp=drive_link",
        "https://drive.google.com/file/d/10N9ZPolq_L8HivBqzf_jFKbwjSxddsZp/view?usp=drive_link",
        "https://drive.google.com/file/d/1xIr8gw-4zc8ESjUpYtrFsbOwhPGUSd15/view?usp=drive_link",
        "https://drive.google.com/file/d/1PvlCp2oQaxEBEFGSQWfcFVj19zLOe7yH/view?usp=drive_link"
    ]

    for url in urls:
        gdown.download(url, quiet=False, use_cookies=False, fuzzy=True)

    log_files = [
        ("AIA-201-225.ecar-2019-12-08T11-05-10.046.json.gz", "0201"),
        ("AIA-201-225.ecar-last.json.gz", "0201"),
        ("AIA-501-525.ecar-2019-11-17T04-01-58.625.json.gz", "0501"),
        ("AIA-501-525.ecar-last.json.gz", "0501"),
        ("AIA-51-75.ecar-last.json.gz", "0051")
    ]
    
    # os.system("rm SysClient0201.com.txt")
    # os.system("rm SysClient0501.com.txt")
    # os.system("rm SysClient0051.com.txt")
    
    os.system("rm SysClient0201.systemia.com.txt")
    os.system("rm SysClient0501.systemia.com.txt")
    os.system("rm SysClient0051.systemia.com.txt")

    for file, code in tqdm(log_files, desc="Extracting logs", unit="file"):
        extract_logs(file, code)

prepare_test_set()

Downloading...
From (original): https://drive.google.com/uc?id=1HFSyvmgH0jvdnnnTdKfWRjZYOrLWoIkv
From (redirected): https://drive.google.com/uc?id=1HFSyvmgH0jvdnnnTdKfWRjZYOrLWoIkv&confirm=t&uuid=344202e0-b295-4237-828e-c8ffef91d172
To: /home/tpiuser2/prov_project/amaan_flash/Flash-IDS-main/AIA-201-225.ecar-last.json.gz
100%|██████████| 2.22G/2.22G [01:11<00:00, 30.8MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1pJLxJsDV8sngiedbfVajMetczIgM3PQd
From (redirected): https://drive.google.com/uc?id=1pJLxJsDV8sngiedbfVajMetczIgM3PQd&confirm=t&uuid=b757c88a-4f99-4c20-a1f5-86c3a47facc0
To: /home/tpiuser2/prov_project/amaan_flash/Flash-IDS-main/AIA-201-225.ecar-2019-12-08T11-05-10.046.json.gz
100%|██████████| 110M/110M [00:07<00:00, 14.2MB/s] 
Downloading...
From (original): https://drive.google.com/uc?id=1fRQqc68r8-z5BL7H_eAKIDOeHp7okDuM
From (redirected): https://drive.google.com/uc?id=1fRQqc68r8-z5BL7H_eAKIDOeHp7okDuM&confirm=t&uuid=474873c3-91eb-4b96-831f-efe635b42dfd

In [41]:
def is_valid_entry(entry):
    valid_objects = {'PROCESS', 'FILE', 'FLOW', 'MODULE'}
    invalid_actions = {'START', 'TERMINATE'}

    object_valid = entry['object'] in valid_objects
    action_valid = entry['action'] not in invalid_actions
    actor_object_different = entry['actorID'] != entry['objectID']

    return object_valid and action_valid and actor_object_different

def Traversal_Rules(data):
    filtered_data = {}

    for entry in data:
        if is_valid_entry(entry):
            key = (
                entry['action'], 
                entry['actorID'], 
                entry['objectID'], 
                entry['object'], 
                entry['pid'], 
                entry['ppid']
            )
            filtered_data[key] = entry

    return list(filtered_data.values())

In [42]:
def Sentence_Construction(entry):
    action = entry["action"]
    properties = entry['properties']
    object_type = entry['object']

    format_strings = {
        'PROCESS': "{parent_image_path} {action} {image_path} {command_line}",
        'FILE': "{image_path} {action} {file_path}",
        'FLOW': "{image_path} {action} {src_ip} {src_port} {dest_ip} {dest_port} {direction}",
        'MODULE': "{image_path} {action} {module_path}"
    }

    default_format = "{image_path} {action} {module_path}"

    try:
        format_str = format_strings.get(object_type, default_format)
        phrase = format_str.format(action=action, **properties)
    except KeyError:
        phrase = ''

    return phrase.split(' ')

In [43]:
import pandas as pd
import json

def Extract_Semantic_Info(event):
    object_type = event['object']
    properties = event['properties']

    label_mapping = {
        "PROCESS": ('parent_image_path', 'image_path'),
        "FILE": ('image_path', 'file_path'),
        "MODULE": ('image_path', 'module_path'),
        "FLOW": ('image_path', 'dest_ip', 'dest_port')
    }

    label_keys = label_mapping.get(object_type, None)
    if label_keys:
        labels = [properties.get(key) for key in label_keys]
        if all(labels):
            event["actorname"], event["objectname"] = labels[0], ' '.join(labels[1:])
            return event
    return None

def transform(text):
    labeled_data = [event for event in (Extract_Semantic_Info(x) for x in text) if event]
    data = Traversal_Rules(labeled_data)

    phrases = [Sentence_Construction(x) for x in data if Sentence_Construction(x)]
    for datum, phrase in zip(data, phrases):
        datum['phrase'] = phrase

    df = pd.DataFrame(data)
    df['timestamp'] = pd.to_datetime(df['timestamp'].str[:-6], infer_datetime_format=True)
    df.sort_values(by='timestamp', inplace=True)

    return df

def load_data(file_path):
    with open(file_path, 'r') as file:
        content = [json.loads(line) for line in file]
    
    return Featurize(transform(content))

In [44]:
import numpy as np

def Featurize(df):
    dummies = {'PROCESS': 0, 'FLOW': 1, 'FILE': 2, 'MODULE': 3}

    nodes = {}
    labels = {}
    lblmap = {}
    neimap = {}
    edges = []

    for index, row in df.iterrows():
        actor_id, object_id = row['actorID'], row["objectID"]
        object_type = row['object']

        nodes.setdefault(actor_id, []).extend(row['phrase'])
        nodes.setdefault(object_id, []).extend(row['phrase'])

        labels[actor_id] = dummies.get('PROCESS', -1)
        labels[object_id] = dummies.get(object_type, -1)

        lblmap[actor_id] = row['actorname']
        lblmap[object_id] = row['objectname']

        neimap.setdefault(actor_id, set()).add(row['objectname'])
        neimap.setdefault(object_id, set()).add(row['actorname'])

        edge_type = row['properties']['direction'] if object_type == 'FLOW' else row['action']
        edges.append((actor_id, object_id, edge_type))

    features, feat_labels, edge_index = [], [], [[], []]
    node_index = {}

    for node, phrases in nodes.items():
        if not (len(phrases) == 1 and phrases[0] == 'DELETE'):
            features.append(infer(phrases))
            feat_labels.append(labels[node])
            node_index[node] = len(features) - 1

    for src, dst, _ in edges:
        edge_index[0].append(node_index[src])
        edge_index[1].append(node_index[dst])

    mapp = list(node_index.keys())

    return features, np.array(feat_labels), edge_index, mapp, lblmap, neimap

In [45]:
from gensim.models.callbacks import CallbackAny2Vec

class EpochSaver(CallbackAny2Vec):

    def __init__(self):
        self.epoch = 0

    def on_epoch_end(self, model):
        model.save('trained_weights/optc/w2v_optc.model')
        self.epoch += 1

In [46]:
class EpochLogger(CallbackAny2Vec):

    def __init__(self):
        self.epoch = 0

    def on_epoch_begin(self, model):
        print("Epoch #{} start".format(self.epoch))

    def on_epoch_end(self, model):
        print("Epoch #{} end".format(self.epoch))
        self.epoch += 1

In [47]:
import json
from gensim.models import Word2Vec

def prepare_sentences(df):
    nodes = {}
    for index, row in df.iterrows():
        for key in ['actorID', 'objectID']:
            node_id = row[key]
            nodes.setdefault(node_id, []).extend(row['phrase'])
    return list(nodes.values())

def train_word2vec_model(train_file_path):
    with open(train_file_path, 'r') as file:
        content = [json.loads(line) for line in file]

    events = transform(content)
    phrases = prepare_sentences(events)

    logger = EpochLogger()
    saver = EpochSaver()
    word2vec = Word2Vec(sentences=phrases, vector_size=20, window=5, min_count=1, workers=8, epochs=300, callbacks=[saver, logger])

    return word2vec

In [48]:
import math
import torch
import numpy as np
from gensim.models import Word2Vec

class PositionalEncoder:

    def __init__(self, d_model, max_len=100000):
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        self.pe = torch.zeros(max_len, d_model)
        self.pe[:, 0::2] = torch.sin(position * div_term)
        self.pe[:, 1::2] = torch.cos(position * div_term)

    def embed(self, x):
        return x + self.pe[:x.size(0)]


def infer(document):
    word_embeddings = [w2vmodel.wv[word] for word in document if word in  w2vmodel.wv]
    
    if not word_embeddings:
        return np.zeros(20)

    output_embedding = torch.tensor(word_embeddings, dtype=torch.float)
    if len(document) < 100000:
        output_embedding = encoder.embed(output_embedding)

    output_embedding = output_embedding.detach().cpu().numpy()
    return np.mean(output_embedding, axis=0)

encoder = PositionalEncoder(20)
w2vmodel = Word2Vec.load(word2vec_weights)

In [49]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import SAGEConv

class GCN(torch.nn.Module):
    def __init__(self):
        super(GCN, self).__init__()
        self.conv1 = SAGEConv(20, 32, normalize=True)
        self.conv2 = SAGEConv(32, 20, normalize=True)
        self.linear = nn.Linear(in_features=20, out_features=4)

    def forward(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
    
        x = self.encode(x, edge_index)
        x = self.linear(x)
        return F.softmax(x, dim=1)
    
    def encode(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
        
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

In [50]:
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss

model = GCN().to(device)
if not gnnTrain:
    model.load_state_dict(torch.load(gnn_weights))
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

In [51]:
from sklearn.utils import class_weight

if gnnTrain or create_store:
    file_path = 'Enter Path to Train File Here'
    nodes,labels,edges,mapp,lbl,nemap = load_data(file_path)

    l = np.array(labels)
    class_weights = class_weight.compute_class_weight(class_weight = "balanced",classes = np.unique(l),y = l)
    class_weights = torch.tensor(class_weights,dtype=torch.float).to(device)
    criterion = CrossEntropyLoss(weight=class_weights,reduction='mean')

    graph = Data(x=torch.tensor(nodes,dtype=torch.float).to(device),y=torch.tensor(labels,dtype=torch.long).to(device), edge_index=torch.tensor(edges,dtype=torch.long).to(device))

FileNotFoundError: [Errno 2] No such file or directory: 'Enter Path to Train File Here'

In [18]:
from torch_geometric.loader import NeighborLoader

def train_model(batch):
    model.train()
    optimizer.zero_grad()
    predictions = model(batch.x, batch.edge_index)
    loss = criterion(predictions, batch.y)
    loss.backward()
    optimizer.step()
    return loss.item(), batch.x.size(0)

def evaluate_model(batch):
    model.eval()
    with torch.no_grad():
        predictions = model(batch.x, batch.edge_index)
        pred_labels = predictions.argmax(dim=1)
        correct_predictions = int((pred_labels == batch.y).sum())
    return correct_predictions

if gnnTrain:
    loader = NeighborLoader(graph, num_neighbors=[-1, -1], batch_size=5000)

    for epoch in range(100):
        total_loss = total_correct = total_nodes = 0

        for batch in loader:
            loss, nodes = train_model(batch)
            total_loss += loss
            total_nodes += nodes
            total_correct += evaluate_model(batch)

        average_loss = total_loss / total_nodes
        accuracy = total_correct / total_nodes

        print(f"Epoch #{epoch}. Training Loss: {average_loss:.5f}, Accuracy: {accuracy:.5f}")
        torch.save(model.state_dict(), gnn_weights)


In [19]:
if create_store:
    model.eval()
    out = model.encode(graph.x, graph.edge_index).tolist()
    
    gnn_map = {}
    
    for i in range(len(mapp)):
        gnn_map[lbl[mapp[i]]] = (out[i],list(nemap[mapp[i]]))
    
    with open("data_files/emb_store.json", "w") as file:
        json.dump(gnn_map, file)

In [20]:
with open("data_files/emb_store.json", "r") as file:
    gnn_map = json.load(file)

In [21]:
import numpy as np

def load_features(filename=None, similarity=1):
    nodes, y_train, edges, mapp, lbl, nemap = load_data(filename)
    zero_vector = np.zeros(20)

    X_train = []
    for idx, map_item in enumerate(mapp):
        label = lbl[map_item]
        node_feature = nodes[idx]

        if label in gnn_map:
            emb, stored_set = gnn_map[label]
            current_set = nemap[map_item]
            jaccard_similarity = len(current_set.intersection(stored_set)) / len(current_set.union(stored_set))

            feature_vector = emb if jaccard_similarity >= similarity else zero_vector
        else:
            feature_vector = zero_vector

        X_train.append(np.hstack((node_feature, feature_vector)))

    return np.array(X_train), y_train, edges, mapp

In [22]:
from sklearn.metrics import accuracy_score
from collections import Counter
import xgboost as xgb

if xgbTrain:
    file_path = 'Enter Path to Train File Here'
    x,y,_,_ = load_features(file_path)

    xgb_cl = xgb.XGBClassifier()

    xgb_cl.fit(x,y)
    pickle.dump(xgb_cl, open(xgboost_weights, "wb"))

    preds = xgb_cl.predict(x)
    print(accuracy_score(y, preds))

In [23]:
def load_pkl(fname):
    with open(fname, 'rb') as f:
        obj = pickle.load(f)
    return obj

In [24]:
def validate(file_path):
    x,y,_,_ = load_features(file_path)
    xgb_cl = load_pkl(xgboost_weights)

    pred = xgb_cl.predict(x)
    proba = xgb_cl.predict_proba(x)

    sorted = np.sort(proba, axis=1)
    conf = (sorted[:,-1] - sorted[:,-2]) / sorted[:,-1]
    conf = (conf - conf.min()) / conf.max()

    check = (pred == y)
    flag = ~torch.tensor(check)
    scores = conf[flag].tolist()
    return scores

In [25]:
from itertools import compress
from torch_geometric import utils

def Get_Adjacent(ids, mapp, edges, hops):
    if hops == 0:
        return set()
    
    neighbors = set()
    for edge in zip(edges[0], edges[1]):
        if any(mapp[node] in ids for node in edge):
            neighbors.update(mapp[node] for node in edge)

    if hops > 1:
        neighbors = neighbors.union(Get_Adjacent(neighbors, mapp, edges, hops - 1))
    
    return neighbors

def calculate_metrics(TP, FP, FN, TN):
    FPR = FP / (FP + TN) if FP + TN > 0 else 0
    TPR = TP / (TP + FN) if TP + FN > 0 else 0

    prec = TP / (TP + FP) if TP + FP > 0 else 0
    rec = TP / (TP + FN) if TP + FN > 0 else 0
    fscore = (2 * prec * rec) / (prec + rec) if prec + rec > 0 else 0

    return prec, rec, fscore, FPR, TPR

def helper(MP, all_pids, GP, edges, mapp):
    TP = MP.intersection(GP)
    FP = MP - GP
    FN = GP - MP
    TN = all_pids - (GP | MP)

    two_hop_gp = Get_Adjacent(GP, mapp, edges, 2)
    two_hop_tp = Get_Adjacent(TP, mapp, edges, 2)
    FPL = FP - two_hop_gp
    TPL = TP.union(FN.intersection(two_hop_tp))
    FN = FN - two_hop_tp

    TP, FP, FN, TN = len(TPL), len(FPL), len(FN), len(TN)

    prec, rec, fscore, FPR, TPR = calculate_metrics(TP, FP, FN, TN)
    print(f"Precision: {round(prec, 2)}, Recall: {round(rec, 2)}, Fscore: {round(fscore, 2)}")
    
    return TPL, FPL

In [26]:
import numpy as np

def load_features_test(dataframe, similarity_threshold=1):
    nodes, y_train, edges, mapping, label_map, node_entity_map = Featurize(dataframe)
    X_train = []

    for i, map_id in enumerate(mapping):
        label = label_map[map_id]
        node_embedding = np.zeros(20)  

        if label in gnn_map:
            embedding, stored_set = gnn_map[label]
            current_set = node_entity_map[map_id]
            similarity_metric = len(current_set.intersection(stored_set)) / len(current_set.union(stored_set))

            if similarity_metric >= similarity_threshold:
                node_embedding = np.array(embedding)

        X_train.append(np.hstack((nodes[i], node_embedding)))

    return np.array(X_train), y_train, edges, mapping

In [27]:
import json
import numpy as np
import torch
from torch_geometric import utils

In [28]:
def load_events_from_hosts(hosts):
    all_events = []
    for host in hosts:
        path = f'SysClient0{host}.systemia.com.txt'
        with open(path, 'r') as file:
            raw_events = [json.loads(line) for line in file]
        all_events.extend(raw_events)
    return all_events

def load_ground_truth(gt_file):
    with open(gt_file, 'r') as file:
        gt_nodes = set(file.read().split())
    return gt_nodes

def evaluate_model(df, xgb_cl, similarity_threshold, confidence_threshold):
    x, y, edges, mapp = load_features_test(df)

    pred = xgb_cl.predict(x)
    proba = xgb_cl.predict_proba(x)

    sorted_proba = np.sort(proba, axis=1)
    conf = (sorted_proba[:, -1] - sorted_proba[:, -2]) / sorted_proba[:, -1]
    normalized_conf = (conf - conf.min()) / conf.max()

    check = (pred == y) & (normalized_conf > confidence_threshold)
    flag = ~torch.tensor(check)

    index = utils.mask_to_index(flag).tolist()
    ids = {mapp[idx] for idx in index}
    return ids,edges,mapp

In [29]:
import json
import numpy as np
import torch

def read_event_data(host):
    file_path = f'SysClient0{host}.systemia.com.txt'
    with open(file_path, 'r') as file:
        return [json.loads(line) for line in file]
        
def stream_events(batch_size, window_size):
    event_buffer = {}
    hosts = ['051']
    positions = {host: 0 for host in hosts}
    while True:
        for host in hosts:
            if host not in event_buffer or len(event_buffer[host]) < positions[host] + batch_size:
                events = read_event_data(host)
                dframe = transform(events)
                if host in event_buffer:
                    event_buffer[host] = event_buffer[host].append(dframe, ignore_index=True)
                else:
                    event_buffer[host] = dframe
            start = positions[host]
            end = start + batch_size
            yield event_buffer[host][start:end]
            positions[host] += window_size
            if positions[host] >= len(event_buffer[host]):
                return

def analyze_events(data_frame, ground_truth_nodes):
    
    if data_frame['properties'].apply(lambda x: isinstance(x, str)).any():
        data_frame['properties'] = data_frame['properties'].apply(json.loads)
        
    actor_and_object_ids = set(data_frame['actorID']) | set(data_frame['objectID'])
    relevant_ground_truth = {x for x in ground_truth_nodes if x in actor_and_object_ids}

    features, labels, edges, mapping = load_features_test(data_frame)
    model = load_pkl(xgboost_weights)

    predictions = model.predict(features)
    probabilities = model.predict_proba(features)

    sorted_probabilities = np.sort(probabilities, axis=1)
    confidence_scores = (sorted_probabilities[:, -1] - sorted_probabilities[:, -2]) / sorted_probabilities[:, -1]
    normalized_confidence = (confidence_scores - confidence_scores.min()) / confidence_scores.max()

    misclassified = ~torch.tensor(predictions == labels)
    misclassified_indices = utils.mask_to_index(misclassified).tolist()
    misclassified_ids = {mapping[idx] for idx in misclassified_indices}

    helper(misclassified_ids, actor_and_object_ids, relevant_ground_truth, edges, mapping)

In [30]:
def traverse(ids, mapping, edges, hops, visited=None):
    if hops == 0:
        return set()

    if visited is None:
        visited = set()

    neighbors = set()
    for src, dst in zip(edges[0], edges[1]):
        src_mapped, dst_mapped = mapping[src], mapping[dst]

        if (src_mapped in ids and dst_mapped not in visited) or \
           (dst_mapped in ids and src_mapped not in visited):
            neighbors.add(src_mapped)
            neighbors.add(dst_mapped)

        visited.add(src_mapped)
        visited.add(dst_mapped)

    neighbors.difference_update(ids) 
    return ids.union(traverse(neighbors, mapping, edges, hops - 1, visited))

def load_data(file_path):
    with open(file_path, 'r') as file:
        return json.load(file)

def find_connected_alerts(start_alert, mapping, edges, depth, remaining_alerts):
    connected_path = traverse({start_alert}, mapping, edges, depth)
    return connected_path.intersection(remaining_alerts)

def generate_incident_graphs(alerts, edges, mapping, depth):
    incident_graphs = []
    remaining_alerts = set(alerts)

    while remaining_alerts:
        alert = remaining_alerts.pop()
        connected_alerts = find_connected_alerts(alert, mapping, edges, depth, remaining_alerts)

        if len(connected_alerts) > 1:
            incident_graphs.append(connected_alerts)
            remaining_alerts -= connected_alerts

    return incident_graphs

### Testing Flash on OpTC Malicious Upgrade Attack

In [31]:
all_events = load_events_from_hosts(['051'])

EnActIds = [x['actorID'] for x in all_events]
EnObjIds = [x['objectID'] for x in all_events]
EntitySet = set(EnActIds).union(set(EnObjIds))

df = transform(all_events)

gt_nodes = load_ground_truth('optc.txt')
gt_nodes = [x for x in gt_nodes if x in EntitySet]
gt_nodes = set(gt_nodes)

xgboost_model = load_pkl(xgboost_weights)
identified_ids,edges,mapp = evaluate_model(df, xgboost_model, 1, 0.6)

alerts = helper(identified_ids, EntitySet, gt_nodes, edges, mapp)

Precision: 0.93, Recall: 0.92, Fscore: 0.93


### Testing Flash on OpTC Plain PowerShell Empire Attack

In [32]:
all_events = load_events_from_hosts(['201'])

EnActIds = [x['actorID'] for x in all_events]
EnObjIds = [x['objectID'] for x in all_events]
EntitySet = set(EnActIds).union(set(EnObjIds))

df = transform(all_events)

gt_nodes = load_ground_truth('optc.txt')
gt_nodes = [x for x in gt_nodes if x in EntitySet]
gt_nodes = set(gt_nodes)

xgboost_model = load_pkl(xgboost_weights)
identified_ids,edges,mapp = evaluate_model(df, xgboost_model, 1, 0)

alerts = helper(identified_ids, EntitySet, gt_nodes, edges, mapp)

Precision: 0.81, Recall: 0.95, Fscore: 0.87


### Testing Flash on OpTC Custom PowerShell Empire Attack

In [33]:
all_events = load_events_from_hosts(['501'])

EnActIds = [x['actorID'] for x in all_events]
EnObjIds = [x['objectID'] for x in all_events]
EntitySet = set(EnActIds).union(set(EnObjIds))

df = transform(all_events)

gt_nodes = load_ground_truth('optc.txt')
gt_nodes = [x for x in gt_nodes if x in EntitySet]
gt_nodes = set(gt_nodes)

xgboost_model = load_pkl(xgboost_weights)
identified_ids,edges,mapp = evaluate_model(df, xgboost_model, 1, 0.98)

alerts = helper(identified_ids, EntitySet, gt_nodes, edges, mapp)

Precision: 0.94, Recall: 0.92, Fscore: 0.93


### Testing Flash on Streaming Batches Generated from OpTC Attack Logs.

In [34]:
stream = False
if stream:
    for data_frame in stream_events(250000, 250):
        gt_nodes = load_ground_truth('optc.txt')
        analyze_events(data_frame, gt_nodes)