# Threat Hunting Report with Logarithms

### Logarithms

You will need to tune the threshold when using a logarithm, though the threshold of -3 is set, by default. The logarithm is used because it can compress large datasets and help rare events to stand out. The term "compress" in this context means to group common data points. A value of -3 provides a good balance of whether or not an anomaly will stand out.

Changing the values will help you identify anomalies.

#### Heterogenous Dataset

If the dataset is heterogenous, then you'll likely need a higher threshold (thus less sensitivity) to test it and start at around -2 or -2.5. That is because there is more variablity in the data so less sensitivity will lead to anomalies standing out easier.

The context is if you are analzying processes, for example, across different operating system versions or roles of the OS are different (eg. IIS web servers versus Exchange server). Not recommended, however.

#### Homogenous Dataset

If you have a similar dataset, then you'd want a lower threshold so the -3 is a good start.

It is able to find 2 anomalies. The first value is the number of hosts the file is on and then the logarithmic value. Since the dataset is homogenous, the lower threshold shows rare processes running.

#### Dataset size

Larger datasets provide higher confidence in detecting anomalies so a lower threshold works well (-3, -3.5, -4)

Smaller datasets have less confidence so a higher threshold is necessary -2 or -1 (for small datasets).

### Sample Analysis
```
[snipped for brevity]
[200 -1.838629 C:\ProgramData\Ashampoo Winzip AshampooWinZip.exe]
[200 -1.838629 C:\Program Files\WebServer_11.24.0.0_x64__8wekyb3d8bbwe\WebServer.exe]
[200 -1.838629 C:\Program Files\SessionManager_11.24.0.0_x64__8wekyb3d8bbwe\SessionManager.exe]
[200 -1.838629 C:\ProgramData\Dell\Supportassist\DellSupportAssist.exe]
[200 -1.838629 C:\Program Files (x86)\Twitch Twitch.exe]
[200 -1.838629 C:\Program Files (x86)\Microsoft Edge\Application MicrosoftEdge.exe]
[200 -1.838629 C:\Windows\SystemHealth.exe]
[200 -1.838629 C:\Windows\system32\taskmgr.exe]
[199 -1.840806 C:\Program Files (x86)\iPod\ iTunes AppleiTunes.exe]
[193 -1.854101 C:\Windows\taskmgr.exe]
[2 -3.838629 C:\ProgramData\system32\csrss.exe]
[1 -4.139659 C:\Program Files (x86)\iPod\ iTunes AppleTunes.exe]
```

Note how there is a C:\Windows\taskmgr.exe on 193 hosts, but didn't show up as an anomaly with a threshold of -3. That is why you need to change the threshold so that you better spot anomalies that may exist outside of a given threshold. While larger datasets have higher confidence of anomalies with a lower threshold, some anomalies could still be missed. Accordingly, all of these factors are the reason you need to change the the threshold during your analysis. Also, the legit `taskmgr.exe` file is located in `C:\Windows\system32\taskmgr.exe`. If this was a real system, this would likely be a mass compromise UNLESS there is a custom program in that path or a third-party program had a similar name process. *CONTEXT! CONTEXT! CONTEXT!*

In the above output, you can also see the difference with the 'AppleiTunes.exe' path and the 'AppleTunes.exe' path. 

GitHub repo contains a Go based version of the logarithmic functions below and another tool to create a baseline and then compare the remaining systems against the baseline.  https://github.com/thedunston/goMeeb/


## Prints only the summary of the selected header in descending order. Anomalies start at the bottom.

In [6]:
import os
import csv
import math
import threading
import queue
import pandas as pd
from pathlib import Path
from collections import defaultdict

# Directory path where the CSV files are stored.
# Download the threat hunting dataset: https://github.com/mosse-security/threat-hunting-samples and update the path.
directory = "PATH_TO_DATASET"

# Header in the CSV files to filter on. Update this as needed based on the CSV file you use.
header = "name"

# Threshold for identifying anomalies. This value will need to be adjusted 
# based on the size of the dataset and the variability in the data.
# Careful here because rendering the results in the browser can
# cause it to crash if there are a lot of results.
threshold = -3.0

# Retrieve list of CSV files from the specified directory.
def get_csv_files(directory):
    # Recursively gather all CSV files.
    files = [str(path) for path in Path(directory).rglob('*.csv')]
    if not files:

        raise FileNotFoundError(f"No CSV files found in directory {directory}")
    
    return files

# Determine the number of threads to use based on the number of files.
def determine_num_threads(files):
    
    # Basic method to determine number of threads.
    # This is helpful for dozens of CSV files..
    num_threads = len(files) // 2
    if num_threads > 10:
        num_threads = 10
    elif num_threads < 1:
        num_threads = 1
    return num_threads

# Process each CSV file and return the count of occurrences for the specified header.
def process_file(file, header, data_queue):
    data = defaultdict(int)
    total_entries = 0

    # Open the CSV file.
    with open(file, 'r') as f:
        reader = csv.DictReader(f)
        # Check if the header is present in the CSV file.
        if header not in reader.fieldnames:
            print(f"Header {header} not found in file {file}")
            return
        
        # Count occurrences of each value under the specified header.
        for row in reader:
            value = row[header]
            data[value] += 1
            total_entries += 1
    
    # Put the processed data into the queue.
    data_queue.put((data, total_entries))

# Aggregate data from all files for the specified header.
def aggregate_data(files, header, num_threads):
    data_queue = queue.Queue()
    threads = []

    # Create and start threads for concurrent processing.
    for i in range(num_threads):
        t = threading.Thread(target=process_files_thread, args=(files[i::num_threads], header, data_queue))
        threads.append(t)
        t.start()

    # Wait for all threads to finish.
    for t in threads:
        t.join()

    aggregated_data = defaultdict(int)
    total_entries = 0

    # Collect data from the queue and aggregate it.
    while not data_queue.empty():
        data, count = data_queue.get()
        for key, value in data.items():
            aggregated_data[key] += value
        total_entries += count

    return aggregated_data, total_entries

# Process files.
def process_files_thread(files, header, data_queue):
    for file in files:
        process_file(file, header, data_queue)

# Identify anomalies based on the specified threshold.
# ChatGPT help with this...math. :)
def identify_anomalies(data, total_entries, threshold):
    anomalies = []
    for value, count in data.items():
        
        # Calculate the log proportion of each value.
        proportion = float(count) / float(total_entries)
        log_proportion = math.log10(proportion)
        
        # Identify anomalies based on the threshold.
        if log_proportion < threshold:
            anomalies.append([count, log_proportion, value])
    
    # Sort anomalies by log proportion in ascending order.
    anomalies.sort(key=lambda x: x[1])
    return anomalies

# Print results as a pandas DataFrame.
def print_results_dataframe(results):
    df = pd.DataFrame(results, columns=['Count', 'Log Proportion', 'Value'])
    display(df)

try:
    
    files = get_csv_files(directory)

    num_threads = determine_num_threads(files)
    
    aggregated_data, total_entries = aggregate_data(files, header, num_threads)
    
    results = identify_anomalies(aggregated_data, total_entries, threshold)
    
    print_results_dataframe(results)
    
except Exception as e:
    print(f"An error occurred: {e}")


Unnamed: 0,Count,Log Proportion,Value
0,50,-3.499038,sbdinst.exe
1,54,-3.465614,paexec.exe
2,96,-3.215737,Powershell.exe
3,100,-3.198008,schtasks.exe
4,106,-3.172702,mshta.exe


In [19]:
import os
import csv
import math
import pandas as pd
from pathlib import Path
from collections import defaultdict
import threading
import queue

# Download the threat hunting dataset: https://github.com/mosse-security/threat-hunting-samples and update the path.
directory = "PATH_TO_DATASET"

# Header to filter on. Testing with specifying multiple headers.
headers_input = "name"
headers = [header.strip() for header in headers_input.split(",")]

# Threshold for identifying anomalies. This value will need to be adjusted 
# based on the size of the dataset and the variability in the data.
# Careful here because rendering the results in the browser can
# cause it to crash if there are a lot of results.
threshold = -3

def get_csv_files(directory):
    files = [str(path) for path in Path(directory).rglob('*.csv')]
    if not files:
        raise FileNotFoundError(f"No CSV files found in directory {directory}")
    return files

# Basic method to determine number of threads.
# This is helpful for dozens of CSV files.
# Determine the number of threads to use based on the number of files.
def determine_num_threads(files):

    num_threads = len(files) // 2
    if num_threads > 10:
        num_threads = 10
    elif num_threads < 1:
        num_threads = 1
    return num_threads

# Process each CSV file and return the count of occurrences for the specified header.
def process_file(file, headers, data_queue):
    data = defaultdict(list)
    total_entries = 0

    with open(file, 'r') as f:
        reader = csv.DictReader(f)
        header_indices = {header: reader.fieldnames.index(header) for header in headers if header in reader.fieldnames}
        if len(header_indices) != len(headers):
            missing_headers = [header for header in headers if header not in reader.fieldnames]
            print(f"Headers {missing_headers} not found in file {file}")
            return

        for row in reader:
            for header, index in header_indices.items():
                value = row[header]
                data[value].append(row)
            total_entries += 1
    
    data_queue.put((data, total_entries))

# Aggregate data from all files for the specified header.
def aggregate_data(files, headers, num_threads):

    data_queue = queue.Queue()
    threads = []

    for i in range(num_threads):
        t = threading.Thread(target=process_files_thread, args=(files[i::num_threads], headers, data_queue))
        threads.append(t)
        t.start()

    for t in threads:
        t.join()

    aggregated_data = defaultdict(list)
    total_entries = 0

    while not data_queue.empty():
        data, count = data_queue.get()
        for key, value in data.items():
            aggregated_data[key].extend(value)
        total_entries += count

    return aggregated_data, total_entries

# Process files.
def process_files_thread(files, headers, data_queue):
    for file in files:
        process_file(file, headers, data_queue)

# Identify anomalies based on the specified threshold.
# ChatGPT help with this...math. :)
def identify_anomalies(data, total_entries, threshold):
    anomalies = []
    for value, rows in data.items():
        count = len(rows)
        proportion = float(count) / float(total_entries)
        log_proportion = math.log10(proportion)
        if log_proportion < threshold:
            for row in rows:
                anomalies.append([count, log_proportion, row])
    
    anomalies.sort(key=lambda x: x[1])
    return anomalies

# Print results as a pandas DataFrame.
def print_results_dataframe(results):
    records = []
    for result in results:
        count, log_proportion, record = result
        flattened_record = {**record, 'Count': count, 'Log Proportion': log_proportion}
        records.append(flattened_record)
    
    df = pd.DataFrame(records)
    
    # Set pandas options to display all records and don't trucate the data.
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)
    
    # Show 'Count' and 'Log Proportion' on the left and then show the data from the dataset.
    cols = ['Count', 'Log Proportion'] + [col for col in df.columns if col not in ['Count', 'Log Proportion']]
    df = df[cols]
    
    # Apply left justification to all columns
    df_style = df.style.set_properties(**{'text-align': 'left'})

    # Print as a dataFrame.
    display(df_style)

try:
    files = get_csv_files(directory)
    num_threads = determine_num_threads(files)
    aggregated_data, total_entries = aggregate_data(files, headers, num_threads)
    results = identify_anomalies(aggregated_data, total_entries, threshold)
    print_results_dataframe(results)
except Exception as e:
    print(f"An error occurred: {e}")


Unnamed: 0,Count,Log Proportion,arguments,hostname,name,path,pid,username
0,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\8b1cbd46-c17f-4600-a6ef-4e60c7babbb0.sdb,HLC469ES,sbdinst.exe,C:\Windows\System32\sbdinst.exe,2820,NT AUTHORITY\SYSTEM
1,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\3f1fbe1b-a0d4-490d-a838-4dda6648855f.sdb,HLC291SE,sbdinst.exe,C:\Windows\System32\sbdinst.exe,1132,NT AUTHORITY\SYSTEM
2,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\93ad1550-8e79-4907-990f-df900efd0a1c.sdb,ZVQ262NB,sbdinst.exe,C:\Windows\System32\sbdinst.exe,3788,NT AUTHORITY\SYSTEM
3,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\c8262d20-5b61-493d-ad1b-daa391ff8556.sdb,ZVQ988WC,sbdinst.exe,C:\Windows\System32\sbdinst.exe,120,NT AUTHORITY\SYSTEM
4,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\ecae7015-457f-453d-93e5-f523d6e148ee.sdb,ZAX298WC,sbdinst.exe,C:\Windows\System32\sbdinst.exe,6036,NT AUTHORITY\SYSTEM
5,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\7d253014-689f-4587-8f3c-a3aae5cba72c.sdb,YBI898NE,sbdinst.exe,C:\Windows\System32\sbdinst.exe,5932,NT AUTHORITY\SYSTEM
6,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\786e9325-cb8d-47f9-86bd-e067f6519628.sdb,YBI729SE,sbdinst.exe,C:\Windows\System32\sbdinst.exe,1512,NT AUTHORITY\SYSTEM
7,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\6b7b9d30-c9f6-47fd-b72e-9824972caed5.sdb,GUX432SB,sbdinst.exe,C:\Windows\System32\sbdinst.exe,208,NT AUTHORITY\SYSTEM
8,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\4c0c1581-1953-40e6-b472-ae649ebfcb2f.sdb,NJP855EL,sbdinst.exe,C:\Windows\System32\sbdinst.exe,6388,NT AUTHORITY\SYSTEM
9,50,-3.499038,C:\Windows\System32\sbdinst.exe -q C:\Windows\AppPatch\Custom\Custom64\aab94830-ca29-42ac-9bd2-4b8c21f59ef9.sdb,RLR202SC,sbdinst.exe,C:\Windows\System32\sbdinst.exe,1548,NT AUTHORITY\SYSTEM
