### Evaluation Notebook for LLM Responses

This notebook contains scripts to evaluate the properly formatted responses from large language models (LLMs) for four tasks:

- **CTI-MCQ Task**: The response should be a letter corresponding to the correct answer (A, B, C, D or X which indicates error, such as refusing to answer).
- **CTI-RCM and CTI-RCM-2021 Tasks**: The response should be a properly formatted CWE ID.
- **CTI-VSP Task**: The response should be a properly formatted CVSS v3 string.
- **CTI-TAA Task**: The response should be the name of the threat actor.

Check the files in the `responses` folder for examples. This notebook will evaluate the responses from the ChatGPT-4 model and calculate the evaluation metrics.

Libraries required
- pandas
- cvss (https://pypi.org/project/cvss/)

Note: you may need to restart the kernel to use updated packages.


In [94]:
import pandas as pd
from cvss import CVSS3
import pickle

In [96]:
model_name = 'GT'  # corresponds to the column name in the respone sheet

### CTI-MCQ evaluation

In [99]:
def compute_mcq_accuracy(fname, col):
    df = pd.read_csv(fname, sep='\t')
    correct = 0
    total = 0
    for idx, row in df.iterrows():
        pred = row[col].upper()
        gt = row['GT'].upper()
        if pred in ['A', 'B', 'C', 'D', 'X']:
            total += 1
        else:
            print('Invalid response at row {}'.format(idx+1))
        if pred == gt:
            correct += 1
    return correct/total*100

In [101]:
print('Accuracy:', compute_mcq_accuracy('responses/cti-mcq-responses.tsv', model_name))

Accuracy: 100.0


### CTI-RCM evaluation

In [104]:
def compute_rcm_accuracy(fname, col):
    df = pd.read_csv(fname, sep='\t')
    correct = 0
    total = 0
    for idx, row in df.iterrows():
        pred = row[col].upper()
        gt = row['GT'].upper()
        if pred.startswith('CWE-'):
            total += 1
        else:
            print('Invalid response at row {}'.format(idx+1))
        if pred == gt:
            correct += 1
    return correct/total*100

In [106]:
print('Accuracy:', compute_rcm_accuracy('responses/cti-rcm-responses.tsv', model_name))

Accuracy: 100.0


In [108]:
print('Accuracy:', compute_rcm_accuracy('responses/cti-rcm-2021-responses.tsv', model_name))

Accuracy: 100.0


### CTI-VSP evaluation

In [111]:
def get_cvss_score(cvss_vector):
    c = CVSS3(cvss_vector)
    cvss_score = c.scores()[0]
    return cvss_score

In [113]:
def compute_vsp_mad(fname, col):
    cvss_prefix = 'CVSS:3.0/'   # should be empty string if the model responds with the prefix
    df = pd.read_csv(fname, sep='\t')
    error = 0
    total = 0
    for idx, row in df.iterrows():
        pred = row[col].upper()
        gt = row['GT'].upper()
        try:
            pred_vector = cvss_prefix + pred
            pred_score = get_cvss_score(pred_vector)
            gt_score = get_cvss_score(gt)
            error += abs(pred_score-gt_score)
        except Exception as e:
            print('Invalid response at row {}'.format(idx+1))
            print(e)
            continue
        total += 1
            
    return error/total

In [115]:
print('Mean Absolute Deviation:', compute_vsp_mad('responses/cti-vsp-responses.tsv', model_name))

Invalid response at row 1
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 2
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 3
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 4
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 5
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 6
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 7
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 8
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 9
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 10
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 11
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 12
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 13
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 14
Unknown metric "CVSS" in field "CVSS:3.1"
Invalid response at row 15
Unknown metric "

ZeroDivisionError: division by zero

### CTI-TAA evaluation

In [25]:
with open('alias_dict.pickle', 'rb') as handle:
    alias_dict = pickle.load(handle)

In [27]:
with open('related_dict.pickle', 'rb') as handle:
    related_dict = pickle.load(handle)

In [29]:
def threat_actor_connection(actor1, actor2, alias_dict, related_dict):
    """
    Determines the connection type between two threat actors based on alias and related group information.

    Args:
        actor1: The first threat actor.
        actor2: The second threat actor.
        alias_dict: A dictionary where keys are threat actors and values are lists of their aliases.
        related_dict: A dictionary where keys are threat actors and values are lists of related threat actors.

    Returns:
        "C" if the actors are connected via an alias chain.
        "P" if the actors are connected via a related group chain.
        "I" if no connection is found.
    """

    actor1 = actor1.strip().lower()
    actor2 = actor2.strip().lower()

    # Normalize dictionaries and ensure bidirectional alias relationships
    alias_dict = {k.strip().lower(): [v.strip().lower() for v in val] for k, val in alias_dict.items()}
    for actor in list(alias_dict): # Iterate over a copy of the keys
        aliases = alias_dict[actor]
        for alias in aliases:
            if actor not in alias_dict.setdefault(alias, []):  # Avoid duplicates
                alias_dict[alias].append(actor)

    related_dict = {k.strip().lower(): [v.strip().lower() for v in val] for k, val in related_dict.items()}
    for actor in list(related_dict):  # Iterate over a copy of the keys
        related_groups = related_dict[actor]
        for related_actor in related_groups:
            if actor not in related_dict.setdefault(related_actor, []):
                related_dict[related_actor].append(actor)

    if is_alias_connected(actor1, actor2, alias_dict):
        return "C"

    if is_related_connected(actor1, actor2, alias_dict, related_dict):
        return "P"

    return "I"


def is_alias_connected(actor1, actor2, alias_dict):
    """
    Checks if two actors are connected through an alias chain using Breadth First Search (BFS).
    """
    visited = set()
    queue = [actor1]

    while queue:
        current_actor = queue.pop(0)
        visited.add(current_actor)

        for alias in alias_dict.get(current_actor, []):
            if alias == actor2:
                return True
            if alias not in visited:
                queue.append(alias)

    return False


def is_related_connected(actor1, actor2, alias_dict, related_dict) :
    """
    Checks if two actors are connected through a chain of aliases and related groups using BFS.
    """
    visited = set()
    queue = [actor1]

    while queue:
        current_actor = queue.pop(0)
        visited.add(current_actor)

        for alias in alias_dict.get(current_actor, []):
            if alias == actor2:
                return True
            if alias not in visited:
                queue.append(alias)

        for related_actor in related_dict.get(current_actor, []):
            if related_actor == actor2:
                return True
            if related_actor not in visited:
                queue.append(related_actor)

    return False

In [31]:
def compute_taa_accuracy(fname, col):
    """
    Returns Correct & Plausible Accuracy
    """
    df = pd.read_csv(fname, sep='\t')
    correct = 0
    plausible = 0
    total = 0
    
    for idx, row in df.iterrows():
        pred = row[col].lower().strip()
        gt = row['GT'].lower().strip()
        
        res = threat_actor_connection(gt, pred, alias_dict, related_dict)

        if res == 'C':
            correct += 1
        elif res == 'P':
            plausible += 1
        total += 1

    return correct/total*100, (correct+plausible)/total*100

In [33]:
print('Correct & Plausible Accuracy:', compute_taa_accuracy('responses/cti-taa-responses.tsv', model_name))

Correct & Plausible Accuracy: (52.0, 86.0)


In [51]:
import matplotlib.pyplot as plt

In [None]:
models = ["ChatGPT-3.5","ChatGPT-4","Gemini-1.5"]

In [59]:
mcq_accuracy = [compute_mcq_accuracy('responses/cti-mcq-responses.tsv', i) for i in models]
vsp_mad = [compute_vsp_mad('responses/cti-mcq-responses.tsv', i) for i in models]
thractor_connection = [threat_actor_connection('responses/cti-mcq-responses.tsv', i) for i in models]
taa_accuracy = [compute_taa_accuracy('responses/cti-mcq-responses.tsv', i) for i in models]
plt.bar(models,res)

plt.show()
plt.bar(models,res)

plt.show()

NameError: name 'compute_vsp_mad' is not defined