### Evaluation Notebook for LLM Responses

This notebook contains scripts to evaluate the properly formatted responses from large language models (LLMs) for four tasks:

- **CTI-MCQ Task**: The response should be a letter corresponding to the correct answer (A, B, C, D or X which indicates error, such as refusing to answer).
- **CTI-RCM and CTI-RCM-2021 Tasks**: The response should be a properly formatted CWE ID.
- **CTI-VSP Task**: The response should be a properly formatted CVSS v3 string.
- **CTI-TAA Task**: The response should be the name of the threat actor.

Check the files in the `responses` folder for examples. This notebook will evaluate the responses from the ChatGPT-4 model and calculate the evaluation metrics.

Libraries required
- pandas
- cvss (https://pypi.org/project/cvss/)

In [1]:
import pandas as pd
from cvss import CVSS3
import pickle

In [12]:
model_name = 'LLAMA3-70B'  # corresponds to the column name in the respone sheet
model_name2 = 'LLAMA3-8B'  # corresponds to the column name in the respone sheet

### CTI-MCQ evaluation

In [52]:
def compute_mcq_accuracy(fname, col):
    df = pd.read_csv(fname, sep='\t')
    correct = 0
    total = 0
    for idx, row in df.iterrows():
        pred = row[col]
        gt = row['GT'].upper()
        if pred in ['A', 'B', 'C', 'D', 'X']:
            total += 1
        else:
            print('Invalid response at row {}'.format(idx+1))
        if pred == gt:
            correct += 1
    return correct/total*100

In [54]:
print('Accuracy:', compute_mcq_accuracy('responses/EvaluationFinale_MCQ.tsv', model_name))

Invalid response at row 1664
Invalid response at row 1694
Invalid response at row 2020
Accuracy: 63.3809758501725


### CTI-RCM evaluation

In [64]:
def compute_rcm_accuracy(fname, col):
    df = pd.read_csv(fname, sep='\t')
    correct = 0
    total = 0
    for idx, row in df.iterrows():
        pred = row[col].upper()
        gt = row['GT'].upper()
        if pred.startswith('CWE-'):
            total += 1
        else:
            print('Invalid response at row {}'.format(idx+1))
        if pred == gt:
            correct += 1
    return correct/total*100

In [66]:
print('Accuracy:', compute_rcm_accuracy('responses/EvaluationFinale_RCM.csv', model_name))

KeyError: 'LLAMA3-70B'

In [62]:
print('Accuracy:', compute_rcm_accuracy('responses/EvaluationFinale_RCM.csv', model_name2))

KeyError: 'LLAMA3-8B'

### CTI-VSP evaluation

In [144]:
def get_cvss_score(cvss_vector):
    c = CVSS3(cvss_vector)
    cvss_score = c.scores()[0]
    return cvss_score

In [146]:
def compute_vsp_mad(fname, col):
    cvss_prefix = 'CVSS:3.0/'   # should be empty string if the model responds with the prefix
    df = pd.read_csv(fname, sep='\t')
    error = 0
    total = 0
    for idx, row in df.iterrows():
        pred = row[col].upper()
        gt = row['GT'].upper()
        try:
            pred_vector = cvss_prefix + pred
            pred_score = get_cvss_score(pred_vector)
            gt_score = get_cvss_score(gt)
            error += abs(pred_score-gt_score)
        except Exception as e:
            print('Invalid response at row {}'.format(idx+1))
            print(e)
            continue
        total += 1
            
    return error/total

In [148]:
print('Mean Absolute Deviation:', compute_vsp_mad('responses/EvaluationFinale_VSP.tsv', model_name))

Mean Absolute Deviation: 1.3100000000000027


### CTI-TAA evaluation

In [32]:
with open('alias_dict_NEW.pickle', 'rb') as handle:
    alias_dict = pickle.load(handle)

In [34]:
with open('related_dict.pickle', 'rb') as handle:
    related_dict = pickle.load(handle)

In [36]:
def threat_actor_connection(actor1, actor2, alias_dict, related_dict):
    """
    Determines the connection type between two threat actors based on alias and related group information.

    Args:
        actor1: The first threat actor.
        actor2: The second threat actor.
        alias_dict: A dictionary where keys are threat actors and values are lists of their aliases.
        related_dict: A dictionary where keys are threat actors and values are lists of related threat actors.

    Returns:
        "C" if the actors are connected via an alias chain.
        "P" if the actors are connected via a related group chain.
        "I" if no connection is found.
    """

    actor1 = actor1.strip().lower()
    actor2 = actor2.strip().lower()

    # Normalize dictionaries and ensure bidirectional alias relationships
    alias_dict = {k.strip().lower(): [v.strip().lower() for v in val] for k, val in alias_dict.items()}
    for actor in list(alias_dict): # Iterate over a copy of the keys
        aliases = alias_dict[actor]
        for alias in aliases:
            if actor not in alias_dict.setdefault(alias, []):  # Avoid duplicates
                alias_dict[alias].append(actor)

    related_dict = {k.strip().lower(): [v.strip().lower() for v in val] for k, val in related_dict.items()}
    for actor in list(related_dict):  # Iterate over a copy of the keys
        related_groups = related_dict[actor]
        for related_actor in related_groups:
            if actor not in related_dict.setdefault(related_actor, []):
                related_dict[related_actor].append(actor)

    if is_alias_connected(actor1, actor2, alias_dict):
        return "C"

    if is_related_connected(actor1, actor2, alias_dict, related_dict):
        return "P"

    return "I"


def is_alias_connected(actor1, actor2, alias_dict):
    """
    Checks if two actors are connected through an alias chain using Breadth First Search (BFS).
    """
    visited = set()
    queue = [actor1]

    while queue:
        current_actor = queue.pop(0)
        visited.add(current_actor)

        for alias in alias_dict.get(current_actor, []):
            if alias == actor2:
                return True
            if alias not in visited:
                queue.append(alias)

    return False


def is_related_connected(actor1, actor2, alias_dict, related_dict) :
    """
    Checks if two actors are connected through a chain of aliases and related groups using BFS.
    """
    visited = set()
    queue = [actor1]

    while queue:
        current_actor = queue.pop(0)
        visited.add(current_actor)

        for alias in alias_dict.get(current_actor, []):
            if alias == actor2:
                return True
            if alias not in visited:
                queue.append(alias)

        for related_actor in related_dict.get(current_actor, []):
            if related_actor == actor2:
                return True
            if related_actor not in visited:
                queue.append(related_actor)

    return False

In [38]:
def compute_taa_accuracy(fname, col):
    """
    Returns Correct & Plausible Accuracy
    """
    df = pd.read_csv(fname, sep='\t')
    correct = 0
    plausible = 0
    total = 0
    
    for idx, row in df.iterrows():
        pred = row[col].lower().strip()
        gt = row['GT'].lower().strip()
        
        res = threat_actor_connection(gt, pred, alias_dict, related_dict)

        if res == 'C':
            correct += 1
        elif res == 'P':
            plausible += 1
        total += 1

    return correct/total*100, (correct+plausible)/total*100

In [42]:
print('Correct & Plausible Accuracy [LLAMA 70B]:', compute_taa_accuracy('responses/EvaluationFinale_TAA.tsv', model_name))
print('Correct & Plausible Accuracy [LLAMA 8B]:', compute_taa_accuracy('responses/EvaluationFinale_TAA.tsv', model_name2))

Correct & Plausible Accuracy [LLAMA 70B]: (48.64864864864865, 62.16216216216216)
Correct & Plausible Accuracy [LLAMA 8B]: (20.27027027027027, 29.72972972972973)


In [161]:
import matplotlib.pyplot as plt

In [262]:
models = ["ChatGPT-4","ChatGPT-3.5","Gemini-1.5","LLAMA3-70B","LLAMA3-8B"]

In [264]:
mcq_accuracy = [compute_mcq_accuracy('responses/EvaluationFinale_MCQ.tsv', i) for i in models]
rcm_accuracy = [compute_rcm_accuracy('responses/EvaluationFinale_RCM.tsv', i) for i in models]
vsp_mad = [compute_vsp_mad('responses/EvaluationFinale_VSP.tsv', i) for i in models]
taa_accuracy = [compute_taa_accuracy('responses/EvaluationFinale_TAA.tsv', i) for i in models]
for i in range(0,len(models)):
    print(models[i],": MCQ = ", round(mcq_accuracy[i],2),"RCM = ",round(rcm_accuracy[i],2),"VSP =", round(vsp_mad[i],2), "TAA (Correct)=",round(taa_accuracy[i][0],2),"TAA (Plausible)=",round(taa_accuracy[i][1],2))

Invalid response at row 49
Invalid response at row 89
Invalid response at row 139
Invalid response at row 140
Invalid response at row 163
Invalid response at row 167
Invalid response at row 184
Invalid response at row 197
Invalid response at row 204
Invalid response at row 214
Invalid response at row 215
Invalid response at row 218
Invalid response at row 230
Invalid response at row 250
Invalid response at row 280
Invalid response at row 283
Invalid response at row 286
Invalid response at row 301
Invalid response at row 306
Invalid response at row 310
Invalid response at row 320
Invalid response at row 325
Invalid response at row 327
Invalid response at row 328
Invalid response at row 334
Invalid response at row 337
Invalid response at row 339
Invalid response at row 363
Invalid response at row 369
Invalid response at row 380
Invalid response at row 395
Invalid response at row 402
Invalid response at row 425
Invalid response at row 434
Invalid response at row 447
Invalid response at ro