# Preliminary BERT results, 19 April 2023

This notebook contains the experiment that was presented at the 19 April 2023 participant sync. The first few cells present some basic data transformations to prepare the data for the model.

In [1]:
import json
data_path = '/projects/TRAM2023/tram-private/data/training/refreshed_dataset_march_2023.json'
with open(data_path) as f:
    data = json.loads(f.read())
    
import pandas as pd
raw = pd.DataFrame(data['sentences'])
raw

Unnamed: 0,text,order,disposition,mappings,annotator,reference
0,has overwritten the function pointer in the ex...,,accept,[{'technique_name': 'Extra Window Memory Injec...,"[{'organization_name': 'MITRE', 'annotator_nam...",[[{'url': 'https://recon.cx/2018/brussels/reso...
1,overwrites Explorers Shell_TrayWnd extra windo...,,accept,[{'technique_name': 'Extra Window Memory Injec...,"[{'organization_name': 'MITRE', 'annotator_nam...",[[{'url': 'https://www.malwaretech.com/2013/08...
2,has used scheduled tasks to maintain persistence.,,accept,"[{'technique_name': 'Scheduled Task', 'attack_...","[{'organization_name': 'MITRE', 'annotator_nam...",[[{'url': 'https://www.microsoft.com/security/...
3,has the ability to launch scheduled tasks to e...,,accept,"[{'technique_name': 'Scheduled Task', 'attack_...","[{'organization_name': 'MITRE', 'annotator_nam...",[[{'url': 'https://www.crowdstrike.com/blog/ca...
4,has used scheduled tasks to maintain persistence.,,accept,"[{'technique_name': 'Scheduled Task', 'attack_...","[{'organization_name': 'MITRE', 'annotator_nam...",[[{'url': 'https://go.crowdstrike.com/rs/281-O...
...,...,...,...,...,...,...
24599,"""My God"" was one of the first songs recorded b...",12583,accept,[],"[{'organization_name': 'unknown', 'annotator_n...",[{'url': 'https://github.com/center-for-threat...
24600,It initially had seven students.,12584,accept,[],"[{'organization_name': 'unknown', 'annotator_n...",[{'url': 'https://github.com/center-for-threat...
24601,Vellarikundu is a hillside town and taluk head...,12585,accept,[],"[{'organization_name': 'unknown', 'annotator_n...",[{'url': 'https://github.com/center-for-threat...
24602,This earned the score a parental advisory warn...,12586,accept,[],"[{'organization_name': 'unknown', 'annotator_n...",[{'url': 'https://github.com/center-for-threat...


In [2]:
mappings = raw['mappings'].explode().dropna().apply(pd.Series)
mappings

Unnamed: 0,technique_name,attack_id,confidence
0,Extra Window Memory Injection,T1055.011,100.0
1,Extra Window Memory Injection,T1055.011,100.0
2,Scheduled Task,T1053.005,100.0
3,Scheduled Task,T1053.005,100.0
4,Scheduled Task,T1053.005,100.0
...,...,...,...
13536,Emond,T1546.014,100.0
13537,Control Panel,T1218.002,100.0
13538,Control Panel,T1218.002,100.0
13539,Application Shimming,T1546.011,100.0


In [3]:
df = pd.concat((raw['text'], mappings['attack_id'].str.extract(r"(?P<attack_id>T\d+)\.(?P<subclass_id>\d+)")), axis=1)
df

Unnamed: 0,text,attack_id,subclass_id
0,has overwritten the function pointer in the ex...,T1055,011
1,overwrites Explorers Shell_TrayWnd extra windo...,T1055,011
2,has used scheduled tasks to maintain persistence.,T1053,005
3,has the ability to launch scheduled tasks to e...,T1053,005
4,has used scheduled tasks to maintain persistence.,T1053,005
...,...,...,...
24599,"""My God"" was one of the first songs recorded b...",,
24600,It initially had seven students.,,
24601,Vellarikundu is a hillside town and taluk head...,,
24602,This earned the score a parental advisory warn...,,


In [4]:
df['attack_id'].value_counts(dropna=False)

attack_id
NaN      17932
T1059      706
T1071      397
T1070      368
T1547      337
         ...  
T1011        1
T1499        1
T1216        1
T1597        1
T1601        1
Name: count, Length: 89, dtype: int64

Here we see that unlabeled text segments are significantly more frequent than any individual technique. We will include 1000 of them in the data for this experiment, along with all instances of these ATT&CK techniques.

'T1041', 'T1106', 'T1082', 'T1033', 'T1112', 'T1070', 'T1090', 'T1021', 'T1218', 'T1095', 'T1548', 'T1053', 'T1071', 'T1574', 'T1562', 'T1204', 'T1012', 'T1140', 'T1055', 'T1105', 'T1552', 'T1486', 'T1083', 'T1078', 'T1047', 'T1190', 'T1543', 'T1113', 'T1003', 'T1059', 'T1057', 'T1027', 'T1219', 'T1036', 'T1005'

Note that not all of these techniques are present in the available data, but for each that is present, we use every instance.

In [5]:
classes_of_interest = ['T1041', 'T1106', 'T1082', 'T1033', 'T1112', 'T1070', 'T1090', 'T1021', 'T1218', 'T1095', 'T1548', 'T1053', 'T1071', 'T1574', 'T1562', 'T1204', 'T1012', 'T1140', 'T1055', 'T1105', 'T1552', 'T1486', 'T1083', 'T1078', 'T1047', 'T1190', 'T1543', 'T1113', 'T1003', 'T1059', 'T1057', 'T1027', 'T1219', 'T1036', 'T1005']
positive_data = df[df['attack_id'].isin(classes_of_interest)]
negative_data = df[df['attack_id'].isna()].sample(1000).fillna('none')
data = pd.concat((positive_data, negative_data))
data

Unnamed: 0,text,attack_id,subclass_id
0,has overwritten the function pointer in the ex...,T1055,011
1,overwrites Explorers Shell_TrayWnd extra windo...,T1055,011
2,has used scheduled tasks to maintain persistence.,T1053,005
3,has the ability to launch scheduled tasks to e...,T1053,005
4,has used scheduled tasks to maintain persistence.,T1053,005
...,...,...,...
23524,"In 2017, Waterhouse shifted to Australian Rule...",none,none
21207,The district was formed in 1973 under the Loca...,none,none
6738,has a command to list the victim's processes.,none,none
19345,"Neophile or Neophiliac, a term popularized by ...",none,none


For this preliminary experiment, we will use the model `bert-base-cased`, which is the BERT model provided as a reference in the original BERT paper by Devlin et al.

In [6]:
import transformers
import torch

cuda = torch.device('cuda')
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-cased', max_length=512)

In [7]:
x_tokens = tokenizer(data['text'].tolist(), return_tensors='pt', padding='max_length', truncation=True, max_length=512).input_ids
x_tokens

tensor([[  101,  1144,  1166,  ...,     0,     0,     0],
        [  101,  1166,  2246,  ...,     0,     0,     0],
        [  101,  1144,  1215,  ...,     0,     0,     0],
        ...,
        [  101,  1144,   170,  ...,     0,     0,     0],
        [  101, 14521, 27008,  ...,     0,     0,     0],
        [  101,  1109,  1419,  ...,     0,     0,     0]])

In [8]:
x_tokens.shape

torch.Size([4527, 512])

In [9]:
index_to_label = dict(enumerate(data['attack_id'].unique()))
index_to_label

{0: 'T1055',
 1: 'T1053',
 2: 'T1021',
 3: 'T1218',
 4: 'T1027',
 5: 'T1574',
 6: 'T1059',
 7: 'T1036',
 8: 'T1548',
 9: 'T1003',
 10: 'T1071',
 11: 'T1552',
 12: 'T1204',
 13: 'T1562',
 14: 'T1543',
 15: 'T1070',
 16: 'T1078',
 17: 'T1090',
 18: 'none'}

In [10]:
label_to_index = {label: index for index, label in index_to_label.items()}
label_to_index

{'T1055': 0,
 'T1053': 1,
 'T1021': 2,
 'T1218': 3,
 'T1027': 4,
 'T1574': 5,
 'T1059': 6,
 'T1036': 7,
 'T1548': 8,
 'T1003': 9,
 'T1071': 10,
 'T1552': 11,
 'T1204': 12,
 'T1562': 13,
 'T1543': 14,
 'T1070': 15,
 'T1078': 16,
 'T1090': 17,
 'none': 18}

In [11]:
y_all = torch.Tensor(data['attack_id'].replace(label_to_index).to_numpy()).to(int)
y_all

tensor([ 0,  0,  1,  ..., 18, 18, 18])

We split the data 80/20 between train and test.

In [12]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_tokens, y_all, test_size=0.2, shuffle=True)

def _load_data(x, y, batch_size=10):
    x_len, y_len = x.shape[0], y.shape[0]
    assert x_len == y_len
    for i in range(0, x_len, batch_size):
        slc = slice(i, i + batch_size)
        yield x[slc].to(cuda), y[slc].to(cuda)

Each vector representing an instance is padded with trailing zeros, so that every vector is the same length, 512. In the expression `bert(x, attention_mask=x.ne(0).to(int), labels=y)`, the attention mask indicates that all non-zero elements are important. (`ne` means "not equal", so `x.ne(0).to(int)` returns a binary array of 1s for non-zero elements of x, else 0.)

We use a learning rate of $2 \times 10^{-5}$, $\epsilon = 1 \times 10^{-8}$, and a batch size of 10, for five epochs.

Note that the loss function is (rather unusually) called internally to `bert`. Looking at the source code for `transformers.BertForSequenceClassification`, the loss function that is used in this experiment is cross entropy loss. (My exploration of the source code revealed that `BertForSequenceClassification` will also support multi-label classification, which is good news as we expect to do that later in the project.)

In [13]:
from statistics import mean

from tqdm import tqdm
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.optim import AdamW

bert = transformers.BertForSequenceClassification.from_pretrained(
    'bert-base-cased',
    num_labels=data['attack_id'].nunique(),
    output_attentions=False, 
    output_hidden_states=False,
)
bert.train().to(cuda)
optim = AdamW(bert.parameters(), lr=2e-5, eps=1e-8)

for epoch in range(5):
    epoch_losses = []
    for x, y in tqdm(_load_data(x_train, y_train, batch_size=10)):
        bert.zero_grad()
        out = bert(x, attention_mask=x.ne(0).to(int), labels=y)
        epoch_losses.append(out.loss.item())
        out.loss.backward()
        optim.step()
    print(f"epoch {epoch} loss: {mean(epoch_losses)}")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

epoch 0 loss: 1.6576010820868916


363it [01:54,  3.16it/s]


epoch 1 loss: 0.5818408697474101


363it [01:57,  3.10it/s]


epoch 2 loss: 0.2602325600902896


363it [01:57,  3.09it/s]


epoch 3 loss: 0.12686511959377728


363it [01:57,  3.09it/s]

epoch 4 loss: 0.07342217404128516





In [14]:
from sklearn.metrics import precision_recall_fscore_support as calculate_score

bert.eval()

batch_size = 20
preds = []

with torch.no_grad():
    for i in range(0, x_test.shape[0], batch_size):
        x = x_test[i : i + batch_size].to(cuda)
        out = bert(x, attention_mask=x.ne(0).to(int))
        preds.extend(out.logits.argmax(-1).to('cpu').numpy())

As we are doing single-label classification, the prediction is considered to be the argmax of the logits.

In [30]:
def calculate_scores_df(actual: list[str], predicted: list[str]):
    scores = calculate_score(actual, predicted)
    scores_df = pd.DataFrame(scores).T
    scores_df.columns = ['P', 'R', 'F1', '#']
    scores_df.index = sorted(set(actual) | set(predicted))
    scores_df.loc['(micro)'] = calculate_score(actual, predicted, average='micro')
    scores_df.loc['(macro)'] = calculate_score(actual, predicted, average='macro')
    return scores_df

y_test_list = pd.Series(y_test.tolist()).replace(index_to_label)
preds_list = pd.Series(preds).replace(index_to_label)

results = calculate_scores_df(y_test_list, preds_list)
results

Unnamed: 0,P,R,F1,#
T1003,0.90625,0.828571,0.865672,35.0
T1021,0.913043,0.75,0.823529,28.0
T1027,0.857143,0.789474,0.821918,38.0
T1036,0.925,0.925,0.925,40.0
T1053,0.939394,0.96875,0.953846,32.0
T1055,0.888889,0.888889,0.888889,27.0
T1059,0.917647,0.981132,0.948328,159.0
T1070,0.9625,0.9625,0.9625,80.0
T1071,0.945946,0.958904,0.952381,73.0
T1078,0.7,0.875,0.777778,8.0
