# SciBERT for Single-Label Classification

[![image](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/center-for-threat-informed-defense/tram/blob/main/user_notebooks/fine_tune_single_label.ipynb)

This notebook allows one to continue fine-tuning our provided SciBERT-for-singlelabel-sequence-classification on custom data.

In [46]:
!mkdir scibert_single_label_model
!wget https://ctidtram.blob.core.windows.net/tram-models/single-label-202308303/config.json -O scibert_single_label_model/config.json
!wget https://ctidtram.blob.core.windows.net/tram-models/single-label-202308303/pytorch_model.bin -O scibert_single_label_model/pytorch_model.bin
!pip install torch transformers pandas

mkdir: cannot create directory ‘scibert_single_label_model’: File exists
--2025-11-19 13:04:16--  https://ctidtram.blob.core.windows.net/tram-models/single-label-202308303/config.json
Resolving ctidtram.blob.core.windows.net (ctidtram.blob.core.windows.net)... 57.150.154.65
Connecting to ctidtram.blob.core.windows.net (ctidtram.blob.core.windows.net)|57.150.154.65|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2806 (2.7K) [application/json]
Saving to: ‘scibert_single_label_model/config.json’


2025-11-19 13:04:16 (1.37 GB/s) - ‘scibert_single_label_model/config.json’ saved [2806/2806]

--2025-11-19 13:04:16--  https://ctidtram.blob.core.windows.net/tram-models/single-label-202308303/pytorch_model.bin
Resolving ctidtram.blob.core.windows.net (ctidtram.blob.core.windows.net)... 57.150.154.65
Connecting to ctidtram.blob.core.windows.net (ctidtram.blob.core.windows.net)|57.150.154.65|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 439900

This cell instantiates the label encoder. Do not modify this cell, as the classes (ie, ATT&CK techniques) and their order must match those the model expects.

In [47]:
from sklearn.preprocessing import OneHotEncoder as OHE

CLASSES = [
   'T1003.001', 'T1005', 'T1012', 'T1016', 'T1021.001', 'T1027',
   'T1033', 'T1036.005', 'T1041', 'T1047', 'T1053.005', 'T1055',
   'T1056.001', 'T1057', 'T1059.003', 'T1068', 'T1070.004',
   'T1071.001', 'T1072', 'T1074.001', 'T1078', 'T1082', 'T1083',
   'T1090', 'T1095', 'T1105', 'T1106', 'T1110', 'T1112', 'T1113',
   'T1140', 'T1190', 'T1204.002', 'T1210', 'T1218.011', 'T1219',
   'T1484.001', 'T1518.001', 'T1543.003', 'T1547.001', 'T1548.002',
   'T1552.001', 'T1557.001', 'T1562.001', 'T1564.001', 'T1566.001',
   'T1569.002', 'T1570', 'T1573.001', 'T1574.002'
]

encoder = OHE(sparse_output=False)
encoder.fit([[c] for c in CLASSES])

encoder.categories_

[array(['T1003.001', 'T1005', 'T1012', 'T1016', 'T1021.001', 'T1027',
        'T1033', 'T1036.005', 'T1041', 'T1047', 'T1053.005', 'T1055',
        'T1056.001', 'T1057', 'T1059.003', 'T1068', 'T1070.004',
        'T1071.001', 'T1072', 'T1074.001', 'T1078', 'T1082', 'T1083',
        'T1090', 'T1095', 'T1105', 'T1106', 'T1110', 'T1112', 'T1113',
        'T1140', 'T1190', 'T1204.002', 'T1210', 'T1218.011', 'T1219',
        'T1484.001', 'T1518.001', 'T1543.003', 'T1547.001', 'T1548.002',
        'T1552.001', 'T1557.001', 'T1562.001', 'T1564.001', 'T1566.001',
        'T1569.002', 'T1570', 'T1573.001', 'T1574.002'], dtype=object)]

This cell is for loading the training data. You will need to modify this cell to load your data. Ensure that by the end of this cell, a DataFrame has been assigned to the variable `data` that has a `text` column containing the segments, and a `label` column containing individual strings, where those strings are an ATT&CK IDs that this model can classify. It does not matter how the DataFrame is indexed or what other columns with other names, if any, it has.

For demonstration purposes, we will use the same single-label data that was produced during this TRAM effort, even though the model was trained on this data already. This cell is only present to show the expected format of the `data` DataFrame, and is not intended to be run as shown.

In [48]:
!wget https://raw.githubusercontent.com/hoangcuongnguyen2001/SciBERT-for-Technique-Classification/main/training_dataset.json

--2025-11-19 13:05:09--  https://raw.githubusercontent.com/hoangcuongnguyen2001/SciBERT-for-Technique-Classification/main/training_dataset.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3391311 (3.2M) [text/plain]
Saving to: ‘training_dataset.json.4’


2025-11-19 13:05:09 (83.6 MB/s) - ‘training_dataset.json.4’ saved [3391311/3391311]



In [49]:
import pandas as pd

# 这里直接读当前目录下刚下载的 training_dataset.json
data = pd.read_json('training_dataset.json')

# 如果有 doc_title 这一列就删掉，没有就跳过
if 'doc_title' in data.columns:
    data = data.drop(columns='doc_title')

data.head(5)

Unnamed: 0,instruction,input,output
0,Detect the technique in MITRE ATT&CK framework.,TrickBot has used macros in Excel documents to...,T1059: Command and Scripting Interpreter
1,Detect the technique in MITRE ATT&CK framework.,SombRAT has the ability to use an embedded SOC...,T1090: Proxy
2,Detect the technique in MITRE ATT&CK framework.,Silent Librarian has exfiltrated entire mailbo...,T1114: Email Collection
3,Detect the technique in MITRE ATT&CK framework.,Azorult can collect a list of running processe...,T1057: Process Discovery
4,Detect the technique in MITRE ATT&CK framework.,SeaDuke is capable of executing commands.,T1059: Command and Scripting Interpreter


In [50]:
# 只保留 output 列中冒号前面的 ATT&CK ID
def normalize_label(s):
    if isinstance(s, str):
        return s.split(':')[0].strip()
    return s

LABEL_COL = 'output'   # 你已经在上面定义过，也可以重用
data[LABEL_COL] = data[LABEL_COL].apply(normalize_label)

# 看看清洗后的前几行
data[[LABEL_COL]].head()


Unnamed: 0,output
0,T1059
1,T1090
2,T1114
3,T1057
4,T1059


In [51]:
unknown = set(data[LABEL_COL]) - set(CLASSES)
print("unknown count:", len(unknown))
print(sorted(list(unknown))[:20])


unknown count: 164
['T1001', 'T1003', 'T1006', 'T1007', 'T1008', 'T1010', 'T1011', 'T1014', 'T1018', 'T1020', 'T1021', 'T1025', 'T1029', 'T1030', 'T1036', 'T1037', 'T1039', 'T1040', 'T1046', 'T1048']


In [52]:
import transformers
import torch

cuda = torch.device('cuda')

tokenizer = transformers.BertTokenizer.from_pretrained("allenai/scibert_scivocab_uncased", max_length=512)
bert = transformers.BertForSequenceClassification.from_pretrained('scibert_single_label_model').to(cuda).train()

In [53]:
# 告诉后面的代码：哪一列是文本，哪一列是标签
TEXT_COL = 'input'    # 文本列


In [54]:
# 这里假设你已经做了 ID 清洗：data['output'] = data['output'].apply(normalize_label)

LABEL_COL = 'output'   # 如果前面已经定义过，就保持一致

# 1. 只保留属于 CLASSES 的标签
known_labels = set(CLASSES)
mask = data[LABEL_COL].isin(known_labels)

print("原始样本数:", len(data))
print("保留样本数:", mask.sum())

data = data[mask].reset_index(drop=True)

# 看一下过滤后的标签分布
print("剩余标签种类数:", data[LABEL_COL].nunique())


原始样本数: 14426
保留样本数: 4162
剩余标签种类数: 26


In [55]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, shuffle=True)

def _load_data(x, y, batch_size=10):
    x_len, y_len = x.shape[0], y.shape[0]
    assert x_len == y_len
    for i in range(0, x_len, batch_size):
        slc = slice(i, i + batch_size)
        yield x[slc].to(cuda), y[slc].to(cuda)

def _tokenize(instances: list[str]):
    return tokenizer(instances, return_tensors='pt', padding='max_length', truncation=True, max_length=512).input_ids

def _encode_labels(labels):
    """labels: 应该是 DataFrame 里的标签那一列 (Series)"""
    # 先把 Series 变成 (n_samples, 1) 的二维数组
    labels_2d = labels.to_numpy().reshape(-1, 1)
    # 再丢给 OneHotEncoder
    encoded = encoder.transform(labels_2d)
    return torch.Tensor(encoded)


In [56]:
x_train = _tokenize(train[TEXT_COL].tolist())
x_train

tensor([[  102,  1041,  6919,  ...,     0,     0,     0],
        [  102,   238,  3329,  ...,     0,     0,     0],
        [  102,  6493, 15700,  ...,     0,     0,     0],
        ...,
        [  102,  7897,   126,  ...,     0,     0,     0],
        [  102,  3581, 30137,  ...,     0,     0,     0],
        [  102,  7940,  6236,  ...,     0,     0,     0]])

In [57]:
y_train = _encode_labels(train[LABEL_COL])

This array may appear to be empty, but taking the sum shows that there is one `1` per row.

In [58]:
y_train.sum()

tensor(3329.)

This cell contains the training loop. You may change the `NUM_EPOCHS` value to any integer you would like.

In [59]:
NUM_EPOCHS = 3

from statistics import mean

from tqdm import tqdm
from torch.optim import AdamW

optim = AdamW(bert.parameters(), lr=2e-5, eps=1e-8)

for epoch in range(NUM_EPOCHS):
    epoch_losses = []
    for x, y in tqdm(_load_data(x_train, y_train, batch_size=10)):
        bert.zero_grad()
        out = bert(x, attention_mask=x.ne(tokenizer.pad_token_id).to(int), labels=y)
        epoch_losses.append(out.loss.item())
        out.loss.backward()
        optim.step()
    print(f"epoch {epoch + 1} loss: {mean(epoch_losses)}")

333it [04:48,  1.16it/s]


epoch 1 loss: 0.01097429969401152


333it [04:45,  1.17it/s]


epoch 2 loss: 0.006462818357391513


333it [04:45,  1.17it/s]

epoch 3 loss: 0.004852539871598373





If the loss from the last iteration was not to your liking, do not re-run the previous cell. Uncomment the following cell and run it for however many additional epochs you would like.

In [60]:
# NUM_EXTRA_EPOCHS = 1
# for epoch in range(NUM_EXTRA_EPOCHS):
#     epoch_losses = []
#     for x, y in tqdm(_load_data(x_train, y_train, batch_size=10)):
#         bert.zero_grad()
#         out = bert(x, attention_mask=x.ne(tokenizer.pad_token_id).to(int), labels=y)
#         epoch_losses.append(out.loss.item())
#         out.loss.backward()
#         optim.step()
#     print(f"epoch {epoch + 1} loss: {mean(epoch_losses)}")

The next cells evaluate the performance after the additional fine-tuning. The performance scores on the example data will be high, as the model has already been trained on most of these instances.

In [62]:
bert.eval()

x_test = _tokenize(test[TEXT_COL].tolist())
y_test  = _encode_labels(test[LABEL_COL])

batch_size = 20
preds = []

with torch.no_grad():
    for i in range(0, x_test.shape[0], batch_size):
        x = x_test[i : i + batch_size].to(cuda)
        out = bert(x, attention_mask=x.ne(tokenizer.pad_token_id).to(int))
        preds.extend(out.logits.to('cpu'))

import torch.nn.functional as F
from sklearn.metrics import precision_recall_fscore_support as calculate_score

predicted_labels = (
    encoder.inverse_transform(
        F.one_hot(
            torch.vstack(preds).softmax(-1).argmax(-1),
            num_classes=len(encoder.categories_[0])
        ).numpy()
    )
    .reshape(-1)
)

# 预测结果：一维的标签字符串数组 -> 列表
predicted = list(predicted_labels)

# 真实结果：把 y_test 的 one-hot 反解成标签字符串
actual_labels = encoder.inverse_transform(y_test.numpy()).reshape(-1)
actual = list(actual_labels)

# 取真实 + 预测里所有出现过的标签，作为评价用的 label 集合
labels = sorted(set(actual) | set(predicted))


scores = calculate_score(actual, predicted, labels=labels)

scores_df = pd.DataFrame(scores).T
scores_df.columns = ['P', 'R', 'F1', '#']
scores_df.index = labels
scores_df.loc['(micro)'] = calculate_score(actual, predicted, average='micro', labels=labels)
scores_df.loc['(macro)'] = calculate_score(actual, predicted, average='macro', labels=labels)

scores_df

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Unnamed: 0,P,R,F1,#
T1005,0.681818,0.833333,0.75,18.0
T1012,0.846154,1.0,0.916667,11.0
T1016,1.0,1.0,1.0,24.0
T1027,1.0,0.962406,0.980843,133.0
T1033,0.96875,0.911765,0.939394,34.0
T1041,1.0,0.84,0.913043,25.0
T1047,1.0,0.952381,0.97561,21.0
T1055,0.966102,0.934426,0.95,61.0
T1057,0.982456,0.982456,0.982456,57.0
T1068,0.818182,1.0,0.9,9.0


In [None]:
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=scores_df)