Skip to content

Data Annotation

Mark E. Haase edited this page Aug 28, 2023 · 4 revisions

Data Overview

Data annotation is cognitively taxing, time-intensive, and resource-expensive. Understanding that the training data has the largest impact on the success of our classifier, we created a high quality, open-source dataset using the practices described in our Data Annotation Best Practices Guide. We collaborated with Feedly, a web-based news aggregator, to leverage their enterprise Feedly AI engine to intelligently collect data for our annotation effort. Feedly allows us to filter on the types of articles to view and to use their API to easily download the data in JSON format, including the full text of the report and other metadata.

The full CTI reports and the indices (along with technique labels) are provided in MJSON format, which comes directly from the MITRE Annotation Toolkit. We also included single-label.json and the multi-label.json files for training models.

Comparison of multi-label annotation at sentence and phrase-level

The multi-label.json file contains sentence-level annotation, whereas the single-label.json file contains phrase-level annotations. Multi-label is more flexible in terms of the model being able to detect text referencing multiple techniques in close proximity, but we only had sufficient data for single-label experiments to yield acceptable results. As we continue to label additional training data, we can improve the model’s ability to detect multiple adversary TTPs in a sentence or phrase in the future.

Single-Label vs Multi-Label

One other aspect of building training data is the importance of consistently unitizing the data. While models are able to classify text, they cannot decide where unit boundaries are. Dividing the documents into units must be done before any data is passed to the model. A common approach is to divide the documents into sentences using a sentence-splitting tool, called a sentencizer. We can do this with the documents produced during this annotation effort, since we have the complete documents, as well as the segments of each document that our annotators highlighted. In comparison, the TRAM embedded training dataset is comprised of phrases with associated ATT&CK techniques. These segments of text are not complete sentences and cannot be used to recreate the original documents. Consistently dividing the document allows for the positive samples to be reproducible, as well as, produces negative samples. The model needs to be trained on negative samples so that it can best determine when part of a document does not have an ATT&CK technique.

With the annotations produced during this effort, we have the whole documents, and the sections of text that are highlighted to associate an ATT&CK technique. These highlighted sections may be contained entirely within a sentence, or span multiple sentences, or partially overlap two sentences. We can divide the document into sentences, and for each sentence, assign it all the ATT&CK labels that are contained wholly or partially by that sentence. This means that some sentences will have no labels, some will have exactly one, and some will have more than one. When training a classifier where instances can have more than one label, this is called multi-label classification (“multi-class classification” refers to when there are multiple classes).

The below annotation dataset table of results shows multi-label results from experiments ran with SciBERT.

precision (p), recall (r), and F1-scores for SciBERT multi-class multi-label model

The multi-label approach is more flexible, but the model performance results aren’t good enough to use in production and could likely see improvement with more data.

Clone this wiki locally