Skip to content

Data Annotation

rossj-en edited this page Aug 24, 2023 · 4 revisions

Data annotation is a cognitively taxing, time intensive, and resource-expensive task; in this iteration of TRAM we focused on improving how we annotate training data from CTI reports. Understanding that the training data would have the most impact on the success of our classifier, our research team created a data annotation process, using conventions found in our Data Annotation Best Practices Guide, to produce results that aligned with how the model would best make predictions.

To help with the professionalization of TRAM, we streamlined not only how we annotated data, but also where we collected it. With all those things above in mind, we wanted to ensure the reports we were analyzing were relevant and would contain TTPs of interest, so we explored a few options for report collection and discovered Feedly. We collaborated with Feedly, a web-based news aggregator, to leverage their enterprise Feedly AI engine to intelligently collect data for our annotation effort. Feedly allows you to filter on the types of articles you want to view and use their API to easily download the data in JSON format, including the full text of the report and other metadata.

The full CTI reports and the indices (along with technique labels) for sentences can be found in an MJSON format which comes directly from MAT (MITRE annotation tool). The other formats we’ve included for model training are the single-label.json and the multi-label.json files.

Comparison of multi-label annotation at sentence and phrase-level

The multi-label.json file contains sentence-level annotation, whereas the single-label.json file contains phrase-level annotations. For large language model experiments, multi-label is more representative of what we want the model to do, but we only had sufficient data for single-label experiments to yield acceptable results. As we continue to label additional training data, we can improve model’s ability to detect multiple adversary TTPs in a sentence or phrase in the future.

Preparing Data for Model Training

One other aspect of building training data is the importance of consistently unitizing the data. While models are able to classify text, they cannot decide where unit boundaries are. Dividing the documents into units must be done before any data is passed to the model. A common approach is to divide the documents into sentences using a sentence-splitting tool, called a sentencizer. We can do this with the documents produced during this annotation effort, since we have the complete documents, as well as the segments of each document that our annotators highlighted. In comparison, the TRAM embedded training dataset is comprised of phrases with associated ATT&CK techniques. These segments of text are not complete sentences and cannot be used to recreate the original documents. Consistently dividing the document allows for the positive samples to be reproducible, as well as, produces negative samples. The model needs to be trained on negative samples so that it can best determine when part of a document doesn’t have an ATT&CK technique.

With the annotations produced during this effort, we have the whole documents, and the sections of text that are highlighted to associate an ATT&CK technique. These highlighted sections may be contained entirely within a sentence, or span multiple sentences, or partially overlap two sentences. We can divide the document into sentences, and for each sentence, assign it all the ATT&CK labels that are contained wholly or partially by that sentence. This means that some sentences will have no labels, some will have exactly one, and some will have more than one. When training a classifier where instances can have more than one label, this is called multi-label classification (“multi-class classification” refers to when there are multiple classes).

This modeling approach more closely aligns with what the TRAM tool is intended to do than the single-label modeling approach but is more complicated for the model to learn. We also have less data available for the multi-label approach, because only the data produced during this annotation effort can be used.

The below annotation dataset table of results shows multi-label results from experiments ran with SciBERT. The multi-label analysis is more indicative of real world reports, but the model performance results aren’t good enough to use in production and could likely see improvement with more data.

precision (p), recall (r), and F1-scores for SciBERT multi-class multi-label model

Clone this wiki locally