# Phase 1: Preprocessing Cyber Threat Intelligence Dataset for Llama-3 Model

This notebook demonstrates the process of loading, preprocessing, and preparing the *Cyber Threat Intelligence (CTI)* dataset from Hugging Face for training a decoder-only model. The dataset is used to extract entities, their relationships, and generate a diagnosis from threat reports.

## Dataset Overview

- **Dataset Name**: "swaption2009/cyber-threat-intelligence-custom-data"
- **Content**: The dataset consists of cybersecurity threat reports. Each report contains detailed information about a cyber attack or threat, including entities (e.g., attack types, actors, systems), their relationships (e.g., attack vector, methods), and the diagnosis of the threat. The goal is to train the model to process these reports, extract meaningful entities and relations, and generate a comprehensive diagnosis.

## Data Preparation

1. **Data Splitting**
   - The dataset is split into three sets:
     - **80% for training**
     - **10% for validation**
     - **10% for testing**
   - This split ensures the model is trained on a large portion of the data while maintaining sufficient data for evaluation and tuning.

2. **Data Preprocessing (with custom library Pipeline)**
   - The input prompt given to the model is constructed as follows:

     ```
     You are a skilled AI Agent capable of doing CTI Analysis.
     Given this threat report:
     
     {threat report}
     
     You will extract the main entities and their relations; finally, you will generate a diagnosis of the threat.
     ```

   - The expected output is formatted to extract the entities, relationships, and the diagnosis:
     
     ```
     Entities: {list of entities}
     Relations: {list of relations} 
     Diagnosis: {diagnosis}
     ```
     
   - Since a decoder-only model is used, the expected output is integrated directly into the input prompt.
   - The output sequence also contains the input prompt, but it is masked using the special token `-100`, ensuring that the loss function ignores it during training. This allows the model to focus on predicting the expected output rather than memorizing the input prompt.

3. **Tokenization and Dataset Packing**
   - Both the input prompt and expected output are tokenized and packed into a dataset using the Hugging Face dataset format.
   - This format ensures that the data is ready for training using Hugging Face's `Trainer` API or any other model training utilities.


---

### Load the data

In [1]:
import sys
sys.path.insert(0,'../')

from Pipeline.data_retrieving.HuggingFace_DataRetriever import *
from datasets import DatasetDict

In [2]:
huggingface_dataset_name = "swaption2009/cyber-threat-intelligence-custom-data"

data_retriever = HuggingFace_DataRetriever(huggingface_dataset_name)
dataset = data_retriever.retrieve_data()

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'entities', 'relations', 'diagnosis', 'solutions'],
        num_rows: 476
    })
})

We can notice that the dataset is already split, but the only split available is the training set. We will then use the data available to create our new splits:
- 80% training
- 10% validation
- 10% test

In [4]:
dataset = dataset['train']


In [5]:
# 80% of the data will be used for training, the remaining 20% will be further split into 50% for validation and 50% for testing
first_split = dataset.train_test_split(train_size=0.8,test_size=0.2,shuffle=False)

# split the remaining 20% into 50% for validation and 50% for testing
second_split = first_split['test'].train_test_split(train_size=0.5,test_size=0.5,shuffle=False)

# create the new dataset
dataset = DatasetDict({
    'train': first_split['train'],
    'validation': second_split['train'],
    'test': second_split['test']
})

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'entities', 'relations', 'diagnosis', 'solutions'],
        num_rows: 380
    })
    validation: Dataset({
        features: ['id', 'text', 'entities', 'relations', 'diagnosis', 'solutions'],
        num_rows: 48
    })
    test: Dataset({
        features: ['id', 'text', 'entities', 'relations', 'diagnosis', 'solutions'],
        num_rows: 48
    })
})

---

### Preprocess the data

In [6]:
from Pipeline.preprocessing.CTI_Preprocessor import *

from transformers import AutoTokenizer

In [7]:
model_id = "meta-llama/Llama-3.2-3B-Instruct"

In [8]:
access_token = "YOUR HUGGING FACE ACCESS TOKEN"

In [9]:
# define the necessary preprocessing parameters and create an instance of the preprocessor
tokenizer = AutoTokenizer.from_pretrained(model_id, token=access_token)
tokenizer.pad_token = tokenizer.eos_token

input_preprocessing_params={
        'tokenizer': tokenizer,
        'truncation': True
    }

output_preprocessing_params={
        'tokenizer': tokenizer,
        'truncation': True
    }

preprocessor = CTI_Preprocessor(
    input_preprocessing_params,
    output_preprocessing_params
)

In [10]:
# preprocess the training set
tokenized_training_inputs = preprocessor.preprocess_input_data(dataset['train'].to_pandas())
tokenized_training_outputs = preprocessor.preprocess_output_data(dataset['train'].to_pandas())

# preprocess the validation set
tokenized_validation_inputs = preprocessor.preprocess_input_data(dataset['validation'].to_pandas())
tokenized_validation_outputs = preprocessor.preprocess_output_data(dataset['validation'].to_pandas())

# preprocess the test set
tokenized_test_inputs = preprocessor.preprocess_input_data(dataset['test'].to_pandas(), for_training=False)
tokenized_test_outputs = preprocessor.preprocess_output_data(dataset['test'].to_pandas())

### Let's show an example of the traing data that will be fed to the model

In [19]:
print('Promt example:\n\n')
print(tokenizer.decode(tokenized_training_inputs[0], skip_special_tokens=True))

print('\n\n')

print('Expected response of the model:\n\n')
output = tokens = [token for token in tokenized_training_outputs[0] if token != -100] # remove the -100 tokens (padding), which would generate an error when using the decode() method
print(tokenizer.decode(output, skip_special_tokens=True))

Promt example:


You are a skilled AI Agent capable of doing CTI Analysis.

Given this threat report:
A cybersquatting domain save-russia[.]today is launching DoS attacks on Ukrainian news sites.

You will extract the main entities and their relations; finally, you will generate a diagnosis of the threat.

Entities: cybersquatting (attack-pattern), save-russia[.]today (url), DoS attacks (attack-pattern), Ukrainian news sites (identity)
Relations: DoS attacks to Ukrainian news sites (targets), cybersquatting to save-russia[.]today (uses), DoS attacks to save-russia[.]today (uses)
Diagnosis: The diagnosis is a cyber attack that involves the use of a cybersquatting domain save-russia[.]today to launch DoS attacks on Ukrainian news sites. The attacker targets the Ukrainian news sites as the victim, using the cybersquatting



Expected response of the model:



Entities: cybersquatting (attack-pattern), save-russia[.]today (url), DoS attacks (attack-pattern), Ukrainian news sites (identity)

### If the preparation of the prompt and the output is correct, the lenghts of the tokenized prompt and the lenght of the (masked) output should be the same

In [12]:
# check the length of the first example
len(tokenized_training_inputs[0]) == len(tokenized_training_outputs[0])

True

Everything looks alright!

### Let's print an example of the testing prompt and its output

In [13]:
print('Promt example:\n\n')
print(tokenizer.decode(tokenized_test_inputs[0], skip_special_tokens=True))

print('\n\n')

print('Expected response of the model:\n\n')
output = tokens = [token for token in tokenized_test_outputs[0] if token != -100] # remove the -100 tokens (padding), which would generate an error when using the decode() method
print(tokenizer.decode(output, skip_special_tokens=True))

Promt example:


You are a skilled AI Agent capable of doing CTI Analysis.

Given this threat report:
The PDF  exploits CVE-2013-2729 to download a binary which also installed CryptoWall 2.0.

You will extract the main entities and their relations; finally, you will generate a diagnosis of the threat.




Expected response of the model:



Entities: CryptoWall 2.0 (malware), CVE-2013-2729 (vulnerability)
Relations: CryptoWall 2.0 to CVE-2013-2729 (exploits)
Diagnosis: The entity is a PDF file and the cybersecurity issue is a vulnerability (CVE-2013-2729) that is being exploited. The PDF file is being used as a delivery mechanism for malware (CryptoWall 2.0). The diagnosis is


---


### save locally the dataset, using the Hugging Face Dataset format

In [14]:
from datasets import Dataset, DatasetDict

In [15]:
#first create a dictionay
dataset = {
    'train_set': {
                    'input_ids': tokenized_training_inputs,
                    'labels': tokenized_training_outputs        
                },

    'validation_set': {
                    'input_ids': tokenized_validation_inputs,
                    'labels': tokenized_validation_outputs        
                },
    
    'test_set': {
                    'input_ids': tokenized_test_inputs,
                    'labels': tokenized_test_outputs        
                }
}

In [16]:
# then convert each split into a Hugging Face Dataset
train_dataset = Dataset.from_dict(dataset["train_set"])
validation_dataset = Dataset.from_dict(dataset["validation_set"])
test_dataset = Dataset.from_dict(dataset["test_set"])

In [17]:
# and finally pack it into a DatasetDict structure
dataset = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    "test": test_dataset
})

In [None]:
# save locally
import pickle

# Open a file in write-binary mode
with open("data/dataset_CTI_llama3_2-3B.pkl", "wb") as file:
    # Serialize the list and save it to the file
    pickle.dump(dataset, file)
