# Phase 3: Model Evaluation

In this final phase of the project, we evaluate the performance of the fine-tuned model against the base model using the prepared test set. The evaluation includes both quantitative metrics and qualitative analysis through a prediction example.

## Evaluation Process

1. **Model Loading**
   - Both the **base model** ("meta-llama/Llama-3.2-3B-Instruct") and the **fine-tuned model** were loaded for evaluation.

2. **Performance Metrics**
   - The models were evaluated using **ROUGE scores** to measure the quality of the generated outputs compared to the ground truth. The following scores were obtained:

| Metric    | Base Model | Fine-Tuned Model |
|-----------|------------|------------------|
| **ROUGE-1**   | 0.2895     | **0.6537**         |
| **ROUGE-2**   | 0.1145     | **0.4022**         |
| **ROUGE-L**   | 0.1946     | **0.5090**         |
| **ROUGE-Lsum**| 0.2620     | **0.5719**         |

All the ROUGE scores have significantly improved after fine-tuning, indicating that the fine-tuned model has learned to generate more accurate and relevant outputs for the CTI analysis task.


## Example of Prediction

### **Input Prompt**
```
You are a skilled AI Agent capable of doing CTI Analysis.

Given this threat report: Three of the backdoors, NFlog, PoisonIvy, and NewCT have previously been publicly associated with DragonOK.

You will extract the main entities and their relations; finally, you will generate a diagnosis of the threat.
```

### **Ground Truth**
```
Entities: NFlog (malware), PoisonIvy (malware), NewCT (malware), DragonOK (threat-actor)
Relations: NFlog to DragonOK (authored-by), PoisonIvy to DragonOK (authored-by), NewCT to DragonOK (authored-by)
Diagnosis: Possible diagnosis: The entity DragonOK has been associated with the creation of three malware backdoors, NFlog, PoisonIvy, and NewCT, as they were previously publicly linked to this threat actor. This suggests that DragonOK may be involved in...
```

### **Fine-Tuned Model Prediction**
```
Entities: DragonOK (threat-actor), NFlog (malware), PoisonIvy (malware), NewCT (malware)
Relations: DragonOK to NFlog (uses), DragonOK to PoisonIvy (uses), DragonOK to NewCT (uses)
Diagnosis: The threat actor DragonOK has previously been associated with the use of three malware backdoors: NFlog, PoisonIvy, and NewCT. This indicates a potential cybersecurity breach and highlights the importance of monitoring and securing networks against such threats.
```


## Conclusion

- The fine-tuned model demonstrates a strong ability to extract relevant entities, establish accurate relationships, and generate a comprehensive diagnosis.  
- Although the relations differ slightly from the ground truth (using "uses" instead of "authored-by"), the overall context and threat assessment remain valid and coherent.  
- This example, along with the improved ROUGE scores, confirms that the fine-tuned model has effectively learned the CTI analysis task, significantly outperforming the base model.

---

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

import torch
import pandas as pd
import numpy as np

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

### Load the original model and the fine-tuned model (with LoRA)

In [None]:
model_path="/content/drive/My Drive/Git_Portfolio/CTI/peft_model_CTI"

model_name = "meta-llama/Llama-3.2-3B-Instruct"
access_token = "YOUR HUGGING FACE ACCESS TOKEN"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=access_token)
tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=access_token)

peft_model = PeftModel.from_pretrained(base_model, model_path, is_trainable=False)
peft_model = peft_model.to(device)
peft_model.eval()

In [None]:
# reload the base model
base_model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=access_token)
base_model = base_model.to(device)
base_model.eval()

---

### Run the originl model and the fine-tuned model on the test set and save the predictions

In [None]:
import pickle

# Load the dataset
with open('/content/drive/My Drive/Git_Portfolio/CTI/data/dataset_CTI_llama3_2-3B.pkl', 'rb') as file:
    dataset = pickle.load(file)

In [None]:
from tqdm import tqdm
model_predictions = pd.DataFrame([],columns=['ground_truth', 'base_model', 'fine_tuned_model'])

# iterate ove the test set
for i in tqdm(range(len(dataset['test']))):
  input_ids = dataset['test']['input_ids'][i]
  labels = dataset['test']['labels'][i]

  # remove the -100 tokens (padding), which would generate an error when using the decode() method
  labels = [token for token in labels if token != -100]

  # save the ground truth summary
  model_predictions.loc[i,'ground_truth'] = tokenizer.decode(labels, skip_special_tokens=True)

  # use the base model to predict
  outputs_base_model = base_model.generate(
        input_ids = torch.tensor(input_ids).unsqueeze(0).to(device),
        eos_token_id = tokenizer.eos_token_id,
        max_new_tokens=400
  )

  # the model used is a decoder-only model, so we need to remove the input_ids from the output
  outputs_base_model = outputs_base_model[0][len(input_ids):]

  # decode the output and put it in the dataframe
  model_predictions.loc[i,'base_model'] = tokenizer.decode(outputs_base_model, skip_special_tokens=True)

  # use the fine tuned model to predict the summary
  outputs_peft_model = peft_model.generate(
        input_ids = torch.tensor(input_ids).unsqueeze(0).to(device),
        eos_token_id = tokenizer.eos_token_id,
        max_new_tokens=400
  )

  # the model used is a decoder-only model, so we need to remove the input_ids from the output
  outputs_peft_model = outputs_peft_model[0][len(input_ids):]

  # decode the output and put it in the dataframe
  model_predictions.loc[i,'fine_tuned_model'] = tokenizer.decode(outputs_peft_model, skip_special_tokens=True)


In [None]:
# save the predictions on the Test Set
model_predictions.to_csv('/content/drive/My Drive/Git_Portfolio/CTI/data/model_predictions.csv', index=False)

### Compute the metric ROUGE

In [None]:
# load the predictions of the test set
model_predictions = pd.read_csv('/content/drive/My Drive/Git_Portfolio/CTI/data/model_predictions.csv')

In [None]:
import evaluate

In [None]:
rouge = evaluate.load('rouge')

In [None]:
# compute the score for the base model and for the fine tuned model

base_model_results = rouge.compute(
    predictions=model_predictions['base_model'].to_list(),
    references=model_predictions['ground_truth'].to_list(),
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=model_predictions['fine_tuned_model'].to_list(),
    references=model_predictions['ground_truth'].to_list(),
    use_aggregator=True,
    use_stemmer=True,
)

In [None]:
print('Base Model scores:')
for score in base_model_results:
  print(f'{score}: {base_model_results[score]}')

print('\n --------------- \n')

print('Fine-tuned Model scores:')
for score in peft_model_results:
  print(f'{score}: {peft_model_results[score]}')

#### **We can notic how all the rouge scores improved with fine-tuning!**

---

In [None]:
import random

# select three random dialogues from the list
indeces = random.sample(range(len(model_predictions)), 1)

### Human evaluation of the prediction

#### **We can notice that the output of the fine-tuned model is extensive and very close to the original summary**

In [None]:

# print the original dialogue and compare the grounf truth to the base model prediction and the fine-tuned model
for idx in indeces:
  print()
  print('-'*150)
  print('-'*150)
  print()
  print(f'Report {idx}')
  print()

  report = tokenizer.decode(dataset['test']['input_ids'][idx], skip_special_tokens=True)
  print(report)

  print()
  print('-'*150)
  print()

  print('Ground truth:\n')
  print(model_predictions['ground_truth'][idx])

  print()
  print('-'*150)
  print()

  print('Fine-tuned model:\n')
  print(model_predictions['fine_tuned_model'][idx])