# Table of Contents

1. [RAG Evaluation](#RAG-Evaluation)
    1. [Introduction](#Introduction)
    2. [Dataset Format](#Dataset-Format)
2. [Code](#Code)
    1. [Prompts](#Prompts)
    2. [Chunks](#Chunks)
    3. [Generation](#Generation)

# RAG Evaluation

## Introduction

Our objective is to monitor and improve the RAG pipeline for **AI-OPS**, that requires context-specific data from *Cybersecurity* and *Penetration Testing* fields; also we want the evaluation process to be as automated as possible, for this reason the evaluation output will be provided into [EVALUATION.md](../../../EVALUTATION.md) by using **GitHub Actions**.


The workflow is split in two steps:

1. **Dataset Generation** (what you're reading): this step won't be automated for now, the reason is that to generate the dataset we'll need a LLM, but this project aims to be full open-source (and I do not have OpenAI API keys) and using Ollama or similar for inference in GitHub Actions would be too much slow.

2. **Evaluation** ([evaluation.py](./evaluation.py)): this step will be automated with GitHub Actions, Ollama will still be used for embedding the chunks to upload on the used Vector Database (Qdrant) and generating the responses, but this won't be as expensive as generating a dataset.

## Dataset Format
| Context                                                                                  | Question                   | Answer                                              | Ground Truth                                                              |
|------------------------------------------------------------------------------------------|----------------------------|-----------------------------------------------------|---------------------------------------------------------------------------|
| Used to generate Questions and Ground Truth.<br> Won't be present in the output dataset. | Mockup of a user question. | Answers will be generated during <br> evaluation ([evaluation.py](./evaluation.py)) | The "real" answer for the question.<br> Manual review should be performed |


(...) given a list of chunks we will generate a question/multiple questions for each chunk; we will also generate a Ground Truth, even if it is not optimal (...)

## Code

In [23]:
import os
import json
import time
import textwrap
import random
from json import JSONDecodeError

import pandas as pd
import google.generativeai as genai
from google.generativeai.types import HarmCategory, HarmBlockThreshold
from tqdm import tqdm
from dotenv import load_dotenv

load_dotenv()

True

### Prompts

In [2]:
# Question Generation Prompts
GEN_QUERY_PROMPT = textwrap.dedent("""
    As a question-generating assistant specializing in cybersecurity, your task is to generate simple, domain-specific questions based on the information given in a provided document, with a focus on Penetration Testing. 
    
    You will be provided with the text of a document, surrounded by input tags. Please read the document, extract relevant information, and generate a simple, clear question based on the content. The question should be a maximum of two sentences long.
    
    Your response should be in the following JSON format:
    {{"QUESTION": "Your question here."}}
    
    <input>{document}</input>
""")

In [19]:
# Question Answering (Ground Truth) Prompts
GEN_ANSWER_PROMPT = textwrap.dedent("""
    As an answer-generating assistant specializing in cybersecurity, your task is to provide accurate answers for given questions in the context of Penetration Testing. You will be provided with a question and contextual information to generate a precise and relevant answer.
    
    Your answer should be in the following JSON format:
    {{"ANSWER": "Your answer here."}}
    
    Take a deep breath and work on this problem step by step.
    
    Query:
    <input>{query}</input>
    
    Context:
    <input>{context}</input>
""")

### Chunks

In [4]:
from src.agent.knowledge import chunk_str

In [5]:
owasp_df = pd.read_json('../../../data/json/owasp.json')
owasp_df = owasp_df[['title', 'content']]
owasp_df

Unnamed: 0,title,content
0,Broken Access Control,"Moving up from the fifth position, 94% of appl..."
1,Cryptographic Failures,"Shifting up one position to #2, previously kno..."
2,Injection,Injection slides down to the third position. 9...
3,Insecure Design,A new category for 2021 focuses on risks relat...
4,Security Misconfiguration,"Moving up from #6 in the previous edition, 90%..."
5,Vulnerable and Outdated Components,It was #2 from the Top 10 community survey but...
6,Identification and Authentication Failures,"Previously known as Broken Authentication, thi..."
7,Software and Data Integrity Failures,A new category for 2021 focuses on making assu...
8,Security Logging and Monitoring Failures,Security logging and monitoring came from the ...
9,Server Side Request Forgery (SSRF),This category is added from the Top 10 communi...


In [6]:
chunks = []
for idx, item in owasp_df.iterrows():
    chunks.extend(chunk_str(item.content))

In [7]:
chunks[:1]

['Moving up from the fifth position, 94% of applications were tested for some form of broken access control with the average incidence rate of 3.81%, and has the most occurrences in the contributed dataset with over 318k.']

### Generation

In [8]:
# Gemini Setup
GEMINI_KEY = os.getenv('GEMINI_API_KEY')
genai.configure(api_key=GEMINI_KEY)

llm = genai.GenerativeModel(
    'gemini-1.5-flash',
    generation_config={"response_mime_type": "application/json"}
)

safety_settings = {
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE
}

In [25]:
def gen_data(_chunks):
    # Generate Question
    gen_query = llm.generate_content(
        GEN_QUERY_PROMPT.format(document=_chunks), 
        safety_settings=safety_settings
    )

    try:
        question = json.loads(gen_query.text)['QUESTION']
    except JSONDecodeError:
        question = gen_query.text

    # Generate Ground Truth
    gen_answer = llm.generate_content(
        GEN_ANSWER_PROMPT.format(query=question, context=_chunks), 
        safety_settings=safety_settings
    )
    
    try:    
        answer = json.loads(gen_answer.text)['ANSWER']
    except JSONDecodeError:
        answer = gen_answer.text
    
    return question, answer

In [27]:
dataset_size = 50
data = []
i = 3

for _ in tqdm(range(dataset_size), total=dataset_size, desc='Generating q&a'):
    # Get random chunks
    start = random.choice(range(len(chunks) - 3))
    n = random.choice([1, 2, 3])
    chosen_chunks = random.choice(chunks[start:start+n])
    
    try:
        q, a = gen_data(chosen_chunks)
    except Exception:
        time.sleep(20)
        q, a = gen_data(chosen_chunks)
    
    i -= 1
    if i == 0:
        i = 3
        time.sleep(6)
        
    data.append({
        'context': chosen_chunks,
        'question': q,
        'ground_truth': a
    })

Generating q&a:  64%|██████▍   | 32/50 [03:44<02:06,  7.02s/it]


ResourceExhausted: 429 Resource has been exhausted (e.g. check quota).

In [28]:
data

[{'context': 'Injection slides down to the third position. 94% of the applications were tested for some form of injection with a max incidence rate of 19%, an average incidence rate of 3%, and 274k occurrences.',
  'question': 'While injection vulnerabilities are still prevalent, their occurrence has decreased, with only 19% of applications showing the highest incidence rate. What factors might have contributed to this decrease in injection vulnerabilities?',
  'ground_truth': "The decrease in injection vulnerabilities could be attributed to several factors:\n\n* **Increased Awareness and Education:**  The cybersecurity community has become more aware of injection vulnerabilities, leading to better understanding and practices among developers. Educational resources and training programs have helped raise awareness and promote secure coding practices.\n* **Improved Development Tools and Frameworks:** Modern development tools and frameworks often incorporate security features that help m

In [10]:
for item in data:
    q: str = item['question']
    a: str = item['ground_truth']
    
    if q.startswith('{') and not q.endswith('}'):
        q += '}'
        q = json.loads(q)["QUESTION"]
    if a.startswith('{') and not a.endswith('}'):
        if not a.strip().endswith('"'):
            a += '"}'
        else:
            a += '}'
        a = a.replace('\n', '')
        a = json.loads(a)["ANSWER"]
    
    item['question'] = q
    item['ground_truth'] = a

In [11]:
output = pd.DataFrame(data)[['question', 'ground_truth']]
output

Unnamed: 0,question,ground_truth
0,What are the most common vulnerabilities found...,"According to the OWASP Cheat Sheet, the most c..."
1,How can a patch management process be implemen...,A patch management process can be implemented ...
2,What are some example exploitable component vu...,Some example exploitable component vulnerabili...
3,What are some common authentication weaknesses...,CWE-297: Improper Validation of Certificate w...
4,How can digital signatures or similar mechanis...,Implement digital signatures or similar mechan...
...,...,...
95,How can an attacker take over an application b...,The context does not provide sufficient inform...
96,What security vulnerabilities were exploited b...,The context does not provide information about...
97,Are deprecated cryptographic padding methods s...,The context does not provide information about...
98,What are some key concepts related to secure d...,Insecure design encompasses various weaknesses...


In [12]:
output.to_json('../../../data/rag_eval/owasp_100-200.json')

In [12]:
# TODO: 
#   generated dataset should be cleaned from:
#   items with ground_truth == 'the context does not provide sufficient information'
#   items with no groundedness: question can't be answered with given prompt
#   items where question isn't standalone == can't be understood with any given context