# Arthur Shield Test Harness

The code below can be used to perform a test of the PII, Prompt Injection, Sensitivity, Toxicity, and Hallucination detection capabilities of Arthur Shield.

### Pre-Requisites: 
- You will need to configure which checks you would like to perform within Arthur Shield. See https://shield.docs.arthur.ai/docs/rule-configuration-guide
- You will need to compile a CSV file containing test prompts for each of the rules you would like to evaluate. You can reference the main template CSV in this folder for formatting, and use it to create your own list of prompts. The columns your input CSV should include are as follows (case-sensitive):
    * *Non-hallucination checks*:
        * *id* - A unique ID for the prompt
        * *prompt* - The text you are sending to Shield that you would like to check
        * *flag* - Which flag you expect this prompt to trigger. Must choose one of the following (ENUM): pii, prompt_injection, sensitive_data, toxicity, control
    * *Hallucination checks*:
        * *id* - A unique ID for the prompt
        * *response* - The response from the LLM to be checked - should be mocked/pre-populated for this test
        * *context* - The context which the response is supposed to be referencing
        * *flag* - Which flag you expect this prompt to trigger. Must choose one of the following (ENUM): hallucination, control
- **Notes for creating control prompts/responses**:
  * *For non-hallucination checks*: In the `prompt` column, enter a simple prompt which will not violate any of the rules
  * *For hallucination checks*: Provide context in the `context` column. In the `response` column, provide a response that only contains information that can be found within the context. Example below:
    * `response`: "The main export of Country X is textiles."
    * `context`: "Country X is known for its strong textile industry, which accounts for the majority of its exports."
    * `flag`: control
- Please aim to have at least 15 (and preferably more) prompts for each of the rules you would like to test in order to gather accurate analysis results, including controls

### Output:
This harness will output 2 CSVs: 
- Test output file containing info on the prompt, Shield flags, and latency
- Metrics file containing various evaluation performance metrics for each of the Shield rules


### Configuration
Fill out the config below with the details specific to your Shield instance

In [None]:
import os
from dotenv import load_dotenv
import shutil
import csv
import json
import pandas as pd
import math
import requests 
from datetime import datetime

from os.path import abspath, join
import sys

utils_path = abspath(join('..', 'utils'))
if utils_path not in sys.path:
    sys.path.append(utils_path)

from shield_utils import setup_env, get_task, create_task_rule, archive_task

load_dotenv()
SHIELD_ENDPOINT = os.getenv('SHIELD_ENDPOINT', 'https://<your-shield-instance>') # Do not include the slash on the end
SHIELD_API_KEY = os.getenv('SHIELD_API_KEY', '<your-shield-api-key>')

MAIN_INPUT_FILE = 'data/hr_prompt_input.csv' # Change this to whatever your input file name is
MAIN_OUTPUT_FILE = 'output/shield_output_hr.csv'
METRICS_FILE = 'output/shield_test_metrics_hr.csv'

shield_headers = {
    'Authorization': f'Bearer {SHIELD_API_KEY}'
}

setup_env(base_url=SHIELD_ENDPOINT, api_key=SHIELD_API_KEY)

You'll start by setting up a test task containing all the rules that you would like to run the harness for. This test task contains all the default rules you have set up for your Shield instance.

In [None]:
TASK_ENDPOINT = f"{SHIELD_ENDPOINT}/api/v2/task"
payload = {"name": "test-harness"}

response = requests.post(TASK_ENDPOINT, headers=shield_headers, json=payload).json()
task_id = response["id"]

response

Review the response output from the previous cell to see which rules are associated with the task you just created. If you would like to delete those rules and/or add additional rules for testing, you can use the cells below:

#### Delete existing rule(s)

In [None]:
# Delete one rule from the test-harness task
def delete_one_rule(task_id, rule_id):
    endpoint = f"{SHIELD_ENDPOINT}/api/v2/tasks/{task_id}/rules/{rule_id}"
    return requests.delete(endpoint, headers=shield_headers)

# Delete all rules from the test-harness task
def delete_all_rules(task_id):
    get_endpoint = f"{SHIELD_ENDPOINT}/api/v2/tasks/{task_id}"
    response = requests.get(get_endpoint, headers=shield_headers).json()
    rule_ids=[]
    for rule in response["rules"]:
        rule_ids.append(rule["id"])
    for rule_id in rule_ids:
        delete_one_rule(task_id, rule_id)

# Run either of the above functions to delete whichever rules you'd like to delete
#delete_one_rule(task_id, <insert_rule_id>)
delete_all_rules(task_id)

# See the updated task configuration
get_task(task_id)

#### Add new rules 
See https://shield.docs.arthur.ai/docs/rule-configuration-guide for rule setup instructions. Templates are provided below, but you can modify each of these rules as desired.

In [None]:
pii = {
  "name": "Test PII Rule",
  "type": "PIIDataRule",
  "apply_to_prompt": True,
  "apply_to_response": False,
  "config": {}
}
prompt_injection = {
  "name": "Test Prompt Injection Rule",
  "type": "PromptInjectionRule",
  "apply_to_prompt": True,
  "apply_to_response": False
}
toxicity = {
  "name": "Test Toxicity Rule",
  "type": "ToxicityRule",
  "apply_to_prompt": True,
  "apply_to_response": False
}
hallucination = {
  "name": "Test HallucinationV2 Rule",
  "type": "ModelHallucinationRuleV2",
  "apply_to_prompt": False,
  "apply_to_response": True
}
# You can reconfigure the sensitive data rule to your needs by providing positive/negative examples as shown below
sensitive_data = { 
    "name": "Test Sensitive Data Password Rule",
    "type": "ModelSensitiveDataRule",
    "apply_to_prompt": True,
    "apply_to_response": False,
    "config": {
        "examples": [
            {
                "example": "Hi, Your password 5d9kpQDXRiVf for our online therapy portal has been set successfully.",
                "result": True
            },
            {
                "example": "Mr. Conroy's password, h78WwcpT3s6B has leaked online.",
                "result": True
            },
                    {
                "example": "During this time, your login credentials Remington31 and VXBDni7EO5od will be reset..",
                "result": True
            },
                    {
                "example": "Hey Agnes, your login details for the upcoming clinical trial are as follows: IP: 73ac:d53c:f2bb:94a5:86f3:1dca:e1ec:2b4e, Password: WlBi79j58UsW.",
                "result": True
            },
                    {
                "example": "Can't access database for financial report. Hassie34's password last updated to mWGhCVGWuB8o.",
                "result": True
            },
                    {
                "example": "Linnea_OConnell, please submit your final Psychopharmacology thesis for review by 11/32.",
                "result": False
            },
                    {
                "example": "Dear client, thank you for your payment of NT$200 for our acupuncture services.",
                "result": False
            },
                    {
                "example": "Surely, Nolan and Sons would not want to restrict its healthcare services to people less than 91 years old years of age in the 40753 region?",
                "result": False
            },
                    {
                "example": "The name tag for the new staff member, Angel has arrived. You will find it in the Apt. 634.",
                "result": False
            },
                    {
                "example": "Your password has been reset.",
                "result": False
            }
        ]
    }
}

Once you've configured the rule settings, choose which rules you would like to test and add them to the task:

In [None]:
rules_to_test = [pii, prompt_injection, sensitive_data, toxicity, hallucination]

for rule in rules_to_test:
    create_task_rule(task_id, rule)

In [None]:
get_task(task_id)

## i. Non-Hallucination Checks

### Support Functions

Shield's response schema can be found on the API docs: https://\<your-shield-instance>/docs#/Default%20Validation/default_validate_prompt_api_v2_validate_prompt_post

In [None]:
SHIELD_VAL_ENDPOINT = f"{SHIELD_ENDPOINT}/api/v2/tasks/{task_id}/validate_prompt"

def get_firewall_flags(resp_dict):
    # Determines which rules (PII, Prompt Injection, Toxicity, Sensitivity) were flagged
    flags = []
    for rule in resp_dict["rule_results"]:
        if rule["result"] == "Fail":
            if rule["rule_type"] in ["ModelHallucinationRuleV2", "ModelHallucionationRuleV3"] and "hallucination" not in flags:
                # Only used for responses - does not check for other flags
                flags.append("hallucination")
                break
            elif rule["rule_type"] == "PIIDataRule" and "pii" not in flags:
                flags.append("pii")
            elif rule["rule_type"] == "PromptInjectionRule" and "prompt_injection" not in flags:
                flags.append("prompt_injection")
            elif rule["rule_type"] == "ToxicityRule" and "toxicity" not in flags:
                flags.append("toxicity")
            elif rule["rule_type"] == "ModelSensitiveDataRule" and "sensitive_data" not in flags:
                flags.append("sensitive_data")
    return flags

def get_gt_output_match(row):
    # Returns True if GT label matches Shield flags and False otherwise
    if row["ground_truth"] == "control": 
        return False if row["shield_response"] else True
    else:
        return True if row["ground_truth"] in row["shield_response"] else False
    

def get_pii_triggers(shield_response):
    # Isolate the text which triggered the PII rule to fire
    output = []
    for rule in shield_response["rule_results"]:
        if rule["rule_type"]== "PIIDataRule" and rule["result"]=="Fail":
            for flag in rule["details"]["pii_entities"]:
                output.append({
                    "string_trigger": flag["span"],
                    "flag_type": flag["entity"]
                })
    return output

def get_shield_response(row):
    # Calls Shield API and updates the row dictionary based upon the results from Shield
    shield_start = datetime.now()
    response = requests.post(SHIELD_VAL_ENDPOINT, headers=shield_headers, json={"prompt": row["prompt"]})
    shield_end = datetime.now()
    row["latency_ms"] = int((shield_end - shield_start).microseconds/1000)
    row["shield_response"] = get_firewall_flags(response.json())
    row["gt_output_match"] = get_gt_output_match(row)
    row["sub_flags"] = get_pii_triggers(response.json()) if "pii" in row["shield_response"] else None
    return row


### Generate Output File

The output file generated will contain: 
  * *id* - The ID of the corresponding input file prompt
  * *prompt* - The prompt that was passed
  * *ground_truth* - The flag that was expected
  * *shield_response* - The flag that Shield raised
  * *test_passed* - Whether the ground truth label matches the flag that Shield raised (True or False)
  * *sub-flags* - (For PII) Which sentences were flagged and which PII flag was raised
  * *latency_ms* - Latency of Shield call in ms

In [None]:
output_folder_name = "output"
if os.path.exists(output_folder_name):
    shutil.rmtree(output_folder_name)
os.makedirs(output_folder_name)

with open(MAIN_INPUT_FILE, newline='') as infile:
    reader= csv.DictReader(infile)
    with open(MAIN_OUTPUT_FILE, 'w', newline='') as outfile:
        # List of columns for the output file
        fieldnames = [
            "id",
            "prompt",
            "ground_truth",
            "shield_response",
            "gt_output_match",
            "sub_flags",
            "latency_ms"
        ]
        writer= csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()
        
        for row in reader:
            out_row = {
                "id": row["id"],
                "ground_truth": row["flag"],
                "prompt": row["prompt"]
                }

            out_row = get_shield_response(out_row)
            writer.writerow(out_row)
            

In [None]:
output_df = pd.read_csv(MAIN_OUTPUT_FILE)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
output_df

## ii. Hallucination Checks

The hallucination checks hit a different endpoint and require a slightly different input format, so we run those checks separately here.

In [None]:
HALLUCINATION_INPUT = 'data/hr_response_input.csv' # Change this to whatever your input file name is 
HALLUCINATION_OUTPUT = 'output/shield_output_hallucination_hr.csv'

SHIELD_RESP_ENDPOINT = f"{SHIELD_ENDPOINT}/api/v2/tasks/{task_id}/validate_response/"

### Support Functions

In [None]:
def get_claims(response):
    # Extracts the claims from the Shield flag 
    for rule in response["rule_results"]:
        if rule["rule_type"] not in ["ModelHallucinationRuleV2", "ModelHallucinationRuleV3"]:
            continue
        else:
            if rule["result"] == "Pass":
                return []
            else:
                return rule["details"]["claims"]

def get_inference_id(prompt):
    # Conduct an initial Shield call to get an inference ID for the response endpoint
    response = (requests.post(
        SHIELD_VAL_ENDPOINT, 
        headers=shield_headers, 
        json={"prompt": prompt})).json()
    return str(response["inference_id"])
        
def get_hallucination_shield_response(row):
    # Calls Shield API and updates the row dictionary based upon the results from Shield
    resp_endpoint = SHIELD_RESP_ENDPOINT + get_inference_id("null")
    shield_start = datetime.now()
    response = requests.post(
        resp_endpoint, 
        headers=shield_headers, 
        json={"response": row["llm_response"], "context": row["context"]}
    )
    shield_end = datetime.now()
    row["latency_ms"] = int((shield_end - shield_start).microseconds/1000)
    row["shield_response"] = get_firewall_flags(response.json())
    row["claims"] = get_claims(response.json())
    return row
    

### Generate Output File

In [None]:
with open(HALLUCINATION_INPUT, newline='') as infile:
    reader= csv.DictReader(infile)
    with open(HALLUCINATION_OUTPUT, 'w', newline='') as outfile:
        # List of columns for the output file
        fieldnames = [
            "id",
            "context",
            "llm_response",
            "ground_truth",
            "shield_response",
            "claims",
            "latency_ms"
        ]
        writer= csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()
        
        for row in reader:
            out_row = {
                "id": row["id"],
                "llm_response": row["response"],
                "context": row["context"],
                "ground_truth": row["flag"]
                }

            out_row = get_hallucination_shield_response(out_row)
            writer.writerow(out_row)

In [None]:
output_df = pd.read_csv(HALLUCINATION_OUTPUT)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)
output_df

## iii. Analysis

The metrics file generated will contain:
  * True Positive Count
  * False Positive Count
  * Precision
  * Recall
  * Specificity
  * Miss Rate
  * False Positive Rate
  * F1 Score

In [None]:
def round_row(data_row, precision=3):
    return [round(item, precision) if isinstance(item, (float, int)) else item for item in data_row]

def create_metrics_dict(metric_name):
    if metric_name == "hallucination":
        hdf= pd.read_csv(HALLUCINATION_OUTPUT)
        metrics = {
            "tp": hdf[
            (hdf["ground_truth"] == "hallucination") & (hdf['shield_response'].apply(lambda x: metric_name in x))
            ].shape[0],
            "fp": hdf[
            (hdf["ground_truth"] == "control") & (hdf['shield_response'].apply(lambda x: metric_name in x))
            ].shape[0],
            "fn": hdf[
            (hdf["ground_truth"] == "hallucination") & (hdf['shield_response'].apply(lambda x: metric_name not in x))
            ].shape[0],
            "tn": hdf[
            (hdf["ground_truth"] == "control") & (hdf['shield_response'].apply(lambda x: metric_name not in x))
            ].shape[0]
        }
    else:
        df = pd.read_csv(MAIN_OUTPUT_FILE)
        metrics = {
            "tp": df[(df['gt_output_match']==True) & (df['ground_truth']==metric_name)].shape[0],
            "fp": df[(df['ground_truth']!=metric_name) & (df['shield_response'].apply(lambda x: metric_name in x))].shape[0],
            "fn": df[(df['ground_truth']==metric_name) & (df['gt_output_match']==False)].shape[0],
            "tn": df[(df['ground_truth']!=metric_name) & (df['shield_response'].apply(lambda x: metric_name not in x))].shape[0]
        }
    try:
        metrics["prec"] = metrics["tp"]/(metrics["tp"]+metrics["fp"])
    except: 
        metrics["prec"] = "N/A"
    try:
        metrics["recall"] = metrics["tp"]/(metrics["tp"]+metrics["fn"])
    except:
        metrics["recall"] = "N/A"
    return metrics

def run_analysis(checks=["pii", "prompt_injection", "toxicity", "sensitive_data", "hallucination"]):
    # Runs analysis on the selected checks. Runs on all by default
    with open(METRICS_FILE, 'w', newline='') as outfile:
        writer = csv.writer(outfile)
        metrics = []
        for check in checks:
            metrics.append(create_metrics_dict(check))
        writer.writerow([""]+checks)
    
        # True Positive Count
        tpc= ["True Positive Count"]
        for metric in metrics:
            tpc.append(metric["tp"])
        writer.writerow(round_row(tpc))
    
        # False Positive Count
        fpc= ["False Positive Count"]
        for metric in metrics:
            fpc.append(metric["fp"])
        writer.writerow(round_row(fpc))
                        
        # Precision
        prec = ["Precision"]
        for metric in metrics:
            prec.append(metric["prec"])
        writer.writerow(round_row(prec))
        
        # Recall
        recall = ["Recall"]
        for metric in metrics:
            recall.append(metric["recall"])
        writer.writerow(round_row(recall))

        # Specificity
        spec = ["Specificity"]
        for metric in metrics:
            try:
                spec.append(metric["tn"]/(metric["tn"]+metric["fp"]))
            except:
                spec.append("N/A")
        writer.writerow(round_row(spec))
    
        # Miss Rate
        miss = ["Miss Rate"]
        for metric in metrics:
            try:
                miss.append(metric["fn"]/(metric["fn"]+metric["tp"]))
            except:
                miss.append("N/A")
        writer.writerow(round_row(miss))
        
        # False Positive Rate
        fpr = ["False Positive Rate"]
        for metric in metrics:
            try:
                fpr.append(metric["fp"]/(metric["fp"]+metric["tn"]))
            except:
                fpr.append("N/A")
        writer.writerow(round_row(fpr))
    
        # F1 Score
        f1 = ["F1 Score"]
        for metric in metrics:
            try:
                f1.append((2*metric["prec"]*metric["recall"])/(metric["prec"]+metric["recall"]))
            except:
                f1.append("N/A")
        writer.writerow(round_row(f1))


The cell below will run analysis on all of the flags included in the list (the list can contain "pii", "prompt_injection", "toxicity", and/or "sensitivity"). **If any of these flags are not enabled in the Shield instance you are evaluating, omit it from the input list to avoid generating errors.**

In [None]:
run_analysis([
    "pii",
    "prompt_injection",
    "toxicity",
    "sensitive_data",
    "hallucination"
])

## iv. Cleanup

Once you've finished with your testing, you can use the following cell to delete the test task:

In [None]:
archive_task(task_id)