# Hallucination Rule Demonstration Notebook (V2)

In this notebook we will walk through the following steps in order to demonstrate what the Shield Hallucination Rule V2 is capable of.

**Check Outputs:**
- Hallucination detected.
- No hallucination detected. 
- Not evaluated for hallucinaion.

**Prerequisites:** 
- Shield environment URL
- API key

**Steps**
1. Create a new task for Hallucination evaluation 
2. Arthur Benchmark dataset evaluation 
   1. Run the examples against a pre-configured Shield task from Step 1 
   2. View our results 
3. Load additional dataset (non benchmark) to demonstrate the 'Not evaluated for hallucination' label

#### Configure Shield Test Env Details

In [None]:
from datetime import datetime
import pandas as pd
from os.path import abspath, join
import sys

utils_path = abspath(join('..', 'utils'))
if utils_path not in sys.path:
    sys.path.append(utils_path)

from shield_utils import setup_env, set_up_task_and_rule, task_prompt_validation, task_response_validation, archive task

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

setup_env(base_url="<URL>", api_key="<API_KEY>")

---
### 1.Setup: Configure a test task and enable Hallucination Rule 

In [None]:
hallucination_rule_config =  {
    "name": "Hallucination Rule",
    "type": "ModelHallucinationRuleV2",
    "apply_to_prompt": False,
    "apply_to_response": True
}

# Create task, archive all rules except the one we pass, create the rule we pass 
hallucination_rule, hallucination_task = set_up_task_and_rule(hallucination_rule_config, "hallucination-task-example-notebook")

print(hallucination_rule)
print(hallucination_task)

---
### 2. Arthur benchmark dataset evaluation

## Description

### Label for each claim

Hallucination detection in Shield involves labeling each claim within an LLM response as valid or invalid. Behind the scenes, we use our own custom parser to convert text into claims, such that one claim typically corresponds to one sentence or list item within the LLM response.

The label `False` for a claim indicates that the claim is not supported by the context. It does NOT mean the claim is false on its own - true claims are considered hallucinations if they lack evidence in the provided context.

### Datasets chosen
The two datasets we use here to benchmark the hallucination check are based on public datasets available on HuggingFace.

Each row contains a context, an llm_response, and a binary_label for each claim in the LLM response (the binary_label for each row is a string of one or more labels, one label for each claim). All other columns can be ignored.

The first is based on the [Dolly dataset by Databricks](https://huggingface.co/datasets/databricks/databricks-dolly-15k) - this dataset was compiled with LLM responses based on user-provided contexts made by Databricks employees in early 2023. We have extended the dataset with examples of hallucinations (synthetically generated by gpt-3.5-turbo).

The second is based on the [WikiBio GPT-3 Hallucination dataset by the University of Cambridge](https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination) - this dataset was compiled by prompting GPT-3 to write biographies, many of which contain hallucinated facts.

## Preprocessing

First, we load the two benchmark datasets, each of which has 1 or more labels per response (for each 'claim' within the response).

The hallucination check returns a label for each claim. 

We format the labels here to be consistent across datasets: comma-separated with no list markers like `[` or `]`

In [None]:
df_names = ['dolly', 'wikibio']
dfs = [pd.read_csv(f"./arthur_benchmark_datasets/hallucination_{name}.csv") for name in df_names]
for df in dfs:
    for c in ['label', 'binary_label']:
        if c in df.columns:
            for i, row in df.iterrows():
                df.loc[i,c] = df.loc[i,c].replace("'","").replace("[","").replace("]","")

In [None]:
dfs[0].head(2)

In [None]:
dfs[1].head(2)

#### 2.1  Run the examples against a pre-configured Shield task from Step 1 

First let's validate that the task was created correctly:

In [None]:
if (len(hallucination_task["rules"]) > 1):
    raise Exception("Cannot have more than one rule enabled for this test.")
else: 
    if hallucination_task["rules"][0]["type"] != "ModelHallucinationRuleV2":
            raise Exception("Invalid rule type enabled. Must be ModelHallucinationRuleV2.")
    else: 
         print(f"Valid task {hallucination_task}")
task_id = hallucination_task["id"]
rule_id = hallucination_rule["id"]

Here we define two evaluation functions:

`shield_hallucination_evaluation` returns a comma-separated string of labels that Shield is returning for each claim in the response.

`daily_shield_hallucination_benchmark` runs `shield_hallucination_evaluation` and saves the results to a new dataframe stamped with today's date

In [None]:
def shield_hallucination_evaluation(row, task_id, rule_id, verbose=True): 
    """Returns the result of a shield hallucination rule as a string

    Args:
        row: row from pandas dataframe with columns `llm_response` and `context`
        task_id: str, UUID for the hallucination task created above
        rule_id: str, UUID for the hallucination rule created above
    Returns:
        result_string: str, comma-separated string of labels that shield is returning for each claim in the response
    """
    shield_prompt_inference = task_prompt_validation("dummy", 1, task_id)
    inference_id = shield_prompt_inference["inference_id"]
    if verbose: print("===============\n", row)
    shield_result = task_response_validation(row.llm_response, row.context, inference_id, task_id)
    for rule_result in shield_result["rule_results"]:
        if rule_result["id"] == rule_id:
            try: 
                result_string = str([x['valid'] for x in rule_result['details']['claims']]).replace("[","").replace("]","")
                result_explanations = str([x['reason'] for x in rule_result['details']['claims']]).replace("[","").replace("]","")
            except Exception as e: 
                result_string = "Error when running."
                result_explanations = "Error when running."
            if verbose: print(result_string)
            return result_string, result_explanations 

def daily_shield_hallucination_benchmark(dfs, df_names, task_id, rule_id, verbose=True, n_test=None):
    """Run the benchmark datasets and save them with shield hallucination results as new csvs marked with today's date
    
    Args:
        task_id: str, UUID for the hallucination task created above
        rule_id: str, UUID for the hallucination rule created above
    """
    current_date = datetime.now().strftime("%Y-%m-%d")
    for df, name in zip(dfs, df_names):
        if n_test: 
            df = df.head(n_test)
        print("Benchmarking on dataset", name)
        shield_results = [shield_hallucination_evaluation(row, task_id, rule_id, verbose=verbose) for _, row in df.iterrows()]
        df["shield_result"] = [x[0] for x in shield_results]
        df["shield_reason"] = [x[1] for x in shield_results]
        df.to_csv(f"./results/hallucination_v2_benchmark_{name}_{current_date}.csv")

Now we run the benchmark function which will save our results to analyze below

**REMOVE THE FIRST LINE TO RUN ENTIRE DATASET EVALUATION - IT WILL TAKE TIME AND CONSUME AOAI CREDITS**

In [None]:
daily_shield_hallucination_benchmark(dfs, df_names, task_id, rule_id, n_test=5)
# daily_shield_hallucination_benchmark(dfs, df_names, task_id, rule_id)

#### 2.2 Analyze Results

Since we get multiple labels per response (for each claim in the response), we can measure performance both at response level and at the claim level.

Response-level performance means just looking at the Shield result - did the Shield rule fail or not? Was there a hallucination *somewhere* in the LLM response? Perfect response-level performance does not require labeling individual claims within a response accurately - it only requires for the Shield result to fail in correspondence with the presence of hallucination anywhere in an LLM response.

Claim-level performance is more strict - which claims did the shield result label as valid or invalid? Are those the claims that are actually faithful to the context?

You can typically expect response-level performance to be stronger than claim-level performance.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
def response_level_performance(df):
    """Evaluates the performance of shield hallucination labels at the response level"""
    y_true, y_pred = [], []
    for i, row in df.iterrows():
        if row.binary_label==False or 'false' in row.binary_label.lower():
            y_true.append(True)
        else:
            y_true.append(False)

        if row.shield_result==False or 'false' in row.shield_result.lower():
            y_pred.append(True)
        else:
            y_pred.append(False)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    return precision, recall, f1

def claim_level_performance(df):
    """Evaluates the performance of shield hallucination labels at the claim level"""
    y_true, y_pred = [], []
    for i, row in df.iterrows():
        if row.shield_result != "Error when running.": 
            labels = row.binary_label.split(",")
            preds = row.shield_result.split(",")
            if len(labels) == len(preds):
                y_true.extend([x.strip().lower()=='false' for x in labels])
                y_pred.extend([x.strip().lower()=='false' for x in preds])
            else:
                print("no including index", i, "(inconsistent # of labels)")
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    return precision, recall, f1

In [None]:
current_date = datetime.now().strftime("%Y-%m-%d")
for name in df_names:
    print("~~~~~~~~~~~~~~~~~~~~~~\n\n%%%", name, "%%%\n")
    result_df = pd.read_csv(f"results/hallucination_v2_benchmark_{name}_{current_date}.csv")
    p, r, f = response_level_performance(result_df)
    print(f"\n>>>>Response level hallucination detection performance:\n\nPrecision: {p}\nRecall: {r}\nF1: {f}")
    p, r, f = claim_level_performance(result_df)
    print(f"\n>>>>Claim level hallucination detection performance:\n\nPrecision: {p}\nRecall: {r}\nF1: {f}")

---
### 3. Load additional datasets to demonstrate the "Not evaluated for hallucination" label

In [None]:
more_df_names = ['python_dialogue']
more_dfs = [pd.read_csv(f"./datasets/hallucination_{name}.csv") for name in more_df_names]
for df in more_dfs:
    for c in ['label', 'binary_label']:
        if c in df.columns:
            for i, row in df.iterrows():
                df.loc[i,c] = df.loc[i,c].replace("'","").replace("[","").replace("]","")

**REMOVE THE FIRST LINE TO RUN ENTIRE DATASET EVALUATION - IT WILL TAKE TIME AND CONSUME AOAI CREDITS**

In [None]:
daily_shield_hallucination_benchmark(more_dfs, more_df_names, task_id, rule_id, n_test=5)
# daily_shield_hallucination_benchmark(more_dfs, more_df_names, task_id, rule_id)

In [None]:
result_df = pd.read_csv(f"results/hallucination_v2_benchmark_python_dialogue_{current_date}.csv")

In [None]:
result_df

---

### 4. Delete Test Task

In [None]:
archive_task(hallucination_task["id"])