# Hallucination Rule Demonstration Notebook

In this notebook we will walk through the following steps in order to demonstrate what the Shield Hallucination Rule is capable of.

- Hallucination detected.
- No hallucination detected. 
- Not evaluated for hallucinaion.

Pre  Requisites: 
- A Shield env and API key.

1. Create a new task for Hallucination evaluation 
2. Arthur Benchmark dataset evaluation 
   1. Run the examples against a pre-configured Shield task from Step 1 
   2. View our results 
3. Additional examples evaluation using datasets referenced in our documentation: https://shield.docs.arthur.ai/docs/hallucination#benchmarks
   1. Run the examples against a pre-configured Shield task from Step 1 
    2. View our results 

#### Configure Shield Test Env Details

In [169]:
# %pip install datasets
# %pip install scikit-learn
from datasets import load_dataset, concatenate_datasets
from datetime import datetime
import pandas as pd
from os.path import abspath, join
import sys
import random

utils_path = abspath(join('..', 'utils'))
if utils_path not in sys.path:
    sys.path.append(utils_path)

from shield_utils import setup_env, set_up_task_and_rule, run_shield_evaluation, task_prompt_validation, task_response_validation
from analysis_utils import print_performance_metrics, granular_result_dfs


pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

setup_env(base_url="http://127.0.0.1:8000", api_key="SuperSafe")

'Setup against: http://127.0.0.1:8000/api/v2'

---
### 1.Setup: Configure a test task and enable Prompt Injection Rule 

In [33]:
hallucination_rule_config =  {
    "name": "Hallucination Rule",
    "type": "ModelHallucinationRuleV2",
    "apply_to_prompt": False,
    "apply_to_response": True
}

# Create task, archive all rules except the one we pass, create the rule we pass 
hallucination_rule, hallucination_task = set_up_task_and_rule(hallucination_rule_config, "hallucination-task")

print(hallucination_rule)
print(hallucination_task)

{'id': '0d313e21-dc5a-432c-a540-dd88f5f1aa10', 'name': 'Hallucination Rule', 'type': 'ModelHallucinationRuleV2', 'apply_to_prompt': False, 'apply_to_response': True, 'enabled': True, 'scope': 'task', 'created_at': 1711970896368, 'updated_at': 1711970896368, 'config': None}
{'id': 'd9945390-5c71-460b-bd78-7741a821029b', 'name': 'hallucination-task-4786', 'created_at': 1711970896345, 'updated_at': 1711970896345, 'rules': [{'id': '0d313e21-dc5a-432c-a540-dd88f5f1aa10', 'name': 'Hallucination Rule', 'type': 'ModelHallucinationRuleV2', 'apply_to_prompt': False, 'apply_to_response': True, 'enabled': True, 'scope': 'task', 'created_at': 1711970896368, 'updated_at': 1711970896368, 'config': None}]}


---
### 2. Arthur benchmark dataset evaluation

First, we load two benchmark datasets, each of which has 1 or more labels per response (for each 'claim' within the response).

The hallucination check returns a label for each claim. 

We format the labels here to be consistent across datasets: comma-separated with no list markers like `[` or `]`

In [141]:
df_names = ['dolly', 'wikibio']
dfs = [pd.read_csv(f"./arthur_benchmark_datasets/hallucination_{name}.csv") for name in df_names]
for df in dfs:
    for c in ['label', 'binary_label']:
        if c in df.columns:
            for i, row in df.iterrows():
                df.loc[i,c] = df.loc[i,c].replace("'","").replace("[","").replace("]","")

In [167]:
dfs[0].head(2)

Unnamed: 0,context,user_input,llm_response,binary_label,shield_result
0,"""Needles and Pins"" is a rock song credited to American writers Jack Nitzsche and Sonny Bono. Jackie DeShannon recorded it in 1963 and other versions followed. The most successful ones were by the Searchers, whose version reached No. 1 on the UK singles chart in 1964, and Smokie, who had a worldwide hit in 1977. Others who recorded the song include the Ramones, Gene Clark, and Tom Petty and the Heartbreakers with Stevie Nicks.\n\nJackie DeShannon version (1963)\nIn his autobiography, Bono states that he sang along with Nitzsche's guitar-playing, thus creating both the tune and the lyrics, being guided by the chord progressions. However, Jackie DeShannon claims that the song was written at the piano, and that she was a full participant in the song's creation, along with Nitzsche and Bono, although she did not get formal credit.\n\nDeShannon was the first to record the song; in the US it peaked at number 84 on the Billboard Hot 100 singles chart in May 1963. Though it was only a minor US hit, DeShannon's recording of the song topped the charts in Canada, hitting number one on the CHUM Chart in July 1963",Who was the first to record the Needles and Pins song?,The first artist to record Needles and Pins song was DeShannon.,True,True
1,"""Needles and Pins"" is a rock song credited to American writers Jack Nitzsche and Sonny Bono. Jackie DeShannon recorded it in 1963 and other versions followed. The most successful ones were by the Searchers, whose version reached No. 1 on the UK singles chart in 1964, and Smokie, who had a worldwide hit in 1977. Others who recorded the song include the Ramones, Gene Clark, and Tom Petty and the Heartbreakers with Stevie Nicks.\n\nJackie DeShannon version (1963)\nIn his autobiography, Bono states that he sang along with Nitzsche's guitar-playing, thus creating both the tune and the lyrics, being guided by the chord progressions. However, Jackie DeShannon claims that the song was written at the piano, and that she was a full participant in the song's creation, along with Nitzsche and Bono, although she did not get formal credit.\n\nDeShannon was the first to record the song; in the US it peaked at number 84 on the Billboard Hot 100 singles chart in May 1963. Though it was only a minor US hit, DeShannon's recording of the song topped the charts in Canada, hitting number one on the CHUM Chart in July 1963",Who was the first to record the Needles and Pins song?,The first artist to record Needles and Pins song was Jackie Edwards.,False,False


In [168]:
dfs[1].head(2)

Unnamed: 0.1,Unnamed: 0,llm_response,context,label,binary_label,shield_result
0,83,"Nodar Kumaritashvili (Georgian: ნოდარ ქუმარითაშვილი; 12 December 1988 – 12 February 2010) was a Georgian luger who died during a training run prior to the 2010 Winter Olympics in Vancouver. He was the first athlete to die in competition at the Olympic Games since the death of Danish cyclist Knud Enemark Jensen at the 1960 Summer Olympics.\n\nKumaritashvili was born in Bakuriani, Georgia, and began competing in luge in 2003. He was the Georgian national champion in 2008 and 2009, and was the 2009 Junior World Champion. He was considered a medal contender for the 2010 Winter Olympics.\n\nOn 12 February 2010, Kumaritashvili was killed during a training run at the Whistler Sliding Centre, the venue for the luge events at the 2010 Winter Olympics. He lost control of his sled at","Nodar Kumaritashvili (25 November 1988 – 12 February 2010) was a Georgian luger who suffered a fatal crash during a training run for the 2010 Winter Olympics competition in Whistler, Canada, on the day of the opening ceremony. He became the fourth athlete to have died during Winter Olympics preparations, after British luger Kazimierz Kay-Skrzypeski, Australian skier Ross Milne (both Innsbruck 1964), and Swiss speed skier Nicolas Bochatay (Albertville 1992), and the seventh athlete to die in either a Summer or Winter Olympic Games. Kumaritashvili, who first began to luge when he was 13, came from a family of seasoned lugers; a relative of his was the founder of organised sledding in Georgia, and his father competed when he was younger. A cousin of Kumaritashvili on his father's side was the head of the Georgian Luge Federation; Kumaritashvili himself began competing in the 2008–09 Luge World Cup, where he finished 55th out of 62 racers. Outside of luge, Kumaritashvili had been a student at the Georgian Technical University, where he earned an economics degree in 2009.","minor_inaccurate, minor_inaccurate, minor_inaccurate, minor_inaccurate, minor_inaccurate, accurate, accurate","False, False, False, False, False, True, True","True, False, True, True, True, True, False"
1,106,"Malcolm Brogdon (born December 11, 1992) is an American professional basketball player for the Indiana Pacers of the National Basketball Association (NBA). He played college basketball for the Virginia Cavaliers, where he was the ACC Player of the Year and an All-American in 2016. He was selected in the second round of the 2016 NBA draft by the Milwaukee Bucks with the 36th overall pick. Brogdon was named the NBA Rookie of the Year in 2017. He was traded to the Pacers in 2019.\n\nBrogdon is a two-time NBA All-Star and was named to the All-Defensive Second Team in 2019. He is known for his defensive prowess and his ability to shoot from long range. He is also an advocate for social justice and has been involved in several initiatives to promote racial equality.","Malcolm Moses Adams Brogdon (born December 11, 1992) is an American basketball player who currently plays for the Virginia Cavaliers men's basketball team. He was named to the All-Atlantic Coast Conference (ACC) First Team in 2014 by the league's coaches and to the Third Team by the media. Brogdon redshirted his sophomore year after suffering a serious foot injury the prior season. He was known as one of the top contributors to the team's successful 2013-14 and 2014-15 seasons. In the 2013-14 season, Brogdon averaged 12.7 points, 5.4 rebounds, and 2.7 assists per game. He is a member of the Academic Honor Roll and is currently pursuing a Master's degree in Public Policy at the Frank Batten School of Leadership and Public Policy. In 2015, he was named a consensus Second-Team All American, as well as the All-ACC First Team and ACC Co-Defensive Player of the Year. In July 2015, he participated in the training camp for the United States men's national basketball team, and represented the United States at the 2015 Pan American Games, where the team took the bronze medal.","accurate, accurate, accurate, accurate, accurate, minor_inaccurate, accurate, major_inaccurate","True, True, True, True, True, False, True, False","True, False, True, True, False, True, True, True"


#### 2.1  Run the examples against a pre-configured Shield task from Step 1 

First lets validate that the task was created correctly

In [157]:
if (len(hallucination_task["rules"]) > 1):
    raise Exception("Cannot have more than one rule enabled for this test.")
else: 
    if hallucination_task["rules"][0]["type"] != "ModelHallucinationRuleV2":
            raise Exception("Invalid rule type enabled. Must be PromptInjectionRule.")
    else: 
         print(f"Valid task {hallucination_task}")
task_id = hallucination_task["id"]
rule_id = hallucination_rule["id"]

Valid task {'id': 'd9945390-5c71-460b-bd78-7741a821029b', 'name': 'hallucination-task-4786', 'created_at': 1711970896345, 'updated_at': 1711970896345, 'rules': [{'id': '0d313e21-dc5a-432c-a540-dd88f5f1aa10', 'name': 'Hallucination Rule', 'type': 'ModelHallucinationRuleV2', 'apply_to_prompt': False, 'apply_to_response': True, 'enabled': True, 'scope': 'task', 'created_at': 1711970896368, 'updated_at': 1711970896368, 'config': None}]}


Here we define two evaluation functions:

`shield_hallucination_evaluation` returns a comma-separated string of labels that shield is returning or each claim in the response.

`daily_shield_hallucination_benchmark` runs `shield_hallucination_evaluation` and saves the results to a new dataframe stamped with today's date

In [159]:
def shield_hallucination_evaluation(row, task_id, rule_id, verbose=True): 
    """Returns the result of a shield hallucination rule as a string

    Args:
        row: row from pandas dataframe with columns `llm_response` and `context`
        task_id: str, UUID for the hallucination task created above
        rule_id: str, UUID for the hallucination rule created above
    Returns:
        result_string: str, comma-separated string of labels that shield is returning or each claim in the response
    """
    shield_prompt_inference = task_prompt_validation("dummy", 1, task_id)
    inference_id = shield_prompt_inference["inference_id"]
    if verbose: print("===============\n", row)
    shield_result = task_response_validation(row.llm_response, row.context, inference_id, task_id)
    for rule_result in shield_result["rule_results"]:
        if rule_result["id"] == rule_id:
            result_string = str([x['valid'] for x in rule_result['details']['claims']]).replace("[","").replace("]","")
            if verbose: print(result_string)
            return result_string

def daily_shield_hallucination_benchmark(task_id, rule_id, verbose=True):
    """Run the benchmark datasets and save them with shield hallucination results as new csvs marked with today's date
    
    Args:
        task_id: str, UUID for the hallucination task created above
        rule_id: str, UUID for the hallucination rule created above
    """
    current_date = datetime.now().strftime("%Y-%m-%d")
    for df, name in zip(dfs, df_names):
        print("Benchmarking on dataset", name)
        df['shield_result'] = [shield_hallucination_evaluation(row, task_id, rule_id, verbose=verbose) for _, row in df.iterrows()]
        df.to_csv(f"./results/hallucination_benchmark_{name}_{current_date}.csv")

Now we run the benchmark function which will save our results to analyze below

In [165]:
daily_shield_hallucination_benchmark(task_id, rule_id)

#### 2.2 Analyze Results

Since we get multiple labels per response (for each claim in the response), we can measure performance both at response level and at the claim level.

Response level performance means just looking at the shield result - did the shield rule fail or not? Was there a hallucination *somewhere* in the LLM response? Perfect response-level performance does not require labelling individual claims within a response accurately - it only requires for the shield result to fail in correspondence with the presence of hallucination anywhere in an LLM response.

Claim level performance is more strict - which claims did the shield result label as valid or invalid? Are those the claims that are actually faithful to the context?

You can typically expect response level performance to be stronger than claim level performance.

In [162]:
from sklearn.metrics import precision_score, recall_score, f1_score
def response_level_performance(df):
    """Evaluates the performance of shield hallucination labels at the response level"""
    y_true, y_pred = [], []
    for i, row in df.iterrows():
        if row.binary_label==False or 'false' in row.binary_label.lower():
            y_true.append(True)
        else:
            y_true.append(False)

        if row.shield_result==False or 'false' in row.shield_result.lower():
            y_pred.append(True)
        else:
            y_pred.append(False)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    return precision, recall, f1

def claim_level_performance(df):
    """Evaluates the performance of shield hallucination labels at the claim level"""
    y_true, y_pred = [], []
    for i, row in df.iterrows():
        labels = row.binary_label.split(",")
        preds = row.shield_result.split(",")
        if len(labels) == len(preds):
            y_true.extend([x.strip().lower()=='false' for x in labels])
            y_pred.extend([x.strip().lower()=='false' for x in preds])
        else:
            print("no including index", i, "(inconsistent # of labels)")
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    return precision, recall, f1

In [164]:
for name in df_names:
    print("~~~~~~~~~~~~~~~~~~~~~~\n\n%%%", name, "%%%\n")
    result_df = pd.read_csv(f"results/hallucination_benchmark_{name}_2024-04-01.csv")
    p, r, f = response_level_performance(result_df)
    print(f"\n>>>>Response level hallucination detection performance:\n\nPrecision: {p}\nRecall: {r}\nF1: {f}")
    p, r, f = claim_level_performance(result_df)
    print(f"\n>>>>Claim level hallucination detection performance:\n\nPrecision: {p}\nRecall: {r}\nF1: {f}")

~~~~~~~~~~~~~~~~~~~~~~

%%% dolly %%%


>>>>Response level hallucination detection performance:

Precision: 0.9342105263157895
Recall: 0.7717391304347826
F1: 0.8452380952380952

>>>>Claim level hallucination detection performance:

Precision: 0.9146341463414634
Recall: 0.6410256410256411
F1: 0.7537688442211056
~~~~~~~~~~~~~~~~~~~~~~

%%% wikibio %%%


>>>>Response level hallucination detection performance:

Precision: 1.0
Recall: 0.9148936170212766
F1: 0.9555555555555556
no including index 9 (inconsistent # of labels)
no including index 11 (inconsistent # of labels)
no including index 12 (inconsistent # of labels)
no including index 23 (inconsistent # of labels)
no including index 24 (inconsistent # of labels)
no including index 25 (inconsistent # of labels)
no including index 26 (inconsistent # of labels)
no including index 27 (inconsistent # of labels)
no including index 31 (inconsistent # of labels)
no including index 33 (inconsistent # of labels)
no including index 37 (inconsistent