<center><img src="images/MLU-NEW-logo.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>

# <a name="0">Improve Factual Consistency Part 1 </a>
## <a name="0">Improving Factual Consistency and Explainability using standalone LLM  </a>

### Glossary of Terms
- Naive Judge : This LLM has **no** access to transcript but only question and two summaries. Measure the baseline performance.
- Expert Judge : This LLM has access to transcript along with question and two summaries
- Question asked to LLM (in all experiments): It is always the same: `Which one of these summaries is the most factually consistent one?`

## Dataset
Our dataset is distilled from the Amazon Science evaluation benchmark dataset called <a href="https://github.com/amazon-science/tofueval">TofuEval</a>. 10 summaries have been curated from the [MediaSum documents](https://github.com/zcgzcgzcg1/MediaSum) inside the tofueval dataset for this notebook. 

MediaSum is a large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.

## LLM Access

We will need access to Anthropic Claude v3 Sonnet, Mistral 7b and  Mixtral 8x7b LLMs for this notebook.

[Anthropic Claude v3(Sonnet)](https://www.anthropic.com/news/claude-3-family) , [Mixtral 8X7B](https://mistral.ai/news/mixtral-of-experts/), [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/) - all of them pre-trained on general text summarization tasks.

## Notebook Overview

In this notebook, we navigate the LLM debating technique with more persuasive LLMs having two expert debater LLMs (Claude and Mixtral) and one judge (using Claude - we can use others like Mistral/Mixtral, Titan Premier) to measure, compare and contrast its performance against other techniques like self-consistency (with naive and expert judges) and LLM consultancy. This notebook is an adapted and partial implementation of one of the ICML 2024 best papers, <a href="https://arxiv.org/pdf/2402.06782"> Debating with More Persuasive LLMs Leads to More Truthful Answers </a> on a new and different Amazon Science evaluation dataset <a href="https://github.com/amazon-science/tofueval">TofuEval</a>. 


- Part 1.  **[THIS notebook]** Demonstrate typical Standalone LLM approach

- Part 2.  Demonstrate the LLM Consultancy approach and compare with Part 1.

- Part 3.  Demonstrate the LLM Debate approach and compare with other methods.


<div style="border: 4px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    While this notebook(part 1, 2 and 3) compares various methods and demonstrates the efficacy of LLM Debates in notebook part 3 with a supervised dataset, the greater benefit is possible in unsupervised scenarios where ground truth is unknown and ground truth alignment and/or curation is required. Human annotation can be expensive plus slow and agreement amongst human annotators adds another level of intricacy. A possible `scalable oversight direction could be this LLM debating technique to align on the ground truth options` via this debating and critique mechanism by establishing factual consistency(veracity). This alignment and curation of ground truth for unsupervised data could be a possible win direction for the debating technique in terms of cost versus benefit analysis.
</div>
<br/>


#### Notebook Kernel
Please choose `conda_python3` as the kernel type of the top right corner of the notebook if that does not appear by default.

#### LLMs used
[Anthropic Claude v3(Sonnet)](https://www.anthropic.com/news/claude-3-family) , [Mixtral 8X7B](https://mistral.ai/news/mixtral-of-experts/), [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/) - all of them pre-trained on general text summarization tasks.

## Use-Case Overview

To demonstrate the measurement and improvement of factual consistency (veracity) with explainability in this notebook, we conduct a series of experiments to choose the best summary for each transcript. In each experiment, we measure the veracity and correctness of the summaries generated from transcripts and improve upon the decision to choose the correct one via methods like LLM consultancy and LLM debates.

The <b>overall task in this notebook</b> is choose which one of the two summaries is most appropriate for a given transcript. There are a total of 10 transcripts and each transcript has 2 summaries - one correct and other incorrect. The incorrect summaries have various classes of errors like `Nuanced Meaning Shift`, `Extrinsic Information` and  `Reasoning errors`. 

In this notebook we will conduct the following set of experiment combinations to measure, compare and contrast LLM debating techniques with others.

<div style="border: 4px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    If you see throttling exception, please increase timeout from 10 seconds in `time.sleep(10)` to say 20 and retry
</div>
<br/>

## Experiments
For each of these experiments we flip the side of the argument the LLM takes to account for `position bias` and `verbosity bias` and re-run each experiment.

**Note** We always use the same Judge LLM (Mistral 7B) across all the experiments in this notebook




### Experiment 1: (Naive judge - this judge has no access to transcripts): 
<center><img src="images/veracitylab01-llm-naive-judge.png" alt="In this image, we depict the flow of Naive LLM judge. First the naive judge LLM has NO access to transcripts just the question and two summaries to choose from
as the more factually consistent. Next the naive judge makes a random guess
which of the two summaries are more factually consistent for 3 rounds. Majority answer is chosen based on self-consistency technique."  height="700" width="700" style="background-color:white; padding:1em;" /></center> <br/>

Mistral as naive judge with no access to transcripts. This continues for N(=3 in this notebook) rounds to ensure self-consistency and assert the majority answer as correct. We use this to mark the baseline performance of these series of experiments.

---

### Experiment 2: (Expert judge: This LLM has access to transcripts): 
<center><img src="images/veracitylab01-llm-expert-judge.png" alt="In this image, we depict the flow of LLM Expert Judge. First the expert Judge LLM has access to transcripts along with the question and two summaries to choose from
as more factually consistent. Next the expert judge uses the transcript contents to decide which of the two summaries are more factually consistent for 3 rounds. Majority answer is chosen based on self-consistency technique"  height="700" width="700" style="background-color:white; padding:1em;" /></center> <br/>


Mistral as expert judge with access to transcripts. This continues for N(=3 in this notebook) rounds.This continues for N(=3 in this notebook) rounds to ensure self-consistency and assert the majority answer as correct.

---

---
## Evaluation Metrics
For each type of experiment we evaluate the accuracy of the answers for that experiment/method type to compare and contrast each method at the end.

For the final experiment on LLM Debate, we also calculate the `win rate` of the LLM debaters to evaluate which of the LLMs actually got most of the answers right as adjudicated by the judge. This can be considered a mechanism to choose one LLM over the other given this use-case.

---


This notebook notebook has the following sections:

1. <a href="#1">Dataset exploration</a>
2. <a href="#2">Naive Judge: no access to transcripts - Arguing for 1st summary</a>
3. <a href="#3">Naive Judge: no access to transcripts - Arguing for 2nd summary</a>
4. <a href="#4">Accuracy of Naive Judge</a>
5. <a href="#5">Expert Judge: access to transcripts - Arguing for 1st summary</a>
6. <a href="#6">Expert Judge: access to transcripts - Arguing for 2nd summary</a>
7. <a href="#7">Accuracy of Expert Judge</a>
8. <a href="#16">Challenge exercise and notebook quiz</a>
    
Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.

---

In [None]:
%%html

<a class="github-button" href="https://github.com/aws-samples/improve-factual-consistency-with-llm-debate-technique" data-color-scheme="no-preference: light; light: light; dark: dark;" data-icon="octicon-star" data-size="large" data-show-count="true" aria-label="Star Improve Factual Consistency with LLM Debates on GitHub">Star</a>
<script async defer src="https://buttons.github.io/buttons.js"></script>

In [1]:
%%capture
!pip3 install setuptools==70.0.0

In [2]:
%%capture
!pip install -q -U pip --root-user-action=ignore
!pip3 install -q -r requirements.txt --root-user-action=ignore

In [None]:
# We load all prompts from a separate file prompts.py
%load_ext autoreload
%autoreload 2
from prompts import *

%load_ext autoreload
%autoreload 2
from mlu_utils.veracity_utils import *

In [None]:
clean_up_files_in_dir("./transcripts")
clear_file_contents("./log_files/notebook_run_logs.log")

In [5]:
import boto3
import re, time
import random
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline

from langchain.llms.bedrock import Bedrock
from langchain.prompts import PromptTemplate
from IPython.display import Markdown
from collections import Counter

from IPython.display import Markdown, display
import logging
import boto3, warnings
import pandas as pd
# Supress warnings
warnings.filterwarnings("ignore")
logging.basicConfig(filename='log_files/notebook_run_logs.log', encoding='utf-8', level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("----- Test logging setup -----")


### Bedrock Model Access check

In [None]:
#test if all bedrock model access has been enabled 
test_llm_calls()

### Constants used in this notebook

In [10]:
number_of_rounds = 3
question = "Which one of these summaries is the most factually consistent one?"
total_data_points = 10

### <a name="1">Dataset Exploration</a>
(<a href="#0">Go to top</a>)


In [None]:
# pre-process the dataset
answers_df = pd.read_csv("./tofueval_dataset/mediasum_dev_doc_id_group_final_dual_summaries_manual_final_dataset.csv")
#answers_df.head()
interview_df = pd.read_csv("./tofueval_dataset/mediasum_dev_doc_complete_final.csv")
#interview_df.head()

result = pd.merge(answers_df, interview_df, on="doc_id")
final_dataset = result[["doc_id", "topic", "summ_sent_incorrect_original", "summ_sent_correct_manual", "exp", "type", "source"]]
final_dataset

### <a name="2">Naive Judge: no access to transcripts - Arguing for 1st summary</a>
(<a href="#0">Go to top</a>)


Naive judge has no access to actual transcripts - it just has access to the question and the 2 summaries/answers. We use `self-consistency` technique to test this judge's answers for 3 rounds. It is possible the Naive Judge might be guessing randomly. We flip the answer options in the next set of experiment to determine the baseline performance of a naive judge accuracy.

In [None]:
%%time

naive_judge_regular_answers = list()
for index, row in final_dataset.iterrows():
    naive_judge_per_round = list()
    debate_id = row['doc_id']
    answer_a = row['summ_sent_correct_manual']
    answer_b = row['summ_sent_incorrect_original']
    complete_interview_transcript = row['source']
    logger.info(f"-------------NAIVE JUDGE Debate_id {debate_id}-------------------")
    for round_number in range(number_of_rounds):
        logger.info(f"START OF Naive Judge Round #{round_number + 1} for debate_id {debate_id} >>>>>> \n")
        judge_response = invoke_mistral_standalone_naive(
            debate_id = debate_id,
            question = question,
            answer_a = answer_a,
            answer_b = answer_b
        )
        naive_judge_per_round.append(extract_final_answer(judge_response, flipped=False))
        logger.info(f">>>>>>> judge_response Round #{round_number + 1}>>>>> ::  {judge_response}")
        # Print the final response for turn-3
        format_final_response(debate_id,
                              round_number + 1, 
                              question=question, 
                              answer_a=answer_a, 
                              answer_b=answer_b, 
                              judge_response=judge_response)
        logger.info(f"END OF Naive Judge Round #{round_number + 1} for debate_id {debate_id} >>>>>> \n")
    print(f"=========== END OF Naive Judge Round #{round_number + 1} for debate_id {debate_id} ======= \n")
    naive_judge_regular_answers.append(Counter(naive_judge_per_round).most_common()[0][0]) # get the value of the counter
    print(f"naive_judge_regular_answers :: {naive_judge_regular_answers}")


### <a name="3">Naive Judge: no access to transcripts - Arguing for 2nd summary </a>
(<a href="#0">Go to top</a>)


Naive Judge (with 3 rounds of self-consistency) :: Flip the answers to account for any position bias of the summaries and re-run the experiment.


In [None]:
%%time

naive_judge_flipped_answers = list()
for index, row in final_dataset.iterrows():
    time.sleep(10) # avoid throttling exceptions
    naive_judge_flipped_per_round = list()
    debate_id = row['doc_id']
    answer_a = row['summ_sent_correct_manual']
    answer_b = row['summ_sent_incorrect_original']
    complete_interview_transcript = row['source']
    logger.info(f"-------------NAIVE JUDGE Debate_id {debate_id}-------------------")

    for round_number in range(number_of_rounds):
        time.sleep(10) # avoid throttling exceptions
        logger.info(f"START OF Naive Judge Round #{round_number + 1} >>>>>> \n")
        judge_response = invoke_mistral_standalone_naive(
            debate_id = debate_id,
            question = question,
            answer_a = answer_b, # flipped ans
            answer_b = answer_a  # flipped ans
        )
        naive_judge_flipped_per_round.append(extract_final_answer(
            judge_response=judge_response, 
            flipped=True))
        logger.info(f">>>>>>> judge_response Round #{round_number + 1}>>>>> ::  {judge_response}")
        # Print the final response for turn-3
        format_final_response(debate_id,
                              round_number + 1, 
                              question=question, 
                              answer_a=answer_b, 
                              answer_b=answer_a, 
                              judge_response=judge_response)
        logger.info(f"END OF Naive Judge Round #{round_number + 1} for debate_id {debate_id} >>>>>> \n")
    print(f"=========== END OF Naive Judge Round #{round_number + 1} for debate_id {debate_id} ======= \n")
    naive_judge_flipped_answers.append(Counter(naive_judge_flipped_per_round).most_common()[0][0]) # get the value of the counter
    print(f"naive_judge_flipped_answers :: {naive_judge_flipped_answers}")


### <a name="4">Accuracy of Naive Judge</a>
(<a href="#0">Go to top</a>)

Accuracy is defined as the matching results from the judge even if the answer options are flipped

In [14]:
accuracy_naive_judge = find_num_matching_elements(naive_judge_regular_answers, naive_judge_flipped_answers)/total_data_points

In [None]:
accuracy_naive_judge

----

### <a name="5">Expert Judge: access to transcripts - Arguing for 1st summary</a>
(<a href="#0">Go to top</a>)

EXPERT JUDGE (with 3 rounds of self-consistency)  - Access to actual transcripts


In [None]:
%%time

expert_judge_regular_answers = list()
for index, row in final_dataset.iterrows():
    time.sleep(10) # avoid throttling exceptions
    expert_judge_per_round = list()
    debate_id = row['doc_id']
    answer_a = row['summ_sent_correct_manual']
    answer_b = row['summ_sent_incorrect_original']
    complete_interview_transcript = row['source']
    logger.info(f"-------------EXPERT JUDGE Debate_id {debate_id}-------------------")
    for round_number in range(number_of_rounds):
        time.sleep(10) # avoid throttling exceptions
        logger.info(f"Expert Judge Round #{round_number + 1} >>>>>> \n")
        judge_response = invoke_mistral_standalone_expert(
            debate_id = debate_id,
            question = question,
            answer_a = answer_a,
            answer_b = answer_b,
            complete_interview = complete_interview_transcript
        )
        expert_judge_per_round.append(extract_final_answer(judge_response, flipped=False))
        logger.info(f">>>>>>> judge_response Round #{round_number + 1}>>>>> ::  {judge_response}")
        # Print the final response for turn-3
        format_final_response(debate_id, 
                              round_number + 1, 
                              question=question, 
                              answer_a=answer_a, 
                              answer_b=answer_b, 
                              judge_response=judge_response)
        logger.info(f"END OF Expert Judge Round #{round_number + 1} >>>>>> \n")
    print(f"=========== END OF Expert Judge Round #{round_number + 1} for debate_id {debate_id} ======= \n")
    expert_judge_regular_answers.append(Counter(expert_judge_per_round).most_common()[0][0]) # get the value of the counter
    print(f"expert_judge_regular_correct_answers :: {expert_judge_regular_answers}")



### <a name="6">Expert Judge: access to transcripts - Arguing for 2nd summary</a>
(<a href="#0">Go to top</a>)

Expert JUDGE with access to transcripts  (with 3 rounds of self-consistency) :: flip answer for position bias situation

In [None]:
%%time

expert_judge_flipped_answers = list()
for index, row in final_dataset.iterrows():
    time.sleep(10) # avoid throttling exceptions
    expert_judge_flipped_per_round = list()
    debate_id = row['doc_id']
    answer_a = row['summ_sent_correct_manual']
    answer_b = row['summ_sent_incorrect_original']
    complete_interview_transcript = row['source']
    logger.info(f"-------------EXPERT JUDGE Debate_id {debate_id}-------------------")
    
    for round_number in range(number_of_rounds):
        time.sleep(10) # avoid throttling exceptions
        logger.info(f"Expert Judge Round #{round_number + 1} >>>>>> \n")
        judge_response = invoke_mistral_standalone_expert(
            debate_id = debate_id,
            question = question,
            answer_a = answer_b, # flipped
            answer_b = answer_a, # flipped
            complete_interview = complete_interview_transcript
        )
        expert_judge_flipped_per_round.append(extract_final_answer(judge_response, flipped=True))
        logger.info(f">>>>>>> judge_response Round #{round_number + 1}>>>>> ::  {judge_response}")
        # Print the final response for turn-3
        format_final_response(debate_id,
                              round_number + 1, 
                              question=question, 
                              answer_a=answer_b, 
                              answer_b=answer_a, 
                              judge_response=judge_response)
        logger.info(f"END OF Expert Judge Round #{round_number + 1} >>>>>> \n")
    print(f"=========== END OF Expert Judge Round #{round_number + 1} for debate_id {debate_id} ======= \n")
    expert_judge_flipped_answers.append(Counter(expert_judge_flipped_per_round).most_common()[0][0]) # get the value of the counter
    print(f"expert_judge_flipped_answers :: {expert_judge_flipped_answers}")



### <a name="7">Accuracy of Expert Judge</a>
(<a href="#0">Go to top</a>)

In [None]:
expert_judge_regular_answers

In [None]:
expert_judge_flipped_answers

In [20]:
accuracy_expert_judge = find_num_matching_elements(expert_judge_regular_answers, expert_judge_flipped_answers)/total_data_points

In [None]:
accuracy_expert_judge

In [None]:
# save the results

%load_ext autoreload
%autoreload 2
from mlu_utils.veracity_utils import *

init_results_file()
results_dict = {"accuracy_naive_judge":accuracy_naive_judge, "accuracy_expert_judge": accuracy_expert_judge}
save_each_experiment_result(results_dict)
print("notebook results saved in results folder")

## <a name="14">Compare Accuracies across experiments/methods.</a>
(<a href="#0">Go to top</a>)

Here we compare the accuracies of each method/experiment to understand

In [None]:
accuracy_consultant_judge = None
accuracy_debate_judge = None

final_accuracy_comparison_judge(
    accuracy_naive_judge = accuracy_naive_judge,
    accuracy_expert_judge = accuracy_expert_judge
)

In [None]:
# Build the plot
%matplotlib inline
x_values = [ "Naive Judge", "Expert Judge"]
y_values = [ accuracy_naive_judge, accuracy_expert_judge]
plt.bar(x_values, y_values)
plt.title('Compare Accuracies across experiments')
plt.xlabel('Experiment Type')
plt.ylabel('Accuracy')
 
plt.show()