<center><img src="images/MLU-NEW-logo.png" alt="drawing" width="400" style="background-color:white; padding:1em;" /></center> <br/>


# <a name="0">Improve Factual Consistency Part 3 </a>
## <a name="0">Improving Factual Consistency and Explainability using LLM Debates </a>

### Glossary of Terms
- Naive Judge : This LLM has **no** access to transcript but only question and two summaries. Measure the baseline performance.
- Expert Judge : This LLM has access to transcript along with question and two summaries
- Question asked to LLM (in all experiments): It is always the same: `Which one of these summaries is the most factually consistent one?`

## Dataset
Our dataset is distilled from the Amazon Science evaluation benchmark dataset called <a href="https://github.com/amazon-science/tofueval">TofuEval</a>. 10 summaries have been curated from the [MediaSum documents](https://github.com/zcgzcgzcg1/MediaSum) inside the tofueval dataset for this notebook. 

MediaSum is a large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.


## Notebook Overview

In this notebook, we navigate the LLM debating technique with more persuasive LLMs having two expert debater LLMs (Claude and Mixtral) and one judge (using Claude - we can use others like Mistral/Mixtral, Titan Premier) to measure, compare and contrast its performance against other techniques like self-consistency (with naive and expert judges) and LLM consultancy. This notebook is an adapted and partial implementation of one of the ICML 2024 best papers, <a href="https://arxiv.org/pdf/2402.06782"> Debating with More Persuasive LLMs Leads to More Truthful Answers </a> on a new and different Amazon Science evaluation dataset <a href="https://github.com/amazon-science/tofueval">TofuEval</a>. 


- Part 1.  Demonstrate typical Standalone LLM approach

- Part 2.  Demonstrate the LLM Consultancy approach and compare with Part 1.

- Part 3.  **[THIS notebook]**  Demonstrate the LLM Debate approach and compare with other methods.


<div style="border: 4px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    While this notebook(part 1, 2 and 3) compares various methods and demonstrates the efficacy of LLM Debates in notebook part 3 with a supervised dataset, the greater benefit is possible in unsupervised scenarios where ground truth is unknown and ground truth alignment and/or curation is required. Human annotation can be expensive plus slow and agreement amongst human annotators adds another level of intricacy. A possible `scalable oversight direction could be this LLM debating technique to align on the ground truth options` via this debating and critique mechanism by establishing factual consistency(veracity). This alignment and curation of ground truth for unsupervised data could be a possible win direction for the debating technique in terms of cost versus benefit analysis.
</div>
<br/>


#### Notebook Kernel
Please choose `conda_python3` as the kernel type of the top right corner of the notebook if that does not appear by default.


## LLM Access

We will need access to Anthropic Claude v3 Sonnet, Mistral 7b and  Mixtral 8x7b LLMs for this notebook.

[Anthropic Claude v3(Sonnet)](https://www.anthropic.com/news/claude-3-family) , [Mixtral 8X7B](https://mistral.ai/news/mixtral-of-experts/), [Mistral 7B](https://mistral.ai/news/announcing-mistral-7b/) - all of them pre-trained on general text summarization tasks.

## Use-Case Overview

To demonstrate the measurement and improvement of factual consistency (veracity) with explainability in this notebook, we conduct a series of experiments to choose the best summary for each transcript. In each experiment, we measure the veracity and correctness of the summaries generated from transcripts and improve upon the decision to choose the correct one via methods like LLM consultancy and LLM debates.

The <b>overall task in this notebook</b> is choose which one of the two summaries is most appropriate for a given transcript. There are a total of 10 transcripts and each transcript has 2 summaries - one correct and other incorrect. The incorrect summaries have various classes of errors like `Nuanced Meaning Shift`, `Extrinsic Information` and  `Reasoning errors`. 

In this notebook we will conduct the following set of experiment combinations to measure, compare and contrast LLM debating techniques with others.


## Experiments
For each of these experiments we flip the side of the argument the LLM takes to account for `position bias` and `verbosity bias` and re-run each experiment.

**Note** We always use the same Judge LLM (Mistral 7B) across all the experiments in this notebook

<div style="border: 4px solid coral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px">
    If you see throttling exception, please increase timeout from 10 seconds in `time.sleep(10)` to say 20 and retry
</div>
<br/>

---

### Experiment 4: (LLM Debate) 
<center><img src="images/veracitylab01-llm-debate.png" alt="In this image, we depict the flow of LLM Debate. First Debater LLMs like Claude and Mixtral argue their side
based on transcript contents. Next each argument is saved to a file and
the next debater picks up the entire argument history before posting their next argument. Finally, once all 3 rounds of arguments are over, the Judge LLM reads all the arguments and decides which summary is the most factually consistent answer."  height="700" width="700" style="background-color:white; padding:1em;" /></center> <br/>

We use Claude 3 as first debater and Mixtral as second debater with Claude as Judge. We let the debater argue their sides and finally let the judge decide which argument is better. This continues for N(=3 in this notebook) rounds. We flip Claude and Mistral argument sides in experiments 4a and 4b and take average of the experiment results as final accuracy. This accounts for errors due to position (choosing an answer due to its order/position) and verbosity bias (one answer longer than the other)

##### Experiment 4a: (Claude v3 argues for answer A, Mixtral argues for Answer B): 
Claude v3(Sonnet) argues for answer A(Ground Truth:False Answer) and generates rationale why that answer is correct. Mixtral 8X7B argues for answer B(Ground Truth:True Answer) and generates rationale why that answer is correct. This continues for N(=3 in this notebook) rounds. At the end of the debate, Claude as a judge adjudicates whether Claude's or Mixtral's rationale is correct and chooses a side to give the final answer.

#####  Experiment 4b: (Claude v3 argues for answer B, Mixtral argues for Answer A): 
Claude v3(Sonnet) argues for answer B(Ground Truth:True Answer) and generates rationale why that answer is correct. Mixtral 8X7B argues for answer A(Ground Truth:False Answer) and generates rationale why that answer is correct. This continues for N(=3 in this notebook) rounds. At the end of the debate, Claude as a judge adjudicates whether Claude's or Mixtral's rationale is correct and chooses a side to give the final answer.

---
## Evaluation Metrics
For each type of experiment we evaluate the accuracy of the answers for that experiment/method type to compare and contrast each method at the end.

For the final experiment on LLM Debate, we also calculate the `win rate` of the LLM debaters to evaluate which of the LLMs actually got most of the answers right as adjudicated by the judge. This can be considered a mechanism to choose one LLM over the other given this use-case.

---


This notebook notebook has the following sections:

1. <a href="#1">Dataset exploration</a>
2. <a href="#2">Accuracy of LLM Debate</a>
3. <a href="#3">Compare Accuracies across experiments</a>
4. <a href="#4">Choose expert LLM using Win Rate measured during LLM Debate (Experiment 4) </a>
5. <a href="#5">Challenge exercise and notebook quiz</a>
    
Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.

---

In [1]:
%%capture
!pip3 install setuptools==70.0.0

In [2]:
%%capture
!pip install -q -U pip --root-user-action=ignore
!pip3 install -q -r requirements.txt --root-user-action=ignore

In [None]:
# We load all prompts from a separate file prompts.py
%load_ext autoreload
%autoreload 2
from prompts import *

%load_ext autoreload
%autoreload 2
from mlu_utils.veracity_utils import *

In [None]:
clean_up_files_in_dir("./transcripts")
clear_file_contents("./log_files/notebook_run_logs.log")

In [5]:
import boto3
import re, time
import random
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline

from langchain.llms.bedrock import Bedrock
from langchain.prompts import PromptTemplate
from IPython.display import Markdown
from collections import Counter

from IPython.display import Markdown, display
import logging
import boto3, warnings
import pandas as pd
# Supress warnings
warnings.filterwarnings("ignore")
logging.basicConfig(filename='log_files/notebook_run_logs.log', encoding='utf-8', level=logging.INFO)
logger = logging.getLogger(__name__)
logger.info("----- Test logging setup -----")


### Bedrock Model Access check

In [None]:
#test if all bedrock model access has been enabled 
test_llm_calls()

### Constants used in this notebook

In [7]:
number_of_rounds = 3
question = "Which one of these summaries is the most factually consistent one?"
total_data_points = 10

### <a name="1">Dataset Exploration</a>
(<a href="#0">Go to top</a>)


In [None]:
# pre-process the dataset
answers_df = pd.read_csv("./tofueval_dataset/mediasum_dev_doc_id_group_final_dual_summaries_manual_final_dataset.csv")
#answers_df.head()
interview_df = pd.read_csv("./tofueval_dataset/mediasum_dev_doc_complete_final.csv")
#interview_df.head()

result = pd.merge(answers_df, interview_df, on="doc_id")
final_dataset = result[["doc_id", "topic", "summ_sent_incorrect_original", "summ_sent_correct_manual", "exp", "type", "source"]]
final_dataset.head()

### <a name="2">LLM Debate: 2 expert LLMs, 1 naive judge - LLM-1 arguing for 1st summary</a>
(<a href="#0">Go to top</a>)

In this LLM Debate - Claude(LLM-1) defends incorrect Summary and Mixtral(LLM-2) defends correct summary.

Claude v3(Sonnet) argues for answer A(Ground Truth:False Answer) and generates rationale why that answer is correct. Mixtral 8X7B argues for answer B(Ground Truth:True Answer) and generates rationale why that answer is correct. This continues for N(=3 in this notebook) rounds. At the end of the debate, Claude as a judge adjudicates whether Claude's or Mixtral's rationale is correct and chooses a side to give the final answer.

In [None]:
%%time

for index, row in final_dataset.iterrows():
    time.sleep(20) # avoid throttling exceptions
    debate_id = row['doc_id']
    answer_a = row['summ_sent_correct_manual']
    answer_b = row['summ_sent_incorrect_original']
    complete_interview_transcript = row['source']
    
    #### defending false - Claude
    claude_defending_summary=answer_b
    claude_opposing_summary=answer_a

    #### defending true - Mixtral
    mixtral_defending_summary=answer_a
    mixtral_opposing_summary=answer_b

    logger.info(f"-------------2 model Debate -> Debate_id {debate_id}-------------------")

    #### Consultancy Claude - defending true - 3 rounds
    delete_file(f"./transcripts/full_transcript_debate_{debate_id}.log")
    for round_number in range(number_of_rounds):
        time.sleep(10) # avoid throttling exceptions
        print(f"=========== START OF 2 model DEBATE debate_id {debate_id} Round #1..{round_number + 1} ======= \n")
        logger.info(f"START Debate with Claude Debate_id {debate_id} Round #{round_number + 1} >>>>>> \n") 
        claude_debate_response = invoke_claude_v3(debate_id = debate_id,
                         question=question,
                         round_number = round_number + 1,
                         summary_defending = claude_defending_summary, 
                         summary_opposing = claude_opposing_summary, 
                         complete_interview = complete_interview_transcript,
                         debate=True
                         )

        logger.info(f" >>>>> claude_debate_response Round #{round_number + 1} >>>>> {claude_debate_response}")
        logger.info(f"END Debate with Claude Round #{round_number + 1} >>>>>> \n")

        mixtral_debate_response = invoke_mistral(debate_id = debate_id,
                     question=question,
                     round_number = round_number + 1,
                     summary_defending = mixtral_defending_summary, 
                     summary_opposing = mixtral_opposing_summary, 
                     complete_interview = complete_interview_transcript, 
                     )

        logger.info(f" >>>>> mixtral_debate_response Round #{round_number + 1} >>>>> {mixtral_debate_response}")
        logger.info(f"END Debate with Mixtral Round #{round_number + 1} >>>>>> \n")
    print(f"=========== END OF 2 model DEBATE debate_id {debate_id} Round #1..{round_number + 1} ======= \n")
    

## JUDGE for Regular Debate : LLM-Claude arguing for 1st summary, LLM-Mixtral arguing for 2nd summary

In [None]:
%%time

debate_judge_regular_answers = list()
for index, row in final_dataset.iterrows():
    time.sleep(10) # avoid throttling exceptions
    debate_id = row['doc_id']
    answer_a = row['summ_sent_correct_manual']
    answer_b = row['summ_sent_incorrect_original']
    complete_interview_transcript = row['source']
    logger.info(f"-------------DEBATE  JUDGE Debate_id {debate_id}-------------------")

    judge_response = invoke_claude_judge_debate(debate_id = debate_id,
                              question=question,
                 answer_a = answer_a,
                 answer_b = answer_b)
    debate_judge_regular_answers.append(extract_final_answer(judge_response, flipped=False))
    logger.info(f" >>>>> invoke_mistral_judge_debate - judge_response  >>>>> {judge_response}")
    # Print the final response 
    format_final_response(debate_id, 
                          round_num=1, 
                          question=question, 
                          answer_a=answer_a, 
                          answer_b=answer_b, 
                          judge_response=judge_response)
print(debate_judge_regular_answers)

### <a name="3">LLM Debate: 2 expert LLMs, 1 naive judge - LLM-1 arguing for 2nd summary</a>
(<a href="#0">Go to top</a>)

In this **flipped LLM Debate** - Claude(LLM-1) defends correct Summary and Mixtral(LLM-2) defends incorrect summary.


Claude v3(Sonnet) argues for answer B(Ground Truth:True Answer) and generates rationale why that answer is correct. Mixtral 8X7B argues for answer A(Ground Truth:False Answer) and generates rationale why that answer is correct. This continues for N(=3 in this notebook) rounds. At the end of the debate, Claude as a judge adjudicates whether Claude's or Mixtral's rationale is correct and chooses a side to give the final answer.

In [None]:
%%time

for index, row in final_dataset.iterrows():
    time.sleep(20) # avoid throttling exceptions
    debate_id = row['doc_id']
    answer_a = row['summ_sent_correct_manual']
    answer_b = row['summ_sent_incorrect_original']
    complete_interview_transcript = row['source']
    
    #### defending True - Claude
    claude_defending_summary=answer_a
    claude_opposing_summary=answer_b

    #### defending False - Mixtral
    mixtral_defending_summary=answer_b
    mixtral_opposing_summary=answer_a
    
    delete_file(f"./transcripts/full_transcript_debate_{debate_id}{FLIPPED_FILE_SUFFIX}.log")

    logger.info(f"-------------2 model Debate -> Debate_id {debate_id}-------------------")
    print(f"=========== START OF 2 model FLIPPED DEBATE debate_id {debate_id} Round #1..{round_number + 1} ======= \n")
    for round_number in range(number_of_rounds):
        time.sleep(10) # avoid throttling exceptions
        logger.info(f"START Debate with Claude Round #{round_number + 1} >>>>>> \n") 
        claude_debate_response = invoke_claude_v3(debate_id = debate_id + FLIPPED_FILE_SUFFIX,
                         question=question,
                         round_number = round_number + 1,
                         summary_defending = claude_defending_summary, 
                         summary_opposing = claude_opposing_summary, 
                         complete_interview = complete_interview_transcript,
                         debate=True
                         )

        logger.info(f" >>>>> claude_debate_response Round #{round_number + 1} >>>>> {claude_debate_response}")
        logger.info(f"END Debate with Claude Round #{round_number + 1} >>>>>> \n")

        mixtral_debate_response = invoke_mistral(debate_id = debate_id + FLIPPED_FILE_SUFFIX,
                     question=question,
                     round_number = round_number + 1,
                     summary_defending = mixtral_defending_summary, 
                     summary_opposing = mixtral_opposing_summary, 
                     complete_interview = complete_interview_transcript, 
                     )

        logger.info(f" >>>>> mixtral_debate_response Round #{round_number + 1} >>>>> {mixtral_debate_response}")
        logger.info(f"END Debate with Mixtral Round #{round_number + 1} >>>>>> \n")
    print(f"=========== END OF 2 model FLIPPED DEBATE debate_id {debate_id} Round #1..{round_number + 1} ======= \n")
    

## JUDGE for flipped LLM Debate:LLM-Claude arguing for 1st summary, LLM-Mixtral arguing for 2nd summary

In [None]:
%%time

debate_judge_flipped_answers = list()
for index, row in final_dataset.iterrows():
    time.sleep(10) # avoid throttling exceptions
    debate_id = row['doc_id']
    answer_a = row['summ_sent_correct_manual']
    answer_b = row['summ_sent_incorrect_original']
    complete_interview_transcript = row['source']
    logger.info(f"-------------DEBATE FLIPPED JUDGE Debate_id {debate_id}-------------------")

    judge_response = invoke_claude_judge_debate(debate_id = debate_id + FLIPPED_FILE_SUFFIX,
                              question=question,
                              answer_a = answer_a,
                              answer_b = answer_b)
    debate_judge_flipped_answers.append(extract_final_answer(judge_response, flipped=False))
    logger.info(f" >>>>> Flipped invoke_mistral_judge_debate - judge_response  >>>>> {judge_response}")
    
    # Print the final response 
    format_final_response(debate_id, 
                          round_num=1, 
                          question=question, 
                          answer_a=answer_a, 
                          answer_b=answer_b, 
                          judge_response=judge_response)
print(debate_judge_flipped_answers)

## <a name="4">Accuracy of LLM Debate</a>
(<a href="#0">Go to top</a>)

In [None]:
debate_judge_regular_answers

In [None]:
debate_judge_flipped_answers

In [15]:
accuracy_debate_judge = find_num_matching_elements(debate_judge_regular_answers, debate_judge_flipped_answers)/total_data_points

In [None]:
accuracy_debate_judge

In [None]:
# save the results
results_dict = {"accuracy_debate_judge" : accuracy_debate_judge}
save_each_experiment_result(results_dict)
print("notebook results saved in results folder")

## <a name="5">Compare Accuracies across experiments/methods.</a>
(<a href="#0">Go to top</a>)

Here we compare the accuracies of each method/experiment to understand

In [None]:
accuracy_naive_judge = get_each_experiment_result("accuracy_naive_judge")
accuracy_expert_judge = get_each_experiment_result("accuracy_expert_judge")
accuracy_consultant_judge = get_each_experiment_result("accuracy_consultant_judge")

final_accuracy_comparison(
    accuracy_naive_judge = accuracy_naive_judge,
    accuracy_expert_judge = accuracy_expert_judge,
    accuracy_consultant_judge = accuracy_consultant_judge,
    accuracy_debate_judge = accuracy_debate_judge
)

In [None]:
# Build the plot
x_values = [ "Naive Judge", "Expert Judge", "LLM Consultant", "LLM Debate"]
y_values = [ accuracy_naive_judge, accuracy_expert_judge, accuracy_consultant_judge, accuracy_debate_judge]
plt.bar(x_values, y_values)
plt.title('Compare Accuracies across experiments')
plt.xlabel('Experiment Type')
plt.ylabel('Accuracy')
 
plt.show()

### <a name="6">Choose expert LLM using Win Rate measured during LLM Debate (Experiment 4) </a>
(<a href="#0">Go to top</a>)

With this win rate of expert models, we emprically understand which LLM as a debater is more successful than the other.

In [None]:
claude_avg_win_rate, mixtral_avg_win_rate = get_win_rate_per_model(
    debate_judge_regular_answers, 
    debate_judge_flipped_answers)

In [None]:
win_rate_comparison(claude_avg_win_rate, mixtral_avg_win_rate)

In [None]:
# Build the plot
%matplotlib inline
x_values = [ "Claude v3 Sonnet", "Mixtral"]
y_values = [claude_avg_win_rate, mixtral_avg_win_rate]
plt.bar(x_values, y_values)
plt.title('Compare average win-rate across expert LLMs')
plt.xlabel('average win-rate')
plt.ylabel('win-rate')
 
plt.show()