# Introduction

The Question-Answer Generation (QAG) framework is a evaluation technique recently proposed to assess factual consistency of summaries. The initially proposed process starts with generating some questions based on the summary with LLMs then asks a LLM to answer these questions using both the summary and the original document. The more answers are the same, the better the summary is. 

Confident.ai builds summarisation metric following this process with two modifications.
1. Make questions closed-ended so they can be answered by 'yes' or 'no' for easier scoring.
2. Allow questions to be generated from either the summary or the source document. When questions are from the original text, the score measures **inclusion** of details; when questions are from the summary, the score measurs factual **aligment**. 

In this notebook, I experimented this technique with further adjustments, mostly by changing the prompts. 
1. Questions are still closed-ended but have to relate to the distinct and important information from the given text. 
2. Instead of calling a LLM to answer questions one at a time (which can be costly), answer all questions in one call. 
3. request quotes for the anwser to aid understanding.



In [1]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
sys.path.append('/mnt/d/Projects/papersurvey_tool/src/')
from openai import OpenAI
import pandas as pd
pd.set_option('display.max_columns', None)

In [2]:
# use the custom summariser for the summarisation task
from summarisation.summariser import PaperSummariser
file_path = "../example_paper1.pdf"
autosum = PaperSummariser()
final_summary = autosum.summarise(file_path)
full_doc = "\n".join(autosum.text_chunks)


In [3]:
print(final_summary['summary'])

Summary: The research paper demonstrates that tree-based models consistently surpass neural networks in performance on tabular data due to their abilities to handle irregular target function patterns, disregard uninformative features, and correctly interpret non-rotationally invariant data. The findings also emphasized the difference in performance between the models could vary significantly based on dataset size, missing data, and high-cardinality categorical features.

Findings: 
- Tree-based models greatly outperform neural networks in dealing with tabular data due to their ability to handle irregular data patterns and uninformative features.
- There could be a variance in performance between neural networks and tree-based models concerning missing data and high cardinality categorical features.
- The effectiveness of these models could be impacted by the size of the data set used.

Methods: The study utilized numerical feature-based classification, performing Gaussianization of all

## Question generation

In [4]:
def get_questions(text, n=5):

    closed_end_questions_template = """
    For the given text below, please follow the Guidance to generate {n} questions. 
    
    Text: {text}

    Guidance:
    - questions should be closed-ended that can be answered by 'yes' or 'no'. 
    - questions should be related to the important facts of the text.
    - use distinct information from different parts of the text to generate questions.
    - Return only the questions in JSON as shown in the example output below.

    Example Output: {{questions: [list of questions]}}

    """
    prompt= closed_end_questions_template.format(n=n, text=text)

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages = [{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


In [5]:
questions = get_questions(text=full_doc, n=5)
print(questions)
    

{ "questions": [ "Do tree-based models still outperform deep learning on tabular data?", "Has deep learning enabled progress on text and image datasets?", "Was an empirical investigation conducted to understand the gap between tree-based models and Neural Networks?", "Did the study show that tree-based models remain state-of-the-art on medium-sized data?", "Did the researchers contribute a standard benchmark and raw data for baselines?" ] }


## Answer questions

In [6]:
def get_answers(text, questions):

    closed_end_answers_template = """
    You are given several questions separated by '\n\n' and a text. 
    Answer each question in 'yes', 'no', or 'idk'.
    For each qusetion, find one or two quotes from the text that are most relevant to answering the question, then print them in numbered order. 
    Quotes should be reletively short. 
    Follow the example output to format your response.

    If there are no relevant quotes, print 'no quotes found'.

    Text: {text}

    Questions: {questions}

    
    Example Output: [{{'question': question, 'answer': answer, 'quotes': [list of quotes]}}]

    """
   
    prompt = closed_end_answers_template.format(text=text, questions=questions)
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4",
        messages = [{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


In [10]:
questions_str = "\n\n".join(eval(questions)['questions'])

In [11]:
qa_from_summary = get_answers(text=final_summary, questions=questions_str)
print(qa_from_summary)

[{'question': 'Do tree-based models still outperform deep learning on tabular data?', 'answer': 'yes', 'quotes': ['1. "Tree-based models greatly outperform neural networks in dealing with tabular data due to their ability to handle irregular data patterns and uninformative features."', '2. "The research paper demonstrates that tree-based models consistently surpass neural networks in performance on tabular data due to their abilities to handle irregular target function patterns, disregard uninformative features, and correctly interpret non-rotationally invariant data."']}, 
{'question': 'Has deep learning enabled progress on text and image datasets?', 'answer': 'idk', 'quotes': []}, 
{'question': 'Was an empirical investigation conducted to understand the gap between tree-based models and Neural Networks?', 'answer': 'yes', 'quotes': ['1. "The study utilized numerical feature-based classification, performing Gaussianization of all features before random rotations."']}, 
{'question': 'D

In [12]:
qa_from_source = get_answers(text=full_doc, questions=questions_str)
print(qa_from_source)

[{'question': 'Do tree-based models still outperform deep learning on tabular data?', 
  'answer': 'yes', 
  'quotes': 
    ['1. "While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear."',
     '2. "Results show that treebased models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed."']}, 
    
 {'question': 'Has deep learning enabled progress on text and image datasets?', 
  'answer': 'yes', 
  'quotes': 
    ['1. "While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear."']}, 
   
 {'question': 'Was an empirical investigation conducted to understand the gap between tree-based models and Neural Networks?', 
  'answer': 'yes', 
  'quotes': 
    ['1. "To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs)

## Compare answers and scoring

In [27]:
def evaluate(source_doc, summary, n):

    questions = get_questions(text=source_doc, n=n)
    questions_list = eval(questions)['questions']
    questions_str = "\n\n".join(questions_list)

    qa_source = get_answers(text=source_doc, questions=questions_str)
    qa_source_df = pd.DataFrame(eval(qa_source))
    qa_summary = get_answers(text=summary, questions=questions_str)
    qa_summary_df = pd.DataFrame(eval(qa_summary))
    comparison_df = pd.merge(qa_source_df, qa_summary_df, on='question', how='inner')
    comparison_df.rename(columns={"answer_x": "answer_source", "answer_y": "answer_summary"}, inplace=True)
    inclusion_score = sum(comparison_df["answer_source"] == comparison_df["answer_summary"]) / len(comparison_df)

    return inclusion_score, comparison_df


In [28]:
inclusion_score, comparison_df = evaluate(source_doc=full_doc, summary=final_summary, n=5)

In [29]:
inclusion_score

0.2

In [30]:
comparison_df

Unnamed: 0,question,answer_source,quotes_x,answer_summary,quotes_y
0,Do tree-based models still outperform deep lea...,yes,[1. Tree-based models remain state-of-the-art ...,yes,"[1: ""Tree-based models greatly outperform neur..."
1,Does tuning hyperparameters make Neural Networ...,no,[1. Tree-based models are superior for every r...,idk,[no quotes found]
2,Are Categorical variables considered the main ...,no,[1. Our results on numerical variables only do...,idk,"[1: ""There could be a variance in performance ..."
3,Are Neural Networks biased towards overly smoo...,yes,[1. Such results suggest that the target funct...,idk,[no quotes found]
4,Do data rotations impact the performance of th...,yes,"[1. Fig. 6a, which shows the change in test ac...",idk,[no quotes found]


## Refereces and useful resources

1. [A Step-By-Step Guide to Evaluating an LLM Text Summarization Task](https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task)
2. [Asking and Answering Questions to Evaluate the Factual Consistency of Summaries](https://arxiv.org/pdf/2004.04228.pdf)