# Generate and Filter Confusing Questions for Documents

Given a CSV-file with short documents (a few text paragraphs each), we want to generate confusing questions about these documents. A question is _confusing_ if it contains a false premise and therefore has no good answer. We want to use confusing questions in order to improve LLM ability to detect false premises and point them out to the user who asked the question, rather than play along and create even more confusion by trying to answer an unanswerable question.

Our confusing questions must be complex enough so that the "unprepared" LLM plays along and attempts to answer them, but simple enough so that the same LLM detects the false premise when specifically asked to search for it. In other words, each question must satisfy the following constraints:

* Be based on the contents of the document
* Start off with a false premise or assumption, not just ask if the assumption is true
* Have no good answer (positive or negative) other than pointing out the false assumption to the user
* The "unprepared" LLM must answer the question _AS IF_ it is answerable (creating even more confusion)
* The same LLM must detect the false premise when specifically asked to search for it
* Moreover, the same LLM must be able to detect that its original answer has failed to point out the false premise

This notebook uses two CSV formats: one for the table with documents, and one for the table with individual questions and responses. The input CSV table is expected to have 3 columns: document ID, document source (e.g. URL), and the document itself. We want to keep track of document sources to avoid licensing issues. (Maybe we should add more columns for timestamp, for the license, etc.) Once the list of questions is generated, we add them as extra columns for easy review. Below is the schema of the table with documents:

In [1]:
doc_csv_schema = {
    "doc_id" : "doc_id",           # Column with a unique document ID
    "source" : "source",           # Column with document source (e.g. URL)
    "document" : "document",       # Column with the text of the original document
    "LLM_q" : "LLM_q",             # Column with name of the LLM that generated confusing questions
    "orig_qs" : "orig_questions",  # Column with the original (non-confusing) questions
    "conf_qs" : "conf_questions"   # Column with the modified (confusing) questions
}

Note that `LLM_q` refers to the LLM used to _generate_ the confusing questions, which does not have to be the same LLM as the one (`LLM_r`) used to respond to the questions and test for the false premises. In fact, it is better if `LLM_q` is stronger than `LLM_r`.

Below is the schema of the CSV table used to store individual questions (one question per row), responses, false premises, and their detection information:

In [2]:
qrc_csv_schema = {
    "doc_id" : "doc_id",           # Column with document ID, same as the other table
    "q_id" : "q_id",               # Column with question ID (for this document)
    "is_conf" : "is_confusing",    # Column with "yes" or "no" indicating if the question is confusing
    "question" : "question",       # Column with the question (either original or confusing)
    "LLM_r" : "LLM_r",             # Column with name of the LLM that generated the responses
    "response" : "response",       # Column with response generated given the document and the question
    "confusion" : "confusion",     # Column with LLM-found confusion in the question (or "none")
    "defusion" : "defusion",       # Column with LLM's reply on whether its own response detected the confusion
    "is_defused" : "is_defused"    # Column with "yes" or "no" as LLM checks if response detected the confusion
}

Here we specify the two LLMs, `LLM_q` and `LLM_r`, by short names defined as keys in `llmlib.py`. We also specify the input CSV table with documents and the output CSV tables: one with a row per document and one with a row per question/response.

In [3]:
from promptlib import read_prompts
from datagen import *
import os

llm_q = "llama3-8B-in" # "gpt-3.5" # "gpt-4o"  # LLM for generating questions
llm_r = "llama3-8B-in" # "gpt-3.5"  # LLM for generating responses

num_q = 10  # Number of questions generated per document

prompts_folder = "prompts"
data_folder = "experiments/2024-07-18-llama3"

input_doc_path  = os.path.join(data_folder, "docs_in.csv")
output_doc_path = os.path.join(data_folder, "docs_out.csv")
output_qrc_path = os.path.join(data_folder, "qrc_out.csv")
filter_qrc_path = os.path.join(data_folder, "qrc_filter.csv")

read_prompts(prompts_folder)

**STEP 1:**  For each document, ask `LLM_q` to write `num_q` questions answered in the document.

In [5]:
doc_path_1 = os.path.join(data_folder, "docs_1.csv")
generate_questions_for_documents(llm_q, num_q, doc_csv_schema, input_doc_path, doc_path_1)

Read the input document table from CSV file:
    experiments/2024-07-18-llama3/docs_in.csv
    Rows: 20,  Cols: 3
    Index(['doc_id', 'source', 'document'], dtype='object')
Generate 10 questions for each document


100%|██████████| 20/20 [02:16<00:00,  6.84s/it]


Write the table with questions to CSV file:
    experiments/2024-07-18-llama3/docs_1.csv
    Rows: 20,  Cols: 5
    Index(['doc_id', 'source', 'document', 'LLM_q', 'orig_questions'], dtype='object')


**STEP 2:**  Ask `LLM_q` to modify each question in the list so that it makes a false assumption. This one is tricky and requires few-shot examples to get right. Any improvements are welcome! Since `LLM_q` may be a strong and expensive LLM, we call it once for a list of questions, not for each individual question.

In [6]:
doc_path_1 = os.path.join(data_folder, "docs_1.csv")
infuse_questions_with_false_assumptions(doc_csv_schema, doc_path_1, output_doc_path)

Read the document-and-questions table from CSV file:
    experiments/2024-07-18-llama3/docs_1.csv
    Rows: 20,  Cols: 5
    Index(['doc_id', 'source', 'document', 'LLM_q', 'orig_questions'], dtype='object')
Modify each question by adding confusing (false) assumptions


100%|██████████| 20/20 [02:02<00:00,  6.15s/it]

Write the table with confusing questions to CSV file:
    experiments/2024-07-18-llama3/docs_out.csv
    Rows: 20,  Cols: 6
    Index(['doc_id', 'source', 'document', 'LLM_q', 'orig_questions',
       'conf_questions'],
      dtype='object')





**STEP 3:**  Switching to `LLM_r` now. Give the LLM the document and the question and record the LLM's response as in RAG setting.

In [7]:
qrc_path_1 = os.path.join(data_folder, "qrc_1.csv")
generate_RAG_responses(llm_r, doc_csv_schema, output_doc_path, qrc_csv_schema, qrc_path_1)

Read the document-and-questions table from CSV file:
    experiments/2024-07-18-llama3/docs_out.csv
    Rows: 20,  Cols: 6
    Index(['doc_id', 'source', 'document', 'LLM_q', 'orig_questions',
       'conf_questions'],
      dtype='object')
Generate RAG response for each question, both original and confusing


100%|██████████| 20/20 [09:10<00:00, 27.54s/it]

Write the question-response table to CSV file:
    experiments/2024-07-18-llama3/qrc_1.csv
    Rows: 400,  Cols: 6
    Index(['doc_id', 'q_id', 'is_confusing', 'question', 'LLM_r', 'response'], dtype='object')





**STEP 4:**  Ask `LLM_r` to find the false assumption in each question (including the original "clean" questions, used as control).

In [8]:
qrc_path_2 = os.path.join(data_folder, "qrc_2.csv")
find_false_assumptions_in_questions(doc_csv_schema, output_doc_path, qrc_csv_schema, qrc_path_1, qrc_path_2)

Read the document table from CSV file:
    experiments/2024-07-18-llama3/docs_out.csv
    Rows: 20,  Cols: 6
    Index(['doc_id', 'source', 'document', 'LLM_q', 'orig_questions',
       'conf_questions'],
      dtype='object')
Create a dictionary of indexed documents


100%|██████████| 20/20 [00:00<00:00, 234.19it/s]


Read the question-response table from CSV file:
    experiments/2024-07-18-llama3/qrc_1.csv
    Rows: 400,  Cols: 6
    Index(['doc_id', 'q_id', 'is_confusing', 'question', 'LLM_r', 'response'], dtype='object')
Ask LLM to find a false assumption in each question, or say 'none'


100%|██████████| 400/400 [10:31<00:00,  1.58s/it]

Write the question-response table to CSV file:
    experiments/2024-07-18-llama3/qrc_2.csv
    Rows: 400,  Cols: 7
    Index(['doc_id', 'q_id', 'is_confusing', 'question', 'LLM_r', 'response',
       'confusion'],
      dtype='object')





**STEP 5:**  Ask the LLM to check if its initial response pointed out the false assumption. We are interested in the questions that (a) contained a false assumption, (b) LLM's response missed it, (c) now the LLM can see that it missed the false assumption.

In [9]:
check_if_response_defused_confusion(doc_csv_schema, output_doc_path, qrc_csv_schema, qrc_path_2, output_qrc_path)

Read the document table from CSV file:
    experiments/2024-07-18-llama3/docs_out.csv
    Rows: 20,  Cols: 6
    Index(['doc_id', 'source', 'document', 'LLM_q', 'orig_questions',
       'conf_questions'],
      dtype='object')
Create a dictionary of indexed documents


100%|██████████| 20/20 [00:00<00:00, 1242.02it/s]


Read the question-response table from CSV file:
    experiments/2024-07-18-llama3/qrc_2.csv
    Rows: 400,  Cols: 7
    Index(['doc_id', 'q_id', 'is_confusing', 'question', 'LLM_r', 'response',
       'confusion'],
      dtype='object')
Ask LLM to check if its own response defused the confusion


100%|██████████| 400/400 [06:54<00:00,  1.04s/it]

Write the question-response table to CSV file:
    experiments/2024-07-18-llama3/qrc_out.csv
    Rows: 400,  Cols: 9
    Index(['doc_id', 'q_id', 'is_confusing', 'question', 'LLM_r', 'response',
       'confusion', 'defusion', 'is_defused'],
      dtype='object')





**STEP 6:**  Count the number of both original and confusing quesions where confusion was detected, as performance metrics (true and false positives and negatives).

In [10]:
filter_undefused_confusions_and_compute_metrics(qrc_csv_schema, output_qrc_path, filter_qrc_path)

Read the question-response table from CSV file:
    experiments/2024-07-18-llama3/qrc_out.csv
    Rows: 400,  Cols: 9
    Index(['doc_id', 'q_id', 'is_confusing', 'question', 'LLM_r', 'response',
       'confusion', 'defusion', 'is_defused'],
      dtype='object')


100%|██████████| 400/400 [00:00<00:00, 6491.20it/s]

Original (non-confusing) questions:
    Total questions = 200
    With confusion detected = 122
    With confusion detected and defused = 0
Confusing questions:
    Total questions = 200
    With confusion detected = 74
    With confusion detected and defused = 4
    With confusion detected, but not defused = 70
Write the filtered question-response table to CSV file:
    experiments/2024-07-18-llama3/qrc_filter.csv
    Rows: 70,  Cols: 9
    Index(['doc_id', 'q_id', 'is_confusing', 'question', 'LLM_r', 'response',
       'confusion', 'defusion', 'is_defused'],
      dtype='object')



