DelucionQA: Detecting Hallucinations in Domain-specific Question Answering

This repository contains the dataset proposed in the EMNLP 2023 (Findings) paper titled "DelucionQA: Detecting Hallucinations in Domain-specific Question Answering."

If you face any difficulties while downloading the dataset, raise an issue in this repository or contact us at sadat.mobashir@gmail.com

Abstract

Hallucination is a well-known phenomenon in language generation for large language models (LLMs). The existence of hallucinatory responses is found in almost all application scenarios e.g., summarization, question-answering (QA) etc. For applications requiring high reliability (e.g., customer-facing assistants), the potential existence of hallucination in LLM-generated text is a critical problem. The amount of hallucination can be reduced by leveraging information retrieval to provide relevant background information to the LLM. However, LLMs can still generate hallucinatory content for various reasons (e.g., prioritizing its parametric knowledge over the context, failure to capture the relevant information from the context, etc.). Detecting hallucinations through automated methods is thus paramount. To facilitate research in this direction, we introduce a sophisticated dataset, DelucionQA, that captures hallucinations made by retrieval-augmented LLMs for a domain-specific QA task. Furthermore, we propose a set of hallucination detection methods to serve as baselines for future works from the research community. Analysis and case study are also provided to share valuable insights on hallucination phenomena in the target scenario.

Dataset Description

The dataset DelucionQA is derived from Jeep 2023 Gladiator car manual. First, we make use of an LLM to generate candidate questions based on the car manual. Then, using various information retreival methods, we retreive context relevant to each question. Next, we prompt ChatGPT to answer the question based on the information available in the context. After generating the <question, retreived context, generated answer> triples, we use the Amazon Mechanical Turk (MTurk) platform to annotate whether each sentence in the generated answer in a given triple is supported/conflicted/neither supported nor conflicted with respsect to the context. Based on the sentence level annotations, we assign a binary "Hallucinated"/"Not Hallucinated" label to each triple. For an in-depth description of our dataset construction process, we invite the reader to our paper.

Data Statistics

Split	#Questions	#Triples	#Hallucinated	#Not Hallucinated
Train	513	1,151	392	759
Dev	100	216	94	122
Test	300	671	252	419
Total	913	2,038	738	1,300

Table 1: Number of unique questions, number of triples and label distribution in each split of DelucionQA.

Files

=> Our dataset is located in "./data/DelucionQA_final/" directory. The files named "train.csv", "dev.csv", and "test.csv" contain the training, development and test sets, respectively. Each file has the following columns:

* 'sample_id': a unique id for each sample.
* 'Retreival Setting': the information retrieval method that was used to retrieve the context in each sample.
* 'Question': the question posed about the car manual.
* 'Context': the retreived context relevant to the question in numeric format. Please follow the instructions below to convert the context to textual format.
* 'Answer': answer generated by the LLM for the given Question and Context.
* 'Answer_sent_tokenized': sentence tokenized version of the generated Answer.
* 'Sentence_labels': labels for each sentence in the answer indicating whether it is supported/conflicted/neither supported not conflicted with respect to the Context.
* 'Label': label indicating whether the answer contains hallucination or not.
* 'Answerable': True/False labels indicating whether the Question is answerable based on the retrieved Context.
* 'Does_not_answer': True/False labels indicating whether the answer implies "I don't know" i.e., the LLM refuses to provide an answer.

=> We provide a collection of 240 samples which we excluded from the three splits of DelucionQA. These samples were selected using a rule-based method which detects if the answer provided by ChatGPT implies "I don't know." These examples are included in a file named "unanswerable.csv" along with their 'Answerable' labels. Note that there can still be examples which are not answerable in the train/test/dev splits which were undetected by our rule-based method (i.e., they imply "I don't know" in a more subtle manner).

Reconstructing the context

Due to licensing issues, we are releasing the context retrieved for each question from the car-manual in a numeric format. Here are the instructions to convert it to a textual format:

create python virtual environment (e.g. conda), then install the python packages (we tested our code in python 3.11)
```
install.sh
```
The installation might need sudo priviledge to run "playwright install-deps"
under the project folder, run the code to (1) crawl the data and (2) conduct the conversion
```
run.sh
```
Within the run.sh, we have 2 python commands, the 1st one is to run the following command to crawl the car-manual data to a file named 'data.jsonl':
```
PYTHONPATH=src python -m crawler.main
```
- params:
  - input_url_file: each line contains meta info of the target vehicle, tab separated in format of (BrandName ModelName Year URL_to_Vehicle).
  - output_folder: location of the crawled data. Eventually, a 'data.jsonl' will be generated in './data/data_for_index' folder (Jeep Gladiator 2023).
After the data has been crawled, the 2nd step is to run the following command:
```
PYTHONPATH=src python -m context_reconstruction.main_convert_context_to_textual
```
- params:
  - base: location of a directory containing the train, test and dev files in CSV format.
  - full_text_location: location of the 'data.jsonl' file containing the fulltext of the Jeep Gladiator 2023 manual as the result of crawling.
Please note:

We use hydra for the configuration, please run the command in root project folder so that hydra config can be read. Crawling (1st step) will create a 'data.jsonl' file as output, which is one of the inputs for the 2nd step (convert context).
The crawler is implemented based on current website's content. The crawler might need update in future, once there is a major change in the target page.
The crawler has a parameter 'headless' in file config/config_crawl.yaml', which by default is set as True to enable headless visit to website (browser GUI will not be opened). If 'headless' is set as False, a browser (chromium) will be opened to visit the target url and visit the links in the target page for crawling.

Baseline Performance

We experiment with several baseline methods for DelucionQA. The Macro-F1 score for all three splits can be seen below. The baseline methods are described in detail in our paper.

Method	Train	Dev	Test
Sim-cosine	70.03	74.78	69.45
Sim-overlap	75.59	76.84	71.09
Sim-hybrid	75.94	76.84	70.81
Keyword-match	53.86	50.57	52.77

Table 2: Macro F1 scores of our baseline methods on the three splits of DelucionQA.

Citation

If you use this dataset, please cite our paper:

@inproceedings{sadat-etal-2023-delucionqa,
    title = "{D}elucion{QA}: Detecting Hallucinations in Domain-specific Question Answering",
    author = "Sadat, Mobashir  and
      Zhou, Zhengyu  and
      Lange, Lukas  and
      Araki, Jun  and
      Gundroo, Arsalan  and
      Wang, Bingqing  and
      Menon, Rakesh  and
      Parvez, Md  and
      Feng, Zhe",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.59",
    doi = "10.18653/v1/2023.findings-emnlp.59",
    pages = "822--835",
}

License

The code in this repository is open-sourced under the AGPL-3.0 license. See the LICENSE file for details. For a list of other open source components included in this project, see the file 3rd-party-licenses.txt.

The data folder contains files that are used for reconstructing the DelucionQA data, which are licensed under Creative Commons Attribution 4.0 International License (CC-BY-4.0).

Contact

Please contact us at zhengyu.zhou2@bosch.com, msadat3@uic.edu, sadat.mobashir@gmail.com with any questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

data

data

src

src

3rd-party-licenses.txt

3rd-party-licenses.txt

LICENSE

LICENSE

LICENSE-CC-BY-4.0-legalcode.txt

LICENSE-CC-BY-4.0-legalcode.txt

README.md

README.md

install.sh

install.sh

requirements.txt

requirements.txt

run.sh

run.sh

Repository files navigation

DelucionQA: Detecting Hallucinations in Domain-specific Question Answering

Abstract

Dataset Description

Data Statistics

Files

Reconstructing the context

Baseline Performance

Citation

License

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
data		data
src		src
3rd-party-licenses.txt		3rd-party-licenses.txt
LICENSE		LICENSE
LICENSE-CC-BY-4.0-legalcode.txt		LICENSE-CC-BY-4.0-legalcode.txt
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt
run.sh		run.sh

License

boschresearch/DelucionQA

Folders and files

Latest commit

History

Repository files navigation

Abstract

Dataset Description

Data Statistics

Files

Reconstructing the context

Baseline Performance

Citation

License

Contact

About

Resources

License

Stars

Watchers

Forks

Languages