FaithScore

This is the official release accompanying our paper, FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models. FAITHSCORE is available as a PIP package as well.

If you find FAITHSCORE useful, please cite:

@misc{faithscore,
      title={FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models}, 
      author={Liqiang Jing and Ruosen Li and Yunmo Chen and Mengzhao Jia and Xinya Du},
      year={2023},
      eprint={2311.01477},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Process

Install

Install LLaVA 1.5.

Note that you don't need to download parameter weights of LLaVA 1.5 if you use OFA to realize fact verification.

Install modelscope;

pip install modelscope
pip install "modelscope[multi-modal]"

Install our package.
```
pip install faithscore==0.0.9
```

Running FaithScore using Pip Package

You can evaluate answers generated by large vision-language models via our metric.

from faithscore.framework import FaithScore

images = ["./COCO_val2014_000000164255.jpg"]
answers = ["The main object in the image is a colorful beach umbrella."]

scorer = FaithScore(vem_type="...", api_key="...", llava_path=".../llava/eval/checkpoints/llava-v1.5-13b", use_llama=False,
                   llama_path="...llama2/llama-2-7b-hf")
score, sentence_score = scorer.faithscore(answers, images)

Parameters for FaithScore class:

vem_type: You can set this parameter as ofa-ve, ofa, or llava. This parameter decides which model is used to do fact verification.
api_key: OpenAI API Key.
llava_path: The model folder for LLaVA 1.5. If you don't set vem_type as llava, you needn't to set it.
use_llama: Whether use llave to achieve sub-sentence identification. If it is False, the code uses ChatGPT for this stage.
llama_path: The model folder for LLaMA 2 7B. Before using it, please convert the model weight into huggingface format. For details, you can refer to this link. If you don't use LLaMA, skip this parameter.

Running FaithScore using a Command Line

You can also evaluate answers generated by the following command line.

python run.py --answer_path {answer_path} --openai_key {openai_key} --vem_type {vem_type} --llava_path {llava_path} --llama_path {llama_path} --use_llama {use_llama}

Parameters:

--answer_path: The answer file can be something like test.jsonl. It should be a .jsonl format where each line is a dict {"answer": "xxx", "image": "xxxx"} that contains answer generated by LVLMs and image path.
--openai_key: OpenAI API Key.
--llava_path: The model folder for LLaVA 1.5. If you don't set vem_type as llava, you needn't to set it.
--use_llama: Whether use llave to achieve sub-sentence identification. If it is False, the code uses ChatGPT for this stage.
--llama_path: The model folder for LLaMA 2 7B. Before using it, please convert the model weight into huggingface format. For details, you can refer to this link. If you don't use LLaMA, skip this parameter.

Data

Annotation Data

The data is given in a json format file. For example,

{"id": "000000525439", "answer": "The skateboard is positioned on a ramp, with the skateboarder standing on it.", "stage 1": {"The skateboard is positioned on a ramp": 1, " with the skateboarder standing on it": 1}, "stage 2": {"There is a skateboard.": 1, "There is a ramp.": 0, "There is a skateboarder.": 1, "The skateboarder is standing on a skateboard.": 0}}

Data Format:

id: question_id. You can get the corresponding image name in the COCO validation dataset (2014) by this format COCO_val2014_{id}.jpg.
answer: answer generated by the Large Vision-Language Model.
stage 1: All the sub-sentences and their corresponding labels}. 1 denotes analytical sub-sentence.
stage 2: All the atomic facts and their corresponding labels}. 1 denotes the atomic fact contains hallucination.

You can download our annotation dataset.

Automatic Evaluation Benchmarks

You can download our automatic evaluation benchmarks.

Leaderboard

Public LVLM leaderboard computed on LLaVA-1k.

Model	FaithScore	Sentence-level Faithscore
Multimodal-GPT	0.53	0.49
MiniGPT-4	0.57	0.65
mPLUG-Owl	0.72	0.70
InstructBLIP	0.81	0.72
LLaVA	0.84	0.73
LLaVA-1.5	0.86	0.77

Public LLM leaderboard computed on MSCOCO-Cap.

Model	FaithScore	Sentence-level Faithscore
Multimodal-GPT	0.54	0.63
MiniGPT-4	0.64	0.60
mPLUG-Owl	0.85	0.67
InstructBLIP	0.94	0.80
LLaVA	0.87	0.64
LLaVA-1.5	0.94	0.83

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
__pycache__		__pycache__
dist		dist
src/faithscore		src/faithscore
COCO_val2014_000000164255.jpg		COCO_val2014_000000164255.jpg
LICENSE		LICENSE
README.md		README.md
annotation.jsonl		annotation.jsonl
faithscore.png		faithscore.png
pyproject.toml		pyproject.toml
run.py		run.py
test.json		test.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FaithScore

Process

Install

Running FaithScore using Pip Package

Running FaithScore using a Command Line

Data

Annotation Data

Automatic Evaluation Benchmarks

Leaderboard

About

Releases

Packages

Contributors 2

Languages

License

bcdnlp/FAITHSCORE

Folders and files

Latest commit

History

Repository files navigation

FaithScore

Process

Install

Running FaithScore using Pip Package

Running FaithScore using a Command Line

Data

Annotation Data

Automatic Evaluation Benchmarks

Leaderboard

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages