In the realm of evaluating large language models, automated LLM-based evaluations have emerged as a scalable and efficient alternative to human evaluation. This repo includes papers about the LLM-based evaluators.
Title & Authors | Venue | Year | Citation Count | Code |
---|---|---|---|---|
Chateval: Towards better llm-based evaluators through multi-agent debate by Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, Zhiyuan Liu | ICLR | 2024 | 107 | GitHub |
Dyval: Graph-informed dynamic evaluation of large language models by Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, Xing Xie | ICLR | 2024 | 6 | GitHub |
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization by Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Wenjin Yao, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang | ICLR | 2024 | 52 | GitHub |
Can large language models be an alternative to human evaluations? by Cheng-Han Chiang, Hung-yi Lee | ACL | 2023 | 133 | - |
LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models by Yen-Ting Lin, Yun-Nung Chen | NLP4ConvAI | 2023 | 42 | - |
Are large language model-based evaluators the solution to scaling up multilingual evaluation? by Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram | EACL Findings | 2024 | 12 | GitHub |
Judging llm-as-a-judge with mt-bench and chatbot arena by Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica | NeurIPS | 2023 | 510 | GitHub |
Calibrating LLM-Based Evaluator by Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang | arXiv | 2023 | 7 | - |
LLM-based NLG Evaluation: Current Status and Challenges by Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, Xiaojun Wan | arXiv | 2024 | 1 | - |
Are LLM-based Evaluators Confusing NLG Quality Criteria? by Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan | arXiv | 2024 | - | - |
PRE: A Peer Review Based Large Language Model Evaluator by Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, Yiqun Liu | arXiv | 2024 | 1 | GitHub |
Allure: A systematic protocol for auditing and improving llm-based evaluation of text using iterative in-context-learning by Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, Ida Momennejad | arXiv | 2023 | 4 | - |
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate by Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu | arXiv | 2024 | - | GitHub |
Split and merge: Aligning position biases in large language model based evaluators by Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, Yang Liu | arXiv | 2023 | 8 | |
One Prompt To Rule Them All: LLMs for Opinion Summary Evaluation by Tejpalsingh Siledar, Swaroop Nath, Sankara Sri Raghava Ravindra Muddu, Rupasai Rangaraju, Swaprava Nath, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Sudhanshu Shekhar Singh, Muthusamy Chelliah, Nikesh Garera | arXiv | 2024 | - | GitHub |
Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation by Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang | arXiv | 2023 | 9 | GitHub |
Is chatgpt a good nlg evaluator? a preliminary study by Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie Zhou | NewSumm@EMNLP | 2023 | 179 | Github |
G-eval: Nlg evaluation using gpt-4 with better human alignment by Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu | EMNLP | 2023 | 381 | Github |
GPTScore: Evaluate as You Desire by Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu | EMNLP | 2023 | 228 | Github |
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study by Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu | arXiv | 2023 | 39 | - |
Evaluating General-Purpose AI with Psychometrics by Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, Xing Xie | arXiv | 2023 | 3 | - |
Title & Authors | Venue | Year | Citation Count | Code |
---|---|---|---|---|
InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews by Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, Yanghua Xiao | arXiv | 2023 | 16 | Github |
Who is ChatGPT? Benchmarking LLMs’ Psychological Portrayal Using Psycho Bench by Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu | ICLR | 2024 | 10 | Github |
On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs by Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho LAM, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael Lyu | ICLR | 2024 | 2 | Github |
Evaluating and Inducing Personality in Pre-trained Language Models by Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wenjuan Han, Chi Zhang, Yixin Zhu | NeurIPS | 2023 | 47 | Github |
Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing Perspective by Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Rui Lv, Zhenya Huang, Guanhao Zhao, Zheng Zhang, Qingyang Mao, Shijin Wang, Enhong Chen | arXiv | 2023 | 19 | - |
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts by Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao | arXiv | 2023 | 49 | Github |
LLM Agents for Psychology: A Study on Gamified Assessments by Qisen Yang, Zekun Wang, Honghui Chen, Shenzhi Wang, Yifan Pu, Xin Gao, Wenhao Huang, Shiji Song, Gao Huang | arXiv | 2024 | - | - |
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback by Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng and Heng Ji | ICLR | 2024 | 37 | Github |
GPT-4’s assessment of its performance in a USMLE-based case study by Uttam Dhakal, Aniket Kumar Singh, Suman Devkota, Yogesh Sapkota, Bishal Lamichhane, Suprinsa Paudyal, Chandra Dhakal | arXiv | 2024 | 1 | - |