Skip to content

aliang-chatgpt/EvaluationPapers4ChatGPT

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 

Repository files navigation

Evaluation Papers for ChatGPT

License: MIT

Introduction

This repository stores Dataset Resources, Evaluation Papers and Detection Tools for ChatGPT.

0. Survey

  1. ChatGPT: A Meta-Analysis after 2.5 Months.

    Christoph Leiter, Ran Zhang, Yanran Chen, Jonas Belouadi, Daniil Larionov, Vivian Fresen, Steffen Eger. [abs], 2023.2

1. Dataset Resource

  1. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection.

    Biyang Guo, Xin Zhang , Ziyuan Wang, Minqi Jiang , Jinran Nie, Yuxuan Ding, Jianwei Yue , Yupeng Wu. [abs],[github], 2023.1

  2. ChatGPT: Jack of all trades, master of none.

    Jan Kocoń , Igor Cichecki , Oliwier Kaszyca , Mateusz Kochanek , Dominika Szydło , Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz, Anna Kocoń, Bartłomiej Koptyra, Wiktoria Mieleszczenko-Kowszewicz, Piotr Miłkowski, Marcin Oleksy, Maciej Piasecki, Łukasz Radliński, Konrad Wojtasik, Stanisław Woźniak and Przemysław Kazienko. [abs],[github], 2023.2

  3. Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT.

    Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao. [abs],[github], 2023.2

  4. Is ChatGPT A Good Translator? A Preliminary Study.

    Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Zhaopeng Tu. [abs],[github], 2023.1

  5. On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective.

    Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang, Wei Ye, Xiubo Geng, Binxin Jiao, Yue Zhang, Xing Xie . [abs],[github], 2023.2

  6. An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP).

    Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, Lakshmivihari Mareedu. [abs][github], 2023.2

Data statistics of these resources:

Paper with Dataset Task #Examples
How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection QA + Dialog 40,000
ChatGPT: Jack of all trades, master of none 25 classification/ QA/reasoning task 38,000
Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT sentiment analysis / Paraphrase / NLI 475
Is ChatGPT A Good Translator? A Preliminary Study Translation 5,609
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective Robustness 2,237
An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP). Reasoning 1,000

2. Evaluation Papers

2.1 Natural Language Understanding

  1. Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT.

    Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao. [abs],[github], 2023.2

  2. ChatGPT: Jack of all trades, master of none.

    Jan Kocoń , Igor Cichecki , Oliwier Kaszyca , Mateusz Kochanek , Dominika Szydło , Joanna Baran, Julita Bielaniewicz, Marcin Gruza, Arkadiusz Janz, Kamil Kanclerz, Anna Kocoń, Bartłomiej Koptyra, Wiktoria Mieleszczenko-Kowszewicz, Piotr Miłkowski, Marcin Oleksy, Maciej Piasecki, Łukasz Radliński, Konrad Wojtasik, Stanisław Woźniak and Przemysław Kazienko. [abs],[github], 2023.2

  3. How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks.

    Xuanting Chen, Junjie Ye, Can Zu, Nuo Xu, Rui Zheng, Minlong Peng, Jie Zhou, Tao Gui, Qi Zhang, Xuanjing Huang. [abs], 2023.3

2.2 Open-ended Generation

  1. Exploring AI Ethics of ChatGPT: A Diagnostic Analysis.

    Terry Yue Zhuo, Yujin Huang , Chunyang Chen , Zhenchang Xing. [abs], 2023.2

  2. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech.

    Fan Huang, Haewoon Kwak, Jisun An. [abs], 2023.2

2.3 Long Text Summarization

  1. Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization.

    Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, Wei Cheng. [abs], 2023.2

  2. Can ChatGPT Write a Good Boolean Query for Systematic Review Literature Search?

    Shuai Wang, Harrisen Scells, Bevan Koopman, Guido Zuccon. [abs], 2023.2

  3. ChatGPT Makes Medicine Easy to Swallow: An Exploratory Case Study on Simplified Radiology Reports.

    Katharina Jeblick, Balthasar Schachtner, Jakob Dexl, Andreas Mittermeier, Anna Theresa Stüber, Johanna Topalis, Tobias Weber, Philipp Wesp, Bastian Sabel, Jens Ricke, Michael Ingrisch. [abs], 2022.12

  4. Cross-Lingual Summarization via ChatGPT.

    Jiaan Wang, Yunlong Liang, Fandong Meng, Zhixu Li, Jianfeng Qu, Jie Zhou. [abs], 2023.2

2.4 Reasoning

  1. Mathematical Capabilities of ChatGPT.

    Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, Julius Berner. [abs], 2023.1

  2. Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

    Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, Diyi Yang. [abs], 2023.2

  3. A Categorical Archive of ChatGPT Failures.

    Ali Borji. [abs], 2023.2

  4. An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP).

    Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, Lakshmivihari Mareedu. [abs][github], 2023.2

2.5 Multimodal

  1. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity.

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, Pascale Fung. [abs], 2023.2

  2. A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning

    Zhisheng Tang, Mayank Kejriwal. [abs], 2023.2

2.6 Information Extraction

  1. Zero-Shot Information Extraction via Chatting with ChatGPT

    Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, Yong Jiang, Wenjuan Han. [abs][github][demo], 2023.2

2.7 Other Domains

Education

  1. ChatGPT: The End of Online Exam Integrity?

    Teo Susnjak. [abs], 2022.12

  2. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?

    Jürgen Rudolph, Samson Tan, Shannon Tan. [pdf], 2023.1

  3. Will ChatGPT get you caught? Rethinking of Plagiarism Detection

    Mohammad Khalil, Erkan Er. [abs], 2023.2

Biology

  1. How Does ChatGPT Perform on the Medical Licensing Exams? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

    Aidan Gilson, Conrad Safranek, Thomas Huang, Vimig Socrates, Ling Chi, R. Andrew Taylor, David Chartash. [pdf], 2022.12

  2. Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making.

    Arya Rao, John Kim, Meghana Kamineni, Michael Pang, Winston Lie, Marc D. Succi. [pdf], 2023.2

  3. Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness.

    Guido Zuccon, Bevan Koopman. [abs], 2023.2

Law

  1. Chatgpt goes to law school

    Teo Susnjak. [abs], 2023

3. Detection Tools

3.1 Metrics

Metrics Before ChatGPT

  1. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature.

    Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, Chelsea Finn. [abs],[demo], 2023.1

  2. GPTScore: Evaluate as You Desire.

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu. [abs],[github], 2023.2

  3. MAUVE Scores for Generative Models: Theory and Practice.

    Krishna Pillutla, Lang Liu, John Thickstun, Sean Welleck, Swabha Swayamdipta, Rowan Zellers, Sewoong Oh, Yejin Choi, Zaid Harchaoui. [abs], 2022.12

Using ChatGPT as evaluation metric

  1. Large Language Models Are State-of-the-Art Evaluators of Translation Quality.

    Tom Kocmi, Christian Federmann. [abs],[github], 2023.2

Metrics for detecting ChatGPT

  1. AI vs. Human -- Differentiation Analysis of Scientific Content Generation.

    Yongqiang Ma, Jiawei Liu, Fan Yi, Qikai Cheng, Yong Huang, Wei Lu, Xiaozhong Liu. [abs], 2023.1

3.2 Available Tools

  1. Hello-SimpleAI ChatGPT Detector: An open-source detection project consists of three versions of models to detect text generated with ChatGPT, including QA version, Sinlge-text version and Linguistic version.
  2. GPTZero: A demo to detect writings generated by ChatGPT. The creator has seen that the technology was used by students to cheat on assignments, so he came up with a safeguard.
  3. OpenAI Classifier: A classifier fine-tuned on a dataset of pairs of human-written text and AI-written text on the same topic.
  4. Contentatscale AI Content Detector : A tool that allows users to receive the Human or AI Content score in the text to detect. It provides probability for each sentence.
  5. Writers AI Content Detector: A tool similar to Contentatscale. It requires either the URL of the page or text to calculate the “Human-Generated Content” score.

Statistics of these tools:

Tool Detection Target Language Input Range (# characters)
Hello-SimpleAI ChatGPT Detector ChatGPT en/zh (0,~1500] (512tokens)
GPTZero LLM en (250,♾️)
OpenAI Classifier LLM en (0,♾️)
Contentatscale AI Content Detector AI Content (NLP+SERP) en (0,25,000]
Writers AI Content Detector AI Content en (0, 1,500]

About

Resource, Evaluation and Detection Papers for ChatGPT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published