All files stored in this folder are test results for large language models (especially ChatGPT and GPT-4 developed by OpenAI), showing the performance of large language models in completing these test tasks. These tests are mainly designed from the perspective of linguistics. The language involved in the questions is mainly Mandarin Chinese.
Since OpenAI released ChatGPT at the end of November 2022, I have designed a variety of questions based on the Chinese linguistics to investigate language performance of ChatGPT. Some of these test results reflect the amazing ability of large language models to understand Chinese, and some reflect that large language models still have insufficient understanding of natural language. By testing the language capabilities of large language models, it may help us gain a deeper understanding of the nature of natural language.
Weidong Zhan
Dept. of Chinese Language & Literature,
Peking University
http://ccl.pku.edu.cn/doubtfire
Various evaluations of large language models
ChatGPT/LLM Errors Tracker
https://researchrabbit.typeform.com/llmerrors?typeform-source=garymarcus.substack.com
https://docs.google.com/spreadsheets/d/1kDSERnROv5FgHbVN8z_bXH9gak2IXRtoqz0nwhrviCw/edit?usp=sharing
Sparks of Artificial General Intelligence: Early experiments with GPT-4
https://arxiv.org/abs/2303.12712
https://www.youtube.com/watch?v=qbIk7-JPB2c
Theory of Mind May Have Spontaneously Emerged in Large Language Models
https://arxiv.org/abs/2302.02083
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
https://arxiv.org/abs/2304.06364