Skip to content

There are many cases of testing large language models recorded here

Notifications You must be signed in to change notification settings

d0ubtfire/LLM_Evaulation

Repository files navigation

LLM_Evaulation

All files stored in this folder are test results for large language models (especially ChatGPT and GPT-4 developed by OpenAI), showing the performance of large language models in completing these test tasks. These tests are mainly designed from the perspective of linguistics. The language involved in the questions is mainly Mandarin Chinese.

Since OpenAI released ChatGPT at the end of November 2022, I have designed a variety of questions based on the Chinese linguistics to investigate language performance of ChatGPT. Some of these test results reflect the amazing ability of large language models to understand Chinese, and some reflect that large language models still have insufficient understanding of natural language. By testing the language capabilities of large language models, it may help us gain a deeper understanding of the nature of natural language.

Weidong Zhan
Dept. of Chinese Language & Literature,
Peking University
http://ccl.pku.edu.cn/doubtfire

Various evaluations of large language models

ChatGPT/LLM Errors Tracker
https://researchrabbit.typeform.com/llmerrors?typeform-source=garymarcus.substack.com
https://docs.google.com/spreadsheets/d/1kDSERnROv5FgHbVN8z_bXH9gak2IXRtoqz0nwhrviCw/edit?usp=sharing

Sparks of Artificial General Intelligence: Early experiments with GPT-4 https://arxiv.org/abs/2303.12712
https://www.youtube.com/watch?v=qbIk7-JPB2c

Theory of Mind May Have Spontaneously Emerged in Large Language Models
https://arxiv.org/abs/2302.02083

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
https://arxiv.org/abs/2304.06364

About

There are many cases of testing large language models recorded here

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published