Skip to content

autogenai/easy-problems-that-llms-get-wrong

Repository files navigation

README

Code for "Easy Problem That LLMs Get Wrong" Paper

ArXiv Paper: https://arxiv.org/abs/2405.19616

Benchmark Results

2024-07-20-Multi-Benchmark

1721933998509

Hotz-Reflection

Description - LinkedIn Post

Basic prompt template

f"""
{question["multi_choice_question"]}

INITIAL ANSWER
{question["model_answer"]}

REFLECTION TASK
Review the question carfully and assess your initial answer. You can amend the answer if you wish too, otherwise return the original answer. Return in JSON format, for example:
{{"ANSWER": {random.choice(['A','B','C','D'])}}}
"""

Results

Full JSON results

1719915880192

LLM Linguistic Benchmark Tool

This tool facilitates benchmarking and statistical analysis of various Language Learning Models (LLMs) against a set of linguistic benchmark questions. It encapsulates functionalities to asynchronously query different LLMs, evaluate their responses, and perform statistical analysis to gauge the performance of each model.

Features

  • LLM Query Interface: Interface to send queries to different LLMs like OpenAI's GPT models, Mistral, etc.
  • Asynchronous Processing: Batch processing of queries to LLMs for efficient data handling.
  • Benchmark and Evaluation: Load benchmark questions, obtain model responses, and evaluate them according to a predefined rubric.
  • Statistical Analysis: Calculate mean scores, standard deviations, and confidence intervals of model performances.

Installation

First, clone this repository to your local machine:

git clone https://yourrepositoryurl.git
cd language-model-benchmark-tool

Then, install the required Python packages:

pip install -r requirements.txt

LLM API Access

To access the various LLM services you will need valid API keys and credentials.

Place them in an .env file in the project root (use .env copy as a template):

OPENAI_API_KEY=your_openai_api_key_here
COHERE_API_KEY=your_cohere_api_key_here
MISTRAL_API_KEY=your_mistral_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
....

See LiteLLM for more details on how to set up for various LLM providers.

Usage

To run the benchmark tool, jump into the main.ipynb notebook and run all of the cells.

Make changes to the #Variables notebook cell, which includes:

  • LLM models to test
  • Model hyperparameters
  • Method of answer evaluation
  • Whether to include reflection
  • The various save paths
  • The exectution steps to conduct (perhaps you only want to get answers, for example)

Ultimately, this will process the benchmark questions, query the LLMs, analyse the responses, and output the statistical summary and graph.

Most Accurate Results

The multiple-choice questions are the most determinitistic and the more reliable to evaluate, as there is a clear set answer to measure against; however, open-ended questions can often expose illogical and inconsistent behavior more reliably, but are difficult to evalutate.

For open-ended questions (non multiple choice) it is best for a person to mark the LLM responses so as not to rely on the scores auto-generated in the auto_eval_outputs folder (by default marked by GPT-4o). You can edit the scores in the auto_eval_outputs json files directly and then re-run the "generate_statistics" execution step in the main.ipynb notebook to get the final results. This is how the authors did it for the paper, resulting in much lower scores than the less reliable LLM based auto evaluation.

Modifying the Benchmark Questions

The Benchmark can be modified or extended by editing the linguistic_benchmark.json file and linguistic_benchmark_multi_choice.json in the root directory. Ensure the format remains consistent with existing entries.

Future Work and Limitations

There are vast limitations to this approach, but further improvements might include:

  • Using multiple-choice questions to make evaluation more reliable.
  • Running inference multiple times with the temperature for each model set above zero (standardised and equivalent across all architectures) and generating aggregate statistics.
  • Building in "Hotz Reflection" to allow the model to reflect and potentially change its answer.
  • Expanding the Linguistic Benchmark beyond thirty questions to increase statistical significance and test a more diverse range of inputs.
  • Testing on a sample of smaller LLMs to see if performance is correlated to model size.
  • Fine-tuning models with a training dataset of perturbed variations of well-known logictype problems found in the training corpora (on the internet) to see if this decreases overfitting variance.
  • Testing advanced regularisation techniques for LLMs during the pre-training process.
  • Finding better methodologies to keep LLM outputs deterministic.

Contributing

Contributions to enhance or extend the functionality of this tool are welcomed with open arms. Please adhere to conventional coding standards and include unit tests with your pull requests.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published