GitHub - bowen-upenn/llm_token_bias: This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.

This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.

💜 Do LLMs have genuine reasoning capabilities? How to evaluate them?

Large language models (LLMs) have achieved remarkable progress in understanding and generating human-like text, but there is ongoing debate about whether LLMs possess genuine reasoning capabilities. This work reconceptualizes the evaluation of LLM's reasoning capabilities into a general and rigorous testing framework with statistical guarantee.

We say that an LLM is subject to token bias in a reasoning task if systematic changes to some or all tokens in the task descriptions - while keeping the underlying logic intact - allow us to predict the direction of the shift in the model’s output. A strong token bias suggests that LLM is relying on superficial patterns in the input rather than truly understanding the underlying reasoning task, leading to brittle performance that fails to generalize well.

Comprehensive experiments on both commercial and open-sourced LLMs on large-scale synthetic datasets uncover a critical insight: It is the token bias that contributes the most to performance improvements in reasoning tasks, if any, rather than genuine advances in reasoning capabilities.

We explore several well-known logical fallacy problems from the cognitive science literature as a clean experimental playground. Following is an example of the classical Linda Problem.

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?

(a) Linda is a bank teller. 🙆‍♀️

(b) Linda is a bank teller and is active in the feminist movement. 💁‍♀️

Experiments in behavioral psychology reveal that people typically believed the second option was more likely than the first, but this contradicts the basic probability rule of conjunction. Advanced LLMs like GPT-4 can typically recognize this fallacy well since it is a classical problem that appears frequently in cognitive science literature. However, altering seemingly irrelevant tokens, like the name 🙆‍♀️ "Linda" -> 🙆 "Bob" in the problem statement, while maintaining the same logical structure would surprisingly confuse most LLMs, leading to the concern that LLMs are not yet genuine reasoners. Please see detailed token perturbations in our paper.

Dependencies

Please check requirements.txt. You can run the following commands to create a virtual environment and install all the requirements:

python -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt

Citation

This bunny 🐰 will be happy if you could cite our work. Thank you!

@article{jiang2024peek,
  title={A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners},
  author={Jiang, Bowen and Xie, Yangxinyu and Hao, Zhuoqun and Wang, Xiaomeng and Mallick, Tanwi and Su, Weijie J and Taylor, Camillo J and Roth, Dan},
  journal={arXiv preprint arXiv:2406.11050},
  year={2024}
}

Dataset

We provide our synthetic dataset under data/, which contains a comprehensive set of logical-fallacy problems. The dataset file is in JSON format, and each item is a dictionary containing question_id, question, target_answer, and incorrect_answer. You can also follow the instructions below to generate more synthetic data on the fly.

LLM Setups

❤️ Always set up OpenAI ChatGPT models. Please follow its Developer quickstart to set up your OpenAI API, create a new api_tokens/openai_key.txt file, and copy and paste your API key into it.

🧡 To use Google Gemini models with an API for inference, follow instructions on Google Vertex AI about the Try Gemini 1.0 Pro (Python) section. Note that your school's Gmail account may not allow you to make payments.

Step 1: According to their instructions, you need to first install the Vertex AI client libraries to create a project with a project ID, enable Vertex AI API, create a service account, and generate your account key. You don't need to set the environment variable GOOGLE_APPLICATION_CREDENTIALS since we have already done that for you in our codes query_llm.py.
Step 2: Install or update the Vertex AI SDK for Python.
Step 3: Authenticate to Vertex AI and set up Application Default Credentials.
- Follow the Local development environment - Provide user credentials for your Google Account section to install and initialize the gcloud CLI. This step will download a folder google-cloud-sdk to your project's top directory.
- After installation, run
```
gcloud init
```
  to initialize the gcloud CLI. You will be able to choose your account and project ID. Create a new api_tokens/gemini_project_id.txt file, and copy and paste your project ID into it.
- To create your credential file, run
```
gcloud auth application-default login
```
  You will see a prompt like Credentials saved to file: [/path/to/your/home/.config/gcloud/application_default_credentials.json].
- Because of the path of the credential file we set in our config.yaml, run
```
mv /path/to/your/home/.config/gcloud/application_default_credentials.json google-cloud-sdk/google_gemini_credential.json
```

💛 To use Meta Llama models with an API for inference, follow instructons on Replicate Run Llama 3 with an API about the Running Llama 3 with Python section to set up your API tokens, create a new api_tokens/llama_key.txt file, and copy and paste your tokens into it.

💚 To use Anthropic Claude models with an API for inference, follow its Quickstart Guide to install the Anthropic Python SDK, set up an account with API access, get your API key, create a new api_tokens/claude_key.txt file, and copy and paste your key into it. You don't need to set the environment variable ANTHROPIC_API_KEY.

💙 To use Mistral models with an API for inference, follow its Quickstart to install the mistralai library, set up an account with API access, get your [API key](https://console.anthropic.com/settings/keys, create a new api_tokens/mistral_key.txt file, and copy and paste your key into it. You don't need to set the environment variable MISTRAL_API_KEY.

Quick Start

We allow command-line argparser for the following arguments:

--model to select the LLM for inference. Last updated on 06-29-2024, but our codes should be compatible with any more recent model names.
- OpenAI ChatGPT family. Check OpenAI's continuous model upgrades.
  - gpt3.5 or equivalently gpt-3.5-turbo, gpt-3.5-turbo-0125
  - gpt-3.5-turbo-1106
  - gpt-3.5-turbo-0613
  - gpt-4o
  - gpt4 or equivalently gpt-4-turbo, gpt-4-turbo-2024-04-09
  - gpt-4-0125-preview
  - gpt-4-1106-preview
  - gpt-4-0613
- Google Gemini family. Check Gemini model versions and lifecycle. Note that Google currently imposes a relatively low request-per-minute for API usages, so you may encounter related errors when running the inference code.
  - gemini or equivalently gemini-1.0-pro, gemini-1.0-pro-002
  - gemini-1.0-pro-001
  - gemini-1.5-pro-preview-0409
- Meta Llama family. Check Choosing which model to use Llama-3 and Llama-2.
  - llama or equivalently llama3-70b, meta-llama-3-70b-instruct
  - llama3-8b or equivalently meta-llama-3-8b-instruct
  - llama-2-70b-chat
  - llama-2-13b-chat
  - llama-2-7b-chat
- Anthropic Claude family. Check Models overview.
  - claude or equivalently claude-3-opus-20240229
  - claude-3-sonnet-20240229
  - claude-3-haiku-20240307
- Mistral family. Check API versioning.
  - mistral or equivalently mistral-large-latest, mistral-large-2402
  - mistral-medium-latest or equivalently mistral-medium-2312
  - mistral-small-latest or equivalently mistral-small-2402
  - open-mixtral-8x22b or equivalently open-mixtral-8x22b-2404
  - open-mixtral-8x7b or equivalently mistral-small-2312
  - open-mistral-7b or equivalently mistral-tiny-2312
--task to specify data to generate synthetic datasets or inference to evaluate the LLM's ability to answer the questions.
--verbose to print detailed data information and model responses during the inference.
[For Data Generation Only] --fallacy to select the type of logical fallacy. We currently support linda for the Linda Problem and its variants and sets for the syllogistic problems.
[For Data Generation Only] --gen_mode to select the mode of generating synthetic dataset when task is data. Options are baseline: simple in-context learning with limited instructions, control: step-by-step guidance to generate both gold samples and random samples with irrelevant info.
[For Data Generation Only] --variant to select the variant of the Linda problems, such as the default original, variant_one, variant_two, ..., variant_six. Detailed information about each variant can be found in the def linda_problem() function in prompts.py. Include this argument iff --fallacy is linda.
[For Data Generation Only] --conn to select the logical connecting word, such as because, sothat, or to to generate new data. Add this argument iff --fallacy is linda and --variant is variant_one or variant_two.
[For Data Generation Only] --n to set the number of synthetic data problems to generate.
[For Inference Only] --data_file to set the data file path for inference.
[For Inference Only] --eval_mode to set the evaluation mode for the model to answer questions. Options are
- baseline for directly prompting
- zs_cot for zero-shot chain-of-thought (CoT) prompting
- os for one-shot in-context learning (ICL) prompting with the original Linda Problem (default)
- os_cot for one-shot ICL plus COT prompting
- os_bob for one-shot ICL prompting but with a rephrased Bob Problem
- os_bob_cot for one-shot ICL prompting plus COT but with a rephrased Bob Problem
- os_incorrect for one-shot ICL but with an incorrect answer and a rephrased Bob Problem
- os_incorrect_cot for one-shot ICL plus COT but with an incorrect answer and a rephrased Bob Problem
- fs for few-shot ICL prompting
- fs_cot for few-shot ICL plus COT prompting
- weak_control_zs_cot for weakly controlled zero-shot CoT prompting, leaking the hint that it is a Linda Problem but without detailed instructions
- weak_control_os_cot for weakly controlled one-shot CoT prompting, leaking the hint that it is a Linda Problem but without detailed instructions
- control_zs_cot for controlled zero-shot CoT prompting, leaking the hint that it is a Linda Problem with detailed and carefully-curated instructions
- control_os_cot for controlled one-shot CoT prompting, leaking the hint that it is a Linda Problem with detailed and carefully-curated instructions

To generate synthetic data

python main.py --model gpt3.5 --task data --fallacy linda --gen_mode control --variant original --n 100 --verbose

in the command line and adjust model, fallacy, gen_mode, variant, and n accordingly. All the other hyper-parameters can be set at config.yaml. Generated files will be saved to the data/ directory.

To start the inference

python main.py --model gpt3.5 --task inference --eval_mode os_cot --data_file synthetic_dataset_linda_original_gold.json --verbose

in the command line and adjust model, eval_mode, and data_file accordingly.

To efficiently run the evaluation with multiple prompting methods, models, and/or data files in parallel, please modify the number of GPU devices available and adjust the codes in run.sh. Then run

bash run.sh

All results and final accuracies will be automatically saved to the outputs/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
data/synthetic		data/synthetic
outputs		outputs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
abstraction.png		abstraction.png
bob.png		bob.png
config.yaml		config.yaml
data_prompts.py		data_prompts.py
dataloader.py		dataloader.py
generate_synthetic_dataset.py		generate_synthetic_dataset.py
inference.py		inference.py
inference_prompts.py		inference_prompts.py
main.py		main.py
names.png		names.png
plot_poster.ipynb		plot_poster.ipynb
plots.ipynb		plots.ipynb
query_llm.py		query_llm.py
requirements.txt		requirements.txt
run.sh		run.sh
tests.ipynb		tests.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.

💜 Do LLMs have genuine reasoning capabilities? How to evaluate them?

Dependencies

Citation

Dataset

LLM Setups

Quick Start

About

Releases

Packages

Contributors 2

Languages

License

bowen-upenn/llm_token_bias

Folders and files

Latest commit

History

Repository files navigation

This is the official implementation of the paper "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners" in PyTorch.

💜 Do LLMs have genuine reasoning capabilities? How to evaluate them?

Dependencies

Citation

Dataset

LLM Setups

Quick Start

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages