BIRD Text2SQL Benchmark

Installation

We recommend Python 3.13 but any version >=3.10 should work.

Installing our tool: Open VS Code (or a VS Code-based editor like Cursor). In the VS Code marketplace, search for "agops agent copilot" and install the first extension:

Then run pip install agops-bird. Also run pip install openai.

Run aco-config and configure as follows:

Check that the default path displayed is the bird-bench directory.
Enable telemetry: yes. If you don't have the URL and key below, set telemetry to no.
Telemetry URL: XXX
Telemetry key: XXX
Telemetry username: Enter a name or take default

Setting up API keys: Run export OPENAI_API_KEY=XXX. If you need a key, text ferdi.kossmann@gmail.com.

Using our tool

Our tool will show you runs of the evaluation set (Sample 0, Sample 1, ...). Correct samples will be shown with a green bar, incorrect ones with a red bar. You can click on samples and inspect the inputs and outputs of LLM calls and calls to the database and even modify them to see what would have happened if the input or output to/from an LLM were different.

If you have any questions, please reach out to me at ferdi [dot] kossmann [at] gmail [dot] com.

Developing a workflow

Caution

In the light of this user study, you may only invoke LLMs using the OpenAI chat.completions.create API call (also see workflow/example.py):

from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(...)

If you want to make a call to the database, you have to use the utils.call_db(sql_str) function (no need to create a connections or cursor). The function will return the same object as an actual call to the SQL DB would.

Run workflow and evaluation

When running our tool, the only difference is that you type develop script.py instead of python script.py. Practically, this means the following:

If you want to evaluate and run an individual sample X, do develop workflow/main.py --sample_id X.

When you want to evaluate several samples, run python run_and_evaluate.py --num_samples X, which will run the first X samples of the benchmark. run_and_evaluate.py spawns the workflow runs using the develop command, so the workflows are run with our tool.

Getting started:

Run python run_and_evaluate.py to process all 40 samples in the benchmark. You should see the following output:

You should be able to hover over nodes in the graph (which consists of a single node), and edit the input or output. If you click on the green rerun button, the workflow will rerun with the changes you made. This allows you to ask "what-if questions" (what if the input/output would have been different?).

If you click on the eraser symbol, all your input/output edits will be removed and the workflow is run as defined in the code.
The code of the workflow is in workflow/example.py. It simply calls gpt-3.5 with the input.
The workflow is called from workflow/main.py. We recommend to leave much of the logic in workflow/main.py the same and use it to call your workflow.

Understanding correctness: In the list view at the left in the UI, samples that failed the benchmark test are shown through a red bar. The ones that passed through a green bar.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
media_for_readme		media_for_readme
predictions		predictions
workflow		workflow
.gitignore		.gitignore
README.md		README.md
run_and_evaluate.py		run_and_evaluate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BIRD Text2SQL Benchmark

Installation

Using our tool

Developing a workflow

Run workflow and evaluation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

agops-project/bird-bench

Folders and files

Latest commit

History

Repository files navigation

BIRD Text2SQL Benchmark

Installation

Using our tool

Developing a workflow

Run workflow and evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages