<a href="https://colab.research.google.com/github/ethanachi/api_use/blob/main/demo_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The **Synthetic API Use** benchmark aims to evaluate language models' ability to
perform tool use during code generation. 
The benchmark consists of synthetic API's: synthetic Python libraries,
each of which tests various facets of reasoning in code synthesis. 
An accompanying Python testbed generates **test cases** which must be
solved using functions from these synthetic API's.
At evaluation time, a large language model is first presented with the definition of the API(s),
then prompted to complete the test case using the functions it has seen.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import json

In [3]:
import ipywidgets
from ipywidgets import interact

In [4]:
!git clone https://github.com/ethanachi/api_use/

Cloning into 'api-use-benchmark'...
remote: Enumerating objects: 116, done.[K
remote: Counting objects: 100% (116/116), done.[K
remote: Compressing objects: 100% (84/84), done.[K
remote: Total 116 (delta 60), reused 75 (delta 30), pack-reused 0[K
Receiving objects: 100% (116/116), 124.97 KiB | 5.21 MiB/s, done.
Resolving deltas: 100% (60/60), done.


In [5]:
%cd api_use

/content/api-use-benchmark


In [6]:
import api_use

In [7]:
import api_use.api_use_tasks

In [8]:
from IPython.display import display, HTML
codestyle = "font-size: 0.85em; line-height: 1.2em; display: block; border: 1px solid #ddd; padding: 1em; margin-left: 0.5em;"
def view(result, show_test=False):
  display(HTML('<b style="color:blue; font-size: 1.25em; margin-top: 1em;">Prompt: </b>'))
  display(HTML(f'<pre><code style="{codestyle}">{result.prompt}</code></pre>'))
  display(HTML('<b style="color:red; font-size: 1.25em;">Target: </b>'))
  display(HTML(f'<pre><code style="{codestyle}">{result.target}</code></pre>'))
  if show_test:
    display(HTML('<b style="color:orange">Test: </b>'))
    display(HTML(f'<pre><code style="{codestyle}">{result.test}</code></pre>'))

## Synthetic API's

Each API is defined as a mapping between Python functions (`rotate()`) and an English description (*"Rotates an image by `n` degrees"*).
No code is required.
To create a new synthetic API, one defines a new mapping, either programatically or by creating a JSON file. 

We provide several synthetic API's, which aim to probe a variety of reasoning capabilities while simulating real-life coding tasks. 

| Library | Task | Example call |
| --- | --- | --- |
| Image manipulation | Manipulating images, PIL-style | `image.rotate(45).blur()` |
| Chemistry   | Manipulating and searching through molecules | `molecule.get_atom_with_atomic_weight(75).get_label()`
| Music | Searching through songs and melodies for notes and converting them to other formats | `melody.get_note_with_pitch("A").to_midi()`
| Solids | Computing physical attributes of *n*-dimensional solids | `solids.get_volume_of_cylinder(1, 3, 7)` |

See below for more details.

## Test cases

Creating a test case requires two arguments:

- the **function signature**, a call (or composition of calls) to function(s) from one or more Synthetic API's
- the **description**, a specification in plain English of what the function signature does

Solving the resulting test case requires a model to, given the description, generate code *equivalent* to the function signature.
Specifically, the generated code must follow the same control flow and arguments as the reference solution.

### Generation

To generate a test case, call **`api_use.get_example(signature, description)`**:


In [9]:
result = api_use.get_example(
  signature='solids.volume_of_cone()',
  description='gets the volume of a cone',
)
view(result, show_test=True)


This returns an `api_use.TestCase` object with three fields:

-   `.prompt`: English text which presents the API and gives the function to be written. You should pass this to the model being
    evaluated as the prefix for generation.


-   `.target`: the correct answer to the problem. You don't need this string to
    evaluate model performance, but it's helpful to examine the correct answer.

-   `.test`: A piece of code that can be used to verify the answer to this
    problem.

### Evaluation

`api_use.execute(sample, test` returns a tuple containing whether the test passed and the error message, if
any:

In [10]:
# the correct target
good_sample = "return solids.volume_of_cone(radius, height)"
is_correct, error_message = api_use.execute([good_sample], result.test)
print(is_correct, error_message)

True None


The testbed checks correctness of the control flow rather than exact match, so alternative solutions are possible:

In [11]:
# another correct target
alternative_sample = "val = solids.volume_of_cone(radius, height)\n    return val"
is_correct, error_message = api_use.execute([alternative_sample], result.test)
print(is_correct, error_message)

True None


In [12]:
# wrong target
bad_sample = "val = solids.volume_of_conee(radius, height)\n    return val"
is_correct, error_message = api_use.execute([bad_sample], result.test)
print(is_correct)
print(error_message)

False
An exception occurred while calling exec. Traceback:

Traceback (most recent call last):
  File "/content/api-use-benchmark/api_use/execution_utils.py", line 266, in exec_with_timeout
    exec(code, {}, var_dict)  # pylint: disable=exec-used
  File "<string>", line 31, in <module>
Test failure: D<main.volume_of_conee(dummy_radius_0,dummy_height_1)> != D<main.volume_of_cone(dummy_radius_0,dummy_height_1)>



### Customizing Test Cases

The difficulty of code synthesis can be affected by a variety of factors.
The dependence of model reasoning on these factors can be probed by dialling various
attributes of the test case.
Test case can be customized by passing configuration
arguments to the `get_example` function, as described below.

#### Distractors
When presented with a large number of functions from the same API, can models still select the correct function to complete a task?

In [14]:
@interact(num_distractors=ipywidgets.IntSlider(min=0, max=8, step=1, value=5),
          target_func_location=ipywidgets.FloatSlider(min=0, max=1, step=0.01, value=0.5))
def vary_n_distractors(num_distractors, target_func_location):
    results = api_use.get_example(signature="solids.volume_of_cone()", 
                                  description="gets the volume of a cone",
                                  func_name="get_volume_of_cone",
                                  num_distractors=num_distractors,
                                  target_func_location=target_func_location)
    view(results)

interactive(children=(IntSlider(value=5, description='num_distractors', max=8), FloatSlider(value=0.5, descrip…

#### Semantic invariance

When reading documentation, do models base their decisions off the function name, the function description, or both? 

In [15]:
@interact(func_noise_type=['number', 'none', 'swap'], arg_noise_type=['none', 'number'], desc_noise_type=['none', 'swap'])
def vary_n_distractors(func_noise_type, arg_noise_type, desc_noise_type):
    results = api_use.get_example(signature="solids.volume_of_cone()", 
                                  description="gets the volume of a cone",
                                  func_name="get_volume_of_cone",
                                  function_noise_type=func_noise_type,
                                  arg_noise_type=arg_noise_type,
                                  description_noise_type=desc_noise_type,
                                  num_distractors=8)
    view(results)

interactive(children=(Dropdown(description='func_noise_type', options=('number', 'none', 'swap'), value='numbe…

#### Argument Order

Can models flexibly work with argument order, passing arguments to an API in a different order or with a different degree of specification compared to the input?

In [16]:
@interact(arg_order=[[0, 1, 2], [2, 1, 0], [1, 0, 2]])
def vary_arg_order(arg_order):
    results = api_use.get_example(signature="solids3.volume_of_cone()", 
                                  description="gets the volume of a cone",
                                  func_name="get_volume_of_cone",
                                  arg_order=arg_order)
    view(results)

interactive(children=(Dropdown(description='arg_order', options=([0, 1, 2], [2, 1, 0], [1, 0, 2]), value=[0, 1…

#### Composition

Can models effectively compose functions, using the output of one function call as the input to the next? To what extent do the *length* and the *width* of this chain affect generation correctness? 

In [17]:
import random
func_pool = [
  ("rotate()", "rotate", "rotates $ by the given number of degrees"),
  ("flip_horizontal()", "flip_horizontal", "flips $ horizontally"),
  ("blur()", "blur", "blurs $ by the given number of pixels"),
  ("distort(pixels=5)", "distort_5_px", "distorts $ by 5 pixels"),
]

functions_used = random.sample(func_pool, k=4)

@interact(depth=[1, 2, 3, 4])
def vary_composition(depth):
    function_call_list, func_name_list, desc_list = zip(*functions_used[:depth])
    function_call = "image." + '.'.join(function_call_list)
    func_name = "_then_".join(func_name_list)
    desc = ", then ".join([desc_list[0].replace('$', 'an image')] + [x.replace('$', 'it') for x in desc_list[1:]])
    
    results = api_use.get_example(signature=function_call, 
                                  description=desc,
                                  func_name=func_name)
    view(results)

interactive(children=(Dropdown(description='depth', options=(1, 2, 3, 4), value=1), Output()), _dom_classes=('…

## Synthetic API's

We provide several synthetic API's, which aim to probe a variety of reasoning capabilities while simulating real-life coding tasks. 

- `solids`: This library tests **models' abilities to work with arguments.**
It provides functions which compute the volume and surface area of various physical objects given their dimensions.
By default, these functions take two arguments (radius and height);
we provide variants which take varying number of arguments from 1 to 4
(`solids[1-4]`) respectively.

In [18]:
result = api_use.get_example(
  signature='solids3.volume_of_cone(height=2)',
  description='gets the volume of a cone with height 2',
)
view(result)

- `chem` and `music`: These libraries test **models' abilities to perform search-like operations.**

In [None]:
result = api_use.get_example(
  signature='molecule.get_atom_with_atomic_num(atomic_num=2)',
  description='gets the atom with atomic number 2',
)
view(result)

- `image`: This library tests **models' abilities to chain functions.**

In [None]:
view(api_use.get_example(signature="image.rotate(degrees=image.get_width())", 
                         func_name="rotate_by_width", 
                         description="rotates the given image by a number of degrees equal to its width"))

# API Reference

The core of the API Use benchmark is a **test case**: a definition of
one or more API's, plus a prompt that describes the function to be completed
by the model which utilizes the previously defined API's.
Test cases can be generated using the `api_use.get_example()` function.

In [None]:
results = api_use.get_example(signature="solids.volume_of_cone()", 
                              description="gets the volume of a cone given its height and radius")
view(results)

## Dialing attributes

Test cases are **dialable**: we aim to examine the dependence of model
reasoning abilities on a variety of confounding factors.
Nearly every attribute of the default test case output can be dialed,
which we briefly describe below:

### Function prompt

#### Composition

Requiring models to use multiple functions and connect
their results to each other tests their ability to perform **composition**,
a core part of reasoning.
The benchmark supports generating test cases with arbitrary composition;
simply pass a function composition as the `func_call`.
The results of one function can be passed as an argument to another function:

In [None]:
view(api_use.get_example(signature="image.rotate(degrees=image.get_width())", 
                         func_name="rotate_by_width", 
                         description="rotates the given image by a number of degrees equal to its width"))

or as the subject of another function call.


In [None]:
view(api_use.get_example(signature="molecule.get_atom_with_atomic_num(55).get_atomic_weight()", 
                         func_name="get_atomic_weight_of_atom_with_atomic_num_55", 
                         description="given a molecule, gets the atomic weight of its atom with atomic number 55"))

In this second case, the return type of the first function call is used to
disambiguate the library for the next function call.
For example, in the below example the return type of 
`get_atomic_with_atomic_num` in library `molecule` is defined to be `atom`,
so the `atom.get_atomic_weight()` function is used.

Additionally, multiple levels of composition are possible:

In [None]:
view(api_use.get_example(signature="image.blur(pixels=5).flip_horizontal().rotate(image.get_width())", 
                         func_name="func", 
                         description="given an image, blurs it 5 pixels, flips it horizontally, then rotates it by the width of the image"))

#### Argument order
A core part of working with API's is the partial
application of function arguments. Using the
`solids.volume_of_cone(height, radius)` API as an example,
suppose one wants to compute the volume of a cone with known height
(but not radius).
Successfully solving this task requires passing the `radius` argument
from the outer function definition into the inner API call,
while supplying a set height of its own.

In a more general sense, flexibly working with API's requires
setting any number of inner API call arguments to fixed, pre-known values,
while matching the remaining inner arguments with their corresponding
outer arguments in any arbitrary order.
Learning this invariant matching between arguments is an important
prior to successfully generate code.
The API Use benchmark provides multiple ways to probe this capability,
briefly described below:

**Fixing arguments**: 
Any inner API call arguments can be set to fixed values by specifying
the key and value appropriately in the function signature.

In [None]:
# Fixing 0 arguments
view(api_use.get_example(signature="solids.volume_of_cylinder()", 
                         description="gets the volume of a cylinder",
                         func_name="get_volume_of_cylinder"))

In [None]:
# Fixing first argument
view(api_use.get_example(signature="solids.volume_of_cylinder(radius=5)", 
                         description="gets the volume of a cylinder with radius 5",
                         func_name="get_volume_of_cylinder_radius_5"))

Additionally, the arguments can be fixed in arbitrary order:

In [None]:
view(api_use.get_example(signature="solids.volume_of_cylinder(height=4, radius=5)", 
                         description="gets the volume of a cylinder with height 4 and radius 5",
                         func_name="get_volume_of_cylinder_with_height_4_radius_5"))

- **Outer argument order**: Any arguments not fixed are propagated
to the outer function definition. By default, the arguments are listed in
the order they appear in the signature:


In [None]:
view(api_use.get_example(signature="image.rotate().blur()", 
                         func_name="rotate_then_blur", 
                         description="rotates the given image by the given number of degrees, then blurs it"))

This can be modified by the `arg_order` argument:

In [None]:
view(api_use.get_example(signature="image.rotate().blur()", 
                         func_name="rotate_then_blur", 
                         description="rotates the given image by the given number of degrees, then blurs it",
                         arg_order=[0, 2, 1]))

### Distractors

By default, there are no distractors,
i.e. only the functions needed to complete the task are provided in the
API definition.
However, to force the model to disambiguate between multiple candidate functions,
distractors can be provided as part of the prompt.

For each library in the test case, distractors are sampled from the set of
functions not evaluated in the test case, as follows:

- **Number of distractors** (`num_distractors : Union[int, Dict[str, int]]`):
  - If an `int`, samples a total of `num_distractors` distractors.
  The distractors are sampled randomly from each library
  used in the task, weighted by the number of times each library is used in
  the signature.
  - If a `Dict[str: int]`, samples `num_distractors[library]` distractors
  for each `library`.

In [None]:
view(api_use.get_example(signature="solids.volume_of_cone()", 
                         description="gets the volume of a cone",
                         func_name="get_volume_of_cone",
                         num_distractors=4))

In [None]:
# Weighted random sampling
view(api_use.get_example(signature="molecule.get_atom_with_atomic_num(55).get_atomic_weight()", 
                         func_name="get_atomic_weight_of_atom_with_atomic_num_55", 
                         description="given a molecule, gets the atomic weight of its atom with atomic number 55",
                         num_distractors=4))

**Position of target function** (`target_func_location : int = -1`)
  - By default (if `-1`), spaces the functions evenly in the list
  of confounders.
  - If an `int` and a single function is used in the test case,
  inserts that function at the given index.
  If multiple functions are used in the test case, throws an error.
  - If a `float` and a single function is used in the test case,
  inserts that function at the closest approximate fractional
  location (e.g. `0.5` = halfway through the list of functions).
  If multiple functions are used in the test case, throws an error.
  - If a `Dict[str : int]`, inserts a given function `function_name` at index
  `target_func_location[function_name]`.
  - If a `Dict[str : float]`, inserts a given function `function_name` at
  fractional index `target_func_location[function_name]`.


### Formatting

- **Global indent** (`indent : int = 4`):
Controls the indent for all Python code + descriptions.


In [None]:
view(api_use.get_example(signature="solids.volume_of_cone()", 
                         description="gets the volume of a cone",
                         func_name="get_volume_of_cone",
                         num_distractors=4,
                         indent=2))

- **Intro** (`intro : str = "Consider the following functions:"`): Controls the introductory sentence that goes before the list of functions.


In [None]:
view(api_use.get_example(signature="solids.volume_of_cone()", 
                         description="gets the volume of a cone",
                         func_name="get_volume_of_cone",
                         num_distractors=4,
                         intro='Have a look at these wacky functions:'))

- **Function description**:
By default, documentation is on a new line, indented. 
This can be modified with the following arguments:
  - `use_quotes : bool = False`: 
  If `True`, inserts triple-quotes around each function
  to more resemble a function definition.

In [None]:
view(api_use.get_example(signature="solids.volume_of_cone()", 
                         description="gets the volume of a cone",
                         func_name="get_volume_of_cone",
                         num_distractors=4,
                         use_quotes=True))

- `format_function : Callable`:
If this argument is provided, it will be used to format each function.
Example:


In [None]:
def ff(func):
    args = func.args
    arglist = ", ".join(args)
    return f"- {func.library_name}.{func.name}({arglist}): {func.definition}"

view(api_use.get_example(signature="solids.volume_of_cone()", 
                         description="gets the volume of a cone",
                         func_name="get_volume_of_cone",
                         num_distractors=4,
                         format_function=ff))

In [None]:
def ff(func):
    args = func.args
    arglist = ", ".join(args)
    return f"The function {func.name} from the {func.library_name} library takes in the arguments {arglist} and {func.definition}"

view(api_use.get_example(signature="solids.volume_of_cone()", 
                         description="gets the volume of a cone",
                         func_name="get_volume_of_cone",
                         num_distractors=4,
                         format_function=ff))

- **Preamble joiner (`joiner : str`)**: Controls the separator between each function definition. By default, this is a single newline.

In [None]:
view(api_use.get_example(signature="solids.volume_of_cone()", 
                         description="gets the volume of a cone",
                         func_name="get_volume_of_cone",
                         num_distractors=4,
                         joiner='\n---\n'))