This is Arjun's attempt at building a long-content benchmark based on HumanEvalPlus,
as imagined by Leandro.

In [1]:
import datasets
import random
import bounded_subprocess
import tempfile
from pathlib import Path
import os

In [2]:
print(os.getenv("HF_DATASETS_CACHE"))

None


Let's start by thanking Loubna for uploading this to the Hub.

In [3]:
humanevalplus = datasets.load_dataset("loubnabnl/humaneval_plus", split="train")

Found cached dataset parquet (/home/arjun/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--humaneval_plus-d3a2da5c53783cd1/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


The tests in HumanEval are in the same style as HumanEval:

```
def check(candidate):
    assert candidate(x) == y
    ...
```

The code below extracts the assrtions, unindents them, and renames `candidate`
to the name of the function being tested. Moreover, not all lines are simple
assertions, so we skip over them. There is a possibiliby of error: an assertion
may span several lines. But, it's fairly unlikely, and the models we are testing
shouldn't fall apart on a little noise like that.

Finally, we strip out the docstring from the prompt.


In [4]:
def extract_and_unindent(s, entrypoint):
    idx = s.find("def check(candidate):")
    if idx == -1:
        return None
    extracted = s[idx+len("def check(candidate):"):]
    lines = extracted.split("\n")
    tests = [ ]
    for line in lines:
        if line == "":
            continue
        if not line.startswith("    assert"):
            continue
        tests.append(line.strip().replace("candidate(", entrypoint + "("))
    return tests

def clean_item(item):
    tests = extract_and_unindent(item["test"], item["entry_point"])
    prompt = item["prompt"][:item["prompt"].find("\n    ")]
    return {
        "tests": tests,
        "prompt": prompt,
        "canonical": item["canonical_solution"].strip()
    }

processed_humaneval_plus = humanevalplus.map(clean_item).filter(lambda item: len(item["tests"]) > 0)

Loading cached processed dataset at /home/arjun/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--humaneval_plus-d3a2da5c53783cd1/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-25a2ca89ddab56cb.arrow
Loading cached processed dataset at /home/arjun/.cache/huggingface/datasets/loubnabnl___parquet/loubnabnl--humaneval_plus-d3a2da5c53783cd1/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-f082645a2f7cca9b.arrow


Given `processed_humaneval_plus`, we turn each item into a benchmark:

- `prompt` has several assertions, including distractors, in random order and
  concludes with a function signature `def f(x):`.
- `size` is the length of the prompt in characters.
- `target_tests` are the subset of the assertions that test `f`.
- `canonical_prompt` is the prompt without distractors and assertions
- `canonical_solution` is a canonical solution that should pass the tests.

In [5]:
def build_benchmark(ds, other_indices, target_index):
    canonical_prompt = ds[target_index]["prompt"]
    canonical_solution = ds[target_index]["canonical"]

    tests = []
    tests.extend(ds[target_index]["tests"])
    for ix in other_indices:
        tests.extend(ds[ix]["tests"])
    random.shuffle(tests)
    prompt = "\n".join(tests)
    prompt = prompt + "\n\n" + canonical_prompt
    target_tests = "\n".join(ds[target_index]["tests"])
    return {
        "prompt": prompt, 
        "target_tests": target_tests,
        "canonical_prompt": canonical_prompt,
        "canonical_solution": "\n    " + canonical_solution,
        "size": len(prompt)
    }

def random_benchmark(ds, size: int):
    assert size > 0
    indices = random.sample(range(len(ds)), size)
    return build_benchmark(ds, indices[1:], indices[0])

def validate_benchmark(item):
    program = item["canonical_prompt"] + item["canonical_solution"] + "\n\n" + item["target_tests"]
    with tempfile.NamedTemporaryFile(suffix=".py", delete=True) as f:
        Path(f.name).write_text(program)
        r = bounded_subprocess.run(["python3", f.name])
        return r.exit_code == 0

This is a decent way to prompt an instruction-tuned model, but we aren't going to do it right now.

In [7]:
b = random_benchmark(processed_humaneval_plus, 2)
b_assertions = b["prompt"].split("\n")
b_signature = b_assertions[-1]
b_assertions = "\n".join(b_assertions[:-1]).rstrip()
print(f"These are several assertions:\n\n```\n{b_assertions}\n```\n\nComplete the following function so that the assertions pass:\n\n```\n{b_signature}\n```")

These are several assertions:

```
assert hex_key("112233445566778899AABBCCDDEEFF00") == 12, "Sixth test error: " + str(hex_key("112233445566778899AABBCCDDEEFF00"))
assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
assert hex_key([]) == 0
assert hex_key("1077E") == 2, "Second test error: " + str(hex_key("1077E"))
assert hex_key("ABED1A33") == 4, "Third test error: " + str(hex_key("ABED1A33"))
assert hex_key("123456789ABCDEF0") == 6, "Fifth test error: " + str(hex_key("123456789ABCDEF0"))
assert has_close_elements([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
assert has_close_elements([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
assert has_close_elements([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
assert hex_key("AB") == 1, "First test error: " + str(hex_key("AB"))
assert hex_key("2020") == 2, "Fourth test error: " + str(hex_key("2020"))
assert has_close_elements([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
assert has_close_elements([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True

Now we build a benchmark of varying size.

In [6]:
items = [ ]
counter = 0
for size in [10, 20, 40, 80, 160]:
    for i in range(5):
        b = random_benchmark(processed_humaneval_plus, size)
        if validate_benchmark(b):
            b["benchmark_name"] = f"Test2Code_Long_Context_{counter}"
            items.append(b)
            counter += 1
        else:
            print(f"Failed to generate benchmark of size {size}")

Failed to generate benchmark of size 20


In [8]:
longtest_benchmark = datasets.Dataset.from_list(items)
longtest_benchmark.to_json("test2code_long_context.jsonl", lines=True)

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

652060

How long is the longest benchmark (in characters, not tokens):

In [13]:
max(longtest_benchmark["size"])

64079

In [9]:
import pandas as pd

In [10]:
x = pd.read_json("../test2code_long_context.jsonl", lines=True)

In [14]:
x.to_dict(orient="records")

[{'prompt': 'assert valid_date(\'2003-04-12\') == False\nassert simplify("1/5", "1/5") == False, \'test13\'\nassert simplify("2/3", "5/2") == False, \'test8\'\nassert parse_music(\'.| .| .| .|\') == [1, 1, 1, 1]\nassert valid_date(\'04-2003\') == False\nassert valid_date(\'04-12-2003\') == True\nassert split_words("Hello,world!") == ["Hello","world!"]\nassert simplify("2/4", "8/4") == True, \'test10\'\nassert even_odd_palindrome(9) == (4, 5), "This prints if this assert fails 1 (good for debugging!)"\nassert intersperse([], 7) == []\nassert valid_date(\'15-01-2012\') == False\nassert change_base(16, 2) == "10000"\nassert simplify("7/10", "10/2") == False, \'test4\'\nassert numerical_letter_grade([4.0, 3, 1.7, 2, 3.5]) == [\'A+\', \'B\', \'C-\', \'C\', \'A-\']\nassert even_odd_palindrome(25) == (5, 6)\nassert split_words("aaaBb") == 1\nassert valid_date(\'03-11-2000\') == True\nassert valid_date(\'04-31-3000\') == False\nassert valid_date(\'2003-04\') == False\nassert intersperse([5, 6,