## Overview

This notebook builds a long context benchmark, as discussed by Federico, Arjun,
Harm, and Leandro. Unlike typical Code LLM benchmarks, this is a test
generation benchmark: we prompt the model with the implementation of a Python
function (and its docstring), and ask for a test suite. The result is scored in
two steps: if any test in the test suite fails, the score is zero. Otherwise,
we the tests are scored based on their coverage of the funciton's
implementation. To make the problem harder, we add several other functions to
the prompt to serve as distractors. There are enough distractors to exercise
models with very long context lengths (up to 128K tokens). We use two datasets:
HumanEval and MultiPL-T. Both have several Python functions. The HumanEval
functions should be decontaminated before training: their docstrings should not
appear in the training data. The MultiPL-T functions are functions extracted
from the Stack v1.2. Thus they are very likely to appear in models' training
data, but they are merely distractors.

In [1]:
import datasets
import random
import os
from typing import List
import itertools

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# In case you're in an environment where you really want this to be set.
print(os.getenv("HF_DATASETS_CACHE"))

random.seed(42)

# This is likely an overestimate. But, it should be close enough and we don't need
# to be exact.
CHARS_PER_TOKEN = 2.5

None


In [18]:
humaneval = datasets.load_dataset("openai_humaneval", split="test")
# The MultiPL-T dataset is currently private, but will be public soon.
multiplt = datasets.load_dataset("nuprl/stack-dedup-python-testgen-starcoder-filter-inferred-v2", split="train")
# mutant generated dataset
mutant_ds = datasets.load_dataset("nuprl/humaneval-py-mutants", split="train")

Found cached dataset openai_humaneval (/home/elleven/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75)
Found cached dataset parquet (/home/elleven/.cache/huggingface/datasets/nuprl___parquet/nuprl--stack-dedup-python-testgen-starcoder-filter-inferred-v2-8a147987b4874669/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 620/620 [00:00<00:00, 8.15MB/s]


Downloading and preparing dataset None/None to /home/elleven/.cache/huggingface/datasets/nuprl___parquet/nuprl--humaneval-py-mutants-b940dfa114eb395a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|                                                                                                                                                                                   | 0/1 [00:00<?, ?it/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 245k/245k [00:00<00:00, 5.05MB/s][A
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.54it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2364.32it/s]
                                                                         

Dataset parquet downloaded and prepared to /home/elleven/.cache/huggingface/datasets/nuprl___parquet/nuprl--humaneval-py-mutants-b940dfa114eb395a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.




In [34]:
def get_distractors(approximate_token_count: int) -> List[str]:
    result = []
    result_chars = 0
    target_chars = int(approximate_token_count * CHARS_PER_TOKEN)
    while result_chars < target_chars:
        fn = random.choice(multiplt)["content"]
        result.append(fn)
        result_chars += len(fn)
    return result

def find_mutant_with_idx(humaneval_problem_index: int):
    for ex in mutant_ds:
        if f"_{humaneval_problem_index}_" in ex["name"]:
            return ex
    return None

def build_prompt(
        approximate_token_count: int,
        humaneval_problem_index: int,
        insert_where: str):
    distractors = get_distractors(approximate_token_count)
    target_problem = humaneval[humaneval_problem_index]
    target_function = target_problem["prompt"] + target_problem["canonical_solution"]
    if insert_where == "first half":
        insert_index = random.randint(0, len(distractors) // 2)    
    elif insert_where == "second half":
        insert_index = random.randint(len(distractors) // 2, len(distractors))
    else:
        raise ValueError(f"Unknown insert_where: {insert_where}")
    distractors.insert(insert_index, target_function)
    mutants = find_mutant_with_idx(humaneval_problem_index)["mutants"]
    mutants = mutants if mutants is not None else []
    return { 
        "prompt": "\n\n".join(distractors),
        "target_function": target_function,
        "humaneval_task_id": target_problem["task_id"],
        "task_id": f"LongBench_{target_problem['task_id']}_{approximate_token_count}_{insert_where}",
        "approx_token_count": approximate_token_count,
        "mutants": mutants,
        "target_function_name": target_problem["entry_point"]
    }


## Example Prompts

Some examples of prompts that we can construct.

With 0 as the number of target tokens, we get no distractors.

In [35]:
print(build_prompt(0, 53, "first half")["prompt"])



def add(x: int, y: int):
    """Add two numbers x and y
    >>> add(2, 3)
    5
    >>> add(5, 7)
    12
    """
    return x + y



With 400 target tokens, we get 1-2 distractors and the `where` argument starts to
make sense.

In [36]:
print(build_prompt(400, 53, "first half")["prompt"])

def speed_convert(size):
    """
    Hi human, you can't read bytes?
    """
    power = 2**10
    zero = 0
    units = {0: "", 1: "Kb/s", 2: "Mb/s", 3: "Gb/s", 4: "Tb/s"}
    while size > power:
        size /= power
        zero += 1
    return f"{round(size, 2)} {units[zero]}"



def add(x: int, y: int):
    """Add two numbers x and y
    >>> add(2, 3)
    5
    >>> add(5, 7)
    12
    """
    return x + y


def parse_info_date_str(info_date_str):
    """Returns an info_date string modified in such a way that Elasticsearch would not attempt to interpret it as a date.
    Currently there are several different formats of info_date used.
    If no modification is applied Elasticseach will interpret part of the values as a string and another part as a date
    which causes a value error and should be avoided.

    :param info_date_str:
    :return:
    """
    return 'str:' + info_date_str

def handle_prices(data: dict) -> dict:
    """Set price according to rate."""
    for reservatio

In [15]:
print(build_prompt(400, 53, "second half")["prompt"])

def _LJ_rminepsilon_to_ab(coeffs):
    """
    Convert rmin/epsilon representation to AB representation of the LJ
    potential
    """
    A = coeffs['epsilon'] * coeffs['Rmin']**12.0
    B = 2 * coeffs['epsilon'] * coeffs['Rmin']**6.0
    return {"A": A, "B": B}

def check_if_odd(num):
    """
    Checks if number is odd

    Args:
        num :


    (Generated by docly)
    """
    return True if num % 2 != 0 else False



def add(x: int, y: int):
    """Add two numbers x and y
    >>> add(2, 3)
    5
    >>> add(5, 7)
    12
    """
    return x + y


def declare(var_name, var_type): 
	"""Creates an SMTLIB declaration formatted string

	Parameters
		----------
		var_name: string
			The name of the variable
		var_type: 
			The type of the variable (Int, Bool, etc.)
	"""
	return "(declare-fun " + var_name + " () " + var_type + ")\n"

def dict_scale(d, scl):
    """scales all values in dict and returns a new dict"""
    return dict([(k, v * scl) for k, v in d.items()])

def dasherize

## The Benchmark

There are a number of trivial problems in HumanEval, such as #53 shown above.
We want a subset of problems that have a range of difficulties. The following
ten problems have varying difficulty in several programming languages
and were picked by Francesca Lucchetti for MultiPL-T.

- HumanEval_100_make_a_pile
- HumanEval_13_greatest_common_divisor
- HumanEval_152_compare
- HumanEval_157_right_angle_triangle
- HumanEval_27_flip_case
- HumanEval_40_triples_sum_to_zero
- HumanEval_55_fib
- HumanEval_66_digitSum
- HumanEval_72_will_it_fly
- HumanEval_74_total_match

In [37]:
APPROXIMATE_TOKEN_COUNTS = [0, 8_000, 64_000, 128_000]
HUMANEVAL_PROBLEM_INDICES = [100, 13, 152, 157, 27, 40, 55, 66, 72, 74]
INSERT_WHERES = [ "first half", "second half"]

benchmark = datasets.Dataset.from_list(
    [build_prompt(*x) for x in itertools.product(APPROXIMATE_TOKEN_COUNTS, HUMANEVAL_PROBLEM_INDICES, INSERT_WHERES)])
benchmark

Dataset({
    features: ['prompt', 'target_function', 'humaneval_task_id', 'task_id', 'approx_token_count', 'mutants', 'target_function_name'],
    num_rows: 80
})

In [16]:
benchmark.to_json("benchmark.jsonl", lines=True)

Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 12.38ba/s]


10746864

In [17]:
benchmark[0]

{'prompt': '\ndef make_a_pile(n):\n    """\n    Given a positive integer n, you have to make a pile of n levels of stones.\n    The first level has n stones.\n    The number of stones in the next level is:\n        - the next odd number if n is odd.\n        - the next even number if n is even.\n    Return the number of stones in each level in a list, where element at index\n    i represents the number of stones in the level (i+1).\n\n    Examples:\n    >>> make_a_pile(3)\n    [3, 5, 7]\n    """\n    return [n + 2*i for i in range(n)]\n',
 'target_function': '\ndef make_a_pile(n):\n    """\n    Given a positive integer n, you have to make a pile of n levels of stones.\n    The first level has n stones.\n    The number of stones in the next level is:\n        - the next odd number if n is odd.\n        - the next even number if n is even.\n    Return the number of stones in each level in a list, where element at index\n    i represents the number of stones in the level (i+1).\n\n    Exa