## Overview

This notebook builds a long context benchmark, as discussed by Federico, Arjun,
Harm, and Leandro. Unlike typical Code LLM benchmarks, this is a test
generation benchmark: we prompt the model with the implementation of a Python
function (and its docstring), and ask for a test suite. The result is scored in
two steps: if any test in the test suite fails, the score is zero. Otherwise,
we the tests are scored based on their coverage of the funciton's
implementation. To make the problem harder, we add several other functions to
the prompt to serve as distractors. There are enough distractors to exercise
models with very long context lengths (up to 128K tokens). We use two datasets:
HumanEval and MultiPL-T. Both have several Python functions. The HumanEval
functions should be decontaminated before training: their docstrings should not
appear in the training data. The MultiPL-T functions are functions extracted
from the Stack v1.2. Thus they are very likely to appear in models' training
data, but they are merely distractors.

In [1]:
import datasets
import random
import os
from typing import List
import itertools

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# In case you're in an environment where you really want this to be set.
print(os.getenv("HF_DATASETS_CACHE"))

random.seed(42)

# This is likely an overestimate. But, it should be close enough and we don't need
# to be exact.
CHARS_PER_TOKEN = 2.5

None


In [3]:
humaneval = datasets.load_dataset("openai_humaneval", split="test")
# The MultiPL-T dataset is currently private, but will be public soon.
multiplt = datasets.load_dataset("nuprl/stack-dedup-python-testgen-starcoder-filter-inferred-v2", split="train")
# mutant generated dataset
mutant_ds = datasets.load_dataset("nuprl/humaneval-py-mutants", split="train")

Found cached dataset openai_humaneval (/home/arjun/.cache/huggingface/datasets/openai_humaneval/openai_humaneval/1.0.0/2955cebd73602e828fa8c0a424c594e5fab4ec863b316ca98f3d8fdb6a626e75)
Found cached dataset parquet (/home/arjun/.cache/huggingface/datasets/nuprl___parquet/nuprl--stack-dedup-python-testgen-starcoder-filter-inferred-v2-8a147987b4874669/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
Downloading readme: 100%|██████████| 620/620 [00:00<00:00, 3.01MB/s]


Downloading and preparing dataset None/None to /home/arjun/.cache/huggingface/datasets/nuprl___parquet/nuprl--humaneval-py-mutants-b940dfa114eb395a/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...


Downloading data: 100%|██████████| 245k/245k [00:00<00:00, 3.54MB/s]
Downloading data files: 100%|██████████| 1/1 [00:01<00:00,  1.03s/it]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 784.72it/s]
                                                                      

Dataset parquet downloaded and prepared to /home/arjun/.cache/huggingface/datasets/nuprl___parquet/nuprl--humaneval-py-mutants-b940dfa114eb395a/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.




In [8]:
def get_distractors(approximate_token_count: int) -> List[str]:
    result = []
    result_chars = 0
    target_chars = int(approximate_token_count * CHARS_PER_TOKEN)
    while result_chars < target_chars:
        fn = random.choice(multiplt)["content"]
        result.append(fn)
        result_chars += len(fn)
    return result

def find_mutant_with_idx(humaneval_problem_index: int):
    for ex in mutant_ds:
        if f"_{humaneval_problem_index}_" in ex["name"]:
            return ex
    return None

def remove_doctests(code: str) -> str:
    result = [ ]
    in_doctest = False
    for line in code.splitlines():
        if ">>>" in line:
            in_doctest = True
            continue
        if in_doctest:
            in_doctest = False
            continue
        result.append(line)
    return "\n".join(result)

def build_prompt(
        approximate_token_count: int,
        humaneval_problem_index: int,
        insert_where: str):
    distractors = get_distractors(approximate_token_count)
    target_problem = humaneval[humaneval_problem_index]
    target_function = remove_doctests(target_problem["prompt"]) + "\n" + target_problem["canonical_solution"]
    if insert_where == "first half":
        insert_index = random.randint(0, len(distractors) // 2)    
    elif insert_where == "second half":
        insert_index = random.randint(len(distractors) // 2, len(distractors))
    else:
        raise ValueError(f"Unknown insert_where: {insert_where}")
    distractors.insert(insert_index, target_function)
    mutants = find_mutant_with_idx(humaneval_problem_index)["mutants"]
    mutants = mutants if mutants is not None else []
    return { 
        "prompt": "\n\n".join(distractors),
        "target_function": target_function,
        "humaneval_task_id": target_problem["task_id"],
        "task_id": f"LongBench_{target_problem['task_id']}_{approximate_token_count}_{insert_where}",
        "approx_token_count": approximate_token_count,
        "mutants": mutants,
        "target_function_name": target_problem["entry_point"]
    }


## Example Prompts

Some examples of prompts that we can construct.

With 0 as the number of target tokens, we get no distractors.

In [9]:
print(build_prompt(0, 53, "first half")["prompt"])



def add(x: int, y: int):
    """Add two numbers x and y
    """
    return x + y



With 400 target tokens, we get 1-2 distractors and the `where` argument starts to
make sense.

In [13]:
print(build_prompt(400, 53, "first half")["prompt"])

def dasherize(word):
    """Replace underscores with dashes in the string.

    Example::

        >>> dasherize("lower_case")
        "lower-case"

    """
    return word.replace("_", "-")

def calc_dDdc_fn(c, dc, D_fn):
    """
    Computes dD/dc given a functional form to estimate D(c).
    """
    # computes diffusivity at given concentration and one step size away [m^2/s]
    D1 = D_fn(c)
    D2 = D_fn(c + dc)
    # computes dD/dc using forward difference formula [m^2/s / kg/m^3]
    dDdc = (D2 - D1) / dc

    return dDdc



def add(x: int, y: int):
    """Add two numbers x and y
    """
    return x + y


def group_normalized_count(arr):
    """
    aggregation: inverse of array length
    """
    return 1.0/float(len(arr))

def average(values):
    """Computes the arithmetic mean of a list of numbers.

    >>> print(average([20, 30, 70]))
    40.0
    """
    return sum(values) / len(values)

def get_cli_fname(lon, lat, scenario=0):
    """Get the climate file name for the give

In [11]:
print(build_prompt(400, 53, "second half")["prompt"])

def kewley_agn_oi(log_oi_ha):
    """Seyfert/LINER classification line for log([OI]/Ha)."""
    return 1.18 * log_oi_ha + 1.30

def atoi(text):
    """
    Checks if the file names contain numbers.

    Parameters
    ----------
    text
        This parameter could be a str or int.

    Returns
    -------

    flow  :  int, str
    """
    flow = int(text) if text.isdigit() else text
    return flow

def get_total_time_in_sec(last_user_stats):
    """Calculates the total time you listen to music"""
    total_time_sec = 0
    for song in last_user_stats:
        try:
            total_time_sec += int(song['play_count']) * (int(song['duration_millis']) / 1000)
        except:
            continue
    return total_time_sec

def encode_varint(number: int) -> bytes:
    """
    Encode varint into bytes
    """
    # Shift to int64
    number = number << 1
    buf = b""
    while True:
        towrite = number & 0x7F
        number >>= 7
        if number:
            buf += bytes((towrite

## The Benchmark

There are a number of trivial problems in HumanEval, such as #53 shown above.
We want a subset of problems that have a range of difficulties. The following
ten problems have varying difficulty in several programming languages
and were picked by Francesca Lucchetti for MultiPL-T.

- HumanEval_100_make_a_pile
- HumanEval_13_greatest_common_divisor
- HumanEval_152_compare
- HumanEval_157_right_angle_triangle
- HumanEval_27_flip_case
- HumanEval_40_triples_sum_to_zero
- HumanEval_55_fib
- HumanEval_66_digitSum
- HumanEval_72_will_it_fly
- HumanEval_74_total_match

In [14]:
APPROXIMATE_TOKEN_COUNTS = [0, 8_000, 64_000, 128_000]
HUMANEVAL_PROBLEM_INDICES = [100, 13, 152, 157, 27, 40, 55, 66, 72, 74]
INSERT_WHERES = [ "first half", "second half"]

benchmark = datasets.Dataset.from_list(
    [build_prompt(*x) for x in itertools.product(APPROXIMATE_TOKEN_COUNTS, HUMANEVAL_PROBLEM_INDICES, INSERT_WHERES)])
benchmark

Dataset({
    features: ['prompt', 'target_function', 'humaneval_task_id', 'task_id', 'approx_token_count', 'mutants', 'target_function_name'],
    num_rows: 80
})

In [15]:
benchmark.to_json("benchmark.jsonl", lines=True)

Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00,  6.11ba/s]


10961958

In [17]:
print(benchmark[41]["prompt"])

def func(x):
    """

    >>> func(2)
    32
       
    >>> func(0)
    1

    >>> func(-1)
    1.25

    """

    return x**4 + 4**x

def to_alternating_case(string: str) -> str:
    """
    each lowercase letter becomes uppercase and
    each uppercase letter becomes lowercase
    :param string:
    :return:
    """

    return ''.join((char.upper() if char.islower() else char.lower()) for char in string)

def flatten_name(names):
    """Return a flattened form of an element name.

    """
    return u':'.join(map(lambda x: x.replace('\\', '\\\\')
                                    .replace(':', '\\:'), names))

def one_zero_boolean_to_string(value: str) -> str:
    """Helper function to convert arguments with 1/0 string values to true or false"""
    return 'true' if value == '1' else 'false'

def filename_ext(filename):
    """ Function that returns filename extension """
    # Taken from http://flask.pocoo.org/docs/1.0/patterns/fileuploads/

    return '.' in filename and filena