A common interface for APIs and Models. #161

Anindyadeep · 2023-11-12T11:43:36Z

Summary of the issue

First of all, thanks for the awesome effort for making code-evaluation-package. Highly appreciate it. However, right now, what I see is that it is integrated with just Huggingface models. It would be awesome, if we can evaluate the same for closed source models. For example something like this would be awesome:

accelerate launch  main.py \
  --model gpt-3.5-turbo \
  --max_length_generation 512 \
  --tasks instruct-humaneval \
  --instruction_tokens <user_token>,<end_token>,<assistant_token>\
  --temperature 0.2 \
  --n_samples 200 \
  --batch_size 10 \
  --allow_code_execution

So, with the same interface and the post processing logic of code-evaluation-harness, we can leverage this, to evaluate and compare code-evaluation for open source and closed source models.

What is the motivation

The motivation behind this is that Open-Source models are all good. However, researchers and passionate people on LLM always strive for making models that can surpass gpt in lesser number of parameters and better in performance for certain tasks. And a library like this would be really helpful.

How can I contribute:

Well, I already have most part of this code ready. If you are aligned with the motivations of the issue, then I can create the PR. However, the problem, that I am facing is that the evaluation score for API based models are very low. For example gpt-3.5 is giving a score for 0.006 in HumanEval benchmark. However, the generation is correct. The problem is in indentation and the post-processing of the generations. For example, one instance of the generation of gpt-3.5.turbo looks something like this:

from typing import List


def separate_paren_groups(paren_string: str) -> List[str]:
    """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
    separate those group into separate strings and return the list of those.
    Separate groups are balanced (each open brace is properly closed) and not nested within each other
    Ignore any spaces in the input string.
    >>> separate_paren_groups('( ) (( )) (( )( ))')
    ['()', '(())', '(()())']
    """    paren_string = paren_string.replace(' ', '')
    stack = []
    result = []
    group = ''
    for char in paren_string:
        if char == '(':
            if stack:
                group += char
            stack.append(char)
        elif char == ')':
            stack.pop()
            group += char
            if not stack:
                result.append(group)
                group = ''
    return result

If we see the above code, then we can see, the problem is in indentation, for which during the time of evaluation, we are getting it marked as wrong. Although I tried to implement the code of big-code's post processing for different tasks. But, it was not working. So I would highly appreciate in some help there.

The text was updated successfully, but these errors were encountered:

loubnabnl · 2023-11-16T12:07:37Z

Hi, thanks for the suggestion, I think some challenges in the evaluation of these models are that they might change & evolve behind the API which makes the evaluation numbers not very relevant over time. They also might require different post-processing to extract the code snippet since they tend to generate natural text before and after so i'm not sure the current approach we have will work out of the box for most tasks.

However if you do tests and find your implementation to work/match public numbers for certain tasks like instruct-humaneval or others tasks like HumanEvalsynthesize then feel free to open a PR and we can consider adding this setup for a restricted set of benchmarks if it integrates well with the codebase.

Regarding your indentation issue, I think the prompt is stripped by default and doesn't have a \n at the end in instruct-humaneval. For humaneval task we have both humaneval and humaneval-unstripped tasks because we've noticed that GPT4's tokenizer and few others like Phind require keeping the last \n in the prompt to work properly. Can you try evaluation again while adding the \n? You can do that here or in the context of the task

Anindyadeep · 2023-11-16T15:13:17Z

I tried some of the things above mentioned, but everything just solved by giving a simple prompt. Does that make it a valid solution? For example for HumanEval, the problem solved when I added this prompt

# make a instruction
instruction_prompt = """
Complete the given code. First write whatever is given to you, followed by just completing the rest.
Ensure you have wrote the full function. Do not Write anything else other than completing the function.\n
"""

And the model, I considered was gpt-3.5-turbo for HumanEval.

loubnabnl · 2023-11-16T16:42:50Z

Maybe check this code that OctoCoder authors submitted for evaluation of OpenAI models on HumanEvalSynthesize https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack_openai.py

Anindyadeep · 2023-11-16T16:43:45Z

However if you do tests and find your implementation to work/match public numbers

Very interesting and weird thing, I used gpt-3.5-turbo, deterministic generation, pass@1 for HumanEval on that above prompt and it got me a score of 0.62, meanwhile codellama and Mistral paper states it was 48.1.

But just adding prompt + \n got me a result of 0.0016.

One reason for this could be the time codellama evaluated and the time I am doing the evaluation, gpt-3.5 got evolved over. And I am not sure whether they are contaminated with same examples or not.

ALLISWELL8 · 2023-11-24T04:30:32Z

@Anindyadeep Can you open source your projec.

Anindyadeep · 2023-11-24T04:37:08Z

@Anindyadeep Can you open source your projec.

Yeah, we will do that shortly :)

Anindyadeep · 2023-11-24T10:53:59Z

However if you do tests and find your implementation to work/match public numbers for certain tasks like instruct-humaneval or others tasks like HumanEvalsynthesize then feel free to open a PR and we can consider adding this setup for a restricted set of benchmarks if it integrates well with the codebase.

@loubnabnl I did not checked out instruct-humaneval but I did checked for humaneval, and the results were greater than the results from codellama implementation and very similar with the latest
deepseek coder.

So, can you share is the below mentioned interface is okay, if I put the PR? Feel free to suggest me changes if any.

python3  main.py \
  --model gpt-3.5-turbo \
  --max_length_generation 512 \
  --tasks instruct-humaneval \
  --instruction_tokens <user_token>,<end_token>,<assistant_token>\
  --temperature 0.2 \
  --n_samples 200 \
  --batch_size 10 \
  --allow_code_execution

loubnabnl · 2023-11-28T13:59:53Z

Yes feel free to open a PR and add the scores you got

Anindyadeep · 2024-01-06T08:02:25Z

Hi @loubnabnl, I started a PR. Let me know which benchmarks, I need to evaluate through this so that I can add the results too.

Thanks

Anindyadeep mentioned this issue Jan 6, 2024

Endpoints Integration to evaluate closed source Models. #179

Closed

hmellor mentioned this issue Jan 23, 2024

Make main.py compatible with OpenAI compatible APIs #189

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A common interface for APIs and Models. #161

A common interface for APIs and Models. #161

Anindyadeep commented Nov 12, 2023

loubnabnl commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023 •

edited

Loading

loubnabnl commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023

ALLISWELL8 commented Nov 24, 2023

Anindyadeep commented Nov 24, 2023

Anindyadeep commented Nov 24, 2023

loubnabnl commented Nov 28, 2023

Anindyadeep commented Jan 6, 2024

A common interface for APIs and Models. #161

A common interface for APIs and Models. #161

Comments

Anindyadeep commented Nov 12, 2023

Summary of the issue

What is the motivation

How can I contribute:

loubnabnl commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023 • edited Loading

loubnabnl commented Nov 16, 2023

Anindyadeep commented Nov 16, 2023

ALLISWELL8 commented Nov 24, 2023

Anindyadeep commented Nov 24, 2023

Anindyadeep commented Nov 24, 2023

loubnabnl commented Nov 28, 2023

Anindyadeep commented Jan 6, 2024

Anindyadeep commented Nov 16, 2023 •

edited

Loading