Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A common interface for APIs and Models. #161

Open
Anindyadeep opened this issue Nov 12, 2023 · 9 comments
Open

A common interface for APIs and Models. #161

Anindyadeep opened this issue Nov 12, 2023 · 9 comments

Comments

@Anindyadeep
Copy link

Summary of the issue

First of all, thanks for the awesome effort for making code-evaluation-package. Highly appreciate it. However, right now, what I see is that it is integrated with just Huggingface models. It would be awesome, if we can evaluate the same for closed source models. For example something like this would be awesome:

accelerate launch  main.py \
  --model gpt-3.5-turbo \
  --max_length_generation 512 \
  --tasks instruct-humaneval \
  --instruction_tokens <user_token>,<end_token>,<assistant_token>\
  --temperature 0.2 \
  --n_samples 200 \
  --batch_size 10 \
  --allow_code_execution

So, with the same interface and the post processing logic of code-evaluation-harness, we can leverage this, to evaluate and compare code-evaluation for open source and closed source models.

What is the motivation

The motivation behind this is that Open-Source models are all good. However, researchers and passionate people on LLM always strive for making models that can surpass gpt in lesser number of parameters and better in performance for certain tasks. And a library like this would be really helpful.

How can I contribute:

Well, I already have most part of this code ready. If you are aligned with the motivations of the issue, then I can create the PR. However, the problem, that I am facing is that the evaluation score for API based models are very low. For example gpt-3.5 is giving a score for 0.006 in HumanEval benchmark. However, the generation is correct. The problem is in indentation and the post-processing of the generations. For example, one instance of the generation of gpt-3.5.turbo looks something like this:

from typing import List


def separate_paren_groups(paren_string: str) -> List[str]:
    """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
    separate those group into separate strings and return the list of those.
    Separate groups are balanced (each open brace is properly closed) and not nested within each other
    Ignore any spaces in the input string.
    >>> separate_paren_groups('( ) (( )) (( )( ))')
    ['()', '(())', '(()())']
    """    paren_string = paren_string.replace(' ', '')
    stack = []
    result = []
    group = ''
    for char in paren_string:
        if char == '(':
            if stack:
                group += char
            stack.append(char)
        elif char == ')':
            stack.pop()
            group += char
            if not stack:
                result.append(group)
                group = ''
    return result

If we see the above code, then we can see, the problem is in indentation, for which during the time of evaluation, we are getting it marked as wrong. Although I tried to implement the code of big-code's post processing for different tasks. But, it was not working. So I would highly appreciate in some help there.

@loubnabnl
Copy link
Collaborator

Hi, thanks for the suggestion, I think some challenges in the evaluation of these models are that they might change & evolve behind the API which makes the evaluation numbers not very relevant over time. They also might require different post-processing to extract the code snippet since they tend to generate natural text before and after so i'm not sure the current approach we have will work out of the box for most tasks.

However if you do tests and find your implementation to work/match public numbers for certain tasks like instruct-humaneval or others tasks like HumanEvalsynthesize then feel free to open a PR and we can consider adding this setup for a restricted set of benchmarks if it integrates well with the codebase.

Regarding your indentation issue, I think the prompt is stripped by default and doesn't have a \n at the end in instruct-humaneval. For humaneval task we have both humaneval and humaneval-unstripped tasks because we've noticed that GPT4's tokenizer and few others like Phind require keeping the last \n in the prompt to work properly. Can you try evaluation again while adding the \n? You can do that here or in the context of the task

@Anindyadeep
Copy link
Author

Anindyadeep commented Nov 16, 2023

I tried some of the things above mentioned, but everything just solved by giving a simple prompt. Does that make it a valid solution? For example for HumanEval, the problem solved when I added this prompt

# make a instruction
instruction_prompt = """
Complete the given code. First write whatever is given to you, followed by just completing the rest.
Ensure you have wrote the full function. Do not Write anything else other than completing the function.\n
"""

And the model, I considered was gpt-3.5-turbo for HumanEval.

@loubnabnl
Copy link
Collaborator

Maybe check this code that OctoCoder authors submitted for evaluation of OpenAI models on HumanEvalSynthesize https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack_openai.py

@Anindyadeep
Copy link
Author

However if you do tests and find your implementation to work/match public numbers

Very interesting and weird thing, I used gpt-3.5-turbo, deterministic generation, pass@1 for HumanEval on that above prompt and it got me a score of 0.62, meanwhile codellama and Mistral paper states it was 48.1.

But just adding prompt + \n got me a result of 0.0016.

One reason for this could be the time codellama evaluated and the time I am doing the evaluation, gpt-3.5 got evolved over. And I am not sure whether they are contaminated with same examples or not.

@ALLISWELL8
Copy link

@Anindyadeep Can you open source your projec.

@Anindyadeep
Copy link
Author

@Anindyadeep Can you open source your projec.

Yeah, we will do that shortly :)

@Anindyadeep
Copy link
Author

However if you do tests and find your implementation to work/match public numbers for certain tasks like instruct-humaneval or others tasks like HumanEvalsynthesize then feel free to open a PR and we can consider adding this setup for a restricted set of benchmarks if it integrates well with the codebase.

@loubnabnl I did not checked out instruct-humaneval but I did checked for humaneval, and the results were greater than the results from codellama implementation and very similar with the latest
deepseek coder.

So, can you share is the below mentioned interface is okay, if I put the PR? Feel free to suggest me changes if any.

python3  main.py \
  --model gpt-3.5-turbo \
  --max_length_generation 512 \
  --tasks instruct-humaneval \
  --instruction_tokens <user_token>,<end_token>,<assistant_token>\
  --temperature 0.2 \
  --n_samples 200 \
  --batch_size 10 \
  --allow_code_execution

@loubnabnl
Copy link
Collaborator

Yes feel free to open a PR and add the scores you got

@Anindyadeep
Copy link
Author

Hi @loubnabnl, I started a PR. Let me know which benchmarks, I need to evaluate through this so that I can add the results too.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants