Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate MultiPL-E #44

Merged
merged 27 commits into from
Apr 22, 2023
Merged

Conversation

loubnabnl
Copy link
Collaborator

@loubnabnl loubnabnl commented Feb 8, 2023

Integration of MultiPLE HumanEval version in 18 programming languages

Copy link
Contributor

@arjunguha arjunguha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some notes, but this is overall LGTM.

I ran a small number of problems with both Python and C++.

lm_eval/tasks/multiple.py Show resolved Hide resolved
lm_eval/tasks/multiple.py Show resolved Hide resolved
lm_eval/tasks/multiple.py Show resolved Hide resolved
lm_eval/tasks/multiple.py Show resolved Hide resolved
@arjunguha
Copy link
Contributor

This code pulls out code that we normally run in the MultiPL-E evaluation container.

I think the easiest way to address the dependency problem is the following:

  1. Tell a user "you had better have dependencies installed!"
  2. We can give them a container with both the PL toochains installed and the eval-harness dependencies, along with some instructions on how to run commands in a container.

@loubnabnl
Copy link
Collaborator Author

Yes exactly! I'll upload some code and instructions to use the container

@loubnabnl loubnabnl marked this pull request as ready for review March 14, 2023 16:24
@ytzi
Copy link

ytzi commented Mar 31, 2023

Re: performance issues.

I have obtained the following results for Python and Java on HumanEval:

Python:
pass@1: 0.181 // temp 0.2
pass@10: 0.284 // temp 0.8
pass@100: 0.466 // temp 0.8
Java:
pass@1: 0.143, // temp 0.2
pass@10: 0.252, // temp 0.8
pass@100: 0.416 // temp 0.8

Which are pretty consistent with previously self-reported numbers (off by < 0.02).

@loubnabnl
Copy link
Collaborator Author

loubnabnl commented Apr 22, 2023

This implementation now matches original MultiPL-E for all scores including for pass@100 after this fix

{
  "multiple-py": {
    "pass@10": 0.29917045146858745,
    "pass@100": 0.4996997700167089
  },
  "config": {
    "model": "bigcode/santacoder",
    "temperature": 0.8,
    "n_samples": 200
  }
}

merging the PR 🥳

@loubnabnl loubnabnl merged commit 3ad3b8d into bigcode-project:main Apr 22, 2023
1 check passed
@loubnabnl loubnabnl mentioned this pull request Apr 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants