Main updates
- HumanEval+ and MBPP+ datasets are on the hub now:
- HumanEval+ is ported to original HumanEval format. Release files have a new home now:
- You can use EvalPlus through bigcode-evaluation-harness now
- Docker image now uses Python 3.10 since some model might generate Python code using latest syntax, leading to false positive using older Python
- Sanitizer is now merged into the pacakge
- Several improvements and bug fixes to the sanitizer
- Test suite reduction now moved to
tools
- Fixes the
CACHE_DIR
nonexistance issue - Simplified the format of
eval_results.json
for readability - Use
EVALPLUS_TIMEOUT_PER_TASK
env var to set the maximum testing time for each task - Timeout per test is set to 0.5s by default
- Fixes argument validity for
inputgen.py
Dataset maintainence
HumanEval/32
: fixes the oracle
Supported codegen models
- Now EvalPlus leaderboard lists 82 models
- WizardCoders
- Stable Code
- OpenCodeInterpreter
- antropic API
- mistral API
- CodeLlama instruct
- Phi-2
- Solar
- Dophin
- OpenChat
- CodeMillenials
- Speechless
- xdan-l1-chat
- etc.
PyPI: https://pypi.org/project/evalplus/0.2.1/
Docker Hub: https://hub.docker.com/layers/ganler/evalplus/v0.2.1/images/sha256-2bb315e40ea502b4f47ebf1f93561ef88280d251bdc6f394578c63d90e1825d7