Sync with upstream (#9)

* Update generate_until_template_yaml (EleutherAI#1546) * Update ifeval.yaml (EleutherAI#1506) * add Arabic EXAMS benchmark (EleutherAI#1498) * add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by: Lintang Sutawika <lintang@sutawika.com> * AGIEval (EleutherAI#1359) * add agieval * fix typo * add cloze / math exactmatch agieval tasks, rename * update exact-match agieval tasks, allow for multiple-correct answers * add more detail to readme * don't parse_math_answer twice --------- Co-authored-by: Alex Bäuerle <alex@a13x.io> * cli_evaluate calls simple_evaluate with the same verbosity. (EleutherAI#1563) * add manual tqdm disabling management (EleutherAI#1569) * add manual tqdm disabling management * add typing to all new args * apply precommit changes --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix README section on vllm integration (EleutherAI#1579) * Link to vllm integration * add pip install .[vllm] cmd * Fix Jinja template for Advanced AI Risk (EleutherAI#1587) * Proposed approach for testing CLI arg parsing (EleutherAI#1566) * New tests for CLI args * fix spacing * change tests for parsing * add tests, fix parser * remove defaults for store_true * Patch for Seq2Seq Model predictions (EleutherAI#1584) * Differentiate _encode_pair setting for decoder and enc-dec models * tok_decode to not skip special token so that eos doen't become empty string * Update model.py * Update model.py * Update huggingface.py * Update lm_eval/models/huggingface.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Update model.py --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Add start date in results.json (EleutherAI#1592) * Cleanup for v0.4.2 release (EleutherAI#1573) * Update interface.md * fix: make caching reqs always work with accelerate launch * remove stale task migration checklist * remove deprecation warnings * make informative TypeErrors for get_task_dict * bump version metadata * fix num_fewshot printing bug * add fewshot value to cache key * Fix eval_logger import for mmlu/_generate_configs.py (EleutherAI#1593) * Fix eval_logger import for mmlu/_generate_configs.py * linter --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * use BOS token in loglikelihood (EleutherAI#1588) * use BOS token in loglikelihood * improve comments * add model arg * log prefix token id * log prefix token id * Update lm_eval/api/model.py Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * change name to prefix_token_id --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Revert "Patch for Seq2Seq Model predictions (EleutherAI#1584)" (EleutherAI#1601) This reverts commit b7923a8. * fix gen_kwargs arg reading (EleutherAI#1607) * fix until arg processing (EleutherAI#1608) * Fixes to Loglikelihood prefix token / VLLM (EleutherAI#1611) * make vllm use prefix_token_id ; have prefix_token_id be optional method to define * custom_prefix_token_id wasn't set if not passed * Add ACLUE task (EleutherAI#1614) * Add task ACLUE * fix minor bug * fix code style * fix code style * OpenAI Completions -- fix passing of unexpected 'until' arg (EleutherAI#1612) * add logging of model args (EleutherAI#1619) * add logging of model args * nit * Add warnings. * nit * add warning * nit * Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633) * peft Version Assertion (EleutherAI#1635) * peft Version Assertion * fix the linter issue * Seq2seq fix (EleutherAI#1604) * fix on --task list * add fixes to tokeniation * differentiate encoding for seq2seq and decoder * return token setting * format for pre-commit * Seq2seq fix, pt2 (EleutherAI#1630) * getting model class only when defined * encode_pair handles None, add_special_tokens turned into dict with default value --------- Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> * Integration of NeMo models into LM Evaluation Harness library (EleutherAI#1598) * Integration of NeMo models into LM Evaluation Harness library * rename nemo model as nemo_lm * move nemo section in readme after hf section * use self.eot_token_id in get_until() * improve progress bar showing loglikelihood requests * data replication or tensor/pipeline replication working fine within one node * run pre-commit on modified files * check whether dependencies are installed * clarify usage of torchrun in README * Fix conditional import for Nemo LM class (EleutherAI#1641) * Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (EleutherAI#1647) * Add Latxa paper evaluation tasks for Basque (EleutherAI#1654) * add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit * Fix CLI --batch_size arg for openai-completions/local-completions (EleutherAI#1656) The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion. This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm- testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http://localhost:8000/v1`: ``` Traceback (most recent call last): File "/home/michael/venv/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate results = evaluator.simple_evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate results = evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)), File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks if len(ret) >= size or x[1] != lastuntil: TypeError: '>=' not supported between instances of 'int' and 'str' ``` * Patch QQP prompt (EleutherAI#1661) * TMMLU+ implementation (EleutherAI#1394) * implementation of TMMLU+ * implemented: TMMLU+ ****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding**** - 4 categories - STEM - Social Science - Humanities - Other The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models. ```markdown Total number of tasks in the 'test' sets: 20160 Total number of tasks in the 'validation' sets: 2247 Total number of tasks in the 'train' sets: 335 ``` * Remove print from __init__.py There was my mistake in forgetting to remove the debug print from the code. * update: move TMMLU+ config generation program into default * fix: we should use training set as few shots example * update: README for TMMLU+ * update: a small changes of TMMLU+ README file * pre-commit run thought * Add README for TMMLU+ dataset * run precommit * trigger precommit again * trigger precommit again * isort is fussy * isort is fussy * format, again * oops * oops --------- Co-authored-by: lintang <lintang@eleuther.ai> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Anthropic Chat API (EleutherAI#1594) * claude3 * supply for anthropic claude3 * supply for anthropic claude3 * anthropic config changes * add callback options on anthropic * line passed * claude3 tiny change * help anthropic installation * mention sysprompt / being careful with format in readme --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * correction bug EleutherAI#1664 (EleutherAI#1670) * correction bug EleutherAI#1664 * add any invalid characters for Windows filenames and Unix-like systems see: https://gist.github.com/doctaphred/d01d05291546186941e1b7ddc02034d3?permalink_comment_id=3958715 * Update lm_eval/__main__.py * Update scripts/zeno_visualize.py * fix format --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Update README.md (EleutherAI#1680) * Add delta weights model loading (EleutherAI#1712) * added delta weights * removed debug * readme update * better error handling * autogptq warn * warn update * peft and delta error, explicitly deleting _model_delta * linter fix * Add `neuralmagic` models for `sparseml` and `deepsparse` (EleutherAI#1674) * Add neuralmagic models for SparseML and DeepSparse * Update to latest and add test * Format * Fix list to List * Format * Add deepsparse/sparseml to automated testing * Update pyproject.toml * Update pyproject.toml * Update README * Fixes for dtype and device * Format * Fix test * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Address review comments! --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix error when appending eot_token_id for generate_until tasks (EleutherAI#1699) * Adding retries and rate limit to toxicity tasks (EleutherAI#1620) * reference `--tasks list` in README (EleutherAI#1726) EleutherAI#1698 * Add XNLIeu: a dataset for cross-lingual NLI in Basque (EleutherAI#1694) * add xnli_eu tasks * update tasks readme * update readme * Fix Parameter Propagation for Tasks that have `include` (EleutherAI#1749) * Update task.py * Update __init__.py * Support individual scrolls datasets (EleutherAI#1740) * Support individual scrolls datasets * Add qmsum context * Fix formatting * Add filter registry decorator (EleutherAI#1750) * Add register_filter decorator * Add register_filter docs * remove duplicated `num_fewshot: 0` (EleutherAI#1769) * Pile 10k new task (EleutherAI#1758) * Add Pile-10k readme * Add Pile-10k task configuration file * Fix m_arc choices (EleutherAI#1760) * Update utils.py This is a 4-choice task, option_e is null for all but 3 samples * Fix options Adaptive choices * add option e * bump multilingual arc version --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * upload new tasks (EleutherAI#1728) * upload new tasks * add readmes * run linters --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * vllm lora support (EleutherAI#1756) * vllm lora support * remove print * version check, rename lora kwarg * Add option to set OpenVINO config (EleutherAI#1730) * Add option to set OpenVINO config * Use utils.eval_logger for logging * evaluation tracker implementation (EleutherAI#1766) * evaluation tracker implementation * OVModelForCausalLM test fix * typo fix * moved methods args * multiple args in one flag * loggers moved to dedicated dir * improved filename sanitization * eval tracker args fix (EleutherAI#1777) * limit fix (EleutherAI#1785) * remove echo parameter in OpenAI completions API (EleutherAI#1779) * remove echo parameter in OpenAI completions API * remove context length parameter doc string * Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args` (EleutherAI#1776) fix `----hf_hub_log_args` to `--hf_hub_log_args` * Fix bug in setting until kwarg in openai completions (EleutherAI#1784) * Provide ability for custom sampler for ConfigurableTask (EleutherAI#1616) * Added fewshot sampling seeds to evaluator.simple_evaluate signature Way to control seed of fewshot sampling may help with EleutherAI#1591 * Added ability for custom sampler for ConfigurableTask May be set in config like ``` fewshot_config: sampler: !function utils.MyFewshotSampler ``` * explicitly set fewshot random generator seed for HFLM generate_until_task test * add backward compatibility for three args seed setup * save seeds info to logs/reports * Update `--tasks list` option in interface documentation (EleutherAI#1792) * Fix Caching Tests ; Remove `pretrained=gpt2` default (EleutherAI#1775) * link to the example output on the hub (EleutherAI#1798) * Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant (EleutherAI#1793) * add Hendrycks MATH (no sympy checking) variant * add readmes for MATH tasks * Logging Updates (Alphabetize table printouts, fix eval tracker bug) (EleutherAI#1774) (EleutherAI#1791) * fix auto-batch size bug for seq2seq models * alphabetize task + group tables ; fix eval tracker bug * fix eval tracker bug * Initial integration of the Unitxt to LM eval harness (EleutherAI#1615) * Initial support for Unitxt datasets in LM Eval Harness See https://github.com/IBM/unitxt The script 'generate_yamls.py' creates LM Eval Harness yaml files corresponding to Unitxt datasets specified in the 'unitxt_datasets' file. The glue code required to register Unitxt metrics is in 'unitxt_wrapper.py'. * Added dataset loading check to generate_yaml Improved error messages. * Speed up generate_yaml Added printouts and improved error message * Added output printout * Simplified integration of unitxt datasets Store all the common yaml configuration in a yaml include shared by all datasets of the same task. * Post code review comments - part 1 1. Made sure include files don't end wth 'yaml' so they won't be marked as tasks 2. Added more datasets and tasks (NER, GEC) 3. Added README * Post code review comments - part 2 1. Added install unitxt install option in pyproject.toml: pip install 'lm_eval[unitxt]' 2. Added a check that unitxt is installed and print a clear error message if not * Commited missing pyproject change * Added documentation on adding datasets * More doc changes * add unitxt extra to readme * run precommit --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * add task for mmlu evaluation in arc multiple choice format (EleutherAI#1745) * add mmlu arc style evaluation * rename arc_style to continuation --------- Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> * Update flag `--hf_hub_log_args` in interface documentation (EleutherAI#1806) * update interface documentation with flag --hf_hub_logs_arg * update interface documentation with flag --hf_hub_logs_arg 2 * Copal task (EleutherAI#1803) * add copal * change name to copal id for clarity and the task name * remove `copal_id...` to yaml to make it work * checkmark on README * change group name to `copal_id` * Adding tinyBenchmarks datasets (EleutherAI#1545) * Add tinyBenchmarks * Add acknowledgements * Add ordering of outputs for data-parallel * Run pre-commit * Add few_shot specifications * Add tinyBenchmarks post-processing * add conditional import ; fix task names --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * interface doc update (EleutherAI#1807) * Fix links in README guiding to another branch (EleutherAI#1838) * Fix: support PEFT/LoRA with added tokens (EleutherAI#1828) * resize model embeddings * resize only * tokenizer help * load tokenizer before model * add comment and run precommit lint * Add log message Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fixed incorrect check for task type (replace `~` with `not`) (EleutherAI#1865) * fixed docs typos (EleutherAI#1863) * Update polemo2_out.yaml (EleutherAI#1871) * Unpin vllm in dependencies (EleutherAI#1874) * Fix outdated links to the latest links in `docs` (EleutherAI#1876) * [HFLM]Use Accelerate's API to reduce hard-coded CUDA code (EleutherAI#1880) * Fix `batch_size=auto` for HF Seq2Seq models (EleutherAI#1765) (EleutherAI#1790) * fix auto-batch size bug for seq2seq models * run linter * Fix Brier Score (EleutherAI#1847) `gold_one_hot` needs to follow the dimension of predictions so that it still works when `--limit` is used and the indexes in gold does not cover all gold indexes. * Fix for bootstrap_iters = 0 case (EleutherAI#1715) (EleutherAI#1789) * add handling for bootstrap_iters=0 case * add more detail to docstring * run precommit * add mmlu tasks from pile-t5 (EleutherAI#1710) * add mmlu tasks from pile-t5 * Update _mmlu_flan_cot_fewshot_template_yaml * Update _mmlu_flan_cot_zeroshot_template_yaml * Update _mmlu_flan_generative_template_yaml * Update _mmlu_flan_loglikelihood_template_yaml * Update _default_template_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Bigbench fix (EleutherAI#1686) * edit process multiple-choice * split template yaml * remove * modified multiple_choice tasks * udpate * Update multiple_choice_template_b_yaml * Update multiple_choice_template_a_yaml --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Rename `lm_eval.logging -> lm_eval.loggers` (EleutherAI#1858) * rename lm_eval.logging module * fix evaluation tracker args * Updated vllm imports in vllm_causallms.py (EleutherAI#1890) * Reorder vllm imports in vllm_causallms.py * Update vllm_causallms.py * [HFLM]Add support for Ascend NPU (EleutherAI#1886) * [HFLM]Add support for Ascend NPU Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> * bump accelerate dependency version to 0.26.0 for NPU compat. --------- Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * `higher_is_better` tickers in output table (EleutherAI#1893) * Higher is better tickers in output table * add extra check for `higher_is_better` not being None already * Update lm_eval/evaluator.py * fixup format I messed up * add comment (and retrigger tests) --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add dataset card when pushing to HF hub (EleutherAI#1898) * dataset card initial * few fixes * adds groups for math, mmlu, gpqa * added summary agrs * moved sanitize_list to utils * readme update * recreate metadata moved * multiple model support * results latest split fix * readme update and small refactor * fix grouping * add comments * added pathlib * corrected pathlib approach * check whether to create a metadata card * convert posix paths to str * default hf org from token * hf token value error * Add logs after successful upload * logging updates * dataset card example in the readme --------- Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> * Making hardcoded few shots compatible with the chat template mechanism (EleutherAI#1895) * init test 1 * fix * this format seems to be working - need to update all other tasks with the new format * bbh with few shot format * fix fewshot bbh * add mmlu flan cot * samples of cot * kmmlu * fix gsm8k * update keys for mmlu * minerva math * bbh * fix * fix samples * small fixes to templates * last prompt format change * fixing prompt * fixed minerva math format * rm accidental commited file * added doc for few shot samples * Update lm_eval/loggers/evaluation_tracker.py * Update lm_eval/loggers/evaluation_tracker.py * Update docs/new_task_guide.md Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * added check in sampler per code review * added the system from a function, plus an example in minerva math * style * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * fix unit tests 1 * forcing use of test split --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Try to make existing tests run little bit faster (EleutherAI#1905) * Fix fewshot seed only set when overriding num_fewshot (EleutherAI#1914) Fix EleutherAI#1906 * Complete task list from pr 1727 (EleutherAI#1901) * added tasks and task family descriptors * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * apply format --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Add chat template (EleutherAI#1873) * initial chat template * tokenizer attribute check * variable rename * interface update * system instruction * system inst default update * fewshot as multiturn * typing update * indent update * added comments * Adding a fewshot in a more readable way * linting * Moved apply chat template to LM * multiturn alternation fix * cache key update * apply chat template method fix * add system prompt hash to cache_key * tokenizer name property for cache_key * property name fix * linting backward compatibility fix * docs and errors update * add documentation on adding chat template compatibility to model_guide * fewshot as multiturn check fix * saving system inst and chat template in results * eval tracker update * docs update * Apply suggestions from code review Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> * Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data (EleutherAI#1867) * glianorex tasks * Create README.md * Update README.md * Update README.md * fix formatting * fix internal formatting * Modify pre-commit hook to check merge conflicts accidentally committed not at current merge commit (EleutherAI#1927) * [add] fld logical formula task (EleutherAI#1931) * Add new Lambada translations (EleutherAI#1897) * added tasks and task family descriptors * configs for the new lambada translations * continue work on task list w/ links; slightly reorganize README * Apply suggestions from code review * Rename file so that it'll preview in Github when viewing lm_eval/tasks folder * Update new_task_guide.md * Update README.md * run linter * Add language column to task table; Add missing tasks to task table; fix nq_open and storycloze READMEs * fix typo * update `lm_eval/tasks/README.md` with task description --------- Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: anthony <anthonydipofi@gmail.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Implement NoticIA (EleutherAI#1912) * Noticia * test * Final testes implementation * Fixes * Fix linters * Add The Arabic version of the PICA benchmark (EleutherAI#1917) * Update siqa.yaml (EleutherAI#1909) * Update basque-glue (EleutherAI#1913) * Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml * Test output table layout consistency (EleutherAI#1916) * sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m` * Update __main__.py (EleutherAI#1939) * Add the Arabic version with refactor to Arabic pica to be in alghafa folder (EleutherAI#1940) * Results filenames handling fix (EleutherAI#1926) * results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils * Remove AMMLU Due to Translation (EleutherAI#1948) * Update README.md * Delete lm_eval/tasks/ammlu directory * add include_defaults kwarg to taskmanager, add tests for include_path (EleutherAI#1856) * add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857) * Update interface.md (EleutherAI#1955) * Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848) Fix bug where `self.max_tokens` was not set * `samples` is newline delimited (EleutherAI#1930) * `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix `--gen_kwargs` and VLLM (`temperature` not respected) (EleutherAI#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <lintang@eleuther.ai> * make write_out.py explicitly error if no splits match (EleutherAI#1796) Co-authored-by: lintangsutawika <lintang@eleuther.ai> * fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (EleutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * add trust_remote_code for piqa (EleutherAI#1983) Signed-off-by: changwangss <chang1.wang@intel.com> * Fix self assignment in neuron_optimum.py (EleutherAI#1990) * [New Task] Add Paloma benchmark (EleutherAI#1928) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Fix Paloma Template yaml (EleutherAI#1993) * init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Log `fewshot_as_multiturn` in results files (EleutherAI#1995) * log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> * Added ArabicMMLU (EleutherAI#1987) * Added ArabicMMLU * Rename `ammlu` to `arabicmmlu` * Fix Datasets `--trust_remote_code` (EleutherAI#1998) * Add BertaQA dataset tasks (EleutherAI#1964) * add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> * Fix precommit hook, update run_models.sh --------- Signed-off-by: changwangss <chang1.wang@intel.com> Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com> Co-authored-by: khalil <90086758+khalil-Hennara@users.noreply.github.com> Co-authored-by: Lintang Sutawika <lintang@sutawika.com> Co-authored-by: Alex Bäuerle <alex@a13x.io> Co-authored-by: Wongboo <44860323+Wongboo@users.noreply.github.com> Co-authored-by: achervyakov <77295913+artemorloff@users.noreply.github.com> Co-authored-by: haileyschoelkopf <hailey@eleuther.ai> Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com> Co-authored-by: Rylan Schaeffer <rylanschaeffer@gmail.com> Co-authored-by: Vicki Boykis <vicki@mozilla.ai> Co-authored-by: Lintang Sutawika <lintang@eleuther.ai> Co-authored-by: kwrobel.eth <djstrong@gmail.com> Co-authored-by: Nouf M. Alotaibi <63472979+noufmitla@users.noreply.github.com> Co-authored-by: Haonan Li <nathan.8270.n@gmail.com> Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com> Co-authored-by: WoosungMyung <115716986+LameloBally@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezperez24@gmail.com> Co-authored-by: Or Sharir <or@sharir.org> Co-authored-by: Julen Etxaniz <juletxara@gmail.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: ZoneTwelve <zonetwelve159@gmail.com> Co-authored-by: Seungwoo Ryu <seungwoo.ryu.94@gmail.com> Co-authored-by: nicho2 <nicho2@laposte.net> Co-authored-by: KonradSzafer <61851539+KonradSzafer@users.noreply.github.com> Co-authored-by: Sergio Perez <sergioperezpersonal@gmail.com> Co-authored-by: sator-labs <129434630+sator-labs@users.noreply.github.com> Co-authored-by: Brian Vaughan <nairbv@users.noreply.github.com> Co-authored-by: giorgossideris <56915448+giorgossideris@users.noreply.github.com> Co-authored-by: Nikita Lozhnikov <nikitml@gmail.com> Co-authored-by: Chujie Zheng <chujiezhengchn@gmail.com> Co-authored-by: Gabriel Mukobi <gabrielmukobi@gmail.com> Co-authored-by: Zehan Li <69186130+jordane95@users.noreply.github.com> Co-authored-by: Simran Arora <emailsimran@gmail.com> Co-authored-by: bcicc <142823000+bcicc@users.noreply.github.com> Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com> Co-authored-by: Muhammad Bin Usman <muhammadbin.2003@gmail.com> Co-authored-by: ciaranby <48831615+ciaranby@users.noreply.github.com> Co-authored-by: LSinev <LSinev@users.noreply.github.com> Co-authored-by: aditya thomas <aditya.thomas@alum.mit.edu> Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com> Co-authored-by: jonabur <135807120+jonabur@users.noreply.github.com> Co-authored-by: Jonathan Burdge <jburdge@mahti-login11.mahti.csc.fi> Co-authored-by: Jonathan Burdge <jburdge@mahti-login12.mahti.csc.fi> Co-authored-by: Edd <68678137+Erland366@users.noreply.github.com> Co-authored-by: Lucas Weber <35227161+LucWeber@users.noreply.github.com> Co-authored-by: Nick Doiron <ndoiron@mapmeld.com> Co-authored-by: Zafir Stojanovski <zafir.stojanovski@icloud.com> Co-authored-by: zhabuye <74179177+zhabuye@users.noreply.github.com> Co-authored-by: Edward Gan <efuzzy@gmail.com> Co-authored-by: DongGeon Lee <dg.lee@postech.ac.kr> Co-authored-by: Huazhong Ji <hzji210@gmail.com> Co-authored-by: jiaqiw09 <jiaqiw960714@gmail.com> Co-authored-by: zhabuye <2947436155@qq.com> Co-authored-by: Nathan Habib <nathan.habib@huggingface.com> Co-authored-by: Alina Lozovskaia <alinailozovskaya@gmail.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: anthony-dipofi <anthonydipofi@gmail.com> Co-authored-by: Harish Vadaparty <harishvadaparty@gmail.com> Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com> Co-authored-by: MorishT <106973776+MorishT@users.noreply.github.com> Co-authored-by: Iker García-Ferrero <i.garciaferrerosanpelayo@gmail.com> Co-authored-by: Zafir Stojanovski <zaf.stojano@gmail.com> Co-authored-by: Sadra Barikbin <sadraqazvin1@yahoo.com> Co-authored-by: johnwee1 <91670254+johnwee1@users.noreply.github.com> Co-authored-by: Wang, Chang <491521017@qq.com> Co-authored-by: Yazeed Alnumay <61038456+Yazeed7@users.noreply.github.com>
deepvk · Jun 25, 2024 · 7a89464 · 7a89464
1 parent 5499c72
commit 7a89464
Show file tree

Hide file tree

Showing 996 changed files with 22,541 additions and 9,706 deletions.
diff --git a/.github/workflows/new_tasks.yml b/.github/workflows/new_tasks.yml
@@ -20,13 +20,13 @@ jobs:
         with:
           fetch-depth: 2  # OR "2" -> To retrieve the preceding commit.
 
-      # Uses the tj-actions/changed-files@v37 action to check for changes.
+      # Uses the tj-actions/changed-files action to check for changes.
       # Outputs provided here: https://github.com/tj-actions/changed-files#outputs
       # The `files_yaml` input optionally takes a yaml string to specify filters,
       # and prepends the filter name to the standard output names.
       - name: Check task folders
         id: changed-tasks
-        uses: tj-actions/changed-files@v37.1.2
+        uses: tj-actions/changed-files@v44.5.2
         with:
           # tasks checks the tasks folder and api checks the api folder for changes
           files_yaml: |

diff --git a/.github/workflows/unit_tests.yml b/.github/workflows/unit_tests.yml
@@ -32,7 +32,7 @@ jobs:
       env:
         SKIP: "no-commit-to-branch,mypy"
 
-      uses: pre-commit/action@v3.0.0
+      uses: pre-commit/action@v3.0.1
 #       # mypy turned off for now
 #    - name: Lint with mypy
 #      run: mypy . --ignore-missing-imports --check-untyped-defs --explicit-package-bases --warn-unreachable
@@ -56,7 +56,7 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install -e '.[dev,anthropic,sentencepiece,optimum]' --extra-index-url https://download.pytorch.org/whl/cpu
+        pip install -e '.[dev,anthropic,sentencepiece,optimum,deepsparse,sparseml]' --extra-index-url https://download.pytorch.org/whl/cpu
 #         Install optional git dependencies
 #                pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
 #        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,6 +10,7 @@ repos:
       - id: check-case-conflict
       - id: check-json
       - id: check-merge-conflict
+        args: [--assume-in-merge]
       - id: check-symlinks
       - id: check-yaml
         args: ["--unsafe"]
@@ -28,8 +29,7 @@ repos:
       - id: mixed-line-ending
         args: [--fix=lf]
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    # Ruff version.
-    rev: v0.2.2
+    rev: v0.4.8
     hooks:
       # Run the linter.
       - id: ruff
@@ -38,17 +38,17 @@ repos:
         # Run the formatter.
       - id: ruff-format
   - repo: https://github.com/codespell-project/codespell
-    rev: v2.2.6
+    rev: v2.3.0
     hooks:
       - id: codespell
         exclude: >
           (?x)^(
               .*\.json|ignore.txt|lm_eval/tasks/.*|.*yaml|.*\.ipynb
           )$
         args: [--check-filenames, --check-hidden, --ignore-words=ignore.txt]
-  - repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.5.1
-    hooks:
-    - id: mypy
-      additional_dependencies: [".[sentencepiece,multilingual,promptsource,gptq]", "types-PyYAML", "types-requests"]
-      exclude: ^tests/.*$
+#  - repo: https://github.com/pre-commit/mirrors-mypy
+#    rev: v1.5.1
+#    hooks:
+#    - id: mypy
+#      additional_dependencies: [".[sentencepiece,multilingual,promptsource,gptq]", "types-PyYAML", "types-requests"]
+#      exclude: ^tests/.*$
diff --git a/README.md b/README.md
diff --git a/docs/README.md b/docs/README.md
@@ -4,7 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!
 
 ## Table of Contents
 
-* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md)
-* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
-* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
-* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/task_guide.md).
+* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](./interface.md)
+* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](./model_guide.md).
+* For a crash course on adding new tasks to the library, see our [New Task Guide](./new_task_guide.md).
+* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](./task_guide.md).
diff --git a/docs/interface.md b/docs/interface.md
@@ -10,11 +10,11 @@ Equivalently, running the library can be done via the `lm-eval` entrypoint at th
 
 This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
 
-- `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#commercial-apis) for a full list of enabled model names and supported libraries or APIs.
+- `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.
 
 - `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
 
-- `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups.
+- `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. A list of supported tasks can be viewed with `--tasks list`.
 
 - `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
 
@@ -42,13 +42,28 @@ This mode supports a number of command-line arguments, the details of which can
 
 - `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
 
-- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing ` lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than  `lm_eval/tasks/`
+- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`.
+
+- `--system_instruction`: Specifies a system instruction string to prepend to the prompt.
+
+- `--apply_chat_template` : If this flag is on, a chat template will be applied to the prompt. For Hugging Face models, the chat template is taken from the tokenizer, if the tokenizer does not have a chat template, a default one will be applied. For other models, chat templating is not currently implemented.
+
+- `--fewshot_as_multiturn` : If this flag is on, the Fewshot examples are treated as a multi-turn conversation. Questions are provided as user content and answers are provided as assistant responses. Requires `--num_fewshot` to be set to be greater than 0, and `--apply_chat_template` to be on.
 
 - `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
 
 * `--seed`: Set seed for python's random, numpy and torch.  Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three.  The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility).  E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`.  E.g, `--seed 42` sets all three seeds to 42.
 
-* `--wandb_args`:  Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]. e.g., ```--wandb_args project=test-project,name=test-run```
+* `--wandb_args`:  Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```
+
+* `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
+    * `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`. If not provided, the results will be pushed to the owner of the Hugging Face token,
+    * `hub_repo_name` - repository name on Hugging Face Hub, e.g., `lm-eval-results`,
+    * `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
+    * `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
+    * `public_repo` - whether the repository is public, can be `True` or `False`,
+    * `leaderboard_url` - URL to the leaderboard, e.g., `https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard`.
+    * `point_of_contact` - Point of contact for the results dataset, e.g., `yourname@example.com`.
 
 ## External Library Usage
 
@@ -77,7 +92,7 @@ task_manager = lm_eval.tasks.TaskManager()
 
 # Setting `task_manager` to the one above is optional and should generally be done
 # if you want to include tasks from paths other than ones in `lm_eval/tasks`.
-# `simple_evaluate` will instantiate its own task_manager is the it is set to None here.
+# `simple_evaluate` will instantiate its own task_manager if it is set to None here.
 results = lm_eval.simple_evaluate( # call simple_evaluate
     model=lm_obj,
     tasks=["taskname1", "taskname2"],
@@ -112,8 +127,8 @@ my_model = initialize_my_model()
 # - `Your_LM.generate_until()`
 lm_obj = Your_LM(model=my_model, batch_size=16)
 
-# The task_manager indexes tasks including ones
-# specified by the user through `include_path`
+# optional: the task_manager indexes tasks including ones
+# specified by the user through `include_path`.
 task_manager = lm_eval.tasks.TaskManager(
     include_path="/path/to/custom/yaml"
     )
@@ -132,15 +147,15 @@ task_dict = lm_eval.tasks.get_task_dict(
     task_manager # A task manager that allows lm_eval to
                  # load the task during evaluation.
                  # If none is provided, `get_task_dict`
-                 # will instantiated one itself, but this
+                 # will instantiate one itself, but this
                  # only includes the stock tasks so users
                  # will need to set this if including
                  # custom paths is required.
     )
 
-def evaluate(
+results = evaluate(
     lm=lm_obj,
     task_dict=task_dict,
     ...
-):
+)
 ```
diff --git a/docs/model_guide.md b/docs/model_guide.md
@@ -6,7 +6,7 @@ In order to properly evaluate a given LM, we require implementation of a wrapper
 
 ## Setup
 
-To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
+To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your model, and install the project requirements in your environment:
 
 ```sh
 # After forking...
@@ -107,6 +107,53 @@ Using this decorator results in the class being added to an accounting of the us
 
 We also recommend that new model contributions be accompanied by short tests of their 3 core functionalities, at minimum. To see an example of such tests, look at https://github.com/EleutherAI/lm-evaluation-harness/blob/35bdecd379c0cefad6897e67db892f4a6026a128/tests/test_ggml.py .
 
+## Chat Templating
+
+Many models are fine-tuned with a [Chat Template](https://huggingface.co/docs/transformers/main/en/chat_templating) in order to enable back-and-forth interaction between a "User"'s queries and the model (often called "Assistant")'s responses. It can be desirable to evaluate fine-tuned models on evaluation tasks while wrapped in the conversational format they expect.
+
+In order to make your model optionally compatible with a chat format, three additional methods must be implemented:
+
+```python
+class MyCustomLM(LM):
+    #...
+    @property
+    def tokenizer_name(self) -> str:
+        # should return a string denoting the name of the model's tokenizer and/or the accompanying chat template.
+
+    @property
+    def chat_template(self) -> str:
+        # should return a chat template formatting string that is used to build prompt from a user/assistant chat history.
+        # this will be saved in the evaluation results for reproducibility.
+
+    def apply_chat_template(self, chat_history: List[Dict[str, str]]) -> str:
+        # responsible for taking as input a chat history that would be fed into the model, and
+        # rendering it as a string that can be then tokenized and input into the model.
+    #...
+```
+
+- `apply_chat_template`
+  - This method performs the bulk of the work required for chat-formatting.
+  - As input, a `chat_history: List[Dict[str, str]]` is passed in. This is a transcript of a conversation of a form similar to
+      ```
+      [
+        {"system": <user-provided system message such as "You are a helpful math-focused chatbot">},
+        {"user": <task example - a few-shot example 'input'>}
+        {"assistant": <correct response to the above example>},
+        # ... more few-shot examples, potentially
+        {"user": <test set query--response on which we will evaluate>},
+      ]
+      ```
+      which can then be converted into a string input.
+  - The output is a string representing this conversation that can be fed into the model.
+  - For example, this consists of simply calling `tokenizer.apply_chat_template` for HFLM--see the implementation there for reference.
+- `tokenizer_name`
+  - LM Eval Harness supports [caching requests](https://github.com/EleutherAI/lm-evaluation-harness/blob/4902aaaf1f374682f95ac25fe2e13b23faddc91a/lm_eval/__main__.py#L140) that are sent to a model, for faster setup when repeating an already-performed evaluation.
+  - However, we don't want to use the cache of chat transcripts rendered using one chat template or system prompt to send to a model with a different template! So, we use this `lm.tokenizer_name` string to distinguish caches for a given model (and chat template) from one another.
+- `chat_template`
+  - Chat templates are typically provided as a Jinja template string or a string formatted with str.format to include user and assistant messages in a single prompt. This template string is saved in the evaluation results to ensure reproducibility.
+
+If not implemented for a given model type, the flags `--apply_chat_template` , `--fewshot_as_multiturn`, and `--system_instruction` cannot be used.
+
 ## Other
 
 **Pro tip**: In order to make the Evaluation Harness overestimate total runtimes rather than underestimate it, HuggingFace models come in-built with the ability to provide responses on data points in *descending order by total input length* via `lm_eval.utils.Reorderer`. Take a look at `lm_eval.models.hf_causal.HFLM` to see how this is done, and see if you can implement it in your own model!